demo.Rmd 10.7 KB
Newer Older
Rubin, Paul's avatar
Rubin, Paul committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
---
title: "Stepwise Regression Demonstration"
output: html_notebook
author: Paul A. Rubin (rubin@msu.edu)
date: August 20, 2017
---

# Preamble

This notebook supplies and demonstrates some code I cobbled together to do "classical" stepwise linear regression. Both the code and the rationale for writing it are discussed in a blog post I wrote in 2011 (when I used the code in a course I taught): [Stepwise Regression in R](https://orinanobworld.blogspot.com/2011/02/stepwise-regression-in-r.html).

The function "stepwise" defined below performs stepwise regression based on a "nested model" F test for inclusion/exclusion of a predictor. In keeping with my intent to use it as an instructional tool, it spews progress reports to output (as a side effect) while returning the final model chosen (an `lm` object).

To keep it simple, I made no provision for forcing certain variables to be included in all models. The current version has the following properties.

* You can specificy a data frame, using the optional data argument (as with `lm`).
* The code does some consistency checks (such as whether your sample size exceeds the number of variables, and whether your alpha-to-enter is less than your alpha-to-leave, but not whether the initial model is a subset of the full model). If one of the checks fails, the function will nag you and return `NA`.
* A constant term is optional. Whether or not a constant term is included is controlled by its presence/absence in the initial model specified, regardless of whether the full model has one.
* Both the full and initial models can be specified as formulas or as character vectors (strings). In other words, `y ~ x` and `"y ~ x"` should work equally well.

One other note: since the code uses R's `drop1` and `add1` functions, it respects hierarchy in models. That is, regardless of *p* values, it will not attempt to drop a term while retaining a higher order interaction involving that term, nor will it add an interaction term if the lower order components are not all present. (You can of course defeat this by putting interactions into new variables and feeding it what looks like a first-order model.)

Consider this to be "beta" code (and feel free to improve it). I've done somewhat limited testing on it, beyond what you see in this notebook.

# Function definition

The following defines the stepwise function.

```{r}
#'
#' Perform a stepwise linear regression using F tests of significance.
#'
#' @param full.model the model containing all possible terms
#' @param initial.model the first model to consider
#' @param alpha.to.enter the significance level above which a variable may enter the model
#' @param alpha.to.leave the significance level below which a variable may be deleted from the model
#' @param data the data frame to use (optional, as with lm)
#'
#' @return the final model
#'
stepwise <- 
  function(full.model, initial.model, alpha.to.enter, alpha.to.leave, data = NULL) {
    # Sanity check: alpha.to.enter should not be greater than alpha.to.leave.
    if (alpha.to.enter > alpha.to.leave) {
      warning("Your alpha-to-enter is greater than your alpha-to-leave, which could throw the function into an infinite loop.\n")
      return(NA)
    }
    # Deal with a missing data argument.
    if (is.null(data)) {
      data <- parent.frame()
    }
    # Warning: horrible kludge coming!
    # Acquire the full and initial models as formulas. If they are
    # entered as formulas, convert them to get their environments
    # squared away.
56 57 58
    # Note: "showEnv = F" is necessary to avoid having an
    # environment identifier break things if the model is
    # defined inside a function.
Rubin, Paul's avatar
Rubin, Paul committed
59 60 61
    if (is.character(full.model)) {
      fm <- as.formula(full.model)
    } else {
62
      fm <- as.formula(capture.output(print(full.model, showEnv = F)))
Rubin, Paul's avatar
Rubin, Paul committed
63 64 65 66
    }
    if (is.character(initial.model)) {
      im <- as.formula(initial.model)
    } else {
67
     im <- as.formula(capture.output(print(initial.model, showEnv = F)))
Rubin, Paul's avatar
Rubin, Paul committed
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
    }
    # Fit the full model.
    full <- lm(fm, data);
    # Sanity check: do not allow an overspecified full model.
    if (full$df.residual < 1) {
      warning("Your full model does not have enough observations to properly estimate it.\n")
      return(NA)
    }
    msef <- (summary(full)$sigma)^2;  # MSE of full model
    n <- length(full$residuals);  # sample size
    # Fit the initial model.
    current <- lm(im, data);
    # Process consecutive models until we break out of the loop.
    while (TRUE) {
      # Summarize the current model.
      temp <- summary(current);
      # Print the model description.
      print(temp$coefficients);
      # Get the size, MSE and Mallow's cp of the current model.
      p <- dim(temp$coefficients)[1]; # size
      mse <- (temp$sigma)^2; # MSE
      cp <- (n - p)*mse/msef - (n - 2*p);  # Mallow's cp
      # Show the fit statistics.
      fit <- sprintf("\nS = %f, R-sq = %f, R-sq(adj) = %f, C-p = %f",
                     temp$sigma, temp$r.squared, temp$adj.r.squared, cp);
      # Show the fit itself.
      write(fit, file = "");
      write("=====", file = "");
      # Try to drop a term (but only if more than one is left).
      if (p > 1) {
        # Look for terms that can be dropped based on F tests.
        d <- drop1(current, test = "F");
        # Find the term with largest p-value.
        pmax <- suppressWarnings(max(d[, 6], na.rm = TRUE));
        # If the term qualifies, drop the variable.
        if (pmax > alpha.to.leave) {
          # We have a candidate for deletion.
          # Get the name of the variable to delete.
          var <- rownames(d)[d[,6] == pmax];
          # If an intercept is present, it will be the first name in the list.
          # There also could be ties for worst p-value.
          # Taking the second entry if there is more than one is a safe solution to both issues.
          if (length(var) > 1) {
            var <- var[2];
          }
          # Print out the variable to be dropped.
          write(paste("--- Dropping", var, "\n"), file = "");
          # Modify the formulat to drop the chosen variable (by subtracting it from the current formula).
          f <- formula(current);
          f <- as.formula(paste(f[2], "~", paste(f[3], var, sep = " - ")), env = environment(f));
          # Fit the modified model and loop.
          current <- lm(f, data);
          next;
        }
      }
      # If we get here, we failed to drop a term; try adding one.
      # Note: add1 throws an error if nothing can be added (current == full), which we trap with tryCatch.
      a <- tryCatch(
        add1(current, fm, test = "F"),
        error = function(e) NULL
      );
      if (is.null(a)) {
        # There are no unused variables (or something went splat), so we bail out.
        break;
      }
      # Find the minimum p-value of any term (skipping the terms with no p-value). In case none of the remaining terms have a p-value (true of the intercept and any linearly dependent predictors), suppress warnings about an empty list. The test for a suitable candidate to drop will fail since pmin will be set to infinity.
      pmin <- suppressWarnings(min(a[, 6], na.rm = TRUE));
      if (pmin < alpha.to.enter) {
        # We have a candidate for addition to the model. Get the variable's name.
        var <- rownames(a)[a[,6] == pmin];
        # We have the same issue with ties and the presence of an intercept term, and the same solution, as above.
        if (length(var) > 1) {
          var <- var[2];
        }
        # Print the variable being added.
        write(paste("+++ Adding", var, "\n"), file = "");
        # Add it to the current formula.
        f <- formula(current);
        f <- as.formula(paste(f[2], "~", paste(f[3], var, sep = " + ")), env = environment(f));
        # Fit the modified model and loop.
        current <- lm(f, data = data);
        next;
      }
      # If we get here, we failed to make any changes to the model; time to declare victory and exit.
      break;
    }
    current
  }
```

# Demonstrations

The rest of the notebook demonstrates the function in operation.

The first tests of the function will be done using the `swiss` dataset (47 observations of 6 variables) from the `datasets` package. We will (arbitrarily) use alpha = 0.05 to add a variable and alpha = 0.10 to remove one.

```{r}
data(swiss)
attach(swiss) # to save typing
aToEnter <- 0.05
aToLeave <- 0.10
```

The first invocation will start with just a constant term. Everything except `Examination` ends up used.

```{r}
result <- stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1, aToEnter, aToLeave)
```

The return value is an instance of a linear model.

```{r}
class(result)
```

It can be summarized, used for predictions etc. just like any other linear model.

```{r}
summary(result)
```

The second invocation starts with the complete model and initially winnows it. We end up with the same model as the previous attempt (albeit with the variables listed in a different order).

```{r}
stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, aToEnter, aToLeave)
```

Finally, we start with Education and Examination as the two predictors. The same final model wins out.

```{r}
stepwise(Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ Examination + Education, aToEnter, aToLeave)
```

Whether the final model contains a constant term or not depends on how the initial model is specified (with or without one), irrespective of whether the full model contains a constant term.

First, we include the constant in the full model but not in the intial model.

```{r}
stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 0 + Examination + Education, aToEnter, aToLeave)
```

None of the models, including the last one, has an intercept. Now we reverse that, including it only in the initial model.

```{r}
stepwise(Fertility ~ 0 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1 + Examination + Education, aToEnter, aToLeave)
```

Every model has an intercept.

Before we abandon this data set, there is one other thing worth noting. Your alpha-to-enter *must not be larger* than your alpha-to-leave. Although you can get away with that sometimes, it carries the potential to put stepwise regression into an infinite loop (adding and dropping the same variable repeatedly), so the function disallows it.

```{r}
result <- stepwise(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Infant.Mortality, Fertility ~ 1, 0.2, 0.1)
```

The result of this is a missing model.

```{r}
result
```

We are done with the `swiss` dataset.

```{r}
detach(swiss)
rm(swiss)
```