Kenan Fellows Program Logo and page header graphic

Understanding Data Mining

Lesson 2: Multivariate Least Squares Regression Model with Variable Selection

Objectives:

AP Statistics students will use R to investigate the multivariate least squares regression model multiple explanatory variables and how to utilize three different variable selection techniques to determine the most appropriate model. These techniques are forward selection, backward selection and stepwise selection

Background on Multivariate Least Squares Regression

A multivariate least squares regression model is based on several explanatory variables that contribute to the response variable. The model for the LSLR is

y=α + βx + ε which includes a solitary explanatory variable.

The model for the multivariate least squares regression (MLSR) is

y = α + β1x1i+β2x2i+β3 x3i+…+ βnxni+εi with n explanatory variables.

Each βi represents the corresponding contribution of the explanatory variable to the model. The α represents the y-intercept of the model.

We are going to consider three different methods for building the MVLSR from a data set. The first method is called forward selection where all variables are added to the model, one at a time, in the order that the variables are presented in the data set. The output of each model will include the MLSR equation as well as a corresponding correlation coefficient.

The second method is backward selection. All of the variables are placed into the first M LSR equation and then deleted from the equation one at a time. The correlation coefficient is given with the initial equation and each subsequent deletion.

The third method is stepwise selection. In this process, each of the possible explanatory variables is evaluated to determine which one creates the strongest possible LSR line. The explanatory variable that provides the largest correlation coefficient is selected first. This process is repeated to add explanatory variables to the equation in such an order that the marginal addition to the correlation coefficient is maximized. With the stepwise selection method, the all-important decision is the acceptable r value to stop the variable selection process.

Set-up Work

  1. Check your working directory. Change your working directory to the AP Statistics folder on the Shared Directory.
  2. Load the SENIC data set:
    senic<-read.table(“senic.dat”, header = T)
    
  3. Note the names of the variables in the SENIC data set: Stay, Risk, Age, Chest, Culture, Beds. (Refer to your first R assignment to find an explanation of these variables.)

Multi-variable Linear Regression

We will use the SENIC data set for practice in the lab.

Recall that in the first lab you were trying to find an explanatory variable that had a strong linear relationship with the response variable, Risk, of a nosocomial infection. The possible explanatory variables are Stay, Age, Culture, X.Ray and Beds. It will be tedious to type these names for the variables in MVLR, so let’s rename them:

y<senic$Risk
x1<-senic$Stay
x2<-senic$Age
x3<-senic$Culture
x4<-senic$X.Ray
x5<-senic$Beds

First, build a linear regression model for y using x1 and x2.

lm(y~x1+x2)

Did you get the line y = 2.26510+0.38789x1 – 0.03106x2 ? What do all of these values represent?

Find the correlation coefficient for this model:

cor(y, x1+x2)

Did you get 0.1977854? Do you think this is an appropriate linear model for predicting the risk of infection?

Let’s build another regression model for y, but using x3 and x4 this time.

lm(y~x3+x4)
cor(y, x3+x4)

Here is the output:

Coefficients:
(Intercept)        x3      x4  
 1.94092      0.05877      0.01820  

Did you get this? And what is the correlation coefficient for this new model?

Is the second model better than the first model? Why or why not?

Does the second model do a good job of predicting the risk of infection?

If we tried to do all of the possible combinations of variables with separate regression line commands, we would have 2n -1 = 31 different models to consider. That’s a lot of work!

We’re going to let R do the work for us. There are three possible methods:

  1. Forward selection
  2. Backward selection
  3. Stepwise selection

Before we move on, we need to determine a stopping rule for the selection process. At what value of the correlation coefficient are we comfortable with identification of the model? Let’s assume we want r > 0.8 at this point.

First, let’s fit all five variables to the model: lm(y~x1+x2+x3+x4+x5+x6)

Output:

Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5)

Coefficients:
(Intercept)          x3          x4
   1.94092      0.05877      0.01820 

Let’s call this the “full” model:

full<-lm(y~x1+x2+x3+x4+x5)

We will use this full model for the backward and stepwise selections. Second, we’re going to change up our data a bit by creating our data set with the x and y variables we have defined:

senic.dat<-cbind(y, x1, x2, x3, x4 x5)

If you type senic.dat you will see that the data file is now in order of the response and explanatory variables.

Forward Selection Model

To fit the forward selection model, use the following general command:

forward1<-step(lm(y~1, data=data.frame(senic.dat)), scope=list(lower=~1, upper=~x1+x2+x3+x4+x5), direction="forward")

You will get a huge amount of output! Let’s try to get this output in a summary form:

summary(forward1)

And here are your results:

Call:

lm(formula = y ~ x3 + x1 + x5 + x4, data = data.frame(senic.dat))

Residuals:

     Min       1Q   Median       3Q      Max 
-1.99601 -0.73388  0.07781  0.66121  2.28819

Coefficients:

                Estimate   Std. Error  t value Pr(>|t|)    
(Intercept)    0.4137306  0.5311016   0.779  0.43768    
x3              0.0482073  0.0100638   4.790  5.34e-06 ***
x1              0.1836372  0.0578097   3.177  0.00194 ** 
x5              0.0013465  0.0005238   2.571  0.01151 *  
x4              0.0130965  0.0054932   2.384  0.01886 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.9663 on 108 degrees of freedom
Multiple R-squared: 0.5001,     Adjusted R-squared: 0.4816 
F-statistic: 27.01 on 4 and 108 DF,  p-value: 1.540e-15

So your resulting line of best fit with multiple variables is

= 0.413 + 0.0482x3 + .1836x1 + 0.0013x5 + 0.0131x4.

Note that the variable x2 is not present in this model! Why?

The command forward1$anova will give you the analysis of variance results for the best fitting model:

Step     Df    Deviance    Resid.  Df Resid.   Dev    AIC
1        NA           NA     112   201.7407    67.4942836
2 + x3   -1    63.243478     111   138.4972    26.9912631
3 + x1   -1    27.627195     110   110.8700     3.8497048
4 + x5   -1     4.708954     109   106.1611     0.9453821
5 + x4   -1     5.307933     108   100.8531    -2.8506250

Backward Selection Model

Now, let’s run the backwards selection model:

backward1<-step(full, direction=”backward”)

Did you get the same results or model for the backwards selection method? Run the summary(backward) command and compare the Mutiple R-squared for both methods.

Stepwise Selection Method

Now that you’ve seen how the backward and forward selection methods work and their outputs, let’s run the stepwise selection method:

stepwise1<-step(full, direction=”both”)

With this particular data set, the stepwise selection method stopped with only two iterations.

     Start:  AIC=-1.52
y ~ x1 + x2 + x3 + x4 + x5

        Df   Sum of Sq    RSS           AIC
- x2    1       0.595     100.853      -2.851
<none>                    100.258      -1.519
- x4    1       5.368     105.626       2.374
- x5    1       6.637     106.895       3.724
- x1    1       7.153     107.411       4.269
- x3    1      21.656     121.914      18.580

Step:  AIC=-2.85
y ~ x1 + x3 + x4 + x5

        Df   Sum of Sq    RSS           AIC
<none>                    100.853      -2.851
+ x2    1       0.595     100.258      -1.519
- x4    1       5.308     106.161       0.945
- x5    1       6.172     107.025       1.861
- x1    1       9.423     110.276       5.243
- x3    1      21.427     122.281       16.919

How do your summary results for the stepwise model compare to that of the first two selection models?

What we see here is that this data gives a reasonable linear model for predicting the risk of nosocomial infections with a multiple regression R2 of 0.5001.

Now that you’ve completed this lab, you have a group assignment to complete in groups of 2 or 3 students.