Understanding Data Mining
Author: | Celia Rowland |
Level: | High School |
Content Area: | Mathematics |
Author: | Celia Rowland |
Level: | High School |
Content Area: | Mathematics |
AP Statistics students will use R to investigate the multivariate least squares regression model multiple explanatory variables and how to utilize three different variable selection techniques to determine the most appropriate model. These techniques are forward selection, backward selection and stepwise selection
A multivariate least squares regression model is based on several explanatory variables that contribute to the response variable. The model for the LSLR is
y=α + βx + ε which includes a solitary explanatory variable.
The model for the multivariate least squares regression (MLSR) is
y = α + β1x1i+β2x2i+β3 x3i+…+ βnxni+εi with n explanatory variables.
Each βi represents the corresponding contribution of the explanatory variable to the model. The α represents the y-intercept of the model.
We are going to consider three different methods for building the MVLSR from a data set. The first method is called forward selection where all variables are added to the model, one at a time, in the order that the variables are presented in the data set. The output of each model will include the MLSR equation as well as a corresponding correlation coefficient.
The second method is backward selection. All of the variables are placed into the first M LSR equation and then deleted from the equation one at a time. The correlation coefficient is given with the initial equation and each subsequent deletion.
The third method is stepwise selection. In this process, each of the possible explanatory variables is evaluated to determine which one creates the strongest possible LSR line. The explanatory variable that provides the largest correlation coefficient is selected first. This process is repeated to add explanatory variables to the equation in such an order that the marginal addition to the correlation coefficient is maximized. With the stepwise selection method, the all-important decision is the acceptable r value to stop the variable selection process.
senic<-read.table(“senic.dat”, header = T)
We will use the SENIC data set for practice in the lab.
Recall that in the first lab you were trying to find an explanatory variable that had a strong linear relationship with the response variable, Risk, of a nosocomial infection. The possible explanatory variables are Stay, Age, Culture, X.Ray and Beds. It will be tedious to type these names for the variables in MVLR, so let’s rename them:
y<senic$Risk
x1<-senic$Stay
x2<-senic$Age
x3<-senic$Culture
x4<-senic$X.Ray
x5<-senic$Beds
First, build a linear regression model for y using x1 and x2.
lm(y~x1+x2)
Did you get the line y = 2.26510+0.38789x1 – 0.03106x2 ? What do all of these values represent?
Find the correlation coefficient for this model:
cor(y, x1+x2)
Did you get 0.1977854? Do you think this is an appropriate linear model for predicting the risk of infection?
Let’s build another regression model for y, but using x3 and x4 this time.
lm(y~x3+x4)
cor(y, x3+x4)
Here is the output:
Coefficients: (Intercept) x3 x4 1.94092 0.05877 0.01820
Did you get this? And what is the correlation coefficient for this new model?
Is the second model better than the first model? Why or why not?
Does the second model do a good job of predicting the risk of infection?
If we tried to do all of the possible combinations of variables with separate regression line commands, we would have 2n -1 = 31 different models to consider. That’s a lot of work!
We’re going to let R do the work for us. There are three possible methods:
Before we move on, we need to determine a stopping rule for the selection process. At what value of the correlation coefficient are we comfortable with identification of the model? Let’s assume we want r > 0.8 at this point.
First, let’s fit all five variables to the model: lm(y~x1+x2+x3+x4+x5+x6)
Output:
Call:
lm(formula = y ~ x1 + x2 + x3 + x4 + x5)
Coefficients:
(Intercept) x3 x4
1.94092 0.05877 0.01820
Let’s call this the “full” model:
full<-lm(y~x1+x2+x3+x4+x5)
We will use this full model for the backward and stepwise selections. Second, we’re going to change up our data a bit by creating our data set with the x and y variables we have defined:
senic.dat<-cbind(y, x1, x2, x3, x4 x5)
If you type senic.dat
you will see that the data file is now in order of the response and explanatory variables.
To fit the forward selection model, use the following general command:
forward1<-step(lm(y~1, data=data.frame(senic.dat)), scope=list(lower=~1, upper=~x1+x2+x3+x4+x5), direction="forward")
You will get a huge amount of output! Let’s try to get this output in a summary form:
summary(forward1)
And here are your results:
Call:
lm(formula = y ~ x3 + x1 + x5 + x4, data = data.frame(senic.dat))
Residuals:
Min 1Q Median 3Q Max
-1.99601 -0.73388 0.07781 0.66121 2.28819
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4137306 0.5311016 0.779 0.43768
x3 0.0482073 0.0100638 4.790 5.34e-06 ***
x1 0.1836372 0.0578097 3.177 0.00194 **
x5 0.0013465 0.0005238 2.571 0.01151 *
x4 0.0130965 0.0054932 2.384 0.01886 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9663 on 108 degrees of freedom
Multiple R-squared: 0.5001, Adjusted R-squared: 0.4816
F-statistic: 27.01 on 4 and 108 DF, p-value: 1.540e-15
So your resulting line of best fit with multiple variables is
= 0.413 + 0.0482x3 + .1836x1 + 0.0013x5 + 0.0131x4.
Note that the variable x2 is not present in this model! Why?
The command forward1$anova will give you the analysis of variance results for the best fitting model:
Step Df Deviance Resid. Df Resid. Dev AIC
1 NA NA 112 201.7407 67.4942836
2 + x3 -1 63.243478 111 138.4972 26.9912631
3 + x1 -1 27.627195 110 110.8700 3.8497048
4 + x5 -1 4.708954 109 106.1611 0.9453821
5 + x4 -1 5.307933 108 100.8531 -2.8506250
Now, let’s run the backwards selection model:
backward1<-step(full, direction=”backward”)
Did you get the same results or model for the backwards selection method? Run the summary(backward) command and compare the Mutiple R-squared for both methods.
Now that you’ve seen how the backward and forward selection methods work and their outputs, let’s run the stepwise selection method:
stepwise1<-step(full, direction=”both”)
With this particular data set, the stepwise selection method stopped with only two iterations.
Start: AIC=-1.52
y ~ x1 + x2 + x3 + x4 + x5
Df Sum of Sq RSS AIC
- x2 1 0.595 100.853 -2.851
<none> 100.258 -1.519
- x4 1 5.368 105.626 2.374
- x5 1 6.637 106.895 3.724
- x1 1 7.153 107.411 4.269
- x3 1 21.656 121.914 18.580
Step: AIC=-2.85
y ~ x1 + x3 + x4 + x5
Df Sum of Sq RSS AIC
<none> 100.853 -2.851
+ x2 1 0.595 100.258 -1.519
- x4 1 5.308 106.161 0.945
- x5 1 6.172 107.025 1.861
- x1 1 9.423 110.276 5.243
- x3 1 21.427 122.281 16.919
How do your summary results for the stepwise model compare to that of the first two selection models?
What we see here is that this data gives a reasonable linear model for predicting the risk of nosocomial infections with a multiple regression R2 of 0.5001.
Now that you’ve completed this lab, you have a group assignment to complete in groups of 2 or 3 students.