Kenan Fellows Program Logo and page header graphic

Understanding Data Mining

Lesson 1: Least Squares Linear Regression in R

Objectives:

AP Statistics students will use R to investigate the least squares linear regression model between two variables, the explanatory (input) variable and the response (output) variable. You will learn to identify which explanatory variable supports the strongest linear relationship with the response variable. You will examine data plots and residual plots for single-variable LSLR for goodness of fit. Residual plots will be examined for evidence of patterns that may indicate violation of underlying assumptions.

Simple Linear Regression Model

We will examine the bi-variate data set first to determine the linear relationship between two attributes of individuals or objects in a population. While it does not matter which variable is considered to be x when evaluating the strength of the relationship, it does matter when we establish the regression model. Remember that we routinely use x as the explanatory variable and y as the response variable. In R you may assign any name you so choose to your variables.

Recall that the LSLR line is written in the form y=α+ βx. If the explanatory variable is a perfect predictor of the response variable, then there will be no variation from the line. This is often not the case. The LSLR line is called the “line of best fit” because it minimizes the amount of variation or error of each observation from the fitted line. So the true model is written as y = α + βx + ε where ε represents the variation from the observed value of the response variable to the predicted value of the response variable. You should recognize those error terms as residuals.

The Problem

SENIC is the Study on the Efficacy of Nosocomial Infection Control and its purpose was to determine whether infection surveillance and control programs have reduced the rates of nosocomial (hospital-acquired) infection in U.S. hospitals. This data set consists of 113 randomly selected hospitals from the original 338 hospitals surveyed. The statistics are either average values for the hospital for the study period or ratios for the hospital based on total frequency during the study period. None of these variables are categorical variables.

The headers for each column are:

  • Stay = average length of stay of all patients in the hospital in days
  • Risk = average estimated probability of acquiring infection while in hospital (in percent) (this is generally the y value)
  • Age = average age of patients (in years)
  • Culture= the ratio of number of cultures performed to number of patients without signs or symptoms of hospital-acquired infection, times 100.
  • X.Ray = the ratio of number of x-rays performed to number of patients without signs or symptoms of pneumonia, times 100.
  • Beds = average number of beds in hospital during study period

Source: Applied Linear Statistical Models, 3rd ed, Neter, J., Wasserman, W., Kutner, M. (Boston, 1990) Appendix B

First Tasks

  1. All data files for your labs are located in the Shared Directory in the AP Statistics folder. Use getwd() to determine which directory R is currently using. To change the working directory, use the drop-down menu under File and search for the location of the R folder you wish to use.
  2. Load the first data set – it is called “senic.dat”. Use the read.table command to load this data into R and change the name of the data file into something more reasonable to use:
    senic<-read.table(“senic.dat”, header = T)
    
  3. Type the name of the data file into R which will result in a data dump.
    senic
    

    If you don’t get a data dump, then the data file is not correctly loaded in R’s working directory.

  4. Note the names of all variables and how they are typed – R is case sensitive!!
  5. Change the names of the variables with the following commands:
    stay← senic$Stay
    risk← senic$Risk
    age←senic$Age
    lab←senic$Culture
    xray←senic$X.Ray
    beds←senic$Beds
    

    These shortened names will make it easier to work within the lab.

Data Files & Data Plots

We will begin this exercise by plotting the data set using the plot() command:

plot(filename$variable1, filename$variable2)

Because we already shortened the names, use this command:

plot(stay, risk)

Remember that your plot should have the response variable along the vertical axis and the explanatory variable along the horizontal axis. R will automatically fit the data scale for each axis for all plots.

Your plot should look like the one below.

Note that the output includes labels on the x- and y-axis in the same format as the variables you used in the plot command. If you want to change those labels, you can do so and make them more relevant to the data set. You can add a heading to the graph as well. All of this can be done at one time with the plot command.

plot(stay, risk, xlab=”Length of stay”, ylab=”Risk of infection”, main=”Risk of Nosocomial Infection”)

Here, xlab is the x-axis label and ylab is the y-axis label. The header for the graph is created with the main command. The labels and header are all entered with double quote marks.

A more professional-looking scatterplot is on the next page.

For practice create at least two scatterplots with different explanatory variables from the data set. Be sure to put in full labels for the axes and the header. If you wish to keep your plot, you can right-click on the plot and copy as bitmap. Then paste the graph into Word and save your Word file. The first scatterplot should look like this one:

And the second plot should look like this one:

Least Squares Linear Regression

Next you will run a simple linear regression with two variables from this data set. You want to find a predictor for the risk of hospital-acquired infection, the variable Risk from the SENIC data set. You can consider Length, Age, Lab, Chest or Beds for the explanatory variable. You have already created a scatterplot of Risk vs Length (of stay) in the first plot and it looks roughly linear, so let’s examine this relationship.

The command for simple linear regression is lm(response variable~explanatory variable).

lm(risk~stay)

Here is the output from that command:

 
Coefficients:
(Intercept) stay
0.7437 0.3742
 

How would you interpret this output? ____________________________________________________________________

What is the equation of the LSLR?__________________________________________________________________________

Let’s called this model1. Run the command to create model1:

model1< lm(risk~stay)

Then type the command model1$fitted.values

What do these numbers represent?

       1        2        3        4        5        6        7        8
3.411654 4.044034 3.864423 4.092679 4.934605 4.395772 4.365837 4.927122
       9       10       11       12       13       14       15       16
3.987906 4.051518 4.885961 3.849456 5.525825 3.580039 4.111388 4.889703
      17       18       19       20       21       22       23       24
3.841972 5.091765 4.133840 4.242355 3.561330 4.575383 4.403256 4.425707
      25       26       27       28       29       30       31       32
4.186226 3.841972 4.227387 3.808295 5.102991 4.444417 4.870993 4.425707
      33       34       35       36       37       38       39       40
5.147893 5.828918 4.388289 4.609060 4.474352 3.677328 4.661447 3.797069
      41       42       43       44       45       46       47       48
3.916810 4.754994 4.934605 4.530481 3.875649 4.545448 8.062830 4.822348
      49       50       51       52       53       54       55       56
3.613716 4.066485 5.039378 4.197452 5.013185 5.260150 3.972938 4.915896
      57       58       59       60       61       62       63       64
3.415396 3.606232 4.758736 5.031895 4.642737 4.927122 3.711005 4.358353
      65       66       67       68       69       70       71       72
3.654877 4.268548 4.493062 3.954229 4.339644 3.748424 3.508943 3.392944
      73       74       75       76       77       78       79       80
4.309709 4.504287 3.905584 3.250752 4.073969 4.571641 4.066485 4.597835
      81       82       83       84       85       86       87       88
4.781188 3.714747 3.598749 4.025325 3.770876 4.130098 3.703522 4.631512
      89       90       91       92       93       94       95       96
4.246096 5.013185 4.059002 4.085195 4.081453 3.793327 4.399514 3.939261
      97       98       99      100      101      102      103      104
3.984164 5.237699 3.718489 4.541706 4.395772 4.444417 3.415396 5.963627
     105      106      107      108      109      110      111      112
4.276032 4.784929 3.415396 3.744682 5.159119 4.298483 3.624942 7.456643
     113
4.264806

These values are the predicted y-values ( y ) based on the LSLR! Let’s shorten the predicted y-values to yfit:

yfit<-model1$fitted.values

Before you can plot the line of the model onto a scatterplot, you must have the scatterplot available.

plot(stay, risk)

To plot the model onto the scatterplot use the command:

lines(stay, yfit)

Your output should look like the graph below.

Well, it looks roughly linear. Let’s examine the residual plot. Remember that a residual is the difference between the observed value for y at some particular x and the predicted value for y at that same x value.

Residual = A−A.

The y-values here are the values of Risk and the predicted values are the values associated with yfit . Define the variable resid as outlined below, then plot the x-variable against the residual.

resid←risk – yfit
plot(stay, resid)

So…does the residual plot tell you anything? Describe here what you see in the graph.

Lastly, let’s look at the correlation coefficient for this particular model:

cor(stay, risk)

What value did you get for r? How would you interpret this value in the context of the problem?