# Resources

## Supplemental Information

Some basic R© commands are summarized below:

Command Use
set.seed( ) To ensure that results are the same each time
runif() To generate values of variables
plot() To create a scatterplot of data
lm() To generate a linear model
lines() To add linear model to scatterplot
anova() To generate an analysis of variance
step() To do forward, backward, or stepwise selection
summary() To summarize large results
title() To give a scatterplot a title
cbind() To organize imported data into rows and columns
write.table() To organize data into a table

### Linear Models

In their simplest forms, students study linear models in Algebra I. We use graphing calculators to find linear regression models of data sets, and we then use these equations to make predictions about other observations. In Algebra I, students write their equations in slope intercept form, .

Moving beyond the Algebra I classroom, linear models tend to change in appearance. Instead of just focusing on a single independent variable and a single dependent variable, a linear model can also be used to study multiple independent variables. In this case, the general form of a linear model is:

where is the dependent/response variable, represents the coefficients of the intercepts, and represent the regression coefficients of the predictor values .

Regardless of their form or appearance, linear models are useful because they are the best equations to summarize datasets. Also, linear models are frequently used to make predictions for new observations. The linear models show general patterns in data, and as such, they allow us to engage in conversations about data that we wouldn’t be able to if we had to look at the whole set at once.

### Multiple Variables

In Algebra I, students study data sets with one predictor variable and one response variable. However, in the real world, most response variables have numerous predictor variables, many of which may have a significant impact on the data. These different variables may also have differing effects on the situation at hand, so it is important to identify their effects and then use them appropriately to make sound, more valid predictions.

### Variable Selection

Sometimes, there may be very large numbers of variables involved in a set of data. Some of these variables might interfere with other variables, and some variables might be irrelevant. As a result, we must utilize variable selection techniques to improve our ability to make good predictions and to better observe the impact of specific subsets of variables.

Numerous methods of variable selection exist. At this point, our focus is on some of the traditional methods—forward, backward, and stepwise selection.

## Critical Vocabulary

• scatterplot: used to compare the values of bivariate data
• correlation: describes the degree of relationship between two variables
• linear regression: a model of the relationship between two variables that can be used to predict other values
• data mining: a method commonly used to extract useful information from large sets of data
• forward selection: a method of variable selection in which no variables are initially considered, but then are added if found to be relevant to a given situation
• backward selection: a method of variable selection in which all variables are initially considered, but then are eliminated if found to be irrelevant to a given situation
• stepwise selection: a method of variable selection similar to forward selection, but variables that may initially be added may be later deleted in an attempt to identify the most relevant variables
• analysis of variance: technique used to identify the effect of variables on a response

## Websites

Evolution of Data Mining
http://www.thearling.com/text/dmwhite/dmwhite.htm

The R© Project for Statistical Computing
http://www.r-project.org/