Kenan Fellows Program Logo and page header graphic

Understanding Data Mining: Extracting, Organizing, and Analyzing Large Sets of Data

Lesson Two: Learning to Use R Statistical Software for Data Mining—An Extension of Linear Regression to Multiple Variables

Introduction

In Algebra I, students study data sets with one predictor variable and one response variable. However, in the real world, most response variables have numerous predictor variables, many of which may have a significant impact on the data. These different variables may also have differing effects on the situation at hand, so it is important to identify their effects and then use them appropriately to make sound, more valid predictions. In this lesson, students will use R Statistical Software to navigate through the basics of data mining, a process in which the effects of individual variables can be determined. Students will utilize multiple methods of variable selection—forward selection, backward selection, and stepwise selection—in an attempt to determine which variables are most influential in a given situation.

Learning Outcomes

At the end of this lesson, students should be able to use R to:

  • identify influential variables in a multivariate data set using forward selection, backward selection, and stepwise selection
  • develop linear models that can be used to make predictions

Students should also be able to:

  • explain the purpose of data mining
  • understand the basic principles behind the forward, backward, and stepwise selection processes
  • identify some real-world uses of data mining
  • use linear models to make predictions

Classroom Time Required

To complete this lesson, one block period or two traditional periods (for a total of approximately ninety minutes) would be necessary. This is the second lesson in a set of three.

Materials Needed

  • The Basics of R handout
  • Using R for Data Mining packet
  • pencil/pen
  • One computer per student
  • One computer with projector for teacher
  • R Statistical Software (used with permission from R Development Core Team (2008).

R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0,
URL http://www.R-project.org.)

Pre-Activities

Prior to completing this lesson, students should be able to:

  • Distinguish between independent and dependent variables
  • Create scatterplots, and describe correlation
  • Use a graphing calculator to find linear regression models for data sets
  • Use R Statistical Software to create scatterplots, to calculate linear models, and to determine correlation of variables

Prior to implementing this lesson, teachers should:

  • Become familiar with basic R commands
  • Develop data sets that may be used for demonstration purposes, if necessary
  • Ensure that R (a free software) is installed on student computers
  • Understand the basics of variable selection

Activities

Distribute the Using R for Data Mining packet to students. Lead students through the information in section I, and then allow students to complete section II on their own for approximately ten minutes. Have students share some of their responses with the class, and then begin guiding students through section III. Encourage students to take their time when typing the code, and also emphasize to them the complexity of the work that the software is doing. After completing section III as a group, allow students approximately ten minutes to complete section IV. At the end of class, have students share some of the uses of data mining they found, and compile a list for future reference.

Modifications

This activity can be used in almost any type of classroom setting. It may be helpful to pair students with learning disabilities or English language learners with higher-achieving students. Also, identify “helpers”—those students who quickly catch on to how the R environment works and can help you keep other students on track. One teacher trying to help thirty students write programming code can be a frustrating experience for everyone involved.

File Links for Supporting Materials