Kenan Fellows Program Logo and page header graphic

Understanding Data Mining: Extracting, Organizing, and Analyzing Large Sets of Data

Lesson Three: Putting it All Together

Introduction

In Algebra I, students study data sets with one predictor variable and one response variable. However, in the real world, most response variables have numerous predictor variables, many of which may have a significant impact on the data. These different variables may also have differing effects on the situation at hand, so it is important to identify their effects and then use them appropriately to make sound, more valid predictions. In this lesson, students will create their own data set and then use R Statistical Software to mine their data in an attempt to identify the variables that most significantly impact the selling price of a home. Students will utilize multiple methods of variable selection—forward selection, backward selection, and stepwise selection—in an attempt to determine which variables are most influential.

Learning Outcomes

At the end of this lesson, students should be able to use R to:

  • identify influential variables in a multivariate data set using forward selection, backward selection, and stepwise selection
  • develop linear models that can be used to make predictions

Students should also be able to:

  • explain the purpose of data mining
  • understand the basic principles behind the forward, backward, and stepwise selection processes
  • identify some real-world uses of data mining
  • use linear models to make predictions

Classroom Time Required

To complete this lesson, one block period or two traditional periods (for a total of approximately ninety minutes) would be necessary. This is the second lesson in a set of three.

Materials Needed

  • The Basics of R handout
  • Putting it All Together packet
  • pencil/pen
  • One computer per student
  • One computer with projector for teacher
  • R Statistical Software (used with permission from R Development Core Team (2008).
    R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0,
    URL http://www.R-project.org.)
  • Microsoft Excel
  • MLS Search Engine (I recommend www.fmrealty.com for homes in the Triangle)

Pre-Activities

Prior to completing this lesson, students should be able to:

  • -Distinguish between independent and dependent variables
  • -Create scatterplots, and describe correlation
  • -Use a graphing calculator to find linear regression models for data sets
  • -Use R Statistical Software to create scatterplots, to calculate linear models, to determine correlation of variables, and to complete basic data mining tasks

Prior to implementing this lesson, teachers should:

  • -Become familiar with basic R commands
  • -Develop data sets that may be used for demonstration purposes, if necessary
  • -Ensure that R (a free software) is installed on student computers
  • -Understand the basics of variable selection

Activities

Distribute the Putting it All Together packet to students. Lead students through the overall outline of their assignment. It may be helpful to go through the initial search engine setup with students so they can easily find the information they will need to create their own set of data. At this point in the instructional unit, students should be able to use their previous lesson packets to guide them through this assignment independently. At the end of the lesson, students should print a copy of their R script (including all commands used) and attach it to their lesson packets.

Modifications

This activity can be used in almost any type of classroom setting. It may be helpful to pair students with learning disabilities or English language learners with higher-achieving students. Also, identify “helpers”—those students who quickly catch on to how the R environment works and can help you keep other students on track. One teacher trying to help thirty students write programming code can be a frustrating experience for everyone involved.

File Links for Supporting Materials