# Understanding Data Mining

Author: | Celia Rowland |

Level: | High School |

Content Area: | Mathematics |

Author: | Celia Rowland |

Level: | High School |

Content Area: | Mathematics |

Corporations use data mining to track consumer behavior in purchasing products ranging from gas to oranges to Coca-Cola™. Hospitals create predictive models about infection rates, demand for x-rays and anticipated births. Today most Fortune 500 corporations use data mining to create credible models which anticipate the demand for goods and services. Students will go beyond the textbook lessons for linear regression in AP Statistics with these explorations on data analysis and model fitting. The use of R statistical software by students will provide them with a real taste of how data analysis is conducted in the 21st century business environments across our globe. You, the teacher, will work in a consultation mode rather than a director model with these lessons.

Students should be able to answer these essential questions after completion of these lessons:

- What kinds of graphs are used to explain the appropriateness of a predictive least-squares linear model based on graphs of the data set? How are these graphs created?
- How is the “best” linear model determined for a data set and what statistics from different software commands give this information?
- What methods are used to create predictive models with multiple explanatory variables?
- How is the “best” multiple-variable model determined?

These lesson plans are aligned to the North Carolina Standard Course of Study, the AP Statistics course requirements and the National Council of Teachers of Mathematics Standards.

NCTM Standard: Algebra – Use mathematical models to represent and understand quantitative relationships. (1)

NCSCOS (2005) Competency Goal 4 – The learner will analyze bivariate data solve problems. (Objective 4.01 a, Objective 4.01b, Objective 4.01c..)(2)

AP Statistics Topic 1D – Exploring bivariate data.(3)

Lesson 1 requires one block period or two 45-minute classes on sequential days. This includes time to download and install R statistical software. Ideally, this lesson should occur immediately after linear regression has been introduced in the classroom. You will want to have covered residual analysis prior to this lab. Students will have a greater understanding of residual plots if the concepts have been introduced in the classroom prior to the lab.

Lesson 2 requires one block period or two 45-minute classes on sequential days. Lesson 2 can occur at any time following Lesson 1. For continuity’s sake, this lesson should be completed prior to moving to another unit in the course.

Separate handouts for students with commands and outline of lesson.

Lesson 1 has two handouts: (1) Installing and Working with R Statistical Software, (2) Least Squares Linear Regression in R.

Lesson 2 has one handout: Multivariate Least Squares Regression Model with Variable Selection.

Data files are provided for each lesson as Excel spreadsheets with directions on how to convert the file to the proper format for use with R.

Statistical software program, R, and computer with Windows platform. (There is a version available if your school has Mac computers.) Download the software at http://www.r-project.org. You will need a central location to place data files for students to access. Some examples are your own web page, the course site on Blackboard, or a central location on the school’s server. These files do not work well if placed in Google Documents as the statistical software cannot read it. Students may want a flash drive to save their files if they do not have space on the school’s server.

- Download and install the latest version of R to your school’s server, noting any differences in the installation process that differ from those in the lesson Installing and Working with R Statistical Software. You may want to modify this lesson with specific instructions related to your school’s network system.
- Convert the data files from Excel format to .dat files and install to a central location.
- Work through each lesson on your own to familiarize yourself with the work that students will undertake.
- Make sufficient copies of the handouts for students prior to computer lab time. You may wish to place a copy of the lessons on your own web page.

Basic lessons on correlation, coefficient of determination and the least-squares linear regression model from the AP Statistics curriculum should be covered in the classroom prior to the lab. It is optional if you wish to show students how to run simple linear regression on the Ti-8* series graphic calculator prior to any lab time. Students should already be familiar with the vocabulary related to LSLR: explanatory variable, response variable, error, slope, intercept and residual.

A multiple-question quiz for each of the two lessons is provided. These questions can be incorporated into a unit test that covers all of the linear regression material covered in AP Statistics. There are also two short assignments that students may complete as a follow-up to their lab activities. Rubrics are provided for the assignments.

(1) Principles and Standards for School Mathematics, NCTM, p. 37.

(2) http://www.ncpublicschools.org/curriculum/mathematics/scos/2003/9-12/72a...

(3) AP Statistics Teacher’s Guide, p. 3.