Kenan Fellows Program Logo and page header graphic

Understanding Data Mining

Lesson 1: Installing and Working with R Statistical Software

We are going to use the R statistical software package this year on numerous occasions. It is a very powerful statistical package that is used by the Federal Reserve Bank and international companies all over the world.

You will not be familiar with the R environment unless you have worked with Maple in an AP Calculus class. While we are using a Windows version of R, it still exhibits many of the underlying characteristics of the UNIX operating system on which the original version was developed. R is an open source program, so it is free for you to download to your computer in the lab as well as at home. R also works on MacBooks and the Linux operating system.

Download and Installation of Software

  1. Go to http://www.r-project.org.
  2. Choose CRAN on the download link on the left side of the page. Scroll down to the USA download sites and select one.
  3. Choose the Windows version of R which will take you to another page. You need to select the base package.
  4. On the next page select R-2.10.0-win32.exe and choose Save File. Save it to your personal space on the school server. (At home you may want to save this to your Desktop or to My Documents.) This is a large file and will take several minutes to download.
  5. Create a folder on your computer’s desktop and name it APStatistics. If you are on a school computer, create this folder in your home directory.
  6. Find the downloaded R file and open it, which will install the program. Be sure to put all of your files in the AP Statistics folder you created on your computer.

Syntax and Key Commands

  • R is case sensitive: ‘height’ and ‘Height’ are two different names.
  • All input lines begin with a red arrow: >
  • All output is shown in blue.
  • A semi-colon (;) may be used to separate commands on the same line or a new line may be started for a new command.
  • Comments can be inserted by beginning the line with the pound sign (#). Comment lines are helpful references to the tasks written in code.
  • The up/down arrow keys allow the user to tab through previous commands, which in turn can be copied and pasted at the input arrow, then modified by scrolling back with the left arrow (<-).
  • Always begin each session of R by seeding the program, much the same way that we seed the graphic calculator. The command to seed R is set.seed(2010). We will always seed R for lab activities so we can be assured of getting the same output.
  • To see what objects are active in R, type ls().
  • To remove objects in R, type rm(objectname1, objectname2, etc).
  • To quit R, type q(). Do NOT save your workspace.

Working Directory

Before we begin, you will want to create a folder that will serve as your working directory for R. When we designate this folder as the working directory for R, the program will go to this folder to open any files, such as data files, that you want open. For the sake of consistency, while in the computer lab we create a folder called APStatistics. It is easier if you save your folder onto a thumb or flash drive. If not, you will need to save the folder in your student directory on the school server. The downside to saving the files on the server is that you will not be able to access the files at home.

The command getwd() will give you the current working directory in R. To change the working directory, go to File→ Change directory→ then browse for the folder you created for the working directory.

If you prefer to set the directory manully, the command is

setwd(pathway with right-slants)

Existing Data Files in R

There are a number of existing data files in R. The command data() will pop up a separate window with a list of over 100 data files. Please note the names carefully as R is case sensitive. To call a data file active for you to use, the command is data(filename). For example, to call the data file on Loblolly trees, type:

 data(Loblolly).

If you wish to examine the data file, you use a drop-down menu to open the data editor. Edit→ Data Editor→ type in name of file. You will get another window that looks similar to an Excel file format. Note the names of the variables and what they likely represent. Remember, R is case sensitive so make sure you are specific about the case.

If you want to see the data set in its entirety in the main window, just type the name of the data set. While you cannot copy and paste the data set from the data editor window, you can copy and paste it from the main R window. Be sure to put it into Notepad (not MS Office) and save it as a .txt file.

Existing Data Files in the Home Directory

Many times in this class you will use a data set that has been provided to you or that you create yourself as a part of your project. It is very important that you save your data in Notepad as a .dat file. Just type your filename.dat when you save it.

When you want to open a data file that is already in your home directory, you use the read.table command. You will need to instruct R about the presence or lack of presence of headers on the data columns in your data file. The file name MUST be typed with “ “ marks and the file type (.dat or .txt) must be stipulated.

Example: read.table(“lob.dat”, header=T)

If you want to read the table into a particular file name to use in R, you can stipulate where you want the data to be read into:

mydata← read.table(“lob.dat”, header=T)

If you wish to make certain that your data is there, just type mydata at the command line and R will give you the entire data file in the main window.

If there is an error in your data file, you can fix it directly in R. Just type fix(mydata) and the data editor will open and allow you to fix the data. Don’t forget to save the changes if you want to keep them.

Your Notes: