Understanding Data Mining
Author: | Celia Rowland |
Level: | High School |
Content Area: | Mathematics |
Author: | Celia Rowland |
Level: | High School |
Content Area: | Mathematics |
We are going to use the R statistical software package this year on numerous occasions. It is a very powerful statistical package that is used by the Federal Reserve Bank and international companies all over the world.
You will not be familiar with the R environment unless you have worked with Maple in an AP Calculus class. While we are using a Windows version of R, it still exhibits many of the underlying characteristics of the UNIX operating system on which the original version was developed. R is an open source program, so it is free for you to download to your computer in the lab as well as at home. R also works on MacBooks and the Linux operating system.
set.seed(2010)
. We will always seed R for lab activities so we can be assured of getting the same output.ls()
.rm(objectname1, objectname2, etc)
.q()
. Do NOT save your workspace.Before we begin, you will want to create a folder that will serve as your working directory for R. When we designate this folder as the working directory for R, the program will go to this folder to open any files, such as data files, that you want open. For the sake of consistency, while in the computer lab we create a folder called APStatistics. It is easier if you save your folder onto a thumb or flash drive. If not, you will need to save the folder in your student directory on the school server. The downside to saving the files on the server is that you will not be able to access the files at home.
The command getwd()
will give you the current working directory in R. To change the working directory, go to File→ Change directory→ then browse for the folder you created for the working directory.
If you prefer to set the directory manully, the command is
setwd(pathway with right-slants)
There are a number of existing data files in R. The command data()
will pop up a separate window with a list of over 100 data files. Please note the names carefully as R is case sensitive. To call a data file active for you to use, the command is data(filename)
. For example, to call the data file on Loblolly trees, type:
data(Loblolly).
If you wish to examine the data file, you use a drop-down menu to open the data editor. Edit→ Data Editor→ type in name of file. You will get another window that looks similar to an Excel file format. Note the names of the variables and what they likely represent. Remember, R is case sensitive so make sure you are specific about the case.
If you want to see the data set in its entirety in the main window, just type the name of the data set. While you cannot copy and paste the data set from the data editor window, you can copy and paste it from the main R window. Be sure to put it into Notepad (not MS Office) and save it as a .txt file.
Many times in this class you will use a data set that has been provided to you or that you create yourself as a part of your project. It is very important that you save your data in Notepad as a .dat file. Just type your filename.dat when you save it.
When you want to open a data file that is already in your home directory, you use the read.table command. You will need to instruct R about the presence or lack of presence of headers on the data columns in your data file. The file name MUST be typed with “ “ marks and the file type (.dat or .txt) must be stipulated.
Example: read.table(“lob.dat”, header=T)
If you want to read the table into a particular file name to use in R, you can stipulate where you want the data to be read into:
mydata← read.table(“lob.dat”, header=T)
If you wish to make certain that your data is there, just type mydata at the command line and R will give you the entire data file in the main window.
If there is an error in your data file, you can fix it directly in R. Just type fix(mydata) and the data editor will open and allow you to fix the data. Don’t forget to save the changes if you want to keep them.