# data analysis like a pirate…rrrrrrrrrrrrrrrr….

My Kenan project will be focused on the question, “Is U.S. air pollution getting better or worse?” To that end, the teacher side of this fellowship will involve a project for my students that will involve answering questions related to this focus by analyzing data available through the E.P.A. website.

By law, the E.P.A. monitors 6 different air pollutants: ozone, carbon monoxide, nitrogen dioxide, sulfur dioxide, lead, and particulate matter. Most of these pollutants are measured daily, some several times a day. There are literally thousands of E.P.A. monitor sites across the United States.

So here’s the problem…

There are 6 different pollutants and 37 years of data available (1980-2016). That’s 222 separate spreadsheet data files. Each of these data files is a spreadsheet with 29 columns and up to several hundred THOUSAND rows of data. Each of these data files has data for all 50 states as well, so if you don’t want all 50 states, you need a way to separate the data.

I’ve pretty well decided that the nature of the project will be collaborative, with each group given one of four regions of the country (northeast, south, midwest, west), and each student will become an expert on one of five of the pollutants – leaving out airborne lead since it has not been a problem in a very long time.

As an example: suppose I want a data file for my students who will be doing the Northeast region, and want data on Ozone. What I need to do is to combine all 37 years of data in to one file, but also limit the file to just the 14 states assigned to the Northeast region.

Well, doing this in Excel is almost unthinkable to me.

Fortunately, I’ve been introduced to a statistical analysis program called R. That’s right, just the letter R. R is an open source language and environment for statistical computing and graphics. It is widely used in statistical sciences, data sciences, and data mining. It is both a language and a tool.

This little piece of code in R will read in 37 years of data (in csv format), select only data that applies to the Northeast region states,  and bind them together into one single file.

for(i in 2:37){
X1=read.csv(files[i])
X1=subset(X1, X1\$State.Code %in% northeast)
X1=X1[,c(1:3,12,17)]
X=rbind(X,X1)
}

A bit more work needs to be done to have the data in a form ready for student use, but here’s the amazing part to me.

Those 37 original data files were anywhere from about 50,000 to 200,000 KB in size, and we stick them all together, pull out what we don’t want, keep what we do, split into regions, and when we are done?

The file for Northeast region Ozone data has a size of about 60 KB.

Yup.

That’s doin’ data analysis like a PIRATE!

“RRRRRRRRRRRRRRRRRRR”

## 6 thoughts on “data analysis like a pirate…rrrrrrrrrrrrrrrr….”

1. Linda Dion says:

We were utilizing R in the Dr. Reade Roberts Lab to analyze data as well. We were comparing the significance of the data we had extracted from our DNA samples. Excited to see how you will utilize statistics and analysis in your lesson and classroom!

2. I was very happy to search out this internet-site.I wanted to thanks to your time for this glorious read!! I undoubtedly enjoying every little bit of it and I have you bookmarked to take a look at new stuff you blog post.

3. Thank you for sharing superb informations. Your web site is so cool. I’m impressed by the details that you have on this web site. It reveals how nicely you understand this subject. Bookmarked this web page, will come back for extra articles. You, my friend, ROCK! I found just the information I already searched everywhere and simply couldn’t come across. What a perfect web-site.

4. you might have a fantastic weblog right here! would you like to make some invite posts on my blog?

5. Thanks for discussing your ideas listed here. The other matter is that when a problem appears with a computer motherboard, individuals should not take the risk associated with repairing it themselves because if it is not done properly it can lead to permanent damage to the full laptop. It will always be safe to approach a dealer of any laptop for that repair of the motherboard. They’ve technicians who definitely have an know-how in dealing with notebook computer motherboard troubles and can get the right analysis and accomplish repairs.

6. Hello! I just would like to offer you a huge thumbs up for your great information you have got right here on this post. I will be returning to your site for more soon.|