# data analysis like a pirate…rrrrrrrrrrrrrrrr….

My Kenan project will be focused on the question, “Is U.S. air pollution getting better or worse?” To that end, the teacher side of this fellowship will involve a project for my students that will involve answering questions related to this focus by analyzing data available through the E.P.A. website.

By law, the E.P.A. monitors 6 different air pollutants: ozone, carbon monoxide, nitrogen dioxide, sulfur dioxide, lead, and particulate matter. Most of these pollutants are measured daily, some several times a day. There are literally thousands of E.P.A. monitor sites across the United States.

So here’s the problem…

There are 6 different pollutants and 37 years of data available (1980-2016). That’s 222 separate spreadsheet data files. Each of these data files is a spreadsheet with 29 columns and up to several hundred THOUSAND rows of data. Each of these data files has data for all 50 states as well, so if you don’t want all 50 states, you need a way to separate the data.

I’ve pretty well decided that the nature of the project will be collaborative, with each group given one of four regions of the country (northeast, south, midwest, west), and each student will become an expert on one of five of the pollutants – leaving out airborne lead since it has not been a problem in a very long time.

As an example: suppose I want a data file for my students who will be doing the Northeast region, and want data on Ozone. What I need to do is to combine all 37 years of data in to one file, but also limit the file to just the 14 states assigned to the Northeast region.

Well, doing this in Excel is almost unthinkable to me.

Fortunately, I’ve been introduced to a statistical analysis program called R. That’s right, just the letter R. R is an open source language and environment for statistical computing and graphics. It is widely used in statistical sciences, data sciences, and data mining. It is both a language and a tool.

This little piece of code in R will read in 37 years of data (in csv format), select only data that applies to the Northeast region states,  and bind them together into one single file.

for(i in 2:37){
X1=subset(X1, X1\$State.Code %in% northeast)
X1=X1[,c(1:3,12,17)]
X=rbind(X,X1)
}

A bit more work needs to be done to have the data in a form ready for student use, but here’s the amazing part to me.

Those 37 original data files were anywhere from about 50,000 to 200,000 KB in size, and we stick them all together, pull out what we don’t want, keep what we do, split into regions, and when we are done?

The file for Northeast region Ozone data has a size of about 60 KB.

Yup.

That’s doin’ data analysis like a PIRATE!

“RRRRRRRRRRRRRRRRRRR”