Starting a Data Science Project: Basic Data Exploration
Finding Interest in Your Machine Learning Development
Titanic fatalities, handwritten letters, vehicle mileage and other famous data sets are very useful when starting out on the data science education path. There are numerous tutorials, examples, and resources on how to use almost any machine learning algorithm on these data sets as well. While these data sets are historically important and contain interesting data, they may not be individually interesting to you or answer any of your specific questions.
This blog post will demonstrate how I start on a new data science project and perform some basic data exploration with R.
Do You Have a Question or do You Have a Data Set?
When starting a personal (as opposed to a professional setting) data science project there are two approaches to start from:
What questions do I want to have answered
What can this data set tell me
The first approach may be more difficult as the data may not be easily attainable for any singular question that you have. For the second approach, you may already have a data set that relates to a topic that you want to better understand. These two approaches are somewhat fluid and can absolutely be used in conjunction with each other.
How to Find a Data Set
So in relation to the above idea on how to start a project, the data set that I am using going forward has a personal relationship to me. In my first year of my University education I became friends with a group of biology and chemistry students that, as part of their undergrad research and volunteer work, traveled to some of the waterways and tributaries around Austin, Texas to record water quality samples. The city of Austin has a fantastic resource of publicly available data ranging from financial transactions, graffiti removal requests, crime data, and much more. Your own nearby municipality should have a similar data resource.
I will be using the “Water Quality Sampling Data” data set available here.
While you may not be interested in a city’s data there are many other avenues to find the data you are interested in. Government agencies have a data page for almost any topic including labor statistics or weather and climate. There are competition sites such as Kaggle that have freely available datasets submitted by users or competition sponsors. Many universities and schools have data sets related to research projects as well.
Read the Documentation
Once you have a data set that you find interesting or pertains to your target question(s) one of the first steps you should do is read any associated documentation.
There are instances where the documentation may disagree with the data and you have to apply some detective work and logical reasoning to understand what is more likely to be “correct” as it pertains to your use case.
Basic Data Exploration
From this point forward I will go through my process for first steps in data exploration, also known as exploratory data analysis (commonly abbreviated to EDA). I will be using R with multiple additional packages loaded.
While looking at the data becomes cumbersome as the number of variables increases, insights can still be made quickly by looking at the data alone if the documentation doesn’t offer a full description.
Using head() or view() allows us to see examples for each variable and to check if there are any obvious discrepancies from what we expect. If the documentation didn’t explain each variable then this would be the point to start understanding what each variable is, its type, how it is formatted and if there are any differences from what we expect. Besides looking at the data examples, tables and numerical summaries can be called for each variable.
We also see that the structure of the data is “long” in that multiple parameters and results are associated with the same sample over multiple rows. One fix to this is to create an additional column in the data set for each parameter for each sample. For this data however, many parameters are only used a few times so there will be many empty values. The approach I will take moving forward is to only look at the 10 most populated parameters.
For sake of simplicity I will create a new data frame with only the PH data included and later I will demonstrate how to change the shape (dimensions) of the data set into ‘wide’ format. In addition to subsetting the data for only the PH measurement results, I will also limit the new data set to have none null values and a reasonable range for PH measurements (PH has a minimum of 0 and a maximum of 14). In many situations, null value data points can tell you additional information about the data. For example, a stock trader not checking the markets for a day (hypothetical situation) and not recording any trades may come up as a null value in some data set instead of a 0.
PH_Data <- Water_Data[Water_Data$PARAMETER == "PH" & !is.na(Water_Data$RESULT) & between(Water_Data$RESULT,0,14),]
Now that we can have just PH data let’s check our assumptions that most of the measurements should be between 6 and 8. I don’t try to put too much faith into looking at histograms and charts of data, but they are useful to make sure something has gone terribly wrong.
Final Words
This project-tutorial worked through an example of starting a data science project. From finding an interest or at least an interesting data set to some basic EDA steps. At each step you should hope to better understand your data and be thinking of how the data relate to your target variable or the question to want to have an answer for - not to mention what algorithm or machine learning process will work best on any specific data set and the formatting needs that entails.
There are many many other data exploration processes and data preprocessing to use that weren’t even touched on. In the future, additional links and edits will be made here to connect more in-depth topics that warrant individual blog posts.