Data Janitor
What is a Data Janitor?
From Wikipedia: “A data janitor is a person who works to take big data and condense it into useful amounts of information. Also known as a “data wrangler,” a data janitor sifts through data for companies in the information technology industry. A multitude of start-ups rely on large amounts of data, so a data janitor works to help these businesses with this basic, but difficult process of interpreting data.”
Search Linkedin or AngelList and you won’t see a single job posting for a data janitor. Why is that? A Data Scientist’s time is valuable, and estimates are at 50 -80% of time spent in tiresome cleaning. I am surprised companies are not profiting on the trend of people wanting to become data scientists by hiring data janitors as entry level positions.
Janitorial Cleaning Cart
When I first started using R, data manipulation was daunting, and I avoided using R packages. At first I would do the cleaning manually in Excel and then import the data back to R for charting and analysis. Then I found SQLDF which allowed me to perform SQL selects on R data frames. As I felt more comfortable writing SQL queries to filter and join this was my tool of choice for a while. But sqldf() is not always the most efficient, especially with larger data frames.
#SQLDF Example library(sqldf) scoredist = sqldf("SELECT boro, score, COUNT(score) AS count From df WHERE boro <> 'Missing' Group by boro, score")
Hands-on dplyr tutorial for faster data manipulation in R
Dplyr is “A Grammar of Data Manipulations.” I heard of dplyr and plyr for a while until I finally gave the package a shot, and now I am wishing I started earlier.
Pandas
When using python, the library of choice for data cleaning would be Pandas. Some basic pandas commands.
import pandas as pd #View the head and summary statistics df.head() df.describe() #SQL like Merge/Join df = pd.merge(yelp_df,Inspection_df,how='inner',on='Phone') #Basic string contains df['Me'][df['From'].str.contains("MyEmail@gmail.com")] = 1 #drop rows with missing data df.dropna(how='any') # Get data with Ratings above 3 yelp_df[yelp_df.Rating > 3]
What am I actually cleaning for?
A clean room is not hard to imagine, sparkling floors, and an organized desk, but what does clean data look like? In a tidy data set each variable is saved in its own column and each observation in it’s own row.
- Standardize formats
- Dates should follow the same syntax (e.g YYYY-MM-DD)
- Correct data types, should the columns be characters or factors.
- Is the currency used consistent and labeled
- Trim extraneous spaces
- Label columns to remove ambiguity
- Remove duplicate records
- Sanity checking aggregation type (i.e. are you looking for a count or sum?)
- Check the length of variables (e.g. A zip code should have a length of 5)
- Sorting the data and checking for outliers (flag for removal)
- Missing Data
- Is the data complete (why does one month have a third of the data points as another?)
- NaN or blank rows and columns
Don’t forget to automate when possible. I noticed an account manager spent an hour each week cleaning data in Excel to upload to the UI. Using Python I was able to write a script to automate the steps and save her time each week. Data cleaning is not the most fun and if the cost-benefit math works out, why not automate.
And no package that I know of can replace a fresh set of eyes, taking a look at the data to see if the values are reasonable. Sure it’d be great if you had a million page views in a day, and that would make sense for some blogs, but you need to know the baselines. Garbage in, garbage out, and as a data janitor one should make sure to take the trash out.
Leave a Reply
Want to join the discussion?Feel free to contribute!