Data Janitor

What is a Data Janitor?

From Wikipedia: “A data janitor is a person who works to take big data and condense it into useful amounts of information. Also known as a “data wrangler,” a data janitor sifts through data for companies in the information technology industry. A multitude of start-ups rely on large amounts of data, so a data janitor works to help these businesses with this basic, but difficult process of interpreting data.”

Search Linkedin or AngelList and you won’t see a single job posting for a data janitor. Why is that? A Data Scientist’s time is valuable, and estimates are at 50 -80% of time spent in tiresome cleaning. I am surprised companies are not profiting on the trend of people wanting to become data scientists by hiring data janitors as entry level positions.

Janitorial Cleaning Cart

When I first started using R, data manipulation was daunting, and I avoided using R packages. At first I would do the cleaning manually in Excel and then import the data back to R for charting and analysis. Then I found SQLDF which allowed me to perform SQL selects on R data frames. As I felt more comfortable writing SQL queries to filter and join this was my tool of choice for a while.  But sqldf() is not always the most efficient, especially with larger data frames.

#SQLDF Example
library(sqldf)

scoredist = sqldf("SELECT boro, score, COUNT(score) AS count 
                  From df
                  WHERE boro <> 'Missing'
                  Group by boro, score")

 

Hands-on dplyr tutorial for faster data manipulation in R

Dplyr is “A Grammar of Data Manipulations.” I heard of dplyr and plyr for a while until I finally gave the package a shot, and now I am wishing I started earlier.

 

Pandas

When using python, the library of choice for data cleaning would be Pandas.  Some basic pandas commands.

import pandas as pd

#View the head and summary statistics
df.head()
df.describe()

#SQL like Merge/Join
df = pd.merge(yelp_df,Inspection_df,how='inner',on='Phone')

#Basic string contains
df['Me'][df['From'].str.contains("MyEmail@gmail.com")] = 1

#drop rows with missing data
df.dropna(how='any')

# Get data with Ratings above 3
yelp_df[yelp_df.Rating > 3]

What am I actually cleaning for?

A clean room is not hard to imagine, sparkling floors, and an organized desk, but what does clean data look like? In a tidy data set each variable is saved in its own column and each observation in it’s own row.

  • Standardize formats
    • Dates should follow the same syntax (e.g YYYY-MM-DD)
    • Correct data types, should the columns be characters or factors.
    • Is the currency used consistent and labeled
  • Trim extraneous spaces
  • Label columns to remove ambiguity
  • Remove duplicate records
  • Sanity checking aggregation type (i.e. are you looking for a count or sum?)
  • Check the length of variables (e.g. A zip code should have a length of 5)
  • Sorting the data and checking for outliers (flag for removal)
  • Missing Data
    • Is the data complete (why does one month have a third of the data points as another?)
    • NaN or blank rows and columns

Don’t forget to automate when possible. I noticed an account manager spent an hour each week cleaning data in Excel to upload to the UI. Using Python I was able to write a script to automate the steps and save her time each week. Data cleaning is not the most fun and if the cost-benefit math works out, why not automate.

And no package that I know of can replace a fresh set of eyes, taking a look at the data to see if the values are reasonable. Sure it’d be great if you had a million page views in a day, and that would make sense for some blogs, but you need to know the baselines. Garbage in, garbage out, and as a data janitor one should make sure to take the trash out.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *