The One with the Ratings
Last Summer I binged Friends on Netflix; on a light day 3 episodes to a no shame 10 episodes vegfest. Chandler is one of my favorite characters. When people ask what I do, the default answer is “statistical analysis and data reconfiguration” or you know a transponster.
Graph TV is a site that allows one to visualize IMDB ratings of various shows, or you can use ggplot/plotly to create a custom graphic.
IMDB ratings
ggplot(combinedrating,aes(x=EpNumber,y=UserRating,color=as.factor(Season)))+ geom_point()+ stat_smooth(method="lm", se=FALSE) + labs(x="Episode #",y="Average Rating",title="Friends Episode IMDB Ratings by Season") + geom_point(data = clips, aes(x=clips$EpNumber,y=clips$UserRating), color="black", shape=1, size=5) + geom_text_repel(aes(x=EpNumber,y=UserRating,color=Season, label = Name), data = combinedrating %>% filter(UserRating > 9.4 | UserRating < 8)) + ylim(c(5,10))
Immediately I notice that the lowest rated episodes are clip shows which are marked with a black circle, the exception being ‘The One with the Donor.’ Not even John Stamos and Hank Azaria could save that episode. Also the Series ended at a good time, the quality was starting to trend downwards in season 8,9.
Then combining the IMDB ratings with the Nielsen ratings, I pivoted to see what the connection between ratings and viewers. No surprise that season finales have high ratings and viewership compared to clip shows. While clip shows can be useful for cutting costs, they can have a negative impact for viewership (and thus ad dollars) despite no change in quality of the next episode. This begs the question should Syndication skip clip shows during rotation to keep viewers?
Interactive Plotly Chart
Cross correlation function (CCF)
ccf(combinedrating$Viewers, combinedrating$UserRating, lag.max=3, plot=TRUE, main="Viewers and Rating ACF")
A CCF shows the correlation of two variables with a lag to show a relationship over time. We see the most significant times to be between +/- 1 episode which makes intuitive sense that the quality of an episodes 3 weeks away does not have a strong relationship. and that the episode viewership after a good episode has more of an effect than before. Seen by the ACF being higher at -1, than +1.
A Sequel to SQL
An Introduction to SQL should give you a primer to the “Q” part SQL; or how to query. This post is a SQL sequel; how to add and alter data.
Top level review of basic SQL structure
SELECT The columns you are SELECTing FROM The name of the table FROM which the data is from JOIN ON JOINing a second table based ON a shared column {optional} WHERE Condition the selection should follow {optional} GROUP BY Grouping non aggregated data {optional} HAVING The where equivalent for aggregated selections {optional} ORDER BY You will never guess what this does {optional} LIMIT LIMITs the number of rows {optional}
One could go years as an analyst only reading data, and never writing to the database. I know personally, I was anxious about first getting write access. Hearing stories of people accidentally deleting tables (most are hoaxes), or a wrong keystroke costing millions. Eventually, you will find work to be easier when you can create and alter your own tables. A common example being aggregated tables.
Let’s Begin our SQL Sequel
For this tutorial I am using PostgreSQL 9.4.7, to check your version of SQL you can use the following command.
select version()
Creating New Tables
To creating a table from existing data use the “CREATE TABLE” command.
CREATE TABLE KPI_agg AS ( SELECT date_trunc('month', date_field) as Month ,User_class ,COUNT(myid) as Count_field ,AVG(KPI) AS AVG_KPI FROM Your_Table GROUP by 1,2 );
Or if the data does not already exist in your database create a Table from a CSV
create table KPI_agg ( Month timestamp, User_class varchar, Count_field bigint, AVG_KPI numeric, );
Some different datatypes one can use Postgres datatypes. Varchars should be avoided when possible, but sometimes they are your only option. Altering a column can be done using the ALTER command.
ALTER TABLE KPI_agg ALTER COLUMN User_class TYPE text USING (User_class::text);
After creating an empty table you can upload the CSV.
\copy KPI_agg from 'filelocation/datatoupload.csv' DELIMITER ',' CSV HEADER;
The above line copies the data from your local computer uses the delimiter of a comma, stored as a .csv and has a header. Some SQL client’s don’t allow you to do a \copy command, so you may need to use psql / a SQL shell.
New Columns
To add a new column use the ALTER and ADD commands
ALTER TABLE KPI_agg ADD COLUMN NEW_Class text;
Then you can SET this column’s information based on existing data.
UPDATE KPI_agg SET NEW_Class = 'New Name' WHERE User_class = 'Old Name'
Later on, you can update the table by only inserting data that meets a certain condition
insert into KPI_agg(Month,User_class,Count_field,AVG_KPI) ( SELECT date_trunc('month', date_field) as Month ,User_class ,COUNT(myid) as Count_field ,AVG(KPI) AS AVG_KPI FROM Your_Table WHERE date_field::date >= '2017-03-01' and date_field::date <= '2017-03-31' GROUP by 1,2 );
Removing Data
Drop the entire table
DROP TABLE KPI_agg;
Delete rows that match a condition
DELETE FROM KPI_agg where User_class = 'Undesired_class';
Don’t forget to create indexes on your new tables
CREATE INDEX myidx ON Your_Table (myid);
A guide on Indexes is outside the scope of this post, see Efficient Use of PostgreSQL Indexes for more information.
TV watching habits from GSS data
Last week I went to a Data Visualization Meetup where Hadley Wickham gave a talk on visualization challenges. He went over the following points:
Eight visualization challenges
- Labeling plots: A problem ignored for too long
- Use the labs() function
- Dual axes: usually wrong, sometimes useful
- geom_label and ggrepel: labeling your data
- Ordering
- Visualizing missing values
- Why are histograms so hard?
- Bar charts & variations
- ggplot2 extensions from the community
My main takeaway was from the labeling challenge, particularly how useful geom_label_repel() is. If you have many points you’d like to annotate, the overlap makes the visualization illegible. This function takes the place of geom_text(), and finds how to label points that don’t overlap. Documentation can be found here.
One slide in particular made me want to do some data wrangling to learn more. The slide was about ordering, and the content was the relationship between Religion and TV consumption. Hadley’s slide showed the mean for each Religion, and I wanted to dig deeper on the distribution and the size of the groupings. The dataset can be found here: General Social Survey (GSS). I used data from 2000 – 2014.
R code
library(readxl) library(tidyverse) library(plotly) #Set Correct Path setwd("C:/Users/Matt/Downloads") #read Data GSS_Data <- read_excel("GSS.xls", sheet = "Data") #remove last two rows of data GSS_Data <- head(GSS_Data, -2)
# Basic exploration summary(GSS_Data) str(GSS_Data) # Rename columns GSS_Data <- rename(GSS_Data, year = `Gss year for this respondent `, Religion = `Religion in which raised`, id = `Respondent id number`, `Work Category` = `R self-emp or works for somebody`) # Change data types GSS_Data$Religion <- as.factor(GSS_Data$Religion) GSS_Data$`Hours per day watching tv` <- as.numeric(GSS_Data$`Hours per day watching tv`) GSS_Data$`Work Category` <- as.factor(GSS_Data$`Work Category`) # Select columns Religion_TV <- select(GSS_Data, id, year, Religion, `Hours per day watching tv`) # remove Na's Religion_TV <- Religion_TV[complete.cases(Religion_TV),] # Distinct Religions Religion_TV %>% select(Religion) %>% distinct # 15 religions #Filter to look at 2014 data tv_2014 <- filter(Religion_TV, year == 2014) # reorder religion based on mean TV watching time # Hadley used fcr_reorder() tv_2014$Religion <- with(tv_2014, reorder(Religion, `Hours per day watching tv`, mean))
# Create plot ggplot(tv_2014, aes(Religion, `Hours per day watching tv`, fill = Religion)) + geom_boxplot() + coord_flip() + ggtitle("TV Watched per day by Religion \n in 2014") # Get counts tv_2014 %>% group_by(Religion) %>% summarise( avg_watched = mean(`Hours per day watching tv`), counts = n()) %>% arrange(desc(avg_watched))
We see that the ‘Don’t Know’ group in 2014 only consisted of 1 member, so we should not be drawing any conclusions from that group. While a small sample size we notice that the Eastern Religions watch less Television than the Abrahamic religions in 2014. When looking a larger time period (2000 – 2014) the ‘Don’t Know’ category jumps to the top of the list.
We also notice some atypical behavior with a few people that are watching TV for the whole day. Let’s see how many people are watching TV for 24 hours.
Religion_TV %>% group_by(Religion) %>% filter(`Hours per day watching tv` == 24) %>% summarise( counts = n()) %>% arrange(desc(counts))
and we see 16 Protestants, 3 Catholics, 2 None, and 1 Don’t Know. From this I don’t know if that’s bad data or people leaving on the TV all day. Next Let’s look at how these habits changed over time.
Times <- Religion_TV %>% group_by(Religion, year) %>% summarise( `Avg TV watched` = mean(`Hours per day watching tv`), counts = n()) %>% filter( Religion %in% c('Protestant','Jewish','Moslem/islam', 'None')) %>% arrange(desc(`Avg TV watched`)) ab <- ggplot(Times, aes(x = year, y = `Avg TV watched`, color = Religion)) + geom_line() + coord_cartesian(ylim= c(0,4)) + scale_x_continuous(breaks=seq(2000,2014,2)) ggplotly(ab)
Using Plotly as the charts tend to pop more, and can be interactive. I was then curious about who watches more TV, people that are self-employed or employed by someone else.
worker <- GSS_Data %>% group_by(`Work Category`, year) %>% summarise( `Avg TV watched` = mean(`Hours per day watching tv`, na.rm = TRUE), counts = n()) %>% arrange(desc(`Avg TV watched`)) cd <- ggplot(worker, aes(x = year, y = `Avg TV watched`, color = `Work Category`)) + geom_point() + coord_cartesian(ylim= c(0,8)) + scale_x_continuous(breaks=seq(2000,2014,2)) ggplotly(cd) GSS_Data %>% group_by(`Work Category`) %>% summarise( counts = n()) %>% arrange(desc(counts))
Unsurprisingly we see people where the work category is not applicable watch more TV than both types of the employed. The self-employed are watching just slightly less TV than those employed by others. Last week was also Plotcon, hearing some great talks from amazing speakers has me fired up to create more engaging R Charts, do all the data munging in R, and continue using Excel charting less and less.
Data Janitor
What is a Data Janitor?
From Wikipedia: “A data janitor is a person who works to take big data and condense it into useful amounts of information. Also known as a “data wrangler,” a data janitor sifts through data for companies in the information technology industry. A multitude of start-ups rely on large amounts of data, so a data janitor works to help these businesses with this basic, but difficult process of interpreting data.”
Search Linkedin or AngelList and you won’t see a single job posting for a data janitor. Why is that? A Data Scientist’s time is valuable, and estimates are at 50 -80% of time spent in tiresome cleaning. I am surprised companies are not profiting on the trend of people wanting to become data scientists by hiring data janitors as entry level positions.
Janitorial Cleaning Cart
When I first started using R, data manipulation was daunting, and I avoided using R packages. At first I would do the cleaning manually in Excel and then import the data back to R for charting and analysis. Then I found SQLDF which allowed me to perform SQL selects on R data frames. As I felt more comfortable writing SQL queries to filter and join this was my tool of choice for a while. But sqldf() is not always the most efficient, especially with larger data frames.
#SQLDF Example library(sqldf) scoredist = sqldf("SELECT boro, score, COUNT(score) AS count From df WHERE boro <> 'Missing' Group by boro, score")
Hands-on dplyr tutorial for faster data manipulation in R
Dplyr is “A Grammar of Data Manipulations.” I heard of dplyr and plyr for a while until I finally gave the package a shot, and now I am wishing I started earlier.
Pandas
When using python, the library of choice for data cleaning would be Pandas. Some basic pandas commands.
import pandas as pd #View the head and summary statistics df.head() df.describe() #SQL like Merge/Join df = pd.merge(yelp_df,Inspection_df,how='inner',on='Phone') #Basic string contains df['Me'][df['From'].str.contains("MyEmail@gmail.com")] = 1 #drop rows with missing data df.dropna(how='any') # Get data with Ratings above 3 yelp_df[yelp_df.Rating > 3]
What am I actually cleaning for?
A clean room is not hard to imagine, sparkling floors, and an organized desk, but what does clean data look like? In a tidy data set each variable is saved in its own column and each observation in it’s own row.
- Standardize formats
- Dates should follow the same syntax (e.g YYYY-MM-DD)
- Correct data types, should the columns be characters or factors.
- Is the currency used consistent and labeled
- Trim extraneous spaces
- Label columns to remove ambiguity
- Remove duplicate records
- Sanity checking aggregation type (i.e. are you looking for a count or sum?)
- Check the length of variables (e.g. A zip code should have a length of 5)
- Sorting the data and checking for outliers (flag for removal)
- Missing Data
- Is the data complete (why does one month have a third of the data points as another?)
- NaN or blank rows and columns
Don’t forget to automate when possible. I noticed an account manager spent an hour each week cleaning data in Excel to upload to the UI. Using Python I was able to write a script to automate the steps and save her time each week. Data cleaning is not the most fun and if the cost-benefit math works out, why not automate.
And no package that I know of can replace a fresh set of eyes, taking a look at the data to see if the values are reasonable. Sure it’d be great if you had a million page views in a day, and that would make sense for some blogs, but you need to know the baselines. Garbage in, garbage out, and as a data janitor one should make sure to take the trash out.
Medicare payment comparisons
For this post I wanted to experiment with Choropleth Maps, and interactive visuals. At my past company I could have used Tableau which has built in Geo charts which are really pretty, but for a more open alternative I am using Plotly which also has a R integration which I would have needed anyway to clean the data first.
A choropleth map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map, such as population density or per-capita income.
The data comes from Enigma.io: Payment Comparisons by Hospital
Assumptions made: for the rows with an asterisk (*) under number_of_cases I have substituted a 4
An asterisk (*) appears in the table where data cannot be disclosed to protect personal health information due to the small number of Medicare patients (fewer than 11)
#R Code # Load Libraries library(devtools) library(readxl) library(dplyr) library(plotly) # Read in Data State <- read_excel("MedicareCompare.xlsx", sheet = "State") Clean <- read_excel("MedicareCompare.xlsx", sheet = "Clean") # Change data types Clean$Name <- as.factor(Clean$Name) Clean$State <- as.factor(Clean$State) Clean$zip_code <- as.factor(Clean$zip_code) Clean$county_name <- as.factor(Clean$county_name) Clean$diagnosis_related_group <- as.factor(Clean$diagnosis_related_group) # explore structure and head of the data str(Clean) head(Clean) # Use dplyr to get the median payment and take into account Frequency Clean2 <- Clean[rep(1:nrow(Clean), Clean$Frequency),] %>% group_by(State) %>% summarise(Median = median(medicare_average_payment)) head(Clean2) # give state boundaries a white border l <- list(color = toRGB("white"), width = 2) # specify some map projection/options g <- list( scope = 'usa', projection = list(type = 'albers usa'), showlakes = TRUE, lakecolor = toRGB('white') ) plot_ly(Clean2, z = Median, locations = State, type = 'choropleth', locationmode = 'USA-states', color = Median, colors = 'Purples', marker = list(line = l), colorbar = list(title = "USD")) %>% layout(title = 'Median Medicare payment by State<br>(Hover for breakdown)', geo = g)
Due to the large right skew in different diagnosis groups, I used median instead of the mean to plot the above map. I find it interesting that Alaska has the highest Median Medicare payment, let’s dig into that because that was unexpected. So I graph for the following states (AK, AL, KS, NY) medicare payment by Diagnosis and on a separate axis the frequency, while sorting by payment.
Based on this information above we see that in the other comparison States there is higher frequency towards the lower costing diagnostic groups which brings down the median payment.
Below is some sorted tables to satisfy my own curiosity
Ranked first by Average payment, and then by Frequency, Top and Bottom 5
The most expensive procedures are clustered around Cardiac defibrillators with MCC’s, and the most frequent procedures have to do with Heart failure. The Correlation between average payment and Frequency was (-0.3), makes sense that the most common procedures are less expensive compared to the least common ones.
By Hospital
This one brought back some memories, I worked on the advertising strategy for the CTCA account for a year. Midwestern Region is also a branch of CTCA. In context CTCA is a for profit Cancer Hospital that generally does not take Medicare (at least 3 years ago). The average was brought up from the combination of low frequency and one large “kidney and urinary tract procedures w MCC” Procedure.
Note: MCC Stands for Major Complications or Comorbidities
“A” for Awful?
Introduction
Walking around NYC looking at restaurants I notice many more”A” grade inspection than B’s or C’s. I was curious to see what the distribution was, and what correlation, if any, occurred when comparing to Yelp Reviews.
Data Sets
The New York City Restaurant Inspection Results data set was exported from Engima.io
This data is provided by the Department of Health and Mental Hygiene (DOHMH). The date range used was from 2013 to the start of April 2016. Filtering out pre permit inspections, negative scores and malformed phone numbers. (Mainly containing “_” and less that 7 digits). I joined to the Yelp Data on the condition that the restaurant had at least 25 Yelp Reviews. In total I have a list of 10,114 restaurants in the NYC area.
ID Mapping
I am using the Business Phone column as my unique identifier to connect with the Yelp API’s Phone Search Protocol.
Grade scale: “Restaurants with a score between 0 and 13 points earn an A, those with 14 to 27 points receive a B and those with 28 or more a C.”
Exploratory Analysis:
We see a large step decrease across all Boro’s between the 13 to 14 score mark, which is also the distinction of an A to a B. I was curious as to why this occurs as the rest of distribution is more smooth. Any restaurant initially falling short of an A gets a repeat visit within two to three weeks, which allows the restaurant to improve food safety practices. This is very likely the reason for the strong right skew in the data.
The jump between an A and B is highlighted above. There is about 3x as many scores of 13 which is the cut off for an A compared to a score of 14. This jump between scores is not seen at the cut off between a B and C.
By Boro
The distribution from all the Boro’s are fairly similar except for Staten Island which over indexes more restaurants with a C grade
What surprised me most was what I learned from the How we score and grade document the DOH put out.
• A public health hazard, such as failing to keep food at the right temperature, triggers a minimum of 7 points. If the violation can’t be corrected before the inspection ends, the Health Department may close the restaurant until it’s fixed.
• A critical violation, for example, serving raw food such as a salad without properly washing it first, carries a minimum of 5 point
So a restaurant can have mice, and food at unsafe temperatures and still earn an A! Based on this I really hope “Critical” violations would trigger a minimum of 14 points in the future.
The Correlation (mutual relationship or connection between two or more things) between Health Ratings and Yelp scores is weak and negative (-.03). In other words the higher the Yelp Review the lower the Health inspection score, and remember the lower the score the better. But as the correlation coefficient is so low and we have a large sample size this finding is not statistically significant.
I believe this is due to a false dichotomy that higher Yelp ratings means better health scores, for example service is not reflected in Health inspections nor is price. A restaurant can be squeaky clean and still serve tasteless food. My take from this was to create Quadrants for different Yelp and Health Rating pairs
Q1 High scores and Ratings (Maybe best not to know)
Q2 Low Scores and High Ratings (Crème de la crème)
Q3 Low scores and Ratings (Clean but bland)
Q4 High Scores, Low Ratings (Run don’t walk)
Some Q2 Listings (Score less than 5 so no mice):
- Perfect Picnic NYC
- D’Amore Winebar & Ristorante
- Da Capo
- Naturally Delicious
- L’industrie Pizzeria
- Aux Merveilleux De Fred
- Peasant Stock
I am not including names of restaurants from other Quadrants, but the data is open, so please explore and let me know of any cool findings.
Inbox Analysis
Just another day staring at my inbox, attempting to get to inbox zero. I noticed that Tim Ferriss and Ramit Sethi sent out an email blast within minutes of each other, this got me thinking “When do I get the most emails?” Those two are pretty big on A/B testing and being quantitative marketers so I wanted to see if there was any patterns in my inbox.
I first started out messing around with the Gmail API but after a few misfires I did some research and found I did not have to recreate the wheel. You can export your google data using Google Takeout the process took about 24 hours and I had a 2.8 Gb MBOX file of all the information associated with my gmail account.
I wanted to look at the following information, “To,” “From,” and “Date.”
Using Python I could parse the MBOX file into a more malleable csv
import mailbox import csv from email.utils import parsedate_tz mbox = mailbox.mbox('Matt_mail.mbox') outputFile = open('rawcsv_412.csv', 'wb') outputWriter = csv.writer(outputFile) for message in mbox: outputWriter.writerow([message['to'], message['from'], message['date'], parsedate_tz(message['date'])]) outputFile.close()
The date function in a mbox file is hard to work with so I used “parsedate_tz” from email.utils. This is the point where I used some Excel magic to clean up the csv a bit (Yes this could have been 100% Python but I know I can do this in Excel much faster). I removed the parenthesis around the parsed date, and used the text to column function to separate out the datetime information
Since string manipulation is easier in pandas, I ran the following script for further cleaning:
import pandas as pd import numpy as np from pandas import ExcelWriter data = pd.ExcelFile('Email_Draft.xlsx') df = data.parse('rawcsv_412') print len(df.index) #Clean out nulls in the To column df_1 = df[df.To.notnull()] print len(df_1.index) # Create new columns for selected emails df_1['Me'] = 0 df_1['Ramit'] = 0 df_1['Tim'] = 0 df_1['Insider'] = 0 df_1['Me'][df_1['From'].str.contains("MyEmail@gmail.com")] = 1 df_1['Ramit'][df_1['From'].str.contains("ramit@iwillteachyoutoberich.com")] = 1 df_1['Tim'][df_1['From'].str.contains("tim@fourhourbody.com")] = 1 df_1['Insider'][df_1['From'].str.contains("newsletter@businessinsider.com")] = 1 #concatenate in python, in the future sticking with Excel, but hey why not def concat(*args): strs = [str(arg) for arg in args if not pd.isnull(arg)] return '/'.join(strs) if strs else np.nan np_concat = np.vectorize(concat) df_1['ShortDate'] = np_concat(df_1['Month'],df_1['Date'],df_1['Year']) #Write to a new Excel file writer = ExcelWriter('PythonExport.xlsx') df_1.to_excel(writer,'Sheet1') writer.save()
Then add a Day of week column “=Text(Cell, “DDDD”)”
Now I had a spreadsheet that looks like the following
After some sanity checks, it’s time for some exploratory analysis.
Looks like getting to Inbox Zero is getting harder every year!
We see a clear trend that Ramit sends 41% of his emails at 11 am EST, while the all email average is ~7%. Does this mean we should all send emails at 11? Not necessarily, look at Business Insider which primarily sends emails at 4 PM EST. Different emails categories are better suited at different times; emails that have action steps may be better suited earlier in the day, easy to digest articles might be better near the end of the work day. Also without email click through rate data from multiple people I can’t give an accurate answer on the best time to send emails.
Then I got curious about who is sending the most emails over time:
Thankfully I turned off Facebook notifications by email in 2010, and if anyone is familiar with chapterspot you probably have memories of long frat email chains over nothing. As expected emails are more concentrated during typical work hours, and fall off during nights and weekends. From the heatmap below I see the Weekend starts at 9 PM on Friday.
I use Unroll.me to clean up my inbox, the email total count stays the same but the amount of time spent in gmail goes way down. Also Immersion by MIT has a pretty cool visual based off your email metadata; graphing your email network and seeing with whom you message most.