Analysis Archives - See both the forest and the trees

Deep Dive into Mercaz Daf Yomi Data

March 30, 2023/0 Comments/in Analysis, Testing/by Matt

Exploding Monthly Tehillim: Divisions and Insights

March 6, 2023/0 Comments/in Analysis, Testing/by Matt

How to Securely Share Email Data

October 15, 2018/0 Comments/in Analysis, Testing/by Matt

The One with the Ratings

June 18, 2017/0 Comments/in Analysis, Testing/by Matt

Last Summer I binged Friends on Netflix; on a light day 3 episodes to a no shame 10 episodes vegfest. Chandler is one of my favorite characters. When people ask what I do, the default answer is “statistical analysis and data reconfiguration” or you know a transponster.

Graph TV is a site that allows one to visualize IMDB ratings of various shows, or you can use ggplot/plotly to create a custom graphic.

IMDB ratings

The Iconic quiz show episode linked above is ‘The one with the Embryos.’

ggplot(combinedrating,aes(x=EpNumber,y=UserRating,color=as.factor(Season)))+
  geom_point()+
  stat_smooth(method="lm", se=FALSE) +
  labs(x="Episode #",y="Average Rating",title="Friends Episode IMDB Ratings by Season") +
  geom_point(data = clips, aes(x=clips$EpNumber,y=clips$UserRating), color="black", shape=1, size=5) + 
  geom_text_repel(aes(x=EpNumber,y=UserRating,color=Season, label = Name), data = combinedrating %>% filter(UserRating > 9.4 | UserRating < 8)) +
  ylim(c(5,10))

Immediately I notice that the lowest rated episodes are clip shows which are marked with a black circle, the exception being ‘The One with the Donor.’ Not even John Stamos and Hank Azaria could save that episode. Also the Series ended at a good time, the quality was starting to trend downwards in season 8,9.

Then combining the IMDB ratings with the Nielsen ratings, I pivoted to see what the connection between ratings and viewers. No surprise that season finales have high ratings and viewership compared to clip shows. While clip shows can be useful for cutting costs, they can have a negative impact for viewership (and thus ad dollars) despite no change in quality of the next episode. This begs the question should Syndication skip clip shows during rotation to keep viewers?

Interactive Plotly Chart

Cross correlation function (CCF)

ccf(combinedrating$Viewers, combinedrating$UserRating, lag.max=3, plot=TRUE, main="Viewers and Rating ACF")

A CCF shows the correlation of two variables with a lag to show a relationship over time. We see the most significant times to be between +/- 1 episode which makes intuitive sense that the quality of an episodes 3 weeks away does not have a strong relationship. and that the episode viewership after a good episode has more of an effect than before. Seen by the ACF being higher at -1, than +1.

A Sequel to SQL

April 14, 2017/0 Comments/in Analysis, Testing/by Matt

An Introduction to SQL should give you a primer to the “Q” part SQL; or how to query. This post is a SQL sequel; how to add and alter data.

Top level review of basic SQL structure

SELECT 
 The columns you are SELECTing
FROM 
 The name of the table FROM which the data is from
JOIN ON 
 JOINing a second table based ON a shared column {optional}
WHERE 
 Condition the selection should follow {optional}
GROUP BY 
 Grouping non aggregated data {optional}
HAVING 
 The where equivalent for aggregated selections {optional}
ORDER BY 
 You will never guess what this does {optional}
LIMIT 
 LIMITs the number of rows {optional}

One could go years as an analyst only reading data, and never writing to the database. I know personally, I was anxious about first getting write access. Hearing stories of people accidentally deleting tables (most are hoaxes), or a wrong keystroke costing millions. Eventually, you will find work to be easier when you can create and alter your own tables. A common example being aggregated tables.

Let’s Begin our SQL Sequel

For this tutorial I am using PostgreSQL 9.4.7, to check your version of SQL you can use the following command.

select version()

Creating New Tables

To creating a table from existing data use the “CREATE TABLE” command.

CREATE TABLE KPI_agg AS 
(
 SELECT
 date_trunc('month', date_field) as Month
 ,User_class
 ,COUNT(myid) as Count_field
 ,AVG(KPI) AS AVG_KPI
 FROM Your_Table
 GROUP by 1,2
);

Or if the data does not already exist in your database create a Table from a CSV

create table KPI_agg
(
  Month timestamp,
  User_class varchar,
  Count_field bigint,
  AVG_KPI numeric,
);

Some different datatypes one can use Postgres datatypes. Varchars should be avoided when possible, but sometimes they are your only option. Altering a column can be done using the ALTER command.

ALTER TABLE KPI_agg ALTER COLUMN User_class TYPE text USING (User_class::text);

After creating an empty table you can upload the CSV.

\copy KPI_agg from 'filelocation/datatoupload.csv' DELIMITER ',' CSV HEADER;

The above line copies the data from your local computer uses the delimiter of a comma, stored as a .csv and has a header. Some SQL client’s don’t allow you to do a \copy command, so you may need to use psql / a SQL shell.

New Columns

To add a new column use the ALTER and ADD commands

ALTER TABLE KPI_agg
ADD COLUMN NEW_Class text;

Then you can SET this column’s information based on existing data.

UPDATE KPI_agg
SET NEW_Class = 'New Name'
WHERE User_class = 'Old Name'

Later on, you can update the table by only inserting data that meets a certain condition

insert into KPI_agg(Month,User_class,Count_field,AVG_KPI) 
(
SELECT
   date_trunc('month', date_field) as Month
  ,User_class
  ,COUNT(myid) as Count_field
  ,AVG(KPI) AS AVG_KPI
FROM Your_Table
WHERE date_field::date >= '2017-03-01'
  and date_field::date <= '2017-03-31'
GROUP by 1,2
);

Removing Data

Drop the entire table

DROP TABLE KPI_agg;

Delete rows that match a condition

DELETE FROM
KPI_agg
where User_class = 'Undesired_class';

Don’t forget to create indexes on your new tables

CREATE INDEX myidx ON Your_Table (myid);

A guide on Indexes is outside the scope of this post, see Efficient Use of PostgreSQL Indexes for more information.

TV watching habits from GSS data

November 21, 2016/0 Comments/in Analysis, Testing/by Matt

Last week I went to a Data Visualization Meetup where Hadley Wickham gave a talk on visualization challenges. He went over the following points:

Eight visualization challenges

Labeling plots: A problem ignored for too long
- Use the labs() function
Dual axes: usually wrong, sometimes useful
geom_label and ggrepel: labeling your data
Ordering
Visualizing missing values
Why are histograms so hard?
Bar charts & variations
ggplot2 extensions from the community

My main takeaway was from the labeling challenge, particularly how useful geom_label_repel() is. If you have many points you’d like to annotate, the overlap makes the visualization illegible. This function takes the place of geom_text(), and finds how to label points that don’t overlap. Documentation can be found here.

One slide in particular made me want to do some data wrangling to learn more. The slide was about ordering, and the content was the relationship between Religion and TV consumption. Hadley’s slide showed the mean for each Religion, and I wanted to dig deeper on the distribution and the size of the groupings. The dataset can be found here: General Social Survey (GSS). I used data from 2000 – 2014.

R code

library(readxl)
library(tidyverse)
library(plotly)

#Set Correct Path
setwd("C:/Users/Matt/Downloads")

#read Data
GSS_Data <- read_excel("GSS.xls", sheet = "Data")

#remove last two rows of data
GSS_Data <- head(GSS_Data, -2)

# Basic exploration

summary(GSS_Data)
str(GSS_Data)

# Rename columns

GSS_Data <- rename(GSS_Data, 
              year = `Gss year for this respondent                       `,
              Religion = `Religion in which raised`,
              id = `Respondent id number`,
              `Work Category` = `R self-emp or works for somebody`)

# Change data types

GSS_Data$Religion <- as.factor(GSS_Data$Religion)
GSS_Data$`Hours per day watching tv` <- as.numeric(GSS_Data$`Hours per day watching tv`)
GSS_Data$`Work Category` <- as.factor(GSS_Data$`Work Category`)

# Select columns

Religion_TV <- select(GSS_Data, id, year, Religion, `Hours per day watching tv`)

# remove Na's
Religion_TV <- Religion_TV[complete.cases(Religion_TV),]


# Distinct Religions
Religion_TV %>% select(Religion) %>% distinct
# 15 religions

#Filter to look at 2014 data 
tv_2014 <- filter(Religion_TV, year == 2014)

# reorder religion based on mean TV watching time
# Hadley used fcr_reorder()
tv_2014$Religion <- with(tv_2014, reorder(Religion, `Hours per day watching tv`, mean))

# Create plot

ggplot(tv_2014, 
       aes(Religion, 
       `Hours per day watching tv`, 
       fill = Religion))  + geom_boxplot() + coord_flip() + 
       ggtitle("TV Watched per day by Religion \n in 2014")

# Get counts

tv_2014 %>% 
  group_by(Religion) %>% 
  summarise(
    avg_watched = mean(`Hours per day watching tv`),
    counts = n()) %>% 
      arrange(desc(avg_watched))

We see that the ‘Don’t Know’ group in 2014 only consisted of 1 member, so we should not be drawing any conclusions from that group. While a small sample size we notice that the Eastern Religions watch less Television than the Abrahamic religions in 2014. When looking a larger time period (2000 – 2014) the ‘Don’t Know’ category jumps to the top of the list.

We also notice some atypical behavior with a few people that are watching TV for the whole day. Let’s see how many people are watching TV for 24 hours.

Religion_TV %>% 
  group_by(Religion) %>%
  filter(`Hours per day watching tv` == 24) %>% 
  summarise(
    counts = n()) %>% 
  arrange(desc(counts))

and we see 16 Protestants, 3 Catholics, 2 None, and 1 Don’t Know. From this I don’t know if that’s bad data or people leaving on the TV all day. Next Let’s look at how these habits changed over time.

Times <- Religion_TV %>% 
    group_by(Religion, year) %>% 
    summarise(
      `Avg TV watched` = mean(`Hours per day watching tv`),
      counts = n()) %>% 
    filter( Religion %in% c('Protestant','Jewish','Moslem/islam', 'None')) %>% 
  arrange(desc(`Avg TV watched`))


ab <- ggplot(Times, aes(x = year, y = `Avg TV watched`, color = Religion)) + 
    geom_line() + coord_cartesian(ylim= c(0,4)) + scale_x_continuous(breaks=seq(2000,2014,2))  

ggplotly(ab)

Using Plotly as the charts tend to pop more, and can be interactive. I was then curious about who watches more TV, people that are self-employed or employed by someone else.

worker <- GSS_Data %>% 
  group_by(`Work Category`, year) %>% 
  summarise(
    `Avg TV watched` = mean(`Hours per day watching tv`, na.rm = TRUE),
    counts = n())  %>% 
  arrange(desc(`Avg TV watched`))

cd <- ggplot(worker, aes(x = year, y = `Avg TV watched`, color = `Work Category`)) + 
  geom_point() + coord_cartesian(ylim= c(0,8)) + scale_x_continuous(breaks=seq(2000,2014,2))  

ggplotly(cd)


GSS_Data %>% 
 group_by(`Work Category`) %>%
 summarise(
 counts = n()) %>% 
 arrange(desc(counts))

Unsurprisingly we see people where the work category is not applicable watch more TV than both types of the employed. The self-employed are watching just slightly less TV than those employed by others. Last week was also Plotcon, hearing some great talks from amazing speakers has me fired up to create more engaging R Charts, do all the data munging in R, and continue using Excel charting less and less.

Data Janitor

May 12, 2016/0 Comments/in Analysis, Testing/by Matt

What is a Data Janitor?

From Wikipedia: “A data janitor is a person who works to take big data and condense it into useful amounts of information. Also known as a “data wrangler,” a data janitor sifts through data for companies in the information technology industry. A multitude of start-ups rely on large amounts of data, so a data janitor works to help these businesses with this basic, but difficult process of interpreting data.”

Search Linkedin or AngelList and you won’t see a single job posting for a data janitor. Why is that? A Data Scientist’s time is valuable, and estimates are at 50 -80% of time spent in tiresome cleaning. I am surprised companies are not profiting on the trend of people wanting to become data scientists by hiring data janitors as entry level positions.

Janitorial Cleaning Cart

When I first started using R, data manipulation was daunting, and I avoided using R packages. At first I would do the cleaning manually in Excel and then import the data back to R for charting and analysis. Then I found SQLDF which allowed me to perform SQL selects on R data frames. As I felt more comfortable writing SQL queries to filter and join this was my tool of choice for a while. But sqldf() is not always the most efficient, especially with larger data frames.

#SQLDF Example
library(sqldf)

scoredist = sqldf("SELECT boro, score, COUNT(score) AS count 
                  From df
                  WHERE boro <> 'Missing'
                  Group by boro, score")

Hands-on dplyr tutorial for faster data manipulation in R

Dplyr is “A Grammar of Data Manipulations.” I heard of dplyr and plyr for a while until I finally gave the package a shot, and now I am wishing I started earlier.

Pandas

When using python, the library of choice for data cleaning would be Pandas. Some basic pandas commands.

import pandas as pd

#View the head and summary statistics
df.head()
df.describe()

#SQL like Merge/Join
df = pd.merge(yelp_df,Inspection_df,how='inner',on='Phone')

#Basic string contains
df['Me'][df['From'].str.contains("MyEmail@gmail.com")] = 1

#drop rows with missing data
df.dropna(how='any')

# Get data with Ratings above 3
yelp_df[yelp_df.Rating > 3]

What am I actually cleaning for?

A clean room is not hard to imagine, sparkling floors, and an organized desk, but what does clean data look like? In a tidy data set each variable is saved in its own column and each observation in it’s own row.

Standardize formats
- Dates should follow the same syntax (e.g YYYY-MM-DD)
- Correct data types, should the columns be characters or factors.
- Is the currency used consistent and labeled
Trim extraneous spaces
Label columns to remove ambiguity
Remove duplicate records
Sanity checking aggregation type (i.e. are you looking for a count or sum?)
Check the length of variables (e.g. A zip code should have a length of 5)
Sorting the data and checking for outliers (flag for removal)
Missing Data
- Is the data complete (why does one month have a third of the data points as another?)
- NaN or blank rows and columns

Don’t forget to automate when possible. I noticed an account manager spent an hour each week cleaning data in Excel to upload to the UI. Using Python I was able to write a script to automate the steps and save her time each week. Data cleaning is not the most fun and if the cost-benefit math works out, why not automate.

And no package that I know of can replace a fresh set of eyes, taking a look at the data to see if the values are reasonable. Sure it’d be great if you had a million page views in a day, and that would make sense for some blogs, but you need to know the baselines. Garbage in, garbage out, and as a data janitor one should make sure to take the trash out.

Medicare payment comparisons

May 3, 2016/0 Comments/in Analysis/by Matt

For this post I wanted to experiment with Choropleth Maps, and interactive visuals. At my past company I could have used Tableau which has built in Geo charts which are really pretty, but for a more open alternative I am using Plotly which also has a R integration which I would have needed anyway to clean the data first.

A choropleth map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map, such as population density or per-capita income.

The data comes from Enigma.io: Payment Comparisons by Hospital

Assumptions made: for the rows with an asterisk (*) under number_of_cases I have substituted a 4

An asterisk (*) appears in the table where data cannot be disclosed to protect personal health information due to the small number of Medicare patients (fewer than 11)

#R Code
# Load Libraries 

library(devtools)
library(readxl)
library(dplyr)
library(plotly)

# Read in Data

State <- read_excel("MedicareCompare.xlsx", sheet = "State")
Clean <- read_excel("MedicareCompare.xlsx", sheet = "Clean")


# Change data types

Clean$Name <- as.factor(Clean$Name)
Clean$State <- as.factor(Clean$State)
Clean$zip_code <- as.factor(Clean$zip_code)
Clean$county_name <- as.factor(Clean$county_name)
Clean$diagnosis_related_group <- as.factor(Clean$diagnosis_related_group)

# explore structure and head of the data

str(Clean)
head(Clean)


# Use dplyr to get the median payment and take into account Frequency

Clean2 <- Clean[rep(1:nrow(Clean), Clean$Frequency),] %>% group_by(State) %>% summarise(Median = median(medicare_average_payment))

head(Clean2)


# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)

# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)

plot_ly(Clean2, z = Median, locations = State, type = 'choropleth',
        locationmode = 'USA-states', color = Median, colors = 'Purples',
        marker = list(line = l), colorbar = list(title = "USD")) %>%
  layout(title = 'Median Medicare payment by State<br>(Hover for breakdown)', geo = g)

Due to the large right skew in different diagnosis groups, I used median instead of the mean to plot the above map. I find it interesting that Alaska has the highest Median Medicare payment, let’s dig into that because that was unexpected. So I graph for the following states (AK, AL, KS, NY) medicare payment by Diagnosis and on a separate axis the frequency, while sorting by payment.

Based on this information above we see that in the other comparison States there is higher frequency towards the lower costing diagnostic groups which brings down the median payment.

Below is some sorted tables to satisfy my own curiosity

Ranked first by Average payment, and then by Frequency, Top and Bottom 5

The most expensive procedures are clustered around Cardiac defibrillators with MCC’s, and the most frequent procedures have to do with Heart failure. The Correlation between average payment and Frequency was (-0.3), makes sense that the most common procedures are less expensive compared to the least common ones.

By Hospital

This one brought back some memories, I worked on the advertising strategy for the CTCA account for a year. Midwestern Region is also a branch of CTCA. In context CTCA is a for profit Cancer Hospital that generally does not take Medicare (at least 3 years ago). The average was brought up from the combination of low frequency and one large “kidney and urinary tract procedures w MCC” Procedure.

Note: MCC Stands for Major Complications or Comorbidities

“A” for Awful?

April 27, 2016/0 Comments/in Analysis/by Matt

Introduction

Walking around NYC looking at restaurants I notice many more”A” grade inspection than B’s or C’s. I was curious to see what the distribution was, and what correlation, if any, occurred when comparing to Yelp Reviews.

Data Sets

The New York City Restaurant Inspection Results data set was exported from Engima.io

This data is provided by the Department of Health and Mental Hygiene (DOHMH). The date range used was from 2013 to the start of April 2016. Filtering out pre permit inspections, negative scores and malformed phone numbers. (Mainly containing “_” and less that 7 digits). I joined to the Yelp Data on the condition that the restaurant had at least 25 Yelp Reviews. In total I have a list of 10,114 restaurants in the NYC area.

ID Mapping

I am using the Business Phone column as my unique identifier to connect with the Yelp API’s Phone Search Protocol.

Grade scale: “Restaurants with a score between 0 and 13 points earn an A, those with 14 to 27 points receive a B and those with 28 or more a C.”

Exploratory Analysis:

We see a large step decrease across all Boro’s between the 13 to 14 score mark, which is also the distinction of an A to a B. I was curious as to why this occurs as the rest of distribution is more smooth. Any restaurant initially falling short of an A gets a repeat visit within two to three weeks, which allows the restaurant to improve food safety practices. This is very likely the reason for the strong right skew in the data.

The jump between an A and B is highlighted above. There is about 3x as many scores of 13 which is the cut off for an A compared to a score of 14. This jump between scores is not seen at the cut off between a B and C.

By Boro

The distribution from all the Boro’s are fairly similar except for Staten Island which over indexes more restaurants with a C grade

What surprised me most was what I learned from the How we score and grade document the DOH put out.

• A public health hazard, such as failing to keep food at the right temperature, triggers a minimum of 7 points. If the violation can’t be corrected before the inspection ends, the Health Department may close the restaurant until it’s fixed.

• A critical violation, for example, serving raw food such as a salad without properly washing it first, carries a minimum of 5 point

So a restaurant can have mice, and food at unsafe temperatures and still earn an A! Based on this I really hope “Critical” violations would trigger a minimum of 14 points in the future.

The Correlation (mutual relationship or connection between two or more things) between Health Ratings and Yelp scores is weak and negative (-.03). In other words the higher the Yelp Review the lower the Health inspection score, and remember the lower the score the better. But as the correlation coefficient is so low and we have a large sample size this finding is not statistically significant.

I believe this is due to a false dichotomy that higher Yelp ratings means better health scores, for example service is not reflected in Health inspections nor is price. A restaurant can be squeaky clean and still serve tasteless food. My take from this was to create Quadrants for different Yelp and Health Rating pairs

Q1 High scores and Ratings (Maybe best not to know)
Q2 Low Scores and High Ratings (Crème de la crème)
Q3 Low scores and Ratings (Clean but bland)
Q4 High Scores, Low Ratings (Run don’t walk)

Some Q2 Listings (Score less than 5 so no mice):

Perfect Picnic NYC
D’Amore Winebar & Ristorante
Da Capo
Naturally Delicious
L’industrie Pizzeria
Aux Merveilleux De Fred
Peasant Stock

I am not including names of restaurants from other Quadrants, but the data is open, so please explore and let me know of any cool findings.

Inbox Analysis

April 12, 2016/0 Comments/in Analysis/by Matt

Just another day staring at my inbox, attempting to get to inbox zero. I noticed that Tim Ferriss and Ramit Sethi sent out an email blast within minutes of each other, this got me thinking “When do I get the most emails?” Those two are pretty big on A/B testing and being quantitative marketers so I wanted to see if there was any patterns in my inbox.

I first started out messing around with the Gmail API but after a few misfires I did some research and found I did not have to recreate the wheel. You can export your google data using Google Takeout the process took about 24 hours and I had a 2.8 Gb MBOX file of all the information associated with my gmail account.

I wanted to look at the following information, “To,” “From,” and “Date.”

Using Python I could parse the MBOX file into a more malleable csv

import mailbox
import csv
from email.utils import parsedate_tz

mbox = mailbox.mbox('Matt_mail.mbox')
outputFile = open('rawcsv_412.csv', 'wb')
outputWriter = csv.writer(outputFile)

for message in mbox:
    outputWriter.writerow([message['to'], message['from'], message['date'], parsedate_tz(message['date'])]) 

outputFile.close()

The date function in a mbox file is hard to work with so I used “parsedate_tz” from email.utils. This is the point where I used some Excel magic to clean up the csv a bit (Yes this could have been 100% Python but I know I can do this in Excel much faster). I removed the parenthesis around the parsed date, and used the text to column function to separate out the datetime information

Since string manipulation is easier in pandas, I ran the following script for further cleaning:

import pandas as pd
import numpy as np
from pandas import ExcelWriter

data = pd.ExcelFile('Email_Draft.xlsx')
df = data.parse('rawcsv_412')

print len(df.index)


#Clean out nulls in the To column
df_1 = df[df.To.notnull()]

print len(df_1.index)

# Create new columns for selected emails
df_1['Me'] = 0
df_1['Ramit'] = 0
df_1['Tim'] = 0
df_1['Insider'] = 0 

df_1['Me'][df_1['From'].str.contains("MyEmail@gmail.com")] = 1
df_1['Ramit'][df_1['From'].str.contains("ramit@iwillteachyoutoberich.com")] = 1
df_1['Tim'][df_1['From'].str.contains("tim@fourhourbody.com")] = 1
df_1['Insider'][df_1['From'].str.contains("newsletter@businessinsider.com")] = 1


#concatenate in python, in the future sticking with Excel, but hey why not
 
def concat(*args):
    strs = [str(arg) for arg in args if not pd.isnull(arg)]
    return '/'.join(strs) if strs else np.nan
np_concat = np.vectorize(concat)


df_1['ShortDate'] = np_concat(df_1['Month'],df_1['Date'],df_1['Year'])

#Write to a new Excel file
writer = ExcelWriter('PythonExport.xlsx')
df_1.to_excel(writer,'Sheet1')
writer.save()

Then add a Day of week column “=Text(Cell, “DDDD”)”

Now I had a spreadsheet that looks like the following

Email

After some sanity checks, it’s time for some exploratory analysis.

Trend email

Looks like getting to Inbox Zero is getting harder every year!

By Hour

We see a clear trend that Ramit sends 41% of his emails at 11 am EST, while the all email average is ~7%. Does this mean we should all send emails at 11? Not necessarily, look at Business Insider which primarily sends emails at 4 PM EST. Different emails categories are better suited at different times; emails that have action steps may be better suited earlier in the day, easy to digest articles might be better near the end of the work day. Also without email click through rate data from multiple people I can’t give an accurate answer on the best time to send emails.

Then I got curious about who is sending the most emails over time:

Top10

Thankfully I turned off Facebook notifications by email in 2010, and if anyone is familiar with chapterspot you probably have memories of long frat email chains over nothing. As expected emails are more concentrated during typical work hours, and fall off during nights and weekends. From the heatmap below I see the Weekend starts at 9 PM on Friday.

I use Unroll.me to clean up my inbox, the email total count stays the same but the amount of time spent in gmail goes way down. Also Immersion by MIT has a pretty cool visual based off your email metadata; graphing your email network and seeing with whom you message most.