TV watching habits from GSS data
Last week I went to a Data Visualization Meetup where Hadley Wickham gave a talk on visualization challenges. He went over the following points:
Eight visualization challenges
- Labeling plots: A problem ignored for too long
- Use the labs() function
- Dual axes: usually wrong, sometimes useful
- geom_label and ggrepel: labeling your data
- Ordering
- Visualizing missing values
- Why are histograms so hard?
- Bar charts & variations
- ggplot2 extensions from the community
My main takeaway was from the labeling challenge, particularly how useful geom_label_repel() is. If you have many points you’d like to annotate, the overlap makes the visualization illegible. This function takes the place of geom_text(), and finds how to label points that don’t overlap. Documentation can be found here.
One slide in particular made me want to do some data wrangling to learn more. The slide was about ordering, and the content was the relationship between Religion and TV consumption. Hadley’s slide showed the mean for each Religion, and I wanted to dig deeper on the distribution and the size of the groupings. The dataset can be found here: General Social Survey (GSS). I used data from 2000 – 2014.
R code
library(readxl) library(tidyverse) library(plotly) #Set Correct Path setwd("C:/Users/Matt/Downloads") #read Data GSS_Data <- read_excel("GSS.xls", sheet = "Data") #remove last two rows of data GSS_Data <- head(GSS_Data, -2)
# Basic exploration summary(GSS_Data) str(GSS_Data) # Rename columns GSS_Data <- rename(GSS_Data, year = `Gss year for this respondent `, Religion = `Religion in which raised`, id = `Respondent id number`, `Work Category` = `R self-emp or works for somebody`) # Change data types GSS_Data$Religion <- as.factor(GSS_Data$Religion) GSS_Data$`Hours per day watching tv` <- as.numeric(GSS_Data$`Hours per day watching tv`) GSS_Data$`Work Category` <- as.factor(GSS_Data$`Work Category`) # Select columns Religion_TV <- select(GSS_Data, id, year, Religion, `Hours per day watching tv`) # remove Na's Religion_TV <- Religion_TV[complete.cases(Religion_TV),] # Distinct Religions Religion_TV %>% select(Religion) %>% distinct # 15 religions #Filter to look at 2014 data tv_2014 <- filter(Religion_TV, year == 2014) # reorder religion based on mean TV watching time # Hadley used fcr_reorder() tv_2014$Religion <- with(tv_2014, reorder(Religion, `Hours per day watching tv`, mean))
# Create plot ggplot(tv_2014, aes(Religion, `Hours per day watching tv`, fill = Religion)) + geom_boxplot() + coord_flip() + ggtitle("TV Watched per day by Religion \n in 2014") # Get counts tv_2014 %>% group_by(Religion) %>% summarise( avg_watched = mean(`Hours per day watching tv`), counts = n()) %>% arrange(desc(avg_watched))
We see that the ‘Don’t Know’ group in 2014 only consisted of 1 member, so we should not be drawing any conclusions from that group. While a small sample size we notice that the Eastern Religions watch less Television than the Abrahamic religions in 2014. When looking a larger time period (2000 – 2014) the ‘Don’t Know’ category jumps to the top of the list.
We also notice some atypical behavior with a few people that are watching TV for the whole day. Let’s see how many people are watching TV for 24 hours.
Religion_TV %>% group_by(Religion) %>% filter(`Hours per day watching tv` == 24) %>% summarise( counts = n()) %>% arrange(desc(counts))
and we see 16 Protestants, 3 Catholics, 2 None, and 1 Don’t Know. From this I don’t know if that’s bad data or people leaving on the TV all day. Next Let’s look at how these habits changed over time.
Times <- Religion_TV %>% group_by(Religion, year) %>% summarise( `Avg TV watched` = mean(`Hours per day watching tv`), counts = n()) %>% filter( Religion %in% c('Protestant','Jewish','Moslem/islam', 'None')) %>% arrange(desc(`Avg TV watched`)) ab <- ggplot(Times, aes(x = year, y = `Avg TV watched`, color = Religion)) + geom_line() + coord_cartesian(ylim= c(0,4)) + scale_x_continuous(breaks=seq(2000,2014,2)) ggplotly(ab)
Using Plotly as the charts tend to pop more, and can be interactive. I was then curious about who watches more TV, people that are self-employed or employed by someone else.
worker <- GSS_Data %>% group_by(`Work Category`, year) %>% summarise( `Avg TV watched` = mean(`Hours per day watching tv`, na.rm = TRUE), counts = n()) %>% arrange(desc(`Avg TV watched`)) cd <- ggplot(worker, aes(x = year, y = `Avg TV watched`, color = `Work Category`)) + geom_point() + coord_cartesian(ylim= c(0,8)) + scale_x_continuous(breaks=seq(2000,2014,2)) ggplotly(cd) GSS_Data %>% group_by(`Work Category`) %>% summarise( counts = n()) %>% arrange(desc(counts))
Unsurprisingly we see people where the work category is not applicable watch more TV than both types of the employed. The self-employed are watching just slightly less TV than those employed by others. Last week was also Plotcon, hearing some great talks from amazing speakers has me fired up to create more engaging R Charts, do all the data munging in R, and continue using Excel charting less and less.
Leave a Reply
Want to join the discussion?Feel free to contribute!