TV watching habits from GSS data

Last week I went to a Data Visualization Meetup where Hadley Wickham gave a talk on visualization challenges. He went over the following points:

Eight visualization challenges

Labeling plots: A problem ignored for too long
- Use the labs() function
Dual axes: usually wrong, sometimes useful
geom_label and ggrepel: labeling your data
Ordering
Visualizing missing values
Why are histograms so hard?
Bar charts & variations
ggplot2 extensions from the community

My main takeaway was from the labeling challenge, particularly how useful geom_label_repel() is. If you have many points you’d like to annotate, the overlap makes the visualization illegible. This function takes the place of geom_text(), and finds how to label points that don’t overlap. Documentation can be found here.

One slide in particular made me want to do some data wrangling to learn more. The slide was about ordering, and the content was the relationship between Religion and TV consumption. Hadley’s slide showed the mean for each Religion, and I wanted to dig deeper on the distribution and the size of the groupings. The dataset can be found here: General Social Survey (GSS). I used data from 2000 – 2014.

R code

library(readxl)
library(tidyverse)
library(plotly)

#Set Correct Path
setwd("C:/Users/Matt/Downloads")

#read Data
GSS_Data <- read_excel("GSS.xls", sheet = "Data")

#remove last two rows of data
GSS_Data <- head(GSS_Data, -2)

# Basic exploration

summary(GSS_Data)
str(GSS_Data)

# Rename columns

GSS_Data <- rename(GSS_Data, 
              year = `Gss year for this respondent                       `,
              Religion = `Religion in which raised`,
              id = `Respondent id number`,
              `Work Category` = `R self-emp or works for somebody`)

# Change data types

GSS_Data$Religion <- as.factor(GSS_Data$Religion)
GSS_Data$`Hours per day watching tv` <- as.numeric(GSS_Data$`Hours per day watching tv`)
GSS_Data$`Work Category` <- as.factor(GSS_Data$`Work Category`)

# Select columns

Religion_TV <- select(GSS_Data, id, year, Religion, `Hours per day watching tv`)

# remove Na's
Religion_TV <- Religion_TV[complete.cases(Religion_TV),]


# Distinct Religions
Religion_TV %>% select(Religion) %>% distinct
# 15 religions

#Filter to look at 2014 data 
tv_2014 <- filter(Religion_TV, year == 2014)

# reorder religion based on mean TV watching time
# Hadley used fcr_reorder()
tv_2014$Religion <- with(tv_2014, reorder(Religion, `Hours per day watching tv`, mean))

# Create plot

ggplot(tv_2014, 
       aes(Religion, 
       `Hours per day watching tv`, 
       fill = Religion))  + geom_boxplot() + coord_flip() + 
       ggtitle("TV Watched per day by Religion \n in 2014")

# Get counts

tv_2014 %>% 
  group_by(Religion) %>% 
  summarise(
    avg_watched = mean(`Hours per day watching tv`),
    counts = n()) %>% 
      arrange(desc(avg_watched))

We see that the ‘Don’t Know’ group in 2014 only consisted of 1 member, so we should not be drawing any conclusions from that group. While a small sample size we notice that the Eastern Religions watch less Television than the Abrahamic religions in 2014. When looking a larger time period (2000 – 2014) the ‘Don’t Know’ category jumps to the top of the list.

We also notice some atypical behavior with a few people that are watching TV for the whole day. Let’s see how many people are watching TV for 24 hours.

Religion_TV %>% 
  group_by(Religion) %>%
  filter(`Hours per day watching tv` == 24) %>% 
  summarise(
    counts = n()) %>% 
  arrange(desc(counts))

and we see 16 Protestants, 3 Catholics, 2 None, and 1 Don’t Know. From this I don’t know if that’s bad data or people leaving on the TV all day. Next Let’s look at how these habits changed over time.

Times <- Religion_TV %>% 
    group_by(Religion, year) %>% 
    summarise(
      `Avg TV watched` = mean(`Hours per day watching tv`),
      counts = n()) %>% 
    filter( Religion %in% c('Protestant','Jewish','Moslem/islam', 'None')) %>% 
  arrange(desc(`Avg TV watched`))


ab <- ggplot(Times, aes(x = year, y = `Avg TV watched`, color = Religion)) + 
    geom_line() + coord_cartesian(ylim= c(0,4)) + scale_x_continuous(breaks=seq(2000,2014,2))  

ggplotly(ab)

Using Plotly as the charts tend to pop more, and can be interactive. I was then curious about who watches more TV, people that are self-employed or employed by someone else.

worker <- GSS_Data %>% 
  group_by(`Work Category`, year) %>% 
  summarise(
    `Avg TV watched` = mean(`Hours per day watching tv`, na.rm = TRUE),
    counts = n())  %>% 
  arrange(desc(`Avg TV watched`))

cd <- ggplot(worker, aes(x = year, y = `Avg TV watched`, color = `Work Category`)) + 
  geom_point() + coord_cartesian(ylim= c(0,8)) + scale_x_continuous(breaks=seq(2000,2014,2))  

ggplotly(cd)


GSS_Data %>% 
 group_by(`Work Category`) %>%
 summarise(
 counts = n()) %>% 
 arrange(desc(counts))

Unsurprisingly we see people where the work category is not applicable watch more TV than both types of the employed. The self-employed are watching just slightly less TV than those employed by others. Last week was also Plotcon, hearing some great talks from amazing speakers has me fired up to create more engaging R Charts, do all the data munging in R, and continue using Excel charting less and less.

TV watching habits from GSS data

Eight visualization challenges

R code

Related

Leave a Reply

Leave a Reply Cancel reply

Eight visualization challenges

R code

Share this:

Related

Leave a Reply

Leave a Reply Cancel reply