TV GSS

TV watching habits from GSS data

Last week I went to a Data Visualization Meetup where Hadley Wickham gave a talk on visualization challenges. He went over the following points:

Eight visualization challenges

  1. Labeling plots: A problem ignored for too long
    • Use the labs() function
  2. Dual axes: usually wrong, sometimes useful
  3. geom_label and ggrepel: labeling your data
  4.  Ordering
  5. Visualizing missing values
  6. Why are histograms so hard?
  7. Bar charts & variations
  8. ggplot2 extensions from the community

My main takeaway was from the labeling challenge, particularly how useful geom_label_repel() is. If you have many points you’d like to annotate, the overlap makes the visualization illegible. This function takes the place of geom_text(), and finds how to label points that don’t overlap. Documentation can be found here.

repel

 

One slide in particular made me want to do some data wrangling to learn more. The slide was about ordering, and the content was the relationship between Religion and TV consumption. Hadley’s slide showed the mean for each Religion, and I wanted to dig deeper on the distribution and the size of the groupings. The dataset can be found here: General Social Survey (GSS). I used data from 2000 – 2014.

 

R code

library(readxl)
library(tidyverse)
library(plotly)

#Set Correct Path
setwd("C:/Users/Matt/Downloads")

#read Data
GSS_Data <- read_excel("GSS.xls", sheet = "Data")

#remove last two rows of data
GSS_Data <- head(GSS_Data, -2)
# Basic exploration

summary(GSS_Data)
str(GSS_Data)

# Rename columns

GSS_Data <- rename(GSS_Data, 
              year = `Gss year for this respondent                       `,
              Religion = `Religion in which raised`,
              id = `Respondent id number`,
              `Work Category` = `R self-emp or works for somebody`)

# Change data types

GSS_Data$Religion <- as.factor(GSS_Data$Religion)
GSS_Data$`Hours per day watching tv` <- as.numeric(GSS_Data$`Hours per day watching tv`)
GSS_Data$`Work Category` <- as.factor(GSS_Data$`Work Category`)

# Select columns

Religion_TV <- select(GSS_Data, id, year, Religion, `Hours per day watching tv`)

# remove Na's
Religion_TV <- Religion_TV[complete.cases(Religion_TV),]


# Distinct Religions
Religion_TV %>% select(Religion) %>% distinct
# 15 religions

#Filter to look at 2014 data 
tv_2014 <- filter(Religion_TV, year == 2014)

# reorder religion based on mean TV watching time
# Hadley used fcr_reorder()
tv_2014$Religion <- with(tv_2014, reorder(Religion, `Hours per day watching tv`, mean))
# Create plot

ggplot(tv_2014, 
       aes(Religion, 
       `Hours per day watching tv`, 
       fill = Religion))  + geom_boxplot() + coord_flip() + 
       ggtitle("TV Watched per day by Religion \n in 2014")

# Get counts

tv_2014 %>% 
  group_by(Religion) %>% 
  summarise(
    avg_watched = mean(`Hours per day watching tv`),
    counts = n()) %>% 
      arrange(desc(avg_watched))

 

tv_2014

 

We see that the ‘Don’t Know’ group in 2014 only consisted of 1 member, so we should not be drawing any conclusions from that group. While a small sample size we notice that the Eastern Religions watch less Television than the Abrahamic religions in 2014. When looking a larger time period (2000 – 2014) the ‘Don’t Know’ category jumps to the top of the list.

 

tv_2000_2014

We also notice some atypical behavior with a few people that are watching TV for the whole day. Let’s see how many people are watching TV for 24 hours.

Religion_TV %>% 
  group_by(Religion) %>%
  filter(`Hours per day watching tv` == 24) %>% 
  summarise(
    counts = n()) %>% 
  arrange(desc(counts))

and we see 16 Protestants, 3 Catholics, 2 None, and 1 Don’t Know. From this I don’t know if that’s bad data or people leaving on the TV all day. Next Let’s look at how these habits changed over time.

Times <- Religion_TV %>% 
    group_by(Religion, year) %>% 
    summarise(
      `Avg TV watched` = mean(`Hours per day watching tv`),
      counts = n()) %>% 
    filter( Religion %in% c('Protestant','Jewish','Moslem/islam', 'None')) %>% 
  arrange(desc(`Avg TV watched`))


ab <- ggplot(Times, aes(x = year, y = `Avg TV watched`, color = Religion)) + 
    geom_line() + coord_cartesian(ylim= c(0,4)) + scale_x_continuous(breaks=seq(2000,2014,2))  

ggplotly(ab)

tv_time

Using Plotly as the charts tend to pop more, and can be interactive. I was then curious about who watches more TV, people that are self-employed or employed by someone else.

worker <- GSS_Data %>% 
  group_by(`Work Category`, year) %>% 
  summarise(
    `Avg TV watched` = mean(`Hours per day watching tv`, na.rm = TRUE),
    counts = n())  %>% 
  arrange(desc(`Avg TV watched`))

cd <- ggplot(worker, aes(x = year, y = `Avg TV watched`, color = `Work Category`)) + 
  geom_point() + coord_cartesian(ylim= c(0,8)) + scale_x_continuous(breaks=seq(2000,2014,2))  

ggplotly(cd)


GSS_Data %>% 
 group_by(`Work Category`) %>%
 summarise(
 counts = n()) %>% 
 arrange(desc(counts))

work_cat

Unsurprisingly we see people where the work category is not applicable watch more TV than both types of the employed. The self-employed are watching just slightly less TV than those employed by others. Last week was also Plotcon, hearing some great talks from amazing speakers has me fired up to create more engaging R Charts, do all the data munging in R, and continue using Excel charting less and less.

 

 

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *