Deep Dive into Mercaz Daf Yomi Data

Gooooood Morning Rabosai

This post is going to go over pulling youtube statistics usings pythons pyyoutube library, data cleanings, charting with plotly, and correlations to answer do longer Daf yomi videos have less views?

What is MDY?

The Mercaz Daf Yomi was created by R’ Eli Stefansky a unique program for chazzering the Daf in just 8 minutes, with clear and illustrated concise recap videos in your WhatsApp inbox every day. I first heard of MDY with the Free Artscroll Gemara campaign – Join the MDY Shiur and get a free Gemara in English or Hebrew.

I learn with a Chavruta three times a week, and after each daf I do a quiz from dafyomi.co.il, a weekly recap from dafdigest.org that I post to mywifeisadoctor but the best chazara is from listening to R’ Eli Stefansky, if not the full daf, the 8 minute videos.

Less is more?

In one of the shiurs someone asked why the youtube videos are not longer, and include everything that happens on the zoom calls, and R’ Stefansky said if the videos were 2 hours long no one would watch them, people are skipping the long videos. So I wanted to look into this data and general trends with the MDY videos.

MD(P)Y

Next I’ll break down how to scrape youtube stats using Python

# import libraries
import plotly.express as px
import pandas as pd
import pyyoutube
import datetime
import re

Then you need to get an API key from youtube

def get_videos(channel_id):
    api = pyyoutube.Api(api_key=API_KEY)
    channel_info = api.get_channel_info(channel_id=channel_id)

    playlist_id = channel_info.items[0].contentDetails.relatedPlaylists.uploads

    uploads_playlist_items = api.get_playlist_items(
        playlist_id=playlist_id, count=2100, limit=10
    )

    videos = []
    for item in uploads_playlist_items.items:
        video_id = item.contentDetails.videoId
        video = api.get_video_by_id(video_id=video_id)
        videos.append(video.items[0])

    df = pd.DataFrame(videos, columns=['id', 'etag', 'kind', 'snippet', 'statistics','contentDetails', 'status'])

    return df

channel_id = "UCKwQa5DB_VR98ac_r-Wyl-g" # this is the MDY channel
df = get_videos(channel_id)

We can then look at the data to see 2041 videos, and the latest one as of today (2023-03-29) was “Daf Yomi Nazir Daf 65 by R’ Eli Stefansky”, and the data comes in a JSON format that we need to extract.

df['title'] = df['snippet'].apply(lambda x: x['title'])
df['viewcount'] = df['statistics'].apply(lambda x: x['viewCount'])
df['published_at'] = df['snippet'].apply(lambda x: x['publishedAt'])
df['duration'] = df['contentDetails'].apply(lambda x: x['duration'])

df = df[["title", "viewcount","published_at","duration"]]

It’s About the Yomi

After that we want to filter:

Only titles that contain the words Daf Yomi, and remove Siyum and The magic of MDY Videos.

Then on the date side only videos for the current cycle, starting 202-01-05 and remove the last 7 days so that the videos don’t fall off a cliff.

df = df[df['title'].str.contains("Daf Yomi")]
df = df[~df['title'].str.contains("Siyum")]
df = df[~df['title'].str.contains("The Magic of MDY")]
df['date']= df['published_at'].astype('datetime64[ns]')
#'2020-01-05'
df = df[df['date'] > pd.Timestamp(2020,1,5)]
df = df[df['date'] < datetime.datetime.now() - pd.to_timedelta("7day")]
df.sort_values(by=['date']).head()

We can then use Regex to get the tractate and daf from the title

df['daf'] = df['title'].str.extract(r'Daf\s+(\d+)')
df['tractate'] = df['title'].str.extract(r'(?<=Daf Yomi)(.*)(Daf)')[0]
df = df[df['tractate'].notna()]

Next we can check for typos using a groupby, we need to update some titles as we see berachos and Brachos should be “Berachos”

Some type conversions and getting the duration in a readable format

df = df.sort_values(by=['date'])
df['date'] = pd.to_datetime(df['date']).dt.date
df['viewcount'] = df['viewcount'].astype(str).astype(int)
df['daf'] = df['daf'].astype(str).astype(int)

def convert_duration_to_minutes(duration):
    # Split the duration string into hours, minutes, and seconds
    duration = duration[2:]
    hours, minutes, seconds = 0, 0, 0
    if 'H' in duration:
        hours, duration = duration.split('H')
        hours = int(hours)
    if 'M' in duration:
        minutes, duration = duration.split('M')
        minutes = int(minutes)
    if 'S' in duration:
        seconds = int(duration[:-1])
    
    # Calculate the total duration in minutes
    total_minutes = (hours * 60) + minutes + (seconds / 60)
    return total_minutes

# Apply the function to the duration column and update it with the converted values
df['duration'] = df['duration'].apply(convert_duration_to_minutes)
df.sort_values(by=['duration']).head(10)

We can then filter out any short videos such as ones like
“Daf Yomi Shabbos Daf 63 by R’ Eli Stefansky” which was uploaded twice

Plotly Charts

daf = df

fig = px.line(daf, x="date", y="viewcount", color="tractate")
fig.update_traces(textposition="bottom center")
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="bottom",
    y=1.02,
    xanchor="right",
    x=1
))
fig.show()

We can see a positive trend overtime seeing a growth in the Chevra, and something interesting, the first Daf always has a large spike in views, lets dig in.

from plotly.subplots import make_subplots

# Create the first plot
fig1 = px.line(df, x="daf", y="viewcount", color="tractate")
fig1.update_traces(textposition="bottom center")

# Create the second plot
fig2 = px.line(df, x="daf", y="viewcount", color="tractate")
fig2.update_traces(textposition="bottom center")
fig2.update_xaxes(range=[2, 8])  # Update x-axis range

# Create the subplot figure
fig_combined = make_subplots(rows=1, cols=2)

# Add traces from the first plot to the first subplot
for trace in fig1.data:
    fig_combined.add_trace(trace, row=1, col=1)

# Add traces from the second plot to the second subplot
for trace in fig2.data:
    trace.showlegend = False  # Disable legend for traces in the second subplot
    fig_combined.add_trace(trace, row=1, col=2)

# Update the layout for the combined figure
fig_combined.update_layout(
    title_text="Views Per Daf",
    height=500,
    width=1000
)

# Set the x-axis range for the second subplot explicitly
fig_combined.update_xaxes(range=[2, 8], row=1, col=2)

# Display the combined plot
fig_combined.show()

Using `make_subplots` we can show side by side two charts – We want to make the X-axis the daf number instead of date, and then zoom in on the right to see how Daf’s 2-8 look like in viewership. We see the drop more clearly and then starts to stabilize after the 4th daf.

Correlations

dafcorr = daf[daf['daf'] > 2]

correlations = dafcorr.groupby('tractate')['duration', 'viewcount'].corr().iloc[::2,-1].reset_index()
sorted_dafcorr = dafcorr.sort_values(by='date')
subset = sorted_dafcorr[['tractate', 'duration', 'viewcount']]
tractate_order = sorted_dafcorr['tractate'].unique()
correlations = subset.groupby('tractate')['duration', 'viewcount'].corr().iloc[::2,-1].reset_index()
correlations['tractate'] = pd.Categorical(correlations['tractate'], categories=tractate_order)
correlations = correlations.sort_values(['tractate']).reset_index(drop=True)
correlations.head()

This `.corr()` function is going to do the heavy lifting after the data has been cleaned, we can now see the correlations broken out by Gemara; since we see there is a changes in viewership per Gemara, and not to include the first Daf as that’s an outlier for view count.

fig = px.scatter(dafcorr, x='duration', y='viewcount', color='tractate',hover_data=["daf", "duration", "viewcount"])

fig.show()

For the overall videos is not a strong correlation, but within the Geramas there looks to be a trend.

correlations = correlations.rename(columns={'viewcount': 'correlation'})
fig = px.bar(correlations, x='tractate', y='correlation')
fig.show()

We see that for Berachos yes there was a negative correlation, longer videos did get less views, but for the last 5 Germara 3 of them had a positive correlation. As time has gone people have seen it’s Geshmak to do the Daf and people want more R’ Stefansky.

fig = px.line(daf, x="date", y="duration", color="tractate",hover_data=["daf", "duration", "viewcount"])
fig.update_traces(textposition="bottom center")
fig.show()

Over time the Shiurs have stayed in the 40-60 minute range with few outliers, we learned that people will not run away from the longer videos.

Here at Tevunah, for anyone part of the MDY Chevra, whether you have been listening since Berachos or just happen to watch one 8 minute daf we would put our Maaser contribution to MDY. We are a full stack Data Consultancy from Data Analytics to Warehousing.