0% found this document useful (0 votes)
131 views17 pages

ShowTime OTT Business Report

The document presents a business report on predictive modeling using linear regression for ShowTime, an OTT service provider, aiming to identify factors influencing first-day content viewership. It highlights the growth of the global OTT market, the impact of COVID-19 on viewership, and provides a detailed analysis of various factors such as visitor numbers, ad impressions, genre, and release timing. The findings suggest actionable insights for improving viewership, including optimal release days and seasons, as well as the effects of major sports events on content consumption.

Uploaded by

murali.dhiviya96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views17 pages

ShowTime OTT Business Report

The document presents a business report on predictive modeling using linear regression for ShowTime, an OTT service provider, aiming to identify factors influencing first-day content viewership. It highlights the growth of the global OTT market, the impact of COVID-19 on viewership, and provides a detailed analysis of various factors such as visitor numbers, ad impressions, genre, and release timing. The findings suggest actionable insights for improving viewership, including optimal release days and seasons, as well as the effects of major sports events on content consumption.

Uploaded by

murali.dhiviya96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

PREDICTIVE MODELLING – LINEAR REGRESSION

SHOWTIME – OTT SERVICES

BUSINESS REPORT

PGPDSBA.O. AUG24.A

DHIVIYA MURALIDHARAN

An over-the-top (OTT) media service is a media service offered directly to viewers via the
internet. The term is most synonymous with subscription-based video-on-demand services that offer
access to film and television content, including existing series acquired from other producers, as well
as original content produced specifically for the service. They are typically accessed via websites on
personal computers, apps on smartphones and tablets, or televisions with integrated Smart TV
platforms.

Presently, OTT services are at a relatively nascent stage and are widely accepted as a trending
technology across the globe. With the increasing change in customers' social behaviour, which is
shifting from traditional subscriptions to broadcasting services and OTT on-demand video and music
subscriptions every year, OTT streaming is expected to grow at a very fast pace. The global OTT
market size was valued at $121.61 billion in 2019 and is projected to reach $1,039.03 billion by 2027,
growing at a CAGR of 29.4% from 2020 to 2027. The shift from television to OTT services for
entertainment is driven by benefits such as on-demand services, ease of access, and access to better
networks and digital connectivity.

With the outbreak of COVID19, OTT services are striving to meet the growing entertainment appetite
of viewers, with some platforms already experiencing a 46% increase in consumption and subscriber
count as viewers seek fresh content. With innovations and advanced transformations, which will
enable the customers to access everything they want in a single space, OTT platforms across the
world are expected to increasingly attract subscribers on a concurrent basis.

Objective
ShowTime is an OTT service provider and offers a wide variety of content (movies, web shows, etc.)
for its users. They want to determine the driver variables for first-day content viewership so that they
can take necessary measures to improve the viewership of the content on their platform. Some of
the reasons for the decline in viewership of content would be the decline in the number of people
coming to the platform, decreased marketing spends, content timing clashes, weekends and
holidays, etc. They have hired you as a Data Scientist, shared the data of the current content in their
platform, and asked you to analyse the data and come up with a linear regression model to
determine the driving factors for first-day viewership.

Data Description
The data contains the different factors to analyse for the content. The detailed data dictionary is
given below.

VARIABLES DESCRIPTION
Visitors Average number of visitors, in millions, to the platform in the past week
Number of ad impressions, in millions, across all ad campaigns for the
ad_impressions content
(running and completed)
major_sports_event Any major sports event on the day
genre Genre of the content
dayofweek Day of the release of the content
season Season of the release of the content
views_trailer Number of views, in millions, of the content trailer
views_content Number of first-day views, in millions, of the content
Data Overview
• Data Type

• Statistical Summary

• Statistical summary of categorical variable

Observations and Insights


• The data set contains information about first day views of the content based on the day of
week, genre, season.
• The data set contains 1000 rows with 8 attributes
• There are 5 numeric (float and int type) and 3 string (object type) columns in the data
• The target variable is the First day views of the content, which is of float type
• There are Four seasons, eight Genres and all days of the week mentioned in the Data
• The mean and standard deviation of Views of the content is 0.47 and 0.12 respectively
UNIVARIATE ANALYSIS
Visitor Distribution

• Normal distribution can be


seen with very less outlier
• Mean is around 1.7 million.

Trailer views Distribution


• Outliers can be seen heavily
• Data is heavily right –
skewed.
• Mean is around 70 million.

Content Views Distribution


• Outliers are present
• Data is slightly right-skewed
but seems to have a normal
distribution.
• Mean is around 0.47
million.
Ad impressions
Distribution

• Very less outliers


• Data is rightly-skewed and
doesn’t have any sign of
normal distribution
• Mean is around 1400.

Genre Distribution

• Others have the highest


release followed by Comedy
and Thriller.

Weekday Distribution

• Content released on
Friday has the highest
release, followed by
Wednesday
• Monday and
Tuesday have very less
content release.
Season Distribution
• Winter and Fall has the
highest release
• Other two seasons also
more or less have the same
percentage of release

Major Event Distribution


• Mostly content is release when
there is no any major sport event
happening.
BIVARIATE ANALYSIS
Correlation between numerical values
• Positive correlation
between Trailer views and
content views implying those who
have watched the trailer has
mostly watched the content.
• Mild correlation between
visitors and First day views of the
content.

• Outliers are present in all


the genre
• It is evident that when there
is no major sport event, there is
good response for the views and
content release
• The average views are
ranging from 0.4 to 0.5 million

• No major changes in Ad
campaign or benefit with respect
performed during any season.
• Conducting Ad campaigns
when there is a sport event or not
does not change anything.
• There is no much variation due to
the change in season or genre.
• Winter had the peak visitors and we
can focus on the genre action for the
release to get high views.

• Winter season the maximum


content release and First day views.
• We can see many outliers in
summer.

• There are outliers present in all days


of the week except Monday.
• Wednesday and Saturday have the
highest mean value.

• There is positive correlation


between visitors and First day views of
the content, implying those who have
visited the platform has mostly seen the
content on the first day of release.
KEY QUESTIONS
1. What does the distribution of content views look like?

It has normal distribution but slightly right


skewed and the mean values lie between 4 to 5
million.

2. What does the distribution of genres look like?


It is evident that, Genre (others) has highest
content release and most of the data comes
under this genre, followed by comedy and
thriller.

3. The day of the week on which content is released generally plays a key role in the
viewership. How, does the viewership vary with the day of release?

Wednesday and Saturday have highest


content release. Friday has the lease
viewership, while the other days has the
same average views during the content
release
4.How does the viewership vary with the season of release?

There is no much impact of season with the


viewership and release of the content. Summer has
the highest viewership followed by Winter. The
other two seasons have a decent number of views
but not very less.

5. What is the correlation between trailer views and content views?

There is positive relationship between Trailer


views and First day views which means when
there is increase in Trailer views, content views
increase correspondingly.

DATA PROCESSING
1.Duplicate and Missing values
There are no duplicate and missing values in the provided data set.
2.Feature Engineering
Since the Major Sports Event variable contains “0” and “1” values, we have replaced the
values as “0” = No and “1” = Yes.

3.Outlier Detection and Treatment


We can see many outliers in Trailer views and Content views. It will be considered
as valid and won’t be removed from the data set since there may be rare cases, where the content
release might be a block buster and the views would have reached the peak when compared to the
average views and content. Hence these outliers are considered as informative and no treated and
kept in the data for further insights.

4.Data Preparation for Modelling


• Target variable is “views of the content”
• The categorical features are – Major Sport event, Genre, Season, day of the week.
• Split the data in 80:20 ratio for train to test data.
• Number of rows in train data is 800
• Number of rows in test data is 200
Model Building – Linear Regression
This is the first OLS Regression Model without checking the VIF and P values and with other
necessary information.

Interpreting the Regression Results


1. Adjusted. R-squared: It reflects the fit of the model.

o Adjusted R-squared values generally range from 0 to 1, where a higher value


generally indicates a better fit, assuming certain conditions are met.

o In our case, the value for adj. R-squared is 0.780, which is good and its not
underfitting.

2. Constant coefficient: It is the Y-intercept.

o It means that if all the predictor variable coefficients are zero, then the expected
output (i.e., Y) would be equal to the constant coefficient.

o In our case, the value for constant coefficient is 0.0568


Observations
• The training R2 is 0.79, so the model is not underfitting

• The train and test RMSE and MAE are comparable, so the model is not overfitting either

• MAE suggests that the model can predict views of the content within a mean error of 0.039
on the test data

• MAPE of 8.6 on the test data means that we are able to predict within 8.6% of the views of
the content

In order to make statistical inferences from the model, we have to test the Linear Regression
assumptions are met.

We will be checking the following Linear Regression assumptions:

1. No Multicollinearity – By checking VIF

2. Linearity of variables

3. Independence of error terms

4. Normality of error terms

5. No Heteroscedasticity

1.No Multicollinearity
By checking the variance inflation factor, it is observed that no value is greater than 5 ,
stating there is no multicollinearity .

Dealing with High P-values

• Build a model, check the p-values of the variables, and drop the
column with the highest p-value

• Create a new model without the dropped feature, check the p-


values of the variables, and drop the column with the highest p-value

Repeat the above two steps till there are no columns with p-value > 0.05
Observations after dealing with high p-values
• Now no feature has p-value greater than 0.05, so we'll consider the features in x_train1 as
the final set of predictor variables and olsmod2 as the final model to move forward with

• Now adjusted R-squared is 0.78, i.e., our model is able to explain ~78% of the variance

• RMSE and MAE values are comparable for train and test sets, indicating that the model is not
overfitting

2&3. Linearity of variables and Independence of error Terms.

Observations
• The scatter plot shows the distribution of residuals (errors) vs fitted values (predicted
values).

• If there exist any pattern in this plot, we consider it as signs of non-linearity in the data and a
pattern means that the model doesn't capture non-linear effects.

• We see no pattern in the plot above. Hence, the assumptions of linearity and independence
are satisfied.
4.Normality of error terms

Observations
• The histogram of residuals does have a bell shape.

• The residuals more or less follow a straight line except for the tails.
• Since p-value > 0.05, the residuals are normally distributed as per the Shapiro-Wilk
test.
• So, the assumption is satisfied.

5.No Heteroscedasticity

Observations

• The residual vs fitted values plot can be looked at to check for homoscedasticity. In the case
of heteroscedasticity, the residuals can form an arrow shape or any other non-symmetrical
shape, since the plot does not form an arrow shape, there is no heteroscedasticity.
• The goldfeldquandt test can also be used. If we get a p-value > 0.05 we can say that the
residuals are homoscedastic. Otherwise, they are heteroscedastic. Since p-value > 0.05, we
can say that the residuals are homoscedastic. So, this assumption is satisfied.
PREDICTIONS ON TEST DATA

We can observe here that our model has returned pretty


good prediction results, and the actual and predicted values are
comparable

FINAL MODEL

Observations

• The model is able to explain ~78% of the variation in the data


• The train and test RMSE and MAE are low and comparable. So, our model is not suffering
from overfitting
• The MAPE on the test set suggests we can predict within 8.8% of First day views of the
content
• Hence, we can conclude the model olsmodel_final is good for prediction as well as inference
purposes
Actionable Insights & Recommendations

• The model's R-squared value is approximately 0.780, and the adjusted R-squared is 0.784,
indicating that the model can explain about 79% of the variance in the data. This is quite
satisfactory.
• An increase of one unit in the 'visitors' variable results in a 0.1253unit increase in content
viewership, with all other variables held constant.
• Releasing content on Saturdays and Wednesdays will boost viewership, provided no major
sports events occur on those days.
• Releasing content during the summer season can enhance viewership
• To improve content viewership, it is recommended to avoid releasing content on days when
major sports events are happening.
• In Genre "Other" categories has the highest counts followed by the Comedy, Thriller, Drama
etc.
• A major sports event will lead to a 0.0605 unit decrease in content viewership.
• As data set has the less variable so there should be detailed information or other variable
also provided for prediction to increase in the viewership.
• The summer season can result in a 0.0442 unit increase in content viewership.

You might also like