ShowTime OTT Business Report
ShowTime OTT Business Report
BUSINESS REPORT
PGPDSBA.O. AUG24.A
DHIVIYA MURALIDHARAN
An over-the-top (OTT) media service is a media service offered directly to viewers via the
internet. The term is most synonymous with subscription-based video-on-demand services that offer
access to film and television content, including existing series acquired from other producers, as well
as original content produced specifically for the service. They are typically accessed via websites on
personal computers, apps on smartphones and tablets, or televisions with integrated Smart TV
platforms.
Presently, OTT services are at a relatively nascent stage and are widely accepted as a trending
technology across the globe. With the increasing change in customers' social behaviour, which is
shifting from traditional subscriptions to broadcasting services and OTT on-demand video and music
subscriptions every year, OTT streaming is expected to grow at a very fast pace. The global OTT
market size was valued at $121.61 billion in 2019 and is projected to reach $1,039.03 billion by 2027,
growing at a CAGR of 29.4% from 2020 to 2027. The shift from television to OTT services for
entertainment is driven by benefits such as on-demand services, ease of access, and access to better
networks and digital connectivity.
With the outbreak of COVID19, OTT services are striving to meet the growing entertainment appetite
of viewers, with some platforms already experiencing a 46% increase in consumption and subscriber
count as viewers seek fresh content. With innovations and advanced transformations, which will
enable the customers to access everything they want in a single space, OTT platforms across the
world are expected to increasingly attract subscribers on a concurrent basis.
Objective
ShowTime is an OTT service provider and offers a wide variety of content (movies, web shows, etc.)
for its users. They want to determine the driver variables for first-day content viewership so that they
can take necessary measures to improve the viewership of the content on their platform. Some of
the reasons for the decline in viewership of content would be the decline in the number of people
coming to the platform, decreased marketing spends, content timing clashes, weekends and
holidays, etc. They have hired you as a Data Scientist, shared the data of the current content in their
platform, and asked you to analyse the data and come up with a linear regression model to
determine the driving factors for first-day viewership.
Data Description
The data contains the different factors to analyse for the content. The detailed data dictionary is
given below.
VARIABLES DESCRIPTION
Visitors Average number of visitors, in millions, to the platform in the past week
Number of ad impressions, in millions, across all ad campaigns for the
ad_impressions content
(running and completed)
major_sports_event Any major sports event on the day
genre Genre of the content
dayofweek Day of the release of the content
season Season of the release of the content
views_trailer Number of views, in millions, of the content trailer
views_content Number of first-day views, in millions, of the content
Data Overview
• Data Type
• Statistical Summary
Genre Distribution
Weekday Distribution
• Content released on
Friday has the highest
release, followed by
Wednesday
• Monday and
Tuesday have very less
content release.
Season Distribution
• Winter and Fall has the
highest release
• Other two seasons also
more or less have the same
percentage of release
• No major changes in Ad
campaign or benefit with respect
performed during any season.
• Conducting Ad campaigns
when there is a sport event or not
does not change anything.
• There is no much variation due to
the change in season or genre.
• Winter had the peak visitors and we
can focus on the genre action for the
release to get high views.
3. The day of the week on which content is released generally plays a key role in the
viewership. How, does the viewership vary with the day of release?
DATA PROCESSING
1.Duplicate and Missing values
There are no duplicate and missing values in the provided data set.
2.Feature Engineering
Since the Major Sports Event variable contains “0” and “1” values, we have replaced the
values as “0” = No and “1” = Yes.
o In our case, the value for adj. R-squared is 0.780, which is good and its not
underfitting.
o It means that if all the predictor variable coefficients are zero, then the expected
output (i.e., Y) would be equal to the constant coefficient.
• The train and test RMSE and MAE are comparable, so the model is not overfitting either
• MAE suggests that the model can predict views of the content within a mean error of 0.039
on the test data
• MAPE of 8.6 on the test data means that we are able to predict within 8.6% of the views of
the content
In order to make statistical inferences from the model, we have to test the Linear Regression
assumptions are met.
2. Linearity of variables
5. No Heteroscedasticity
1.No Multicollinearity
By checking the variance inflation factor, it is observed that no value is greater than 5 ,
stating there is no multicollinearity .
• Build a model, check the p-values of the variables, and drop the
column with the highest p-value
Repeat the above two steps till there are no columns with p-value > 0.05
Observations after dealing with high p-values
• Now no feature has p-value greater than 0.05, so we'll consider the features in x_train1 as
the final set of predictor variables and olsmod2 as the final model to move forward with
• Now adjusted R-squared is 0.78, i.e., our model is able to explain ~78% of the variance
• RMSE and MAE values are comparable for train and test sets, indicating that the model is not
overfitting
Observations
• The scatter plot shows the distribution of residuals (errors) vs fitted values (predicted
values).
• If there exist any pattern in this plot, we consider it as signs of non-linearity in the data and a
pattern means that the model doesn't capture non-linear effects.
• We see no pattern in the plot above. Hence, the assumptions of linearity and independence
are satisfied.
4.Normality of error terms
Observations
• The histogram of residuals does have a bell shape.
• The residuals more or less follow a straight line except for the tails.
• Since p-value > 0.05, the residuals are normally distributed as per the Shapiro-Wilk
test.
• So, the assumption is satisfied.
5.No Heteroscedasticity
Observations
• The residual vs fitted values plot can be looked at to check for homoscedasticity. In the case
of heteroscedasticity, the residuals can form an arrow shape or any other non-symmetrical
shape, since the plot does not form an arrow shape, there is no heteroscedasticity.
• The goldfeldquandt test can also be used. If we get a p-value > 0.05 we can say that the
residuals are homoscedastic. Otherwise, they are heteroscedastic. Since p-value > 0.05, we
can say that the residuals are homoscedastic. So, this assumption is satisfied.
PREDICTIONS ON TEST DATA
FINAL MODEL
Observations
• The model's R-squared value is approximately 0.780, and the adjusted R-squared is 0.784,
indicating that the model can explain about 79% of the variance in the data. This is quite
satisfactory.
• An increase of one unit in the 'visitors' variable results in a 0.1253unit increase in content
viewership, with all other variables held constant.
• Releasing content on Saturdays and Wednesdays will boost viewership, provided no major
sports events occur on those days.
• Releasing content during the summer season can enhance viewership
• To improve content viewership, it is recommended to avoid releasing content on days when
major sports events are happening.
• In Genre "Other" categories has the highest counts followed by the Comedy, Thriller, Drama
etc.
• A major sports event will lead to a 0.0605 unit decrease in content viewership.
• As data set has the less variable so there should be detailed information or other variable
also provided for prediction to increase in the viewership.
• The summer season can result in a 0.0442 unit increase in content viewership.