RFM
RFM
3
Objective:..............................................................................................................3 Data
Description:...................................................................................................3 Data
Dictionary:..................................................................................................3 Understanding the
structure of the data:...........................................................4 Showtime - Problem
Statement:............................................................................5 Exploratory Data
Analysis.............................................................................5 Important
Note:...............................................................................................5 Fig: Distribution of
Contents............................................................................5 Fig: Distribution of
Genres...............................................................................6 Fig: Viewership by Day of
Release...................................................................6 Fig: Viewership by Season of
Release..............................................................7 Fig: Correlation between Trailer Views & Content
Views..................................7 Data
preprocessing.........................................................................................8 Fig: Boxplot for content
views.........................................................................8 Model building - Linear
Regression.............................................................9 Testing the assumptions of linear regression
model..............................10 Fig: Residual Plot for Residuals vs Fitted
values.............................................10 Fig: Histogram for Distribution of
Residuals...................................................10 Fig: Scatter plot for Homoscedasticity
check.................................................11 Fig: Scatter plot for Residuals vs Fitted
values..............................................11 Model performance
evaluation...................................................................12 Actionable Insights &
Recommendations.................................................12
Context: An over-the-top (OTT) media service is a media service offered directly to viewers via the
internet. The term is most synonymous with subscription-based video-on- demand services that offer
access to film and television content, including existing series acquired from other producers, as well
as original content produced specifically for the service. They are typically accessed via websites on
personal computers, apps on smartphones and tablets, or televisions with integrated Smart TV
platforms. Presently, OTT services are at a relatively nascent stage and are widely accepted as a
trending technology across the globe. With the increasing change in customers' social behaviour,
which is shifting from traditional subscriptions to broadcasting services and OTT on-demand video
and music subscriptions every year, OTT streaming is expected to grow at a very fast pace. The global
OTT market size was valued at $121.61 billion in 2019 and is projected to reach $1,039.03 billion by
2027, growing at a CAGR of 29.4% from 2020 to 2027. The shift from television to OTT services for
entertainment is driven by benefits such as on-demand services, ease of access, and access to better
networks and digital connectivity. With the outbreak of COVID19, OTT services are striving to meet
the growing entertainment appetite of viewers, with some platforms already experiencing a 46%
increase in consumption and subscriber count as viewers seek fresh content. With innovations and
advanced transformations, which will enable the customers to access everything they want in a
single space, OTT platforms across the world are expected to increasingly attract subscribers on a
concurrent basis. Objective: ShowTime is an OTT service provider and offers a wide variety of content
(movies, web shows, etc.) for its users. They want to determine the driver variables for first-day
content viewership so that they can take necessary measures to improve the viewership of the
content on their platform. Some of the reasons for the decline in viewership of content would be the
decline in the number of people coming to the platform, decreased marketing spends, content
timing clashes, weekends and holidays, etc. They have hired you as a Data Scientist, shared the data
of the current content in their platform, and asked you to analyse the data and come up with a linear
regression model to determine the driving factors for first-day viewership. Data Description: The data
contains the different factors to analyse for the content. The detailed data dictionary is given below.
Data Dictionary: •visitors: Average number of visitors, in millions, to the platform in the past week
•ad_impressions: Number of ad impressions, in millions, across all ad campaigns for the content
(running and completed) •major_sports_event: Any major sports event on the day •genre: Genre of
the content •dayofweek: Day of the release of the content •season: Season of the release of the
content •views_trailer: Number of views, in millions, of the content trailer •views_content: Number
of first-day views, in millions, of the content Understanding the structure of the data: The provided
data structure is a dataset related to an OTT (Over-the-top) media service, which is a streaming
service that offers online content to users. The dataset contains 5 rows, each representing a single
observation or record, and 8 columns, each representing a feature or variable. •The Dataset has
been loaded properly. •Dataset consists of several columns displaying the various attributes related
to an OTT (Over-the-top) media service, which is a streaming service that offers online content to
users. •visitors: This column represents the number of visitors to the OTT platform.
•major_sports_event: This column is a binary indicator (0 or 1) representing whether if a major
sports event occurred on the corresponding day. It is a relevant information for analysing the impact
of sports events on user engagement. •views_trailer: This column represents the number of views
for trailers on the platform. •The day of the week and season might have an impact on user
engagement.
•The Trailer views are relatively high as compared to the content views, which might indicate that
users are more interested in previewing any content before watching it. Showtime - Problem
Statement: Exploratory Data Analysis - Problem definition, questions to be answered - Data
background and contents - Univariate analysis - Bivariate analysis - Answers to the key questions
provided - Insights based on EDA The below EDA section of the report will provide an overview of the
data distribution and relationships between variables. The plots and statistics generated in this
section will help to identify the patterns, outliers, and correlations in the data. Important Note: The
following questions need to be answered as a part of the EDA section of the project: 1.What does the
distribution of content views look like? Fig: Distribution of Contents The histogram shows a skewed
distribution of content views, with most views concentrated around 0-2 million.
2.What does the distribution of genres look like? Fig: Distribution of Genres The count plot reveals
that Thriller is the most popular genre, followed by Sci-Fi and Horror. 3.The day of the week on which
content is released generally plays a key role in the viewership. How does the viewership vary with
the day of release?
Fig: Viewership by Day of Release The box plot shows that viewership is highest on Fridays and
lowest on Sundays. 4.How does the viewership vary with the season of release? Fig: Viewership by
Season of Release The box plot indicates that viewership is highest in Fall and lowest in Winter.
5.What is the correlation between trailer views and content views?
Fig: Correlation between Trailer Views & Content Views The scatter plot and correlation coefficient
(0.83) suggest a strong positive relationship between trailer views and content views. Data
preprocessing - Duplicate value check - Missing value treatment - Outlier treatment - Feature
engineering - Data preparation for modelling. Data Preprocessing includes the data for modelling by:
•Checking for duplicates. •Checking for missing values. •Detecting outliers in content views.
•Encoding categorical variables (genre, dayofweek, season) using one-hot encoding a. Duplicate
Value Check: No Duplicates are found in the given dataset. b. Missing Value Treatment:
No Missing values are found in the given dataset. c. Outlier Treatment: Fig: Boxplot for content views
No Outliers have been detected in the content views for the given dataset. d. Feature Engineering:
We have done the encoding for categorical variables (genre, dayofweek, season) using one-hot
encoding. e. Data Preparation for Model Building: Model building - Linear Regression - Build the
model and comment on the model statistics - Display model coefficients with column names.
Building a linear regression model to predict content views based on the pre- processed data above
and displaying the model's coefficients and statistics. •Splitting the data into training and testing sets.
•Checking the shapes of the training and testing sets.
•Displaying model coefficients: •Displaying the model coefficients with column names. Testing the
assumptions of linear regression model - Perform tests for the assumptions of the linear regression -
Comment on the findings from the tests 1. Linearity:
Fig: Residual Plot for Residuals vs Fitted values The residual plot shows a linear relationship between
actual and predicted views. 2. Normality of Residuals Fig: Histogram for Distribution of Residuals The
histogram shows a normal distribution of residuals. 3. Homoscedasticity
Fig: Scatter plot for Homoscedasticity check The scatter plot indicates constant variance of residuals
across predicted views. 4. Independence Fig: Scatter plot for Residuals vs Fitted values The scatter
plot shows the residuals are independent. Model performance evaluation Evaluate the model on
different performance metrics
The model's performance using various metrics, including: Mean Absolute Error (MAE): 0.43
Mean Squared Error (MSE): 0.23 Root Mean Squared Error (RMSE): 0.48 R-squared (R²): 0.81
Actionable Insights & Recommendations - Comments on significance of predictors - Key takeaways
for the business Below are the insights and recommendations based on the analysis: Significance of
Predictors: The coefficients table shows that ad impressions, trailer views, and certain genres (e.g.,
Thriller) are significant predictors of content views. Key Recommendations: Increase marketing
spend to improve ad impressions. Optimize content release schedules, avoiding clashes with major
sports events. Promote trailers more to increase first-day viewership. Overall, this report provides a
comprehensive analysis of the OTT data, identifying key patterns, relationships, and predictors of
content views. The insights and recommendations generated can inform business decisions to
improve the OTT service's performance.