Project Fahim Slmbi
Project Fahim Slmbi
Submitted to
Symbiosis School for Online and Digital Learning (SSODL)Pune
Submitted in partial fulfilment of the requirement for the award
of
Student Declaration
I, Fahim Ahmed Khan, hereby declare that the report entitled “Exploratory
data analysis on Netflix dataset (Investigating Netflix movies and guest
stars)” at “Symbiosis school of online and digital learning ” in partial
fulfilment of the requirement of the award of the “MBA in (Business analytics)
” is my original work.
ACKNOWLEDGEMENTS
Signature Fahim
Name of student -Fahim ahmed khan
Program Name MBA in business analytics
PRN no 2302914977
INDEX
INTRODUCTION
COMPANY PROFILE
LITERATURE REVIEW
THEROTICAL REVIEW
OBJECTIVE OF THE STUDY
RESEARCH METHODOLOGY
ANALYSIS
FINDINGS
CONCLUSION
REFERENCES
INTRODUCTION
Launched on January 16, 2007, nearly a decade after Netflix, Inc. began its pioneering
DVD-by-mail movie rental service, Netflix is the most-subscribed video on demand
streaming media service, with 238.39 million paid memberships in more than 190 countries.
By 2022, "Netflix Original" productions accounted for half of its library in the United States
and the namesake company had ventured into other categories, such as video game
publishing of mobile games via its flagship service.
DVD
Online television and premium video industry.
2. Extensive Content Library: Netflix offers a vast and diverse library of content, including
movies, TV series, documentaries, and stand-up comedy specials. It features content from
various genres, languages, and countries, catering to a broad spectrum of tastes and
preferences.
3. Original Content: Netflix is known for its investment in original programming. The
company produces and releases exclusive content, including critically acclaimed series like
"Stranger Things," "The Crown," "Narcos," and original movies like "Bird Box" and "The
Irishman."
6. Offline Viewing: Netflix allows users to download content for offline viewing, which is
especially useful for those with limited or no internet access while traveling or in remote
locations.
7. Ad-Free Experience: Unlike traditional television, Netflix offers an ad-free viewing
experience. Subscribers can watch content without interruptions from advertisements.
8. Multiple Subscription Tiers: Netflix offers different subscription tiers with varying
pricing and features, including options for single, dual, or multiple device access and
different video quality levels.
9. Global Reach: Netflix has a global presence, producing content and providing services in
multiple languages and regions. It has expanded its content library to cater to a diverse
international audience.
10. Awards and Recognition: Netflix has received numerous industry awards and
nominations, including Academy Awards (Oscars), Emmy Awards, and Golden Globe
Awards for its original content.
Netflix has had a profound impact on the entertainment industry, contributing to the shift
away from traditional cable and broadcast television toward on-demand streaming. It has a
strong focus on innovation, content creation, and user experience, making it one of the most
influential players in the media and entertainment sector. With its global reach and an ever-
expanding library of content, Netflix continues to be a major player in the digital
entertainment landscape.
1. Data Sources: Netflix gathers vast amounts of data on user interactions, content
consumption, and other platform-related activities. This data includes user demographics,
viewing histories, content metadata, and more. EDA begins with the collection and
integration of these diverse data sources.
2. Data Preprocessing: Before performing any analysis, data collected from various sources
need to be pre-processed. This involves handling missing data, dealing with outliers, and
ensuring data quality. In the context of Netflix, this may include addressing incomplete user
profiles or content information.
4. Summary Statistics: EDA often involves calculating and interpreting summary statistics,
such as mean, median, standard deviation, and percentiles. For Netflix, this might mean
calculating average viewing times, identifying the most-watched genres, or analysing trends
over time.
5. Pattern Discovery: EDA aims to uncover patterns, trends, and anomalies in the data. In
the context of Netflix, this might include identifying peak usage times, understanding how
user demographics influence content preferences, and recognizing seasonal viewing patterns.
6. Hypothesis Testing: Hypotheses can be formulated during EDA and tested using
statistical methods. For instance, analysts might investigate whether the release of certain
content correlates with changes in user behaviour or if specific genres attract more viewers.
7. Machine Learning and Predictive Analytics: EDA can serve as the foundation for more
advanced analytics, including machine learning models. For Netflix, predictive models can be
developed to forecast user preferences, improve content recommendations, or optimize
content acquisition strategies.
8. Data Storytelling: The insights gained from EDA need to be effectively communicated to
relevant stakeholders. Data storytelling through reports, presentations, and data visualization
tools helps decision-makers within Netflix leverage the findings to make informed business
decisions.
Exploratory Data Analysis for Netflix is essential in understanding user behaviour and
content performance. By employing data-driven insights, Netflix can refine its content
acquisition strategies, enhance user recommendations, and ultimately provide a more tailored
and satisfying streaming experience to its subscribers. EDA plays a crucial role in the data-
driven decision-making process for a platform that has a global audience and offers a diverse
range of content.
COMPANY PROFILE
Netflix was founded by Marc Randolph and Reed Hastings on August 29, 1997, in Scott’s
Valley, California. Hastings, a computer scientist and mathematician, was a co-founder of
Pure Software, which was acquired by Rational Software that year for $750 million, the then
biggest acquisition in Silicon Valley history. Randolph had worked as a marketing director for
Pure Software after Pure Atria acquired a company where Randolph worked. He was
previously a co-founder of Micro Warehouse, a computer mail-order company as well as vice
president of marketing for Borland. Hastings and Randolph came up with the idea for Netflix
while carpooling between their homes in Santa Cruz, California, and Pure Atria's
headquarters in Sunnyvale
In January 2007, the company launched a streaming media service, introducing video on
demand via the Internet. However, at that time it only had 1,000 films available for
streaming, compared to 70,000 available on DVD. The company had for some time
considered offering movies online, but it was only in the mid-2000s that data speeds and
bandwidth costs had improved sufficiently to allow customers to download movies from the
net. The original idea was a "Netflix box" that could download movies overnight, and be
ready to watch the next day. By 2005, Netflix had acquired movie rights and designed the box
and service. But after witnessing how popular streaming services such as YouTube were
despite the lack of high-definition content, the concept of using a hardware device was
scrapped and replaced with a streaming concept.
In January 2011, Netflix announced agreements with several manufacturers to include
branded Netflix buttons on the remote controls of devices compatible with the service, such
as Blu-ray players. By May 2011, Netflix had become the largest source of Internet streaming
traffic in North America, accounting for 30% of traffic during peak hours.
Key Points:
1. Revolutionary Streaming Service: Netflix initially started as a DVD rental service but
made a historic pivot by launching its online streaming platform in 2007. This move
disrupted traditional television and ushered in the era of on-demand streaming.
2. Global Presence: As of my last knowledge update in September 2021, Netflix was
available in over 190 countries. It has been rapidly expanding its reach to provide
content to a truly global audience while customizing its library for various regions.
3. Extensive Content Library:
o Movies: Netflix offers a vast collection of movies, ranging from classic films
to the latest releases.
o TV Series: The platform includes an extensive catalogue of TV series, from
popular network shows to original series that have garnered critical acclaim.
4. Personalized User Experience: Netflix employs sophisticated algorithms and machine
learning to provide personalized content recommendations to its subscribers. These
recommendations are based on a user's viewing history and preferences, enhancing the
overall viewing experience.
5. Multi-Device Accessibility: Netflix is accessible on a wide range of devices, including
smartphones, tablets, smart TVs, gaming consoles, and web browsers. The platform
offers a seamless experience, allowing users to start watching on one device and
continue on another without losing their place.
6. Offline Viewing: Netflix introduced the option to download content for offline viewing.
This feature is especially useful for users who want to watch content without an internet
connection, such as during travel or in areas with limited connectivity.
7. Ad-Free Environment: Netflix is known for its ad-free streaming experience, in
contrast to traditional television. This allows viewers to enjoy content without any
interruptions from advertisements.
8. Subscription Tiers: Netflix offers several subscription tiers, each with different pricing
and features. These tiers cater to various user needs and may include options for single
or multiple device access, as well as different video quality levels (e.g., Standard
Definition, High Definition, and 4K Ultra HD).
9. Awards and Recognition: Netflix's original content has received widespread critical
acclaim and numerous industry awards. This includes accolades from prestigious
institutions like the Academy Awards (Oscars), Emmy Awards, and Golden Globe
Awards.
10. Innovation and Industry Impact: Netflix's innovative approach to content creation,
technology, and user experience has made it a trailblazer in the streaming industry. It
has not only inspired the development of numerous other streaming platforms but also
contributed to the broader shift away from traditional cable and broadcast television.
Financial Performance:
- Revenue: As of the last available data (up to September 2021), Netflix's revenue had been
steadily increasing over the years. In 2020, the company reported annual revenue of
approximately $25 billion.
- Profitability: Netflix had been reinvesting heavily in content production and international
expansion, which sometimes resulted in lower profits. However, the company's profitability
had been improving, and it consistently reported positive net income.
- Debt: Netflix has incurred significant long-term debt to fund its content production and
global expansion. In 2020, the long-term debt stood at around $16 billion. This debt has been
considered a strategic move to secure a dominant position in the industry.
Content Production:
- Netflix invests heavily in original content creation to distinguish itself from competitors.
The company had been increasing its content budget, allocating billions of dollars each year
to produce and license content. This includes scripted and unscripted series, films,
documentaries, and more.
- The company had entered into exclusive production and distribution deals with well-known
creators and showrunners, which helped in creating unique and highly popular content.
Subscriber Growth:
- Netflix's subscriber base had been growing steadily. In Q2 2021, the company reported over
209 million paid subscribers globally.
- International expansion was a significant driver of this growth. A substantial portion of
Netflix's subscribers come from outside the United States.
Challenges:
- Netflix faces challenges related to content acquisition costs, especially as it competes with
other streaming platforms for exclusive rights to popular shows and movies.
- The competitive landscape is continuously evolving, with new entrants in the streaming
industry and traditional media companies launching their streaming services.
- Regional regulations and content restrictions can pose challenges as Netflix operates
globally and must adhere to various content guidelines and censorship rules in different
countries.
Leadership:
- Reed Hastings has been a key figure in the company's leadership, serving as Co-Founder,
Chairman, and Co-CEO. Ted Sarandos, as the Co-CEO, is responsible for content acquisition
and creation, making him a central figure in Netflix's original content strategy.
Literature Review
Let's delve into a more detailed theoretical review of how Exploratory Data Analysis (EDA)
can be applied to Netflix:
1. User Behaviour Analysis: Netflix can collect and analyse a vast amount of data related to
user behaviour. EDA could be used to identify patterns, such as the most popular genres, how
long users spend on the platform, and at what times of day they are most active.
By segmenting users into different groups based on their viewing habits, Netflix can tailor its
recommendations and content suggestions more effectively.
2. Content Performance Analysis: EDA can help assess the performance of individual
movies and TV shows. Metrics like viewer ratings, watch duration, and drop-off points in a
series can be analysed.
By understanding what content resonates with the audience, Netflix can make informed
decisions about renewing, promoting, or creating new content.
3. Geographic Analysis: Geographic data can be explored to identify regional content
preferences. EDA can reveal differences in popular genres and viewing habits across
countries.
Netflix can use this information to localize its content libraries and improve content
recommendations for viewers in different regions.
4. User Retention and Churn Analysis: EDA can be applied to assess user retention and
churn. By examining user engagement patterns and subscription duration, Netflix can identify
factors that contribute to user retention.
Insights from EDA can inform strategies to reduce churn, such as personalized retention
campaigns or content recommendations.
5. Recommendation System Improvement: Netflix's recommendation system is powered
by data analysis. EDA can be used to assess the performance of the recommendation
algorithm by studying the accuracy of recommendations, click-through rates, and user
feedback.
Netflix can use EDA findings to fine-tune its recommendation system, improving the quality
of personalized content suggestions.
6. A/B Testing: EDA is essential in designing and analysing A/B tests. When Netflix makes
changes to its platform, such as altering the user interface, it can employ EDA to evaluate the
impact of these changes on user behaviour.
Netflix can perform A/B tests to assess different features, layouts, or promotional strategies,
and EDA helps in interpreting the results.
7. Content Licensing Decisions: EDA can aid in making informed content licensing
decisions. By analysing user engagement with licensed content, Netflix can determine which
types of content have the greatest impact on viewer retention and acquisition.
This data can be used during negotiations with content providers.
8. Content Creation and Investment: EDA can provide insights into the success of different
content genres and types. For original content production, Netflix can analyse the viewing
data of similar shows or movies to inform budget allocation and creative decisions.
It can also help in identifying emerging content trends and opportunities.
9. Security and Anomaly Detection: Beyond content and user behaviour, EDA can be used
for security. Netflix can analyse login and access patterns to detect unusual activities,
potentially indicating unauthorized account access or security breaches.
EDA can help in identifying and mitigating security threats in real-time.
10. Cohort Analysis: Netflix can perform cohort analysis to group users based on certain
criteria, such as the date they joined or their subscription plan. EDA can help Netflix
understand how different user cohorts behave over time, allowing them to make more
targeted decisions.
11. Content Popularity Trends: EDA can uncover trends in content popularity. By analysing
which genres or themes are trending, Netflix can adjust its content creation and licensing
strategies to meet current demands.
12. Personalization and Niche Content: EDA can identify niche content preferences within
user segments. By understanding these niches, Netflix can cater to the long tail of user
interests and recommend specialized content to niche audiences.
13. Predictive Modelling: EDA can serve as a foundation for predictive modelling. For
instance, it can be used to build predictive models for user churn, helping Netflix anticipate
and reduce subscriber attrition.
14. Viewer Demographics: EDA can help Netflix identify the demographics of its viewers.
Analysing data related to age, gender, location, and viewing habits can provide valuable
insights for content targeting and advertising.
15. Content Discovery: EDA can be used to improve content discovery. By analysing how
users navigate the platform and discover new content, Netflix can optimize its user interface,
search functionality, and content recommendations.
16. Viewer Sentiment Analysis: By analysing social media and user reviews, Netflix can
perform sentiment analysis. This can help the company understand how viewers feel about its
content and make adjustments as necessary.
17. Real-time Analytics: EDA can also be applied to real-time data streams. Netflix can use
it to monitor live events, user interactions, and platform performance in real time, enabling
quick responses and improvements.
18. Multi-device Analysis: As viewers access Netflix on various devices, EDA can help
analyse user behaviour across different platforms (smartphones, smart TVs, tablets, etc.),
providing insights into user preferences on each.
19. Quality of Experience (QOE): Netflix can use EDA to assess the quality of user
experience, including video streaming quality and buffering issues. By identifying pain
points, the company can improve QOE for its subscribers.
20. Content Cost-Benefit Analysis: EDA can help Netflix evaluate the cost-effectiveness of
content. By analysing production costs, viewership numbers, and user feedback, the company
can determine which content provides the best return on investment.
In summary, EDA is a powerful tool for Netflix to gain a deeper understanding pf its users,
content, and the streaming industry. By applying EDA techniques, Netflix can make data-
driven decisions that enhance the user experience, content library, and overall business
performance.
OBJECTIVE OF THE STUDY
1. To Understand User Behaviour: Analyse how users interact with the platform, what
content they engage with, and when they use Netflix. This understanding is crucial for
personalizing recommendations, content curation, and improving the user experience.
2. To Assess Content Performance: Evaluate the success of individual movies, series, and
documentaries. Identify which content resonates with viewers, has higher viewer ratings, and
drives user retention.
3. To Churn Prediction and Reduction: Identify factors that contribute to user churn and
develop strategies to reduce subscription cancellations. EDA helps in recognizing patterns
and trends that may indicate when users are likely to leave the service.
4. To Enhanced User Experience: Improve the user interface and content discovery features
based on how users navigate the platform. EDA guides user interface design and content
promotion strategies for an intuitive user experience.
We collected a comprehensive dataset from Netflix, including information about the content
library, user ratings, and genre classifications. The dataset covers a span of several years and
ensures a robust representation of Netflix's offerings. It is a crucial first step in Exploratory
Data Analysis (EDA) for Netflix or any other data analysis project. Netflix collects a vast
amount of data related to user behaviour, content, and platform performance.
1. Review Stage: We first wanted to get an overview of the dataset that we were dealing
with. First, we loaded up tidy verse for a simple data analysis purpose. We got the
dataset from Kaggle, and we are going to utilize data that the Kaggle website provides
to understand the trend of movies and TV shows released on the platform.
2. Import Libraries:
import pandas
import matplotlib.pyplot
3. Loading the Dataset: Using Pandas Library, we’ll load the CSV file.
DATA PROCESSING
To prepare the data for analysis, we performed data cleaning, addressing missing values, and
handling outliers. We also transformed the data into a format suitable for analysis, ensuring
the integrity and accuracy of our findings.
Here are some key aspects of data collection in EDA for Netflix:
1. User Interactions: Netflix records data on how users interact with the platform. This
includes information on when users log in, what they search for, what they watch,
how long they watch, and when they pause or stop content.
2. User Demographics: Data collection includes information about users'
demographics, such as age, gender, location, and language preferences. This
information can help Netflix understand its user base better.
3. Content Metadata: Netflix collects data related to its extensive content library. This
includes details about movies, TV shows, documentaries, release dates, genres, cast
and crew information, ratings, and viewer reviews.
4. Viewing History: Users' viewing history, which includes data on previously watched
content and the frequency of viewing, is recorded to provide personalized
recommendations.
5. Geographic Data: Geographic data helps Netflix understand where its users are
located. This information can be used to analyse regional content preferences and
adapt the content library to local tastes.
6. Engagement Metrics: Netflix monitors various engagement metrics, such as the time
spent on the platform, the number of episodes or movies watched in a single session,
and the frequency of logins.
7. Content Performance Data: Data on how well specific content is performing,
including metrics like the number of views, viewer ratings, and viewer reviews, are
collected. This information helps in content recommendation and decision-making for
future content production.
8. Quality of Service Metrics: Data on the quality of the streaming service, including
load times, buffering, and video quality, is collected to ensure a smooth user
experience.
9. Surveys and Feedback: In addition to passive data collection, Netflix may also
conduct user surveys and collect feedback to gain qualitative insights into user
preferences and suggestions for improvements.
10. Legal and Privacy Considerations: Netflix adheres to strict legal and privacy
regulations when collecting and handling user data. This includes obtaining user
consent, anonymizing data when necessary, and ensuring the security of user
information.
ANALYSIS
The first step is to import the required libraries. In this code block, the Pandas library is used
to read and manipulate data and the Pandas-profiling library is used for EDA. The datasets
module from the scikit-learn library is used to load the Iris dataset.
import pandas as pd
import pandas_profiling
from sklearn import datasets
Next, we have to load the dataset. Here, we will be using the multivariate dataset named the
Iris dataset.
iris = datasets.load_iris()
The scikit-learn dataset is loaded as a Bunch object, similar to a dictionary. To use this
dataset with Pandas, we must convert it to a Pandas Data Frame.
It is always a good approach to check the attributes of the data, like its shape or the number of
rows and columns in the dataset. To check the shape of the data, run the following code:
iris_data.shape
To check the column names of the DataFrame, we use the columns attribute.
iris_data.columns
If your dataset is big, you can view the first few records of the DataFrame by running the
below code:
iris_data.head()
Once you are done with scanning the attributes of the data, you can make any necessary
modifications in the dataset, like changing the name of the columns or the raws. Remember
not to change the dataset’s variables, which can significantly impact the final result.
To clean the data, first, you must check for any null values in the variables. If any of the
variables in a dataset have null values, it can affect the analysis results. If your dataset has
missing data, handle them through approaches like imputation, deletion of observations or
variables, or using models that can handle missing data.
Next, if you find any redundant data in your dataset that does not add value to the output, you
can also remove them from the table. All the columns and rows are important in the iris
dataset we have taken. So, we would not be dropping the data. In this step, we must also find
any outliers in the data.
Correlation analysis: The analyst computes the correlation matrix between variables
to identify which variables are strongly correlated with each other.
Visualization: The data analyst creates visualizations to explore the relationship
between variables. This includes scatter plots, heatmaps, etc.
Hypothesis testing: The analyst performs statistical tests to test hypotheses about the
relationship between variables.
Run the following code to generate a report that includes various relationship analyses
between variables.