Business Case - Netflix - Data Exploration and Visualisation - Ipynb - Colab
Business Case - Netflix - Data Exploration and Visualisation - Ipynb - Colab
ipynb - Colab
Netflix is one of the most popular media and video streaming platforms. They have over 10000 movies or tv shows available on their platform,
as of mid-2021, they have over 222M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on
Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.
Business Problem
Analyze the data and generate insights that could help Nelix ijn deciding which type of shows/movies to produce and how they can grow the
business in different countries
Import Libraries
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
url="/content/netflix_df.csv"
netflix_data = pd.read_csv(url)
netflix_data.head()
Dick
Kirsten United September
0 s1 Movie Johnson Is NaN
Johnson States 25, 2021
Dead
Ama
Qamata,
Khosi
TV Blood & South September
1 s2 NaN Ngema,
Show Water Africa 24, 2021
Gail
Mabalane,
Thaban...
Sami
Bouajila
netflix_data
Dick
Kirsten United September
0 s1 Movie Johnson Is NaN
Johnson States 25, 2021
Dead
Ama
Qamata,
Khosi
TV Blood & South September
1 s2 NaN Ngema,
Show Water Africa 24, 2021
Gail
Mabalane,
Thaban...
Sami
Bouajila,
Tracy
TV Julien September
2 3 G l d G t N N
https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 1/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab
Ju e Septe be
toggle_off
2 s3 Ganglands Gotoas, NaN
Show Leclercq 24, 2021
Next steps: Generate code with netflix_data Samuel
View recommended plots New interactive sheet
Jouy,
Nabi...
The dataset contains over 8807 titles, 12 descriptions. After a quick view of the data frames, it looks like a typical movie/TVshows data frame
Jailbirds
without 3ratings. We
s4
can also
TV see that there are NaN values in some columns.
New NaN NaN NaN
September
Show 24, 2021
Orleans
netflix_data.columns
netflix_data.ndim
netflix_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8807 non-null object
1 type 8807 non-null object
2 title 8807 non-null object
3 director 6173 non-null object
4 cast 7982 non-null object
5 country 7976 non-null object
6 date_added 8797 non-null object
7 release_year 8807 non-null int64
8 rating 8803 non-null object
9 duration 8804 non-null object
10 listed_in 8807 non-null object
11 description 8807 non-null object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 2/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab
duration True
listed_in False
description False
dtype: bool
netflix_data.describe()
release_year duration_int
From the info, we know that there are 8807 entries and 12 columns to work with for this EDA. There are a few columns that contain null values,
“director,” “cast,” “country,” “date_added,” “rating.”
show_id 0
type 0
title 0
director 2634
cast 825
country 831
date_added 10
release_year 0
rating 4
duration 3
listed_in 0
description 0
netflix_data.isnull().sum().sum()
4307
There are a total of 4307 null values across the entre dataset with 2634 missing points under "director", 825 under "cast", 831 under "country",
11 under "date_added", 4 under "rating" and 3 under “duration ”. We will have to handle all null data points before we can dive into EDA and
modelling.
Non-Graphical Analysis involves calculating the summary statistics, without using pictorial or graphical representations.
netflix_data.head()
https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 3/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab
Dick
Kirsten United September
0 s1 Movie Johnson Is NaN
Johnson States 25, 2021
Dead
Ama
Qamata,
Khosi
TV Blood & South September
1 s2 NaN Ngema,
Show Water Africa 24, 2021
Gail
Mabalane,
Thaban...
Sami
Bouajila
Analysis entire Netflix dataset consisting of both movies and shows. Let’s compare the total number of movies and shows in this dataset to
know which one is the majority.
plt.figure(figsize=(6,3))
plt.title("Percentation of Netflix Titles that are either Movies or TV Shows")
g=plt.pie(netflix_data.type.value_counts(),explode=(0.025,0.025),
labels=netflix_data.type.value_counts().index, colors=['red','pink'],autopct='%1.1f%%',
startangle=180)
plt.show()
4.1 For Continuous Variables: Distplot, Countplot, Histogram for Univariate Analysis
https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 4/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab
https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 5/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab
plt.ylabel('release_year')
lt h ()
Cleaning Data
# Create a new column 'duration_int' to store the integer part of the duration
netflix_data['duration_int'] = netflix_data['duration'].str.extract('(\d+)').astype(float)
# Create a new column 'duration_unit' to store the unit of the duration
netflix_data['duration_unit'] = netflix_data['duration'].str.extract('(min|Season)').fillna('Unknown')
# Print the updated DataFrame
print(netflix_data[['duration','duration_int','duration_unit']].head())
https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 6/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab
Pairplots
What is an outlier?
In a random sampling from a population, an outlier is defined as an observation that deviates abnormally from the standard data.
https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 7/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab
Outliers can lead to vague or misleading predictions while using machine learning models. Specific models like linear regression, logistic
regression, and support vector machines are susceptible to outliers.
Q1 = netflix_data['duration_int'].quantile(0.25)
Q3 = netflix_data['duration_int'].quantile(0.75)
IQR = Q3 - Q1
outliers = netflix_data[(netflix_data['duration_int'] < (Q1 - 1.5
* IQR)) | (netflix_data['duration_int'] > (Q3 + 1.5 * IQR))]
print("Number of outliers in duration:", len(outliers))
In a dataset, we often see the presence of empty cells, rows, and columns, also referred to as Missing values.
The dataset includes a wide range of content types (movies vs TV shows), various genres, and a diverse set of countries contributing to
Netflix's library.
6.2 Comments on the distribution of the variables and relationship between them
The distribution plots indicate a significant increase in content production since around 2010, with a notable preference for shorter movies
compared to longer TV series.
The histogram of durations shows that most movies are around 90-120 minutes long.
The boxplot indicates that TV shows generally have longer durations when considering multiple seasons.
The correlation heatmap suggests weak correlations between numerical variables but highlights that longer durations do not necessarily
correlate with newer releases.
https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 8/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab
7. Business Insights
1. Content Trends: There is a clear trend toward producing more TV shows than movies in recent years.
2. Geographic Preferences: Different countries exhibit distinct preferences for genres, indicating potential areas for localized content
development.
3. Optimal Launch Timing: The analysis suggests that launching new content during peak viewing months could enhance audience
engagement.
8. Recommendations
1. Increase Production of International Content: Focus on creating more international shows to cater to diverse audiences.
2. Prioritize Original Series Development: Given the trend towards TV shows, invest more resources into developing original series rather
than standalone films.
https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 9/9