0% found this document useful (0 votes)
172 views

Business Case - Netflix - Data Exploration and Visualisation - Ipynb - Colab

Uploaded by

Ghar Ka Khana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
172 views

Business Case - Netflix - Data Exploration and Visualisation - Ipynb - Colab

Uploaded by

Ghar Ka Khana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.

ipynb - Colab

Netflix is one of the most popular media and video streaming platforms. They have over 10000 movies or tv shows available on their platform,
as of mid-2021, they have over 222M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on
Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.

Business Problem

Analyze the data and generate insights that could help Ne􀆞lix ijn deciding which type of shows/movies to produce and how they can grow the
business in different countries

1. Defining Problem Statement and Analysing basic metrics

Import Libraries

Importing the libraries we need

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

Start coding or generate with AI.

Loading The Dataset

url="/content/netflix_df.csv"
netflix_data = pd.read_csv(url)

netflix_data.head()

show_id type title director cast country date_added rel

Dick
Kirsten United September
0 s1 Movie Johnson Is NaN
Johnson States 25, 2021
Dead

Ama
Qamata,
Khosi
TV Blood & South September
1 s2 NaN Ngema,
Show Water Africa 24, 2021
Gail
Mabalane,
Thaban...

Sami
Bouajila

Next steps: Generate code with netflix_data


toggle_off View recommended plots New interactive sheet

netflix_data

show_id type title director cast country date_added

Dick
Kirsten United September
0 s1 Movie Johnson Is NaN
Johnson States 25, 2021
Dead

Ama
Qamata,
Khosi
TV Blood & South September
1 s2 NaN Ngema,
Show Water Africa 24, 2021
Gail
Mabalane,
Thaban...

Sami
Bouajila,
Tracy
TV Julien September
2 3 G l d G t N N
https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 1/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab
Ju e Septe be

toggle_off
2 s3 Ganglands Gotoas, NaN
Show Leclercq 24, 2021
Next steps: Generate code with netflix_data Samuel
View recommended plots New interactive sheet
Jouy,
Nabi...
The dataset contains over 8807 titles, 12 descriptions. After a quick view of the data frames, it looks like a typical movie/TVshows data frame
Jailbirds
without 3ratings. We
s4
can also
TV see that there are NaN values in some columns.
New NaN NaN NaN
September
Show 24, 2021
Orleans

Start coding or generate with AI.


Mayur
More,
2. Observations on the shape Jitendra
TV of data, data types of all the
Kota attributes, conversion of categorical attributes to 'category' (If required), missing
September
4 s5 NaN Kumar, India
Showsummary
value detection, statistical Factory 24, 2021
Ranjan
Raj, Alam
K...
To get All atributes

netflix_data.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',


'release_year', 'rating', 'duration', 'listed_in', 'description'],
dtype='object')

The shape of data

netflix_data.ndim

Start coding or generate with AI.

Data types of all the attributes

netflix_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8807 non-null object
1 type 8807 non-null object
2 title 8807 non-null object
3 director 6173 non-null object
4 cast 7982 non-null object
5 country 7976 non-null object
6 date_added 8797 non-null object
7 release_year 8807 non-null int64
8 rating 8803 non-null object
9 duration 8804 non-null object
10 listed_in 8807 non-null object
11 description 8807 non-null object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB

Start coding or generate with AI.

Missing Value Detection

print('\nColumns with missing value:')


print(netflix_data.isnull().any())

Columns with missing value:


show_id False
type False
title False
director True
cast True
country True
date_added True
release_year False
rating True

https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 2/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab
duration True
listed_in False
description False
dtype: bool

Statistical Summary Before Data Cleaning:

netflix_data.describe()

release_year duration_int

count 8807.000000 8804.000000

mean 2014.180198 69.846888

std 8.819312 50.814828

min 1925.000000 1.000000

25% 2013.000000 2.000000

50% 2017.000000 88.000000

75% 2019.000000 106.000000

2021 000000 312 000000

From the info, we know that there are 8807 entries and 12 columns to work with for this EDA. There are a few columns that contain null values,
“director,” “cast,” “country,” “date_added,” “rating.”

netflix_data.T.apply(lambda x: x.isnull().sum(), axis = 1)

show_id 0

type 0

title 0

director 2634

cast 825

country 831

date_added 10

release_year 0

rating 4

duration 3

listed_in 0

description 0

netflix_data.isnull().sum().sum()

4307

There are a total of 4307 null values across the entre dataset with 2634 missing points under "director", 825 under "cast", 831 under "country",
11 under "date_added", 4 under "rating" and 3 under “duration ”. We will have to handle all null data points before we can dive into EDA and
modelling.

3. Non-Graphical Analysis: Value counts and unique attributes

Non-Graphical Analysis involves calculating the summary statistics, without using pictorial or graphical representations.

netflix_data.head()

https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 3/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab

show_id type title director cast country date_added rel

Dick
Kirsten United September
0 s1 Movie Johnson Is NaN
Johnson States 25, 2021
Dead

Ama
Qamata,
Khosi
TV Blood & South September
1 s2 NaN Ngema,
Show Water Africa 24, 2021
Gail
Mabalane,
Thaban...

Sami
Bouajila

Next steps: Generate code with netflix_data


toggle_off View recommended plots New interactive sheet

Start coding or generate with AI.

4. Visual Analysis - Univariate, Bivariate after pre-processing of the data

Analysis done based only on one variable

Analysis entire Netflix dataset consisting of both movies and shows. Let’s compare the total number of movies and shows in this dataset to
know which one is the majority.

plt.figure(figsize=(6,3))
plt.title("Percentation of Netflix Titles that are either Movies or TV Shows")
g=plt.pie(netflix_data.type.value_counts(),explode=(0.025,0.025),
labels=netflix_data.type.value_counts().index, colors=['red','pink'],autopct='%1.1f%%',
startangle=180)
plt.show()

Start coding or generate with AI.

4.1 For Continuous Variables: Distplot, Countplot, Histogram for Univariate Analysis

# Plotting histogram for duration


plt.figure(figsize=(10, 6))
sns.histplot(netflix_data['duration'], bins=30, kde=True)
plt.title('Distribution of Movie Durations')
plt.xlabel('Duration (minutes)')
plt.ylabel('Frequency')
plt.show()

https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 4/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab

# Plotting release year distribution


plt.figure(figsize=(10, 6))
sns.histplot(netflix_data['release_year'], bins=30, kde=True)
plt.title('Distribution of Release Years')
plt.xlabel('Release Year')
plt.ylabel('Frequency')
plt.show()

4.2 For Categorical Variables: Boxplot

# Boxplot comparing release_year of Movies and TV Shows


plt.figure(figsize=(10, 6))
sns.boxplot(x='type', y='release_year', data=netflix_data
plt.title('release_year of Movies and TV Shows')
plt.xlabel('Type')

https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 5/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab
plt.ylabel('release_year')
lt h ()

Start coding or generate with AI.

4.3 For Correlation: Heatmaps, Pairplots

Cleaning Data

# Create a new column 'duration_int' to store the integer part of the duration
netflix_data['duration_int'] = netflix_data['duration'].str.extract('(\d+)').astype(float)
# Create a new column 'duration_unit' to store the unit of the duration
netflix_data['duration_unit'] = netflix_data['duration'].str.extract('(min|Season)').fillna('Unknown')
# Print the updated DataFrame
print(netflix_data[['duration','duration_int','duration_unit']].head())

duration duration_int duration_unit


0 90 min 90.0 min
1 2 Seasons 2.0 Season
2 1 Season 1.0 Season
3 1 Season 1.0 Season
4 2 Seasons 2.0 Season

Heatmap and Plots

# Correlation heatmap for numerical variables


correlation_matrix = netflix_data.select_dtypes(include=['number']).corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f",
cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 6/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab

Start coding or generate with AI.

Pairplots

# Pairplot for duration vs release year colored by type


sns.pairplot(netflix_data, vars=['duration_int', 'release_year'
], hue='type')
plt.show()

Start coding or generate with AI.

5. Missing Value & Outlier check (Treatment optional)

What is an outlier?

In a random sampling from a population, an outlier is defined as an observation that deviates abnormally from the standard data.

https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 7/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab

Why do we need to treat outliers?

Outliers can lead to vague or misleading predictions while using machine learning models. Specific models like linear regression, logistic
regression, and support vector machines are susceptible to outliers.

Q1 = netflix_data['duration_int'].quantile(0.25)
Q3 = netflix_data['duration_int'].quantile(0.75)
IQR = Q3 - Q1
outliers = netflix_data[(netflix_data['duration_int'] < (Q1 - 1.5
* IQR)) | (netflix_data['duration_int'] > (Q3 + 1.5 * IQR))]
print("Number of outliers in duration:", len(outliers))

Number of outliers in duration: 2

Start coding or generate with AI.

What are Missing values?

In a dataset, we often see the presence of empty cells, rows, and columns, also referred to as Missing values.

print('\nColumns with missing value:')


print(netflix_data.isnull().any())

Columns with missing value:


show_id False
type False
title False
director True
cast True
country True
date_added True
release_year False
rating True
duration True
listed_in False
description False
duration_int True
duration_unit False
dtype: bool

6. Insights Based on Non-Graphical and Visual Analysis

Start coding or generate with AI.

6.1 Comments on the Range of Attributes

The dataset includes a wide range of content types (movies vs TV shows), various genres, and a diverse set of countries contributing to
Netflix's library.

Start coding or generate with AI.

6.2 Comments on the distribution of the variables and relationship between them

The distribution plots indicate a significant increase in content production since around 2010, with a notable preference for shorter movies
compared to longer TV series.

Start coding or generate with AI.

6.3 Comments for each univariate and bivariate plot

The histogram of durations shows that most movies are around 90-120 minutes long.
The boxplot indicates that TV shows generally have longer durations when considering multiple seasons.
The correlation heatmap suggests weak correlations between numerical variables but highlights that longer durations do not necessarily
correlate with newer releases.

Start coding or generate with AI.

https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 8/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab

7. Business Insights

1. Content Trends: There is a clear trend toward producing more TV shows than movies in recent years.
2. Geographic Preferences: Different countries exhibit distinct preferences for genres, indicating potential areas for localized content
development.
3. Optimal Launch Timing: The analysis suggests that launching new content during peak viewing months could enhance audience
engagement.

Start coding or generate with AI.

8. Recommendations

1. Increase Production of International Content: Focus on creating more international shows to cater to diverse audiences.
2. Prioritize Original Series Development: Given the trend towards TV shows, invest more resources into developing original series rather
than standalone films.

https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 9/9

You might also like