0% found this document useful (0 votes)

172 views

Business Case - Netflix - Data Exploration and Visualisation - Ipynb - Colab

Uploaded by

Ghar Ka Khana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

172 views

Business Case - Netflix - Data Exploration and Visualisation - Ipynb - Colab

Uploaded by

Ghar Ka Khana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.

ipynb - Colab

Netflix is one of the most popular media and video streaming platforms. They have over 10000 movies or tv shows available on their platform,
as of mid-2021, they have over 222M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on
Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.

Business Problem

Analyze the data and generate insights that could help Ne􀆞lix ijn deciding which type of shows/movies to produce and how they can grow the
business in different countries

1. Defining Problem Statement and Analysing basic metrics

Import Libraries

Importing the libraries we need

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

Start coding or generate with AI.

Loading The Dataset

url="/content/netflix_df.csv"
netflix_data = pd.read_csv(url)

netflix_data.head()

show_id type title director cast country date_added rel

Dick
Kirsten United September
0 s1 Movie Johnson Is NaN
Johnson States 25, 2021
Dead

Ama
Qamata,
Khosi
TV Blood & South September
1 s2 NaN Ngema,
Show Water Africa 24, 2021
Gail
Mabalane,
Thaban...

Sami
Bouajila

Next steps: Generate code with netflix_data

toggle_off View recommended plots New interactive sheet

netflix_data

show_id type title director cast country date_added

Dick
Kirsten United September
0 s1 Movie Johnson Is NaN
Johnson States 25, 2021
Dead

Ama
Qamata,
Khosi
TV Blood & South September
1 s2 NaN Ngema,
Show Water Africa 24, 2021
Gail
Mabalane,
Thaban...

Sami
Bouajila,
Tracy
TV Julien September
2 3 G l d G t N N
https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 1/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab
Ju e Septe be

toggle_off
2 s3 Ganglands Gotoas, NaN
Show Leclercq 24, 2021
Next steps: Generate code with netflix_data Samuel
View recommended plots New interactive sheet
Jouy,
Nabi...
The dataset contains over 8807 titles, 12 descriptions. After a quick view of the data frames, it looks like a typical movie/TVshows data frame
Jailbirds
without 3ratings. We
s4
can also
TV see that there are NaN values in some columns.
New NaN NaN NaN
September
Show 24, 2021
Orleans

Start coding or generate with AI.

Mayur
More,
2. Observations on the shape Jitendra
TV of data, data types of all the
Kota attributes, conversion of categorical attributes to 'category' (If required), missing
September
4 s5 NaN Kumar, India
Showsummary
value detection, statistical Factory 24, 2021
Ranjan
Raj, Alam
K...
To get All atributes

netflix_data.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',

'release_year', 'rating', 'duration', 'listed_in', 'description'],
dtype='object')

The shape of data

netflix_data.ndim

Start coding or generate with AI.

Data types of all the attributes

netflix_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8807 non-null object
1 type 8807 non-null object
2 title 8807 non-null object
3 director 6173 non-null object
4 cast 7982 non-null object
5 country 7976 non-null object
6 date_added 8797 non-null object
7 release_year 8807 non-null int64
8 rating 8803 non-null object
9 duration 8804 non-null object
10 listed_in 8807 non-null object
11 description 8807 non-null object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB

Start coding or generate with AI.

Missing Value Detection

print('\nColumns with missing value:')

print(netflix_data.isnull().any())

Columns with missing value:

show_id False
type False
title False
director True
cast True
country True
date_added True
release_year False
rating True

https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 2/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab
duration True
listed_in False
description False
dtype: bool

Statistical Summary Before Data Cleaning:

netflix_data.describe()

release_year duration_int

count 8807.000000 8804.000000

mean 2014.180198 69.846888

std 8.819312 50.814828

min 1925.000000 1.000000

25% 2013.000000 2.000000

50% 2017.000000 88.000000

75% 2019.000000 106.000000

2021 000000 312 000000

From the info, we know that there are 8807 entries and 12 columns to work with for this EDA. There are a few columns that contain null values,
“director,” “cast,” “country,” “date_added,” “rating.”

netflix_data.T.apply(lambda x: x.isnull().sum(), axis = 1)

show_id 0

type 0

title 0

director 2634

cast 825

country 831

date_added 10

release_year 0

rating 4

duration 3

listed_in 0

description 0

netflix_data.isnull().sum().sum()

4307

There are a total of 4307 null values across the entre dataset with 2634 missing points under "director", 825 under "cast", 831 under "country",
11 under "date_added", 4 under "rating" and 3 under “duration ”. We will have to handle all null data points before we can dive into EDA and
modelling.

3. Non-Graphical Analysis: Value counts and unique attributes

Non-Graphical Analysis involves calculating the summary statistics, without using pictorial or graphical representations.

netflix_data.head()

https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 3/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab

show_id type title director cast country date_added rel

Dick
Kirsten United September
0 s1 Movie Johnson Is NaN
Johnson States 25, 2021
Dead

Ama
Qamata,
Khosi
TV Blood & South September
1 s2 NaN Ngema,
Show Water Africa 24, 2021
Gail
Mabalane,
Thaban...

Sami
Bouajila

Next steps: Generate code with netflix_data

toggle_off View recommended plots New interactive sheet

Start coding or generate with AI.

4. Visual Analysis - Univariate, Bivariate after pre-processing of the data

Analysis done based only on one variable

Analysis entire Netflix dataset consisting of both movies and shows. Let’s compare the total number of movies and shows in this dataset to
know which one is the majority.

plt.figure(figsize=(6,3))
plt.title("Percentation of Netflix Titles that are either Movies or TV Shows")
g=plt.pie(netflix_data.type.value_counts(),explode=(0.025,0.025),
labels=netflix_data.type.value_counts().index, colors=['red','pink'],autopct='%1.1f%%',
startangle=180)
plt.show()

Start coding or generate with AI.

4.1 For Continuous Variables: Distplot, Countplot, Histogram for Univariate Analysis

# Plotting histogram for duration

plt.figure(figsize=(10, 6))
sns.histplot(netflix_data['duration'], bins=30, kde=True)
plt.title('Distribution of Movie Durations')
plt.xlabel('Duration (minutes)')
plt.ylabel('Frequency')
plt.show()

https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 4/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab

# Plotting release year distribution

plt.figure(figsize=(10, 6))
sns.histplot(netflix_data['release_year'], bins=30, kde=True)
plt.title('Distribution of Release Years')
plt.xlabel('Release Year')
plt.ylabel('Frequency')
plt.show()

4.2 For Categorical Variables: Boxplot

# Boxplot comparing release_year of Movies and TV Shows

plt.figure(figsize=(10, 6))
sns.boxplot(x='type', y='release_year', data=netflix_data
plt.title('release_year of Movies and TV Shows')
plt.xlabel('Type')

https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 5/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab
plt.ylabel('release_year')
lt h ()

Start coding or generate with AI.

4.3 For Correlation: Heatmaps, Pairplots

Cleaning Data

# Create a new column 'duration_int' to store the integer part of the duration
netflix_data['duration_int'] = netflix_data['duration'].str.extract('(\d+)').astype(float)
# Create a new column 'duration_unit' to store the unit of the duration
netflix_data['duration_unit'] = netflix_data['duration'].str.extract('(min|Season)').fillna('Unknown')
# Print the updated DataFrame
print(netflix_data[['duration','duration_int','duration_unit']].head())

duration duration_int duration_unit

0 90 min 90.0 min
1 2 Seasons 2.0 Season
2 1 Season 1.0 Season
3 1 Season 1.0 Season
4 2 Seasons 2.0 Season

Heatmap and Plots

# Correlation heatmap for numerical variables

correlation_matrix = netflix_data.select_dtypes(include=['number']).corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f",
cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 6/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab

Start coding or generate with AI.

Pairplots

# Pairplot for duration vs release year colored by type

sns.pairplot(netflix_data, vars=['duration_int', 'release_year'
], hue='type')
plt.show()

Start coding or generate with AI.

5. Missing Value & Outlier check (Treatment optional)

What is an outlier?

In a random sampling from a population, an outlier is defined as an observation that deviates abnormally from the standard data.

https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 7/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab

Why do we need to treat outliers?

Outliers can lead to vague or misleading predictions while using machine learning models. Specific models like linear regression, logistic
regression, and support vector machines are susceptible to outliers.

Q1 = netflix_data['duration_int'].quantile(0.25)
Q3 = netflix_data['duration_int'].quantile(0.75)
IQR = Q3 - Q1
outliers = netflix_data[(netflix_data['duration_int'] < (Q1 - 1.5
* IQR)) | (netflix_data['duration_int'] > (Q3 + 1.5 * IQR))]
print("Number of outliers in duration:", len(outliers))

Number of outliers in duration: 2

Start coding or generate with AI.

What are Missing values?

In a dataset, we often see the presence of empty cells, rows, and columns, also referred to as Missing values.

print('\nColumns with missing value:')

print(netflix_data.isnull().any())

Columns with missing value:

show_id False
type False
title False
director True
cast True
country True
date_added True
release_year False
rating True
duration True
listed_in False
description False
duration_int True
duration_unit False
dtype: bool

6. Insights Based on Non-Graphical and Visual Analysis

Start coding or generate with AI.

6.1 Comments on the Range of Attributes

The dataset includes a wide range of content types (movies vs TV shows), various genres, and a diverse set of countries contributing to
Netflix's library.

Start coding or generate with AI.

6.2 Comments on the distribution of the variables and relationship between them

The distribution plots indicate a significant increase in content production since around 2010, with a notable preference for shorter movies
compared to longer TV series.

Start coding or generate with AI.

6.3 Comments for each univariate and bivariate plot

The histogram of durations shows that most movies are around 90-120 minutes long.
The boxplot indicates that TV shows generally have longer durations when considering multiple seasons.
The correlation heatmap suggests weak correlations between numerical variables but highlights that longer durations do not necessarily
correlate with newer releases.

Start coding or generate with AI.

https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 8/9
9/28/24, 2:50 AM Business Case: Netflix - Data Exploration and Visualisation.ipynb - Colab

7. Business Insights

1. Content Trends: There is a clear trend toward producing more TV shows than movies in recent years.
2. Geographic Preferences: Different countries exhibit distinct preferences for genres, indicating potential areas for localized content
development.
3. Optimal Launch Timing: The analysis suggests that launching new content during peak viewing months could enhance audience
engagement.

Start coding or generate with AI.

8. Recommendations

1. Increase Production of International Content: Focus on creating more international shows to cater to diverse audiences.
2. Prioritize Original Series Development: Given the trend towards TV shows, invest more resources into developing original series rather
than standalone films.

https://colab.research.google.com/drive/1KbfWHQLw9KiiquBGbykxBYE5IYKCnNEV#scrollTo=DlGL8sAaDCt-&printMode=true 9/9

Xin Yao ITSS 3300 07/04/2020
100% (1)
Xin Yao ITSS 3300 07/04/2020
10 pages
Introduction To SQL Test Your Understanding
100% (1)
Introduction To SQL Test Your Understanding
71 pages
Assignment Fall2019 DBW
No ratings yet
Assignment Fall2019 DBW
4 pages
Data Analytics Case Study Guide (Updated For 2024)
No ratings yet
Data Analytics Case Study Guide (Updated For 2024)
10 pages
Daniel Greenfeld, Audrey M. Roy - A Wedge of Django (2021, Two Scoops Press)
No ratings yet
Daniel Greenfeld, Audrey M. Roy - A Wedge of Django (2021, Two Scoops Press)
363 pages
(New Critical Theory) Charles W. Mills - From Class To Race - Essays in White Marxism and Black Radicalism-Rowman & Littlefield (2003)
100% (1)
(New Critical Theory) Charles W. Mills - From Class To Race - Essays in White Marxism and Black Radicalism-Rowman & Littlefield (2003)
308 pages
Hinas SQL Assignment
No ratings yet
Hinas SQL Assignment
10 pages
Permissions Poster SQL Server VNext and SQLDB
No ratings yet
Permissions Poster SQL Server VNext and SQLDB
1 page
Module 5 Ellis
No ratings yet
Module 5 Ellis
2 pages
Given A Table
No ratings yet
Given A Table
7 pages
Himanshu_Assignment solved ETL 1
No ratings yet
Himanshu_Assignment solved ETL 1
6 pages
(CS2102) Group 4 Project Report
No ratings yet
(CS2102) Group 4 Project Report
22 pages
Operation Analytics and Investigating Metric Spike
No ratings yet
Operation Analytics and Investigating Metric Spike
13 pages
Project Movielense Solution
29% (7)
Project Movielense Solution
4 pages
Lab 8 MinchulS
0% (1)
Lab 8 MinchulS
5 pages
Assignment Chapter 3 PDF
No ratings yet
Assignment Chapter 3 PDF
2 pages
Customer Churn Prediction
100% (1)
Customer Churn Prediction
18 pages
Project Report On DBMS Project
No ratings yet
Project Report On DBMS Project
22 pages
Advanced SQL Case Study
No ratings yet
Advanced SQL Case Study
42 pages
Starbucks Sentiment Analysis Using VADER
No ratings yet
Starbucks Sentiment Analysis Using VADER
23 pages
Tableau: Introduction To Tableau Asif Sayed
No ratings yet
Tableau: Introduction To Tableau Asif Sayed
8 pages
Solved Sheet 4 DFD
No ratings yet
Solved Sheet 4 DFD
8 pages
My Resume
No ratings yet
My Resume
2 pages
Mining Comlex Types of Data
No ratings yet
Mining Comlex Types of Data
19 pages
Difference Between Temporary Table and Table Variable in SQL Server
No ratings yet
Difference Between Temporary Table and Table Variable in SQL Server
2 pages
Python Programs
No ratings yet
Python Programs
25 pages
Mysql 7-10
No ratings yet
Mysql 7-10
4 pages
Hackathon Overall Travel Experience of Traveling in Shinkansen Bullet Train Merging Two Data Set
No ratings yet
Hackathon Overall Travel Experience of Traveling in Shinkansen Bullet Train Merging Two Data Set
59 pages
Unit II Notes
No ratings yet
Unit II Notes
36 pages
Great Lakes Extraa_Learn Project Business Report - 2-Kavish-Rathod
No ratings yet
Great Lakes Extraa_Learn Project Business Report - 2-Kavish-Rathod
22 pages
Dinya Antony MRA ML2
100% (1)
Dinya Antony MRA ML2
24 pages
2nd Unit - 2.2 - Data Analytics
No ratings yet
2nd Unit - 2.2 - Data Analytics
22 pages
Kavita Bhatt Resume
No ratings yet
Kavita Bhatt Resume
1 page
Lecture 7 p1
No ratings yet
Lecture 7 p1
38 pages
Project DVT CarInsurance
No ratings yet
Project DVT CarInsurance
10 pages
ML Project - Ipynb
No ratings yet
ML Project - Ipynb
324 pages
Time Series Forecasting Jupyter Code - Ipynb
No ratings yet
Time Series Forecasting Jupyter Code - Ipynb
2,484 pages
Trainity Project 3
No ratings yet
Trainity Project 3
18 pages
FirstName LastName DA
No ratings yet
FirstName LastName DA
2 pages
70 534
No ratings yet
70 534
33 pages
SQL Assign 2 Ans
No ratings yet
SQL Assign 2 Ans
2 pages
PYF_Project_LearnerNotebook_LowCode
No ratings yet
PYF_Project_LearnerNotebook_LowCode
6 pages
Lab Manual Week 03
100% (1)
Lab Manual Week 03
4 pages
Mini Project II Instructions Segmentation and Regression
0% (1)
Mini Project II Instructions Segmentation and Regression
6 pages
It0089 Finalreviewer
No ratings yet
It0089 Finalreviewer
143 pages
15A05602 Data Warehousing & Mining
No ratings yet
15A05602 Data Warehousing & Mining
2 pages
Part 1 - Game Data Analyst PDF
No ratings yet
Part 1 - Game Data Analyst PDF
3 pages
Assignment 6
No ratings yet
Assignment 6
4 pages
Pivot Tables
No ratings yet
Pivot Tables
8 pages
Abstraction and Interface
No ratings yet
Abstraction and Interface
17 pages
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
No ratings yet
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
6 pages
BI Projects
No ratings yet
BI Projects
17 pages
Database Lab 4
No ratings yet
Database Lab 4
7 pages
1Z0 1041 23 Oac
No ratings yet
1Z0 1041 23 Oac
21 pages
What Is A FACTLESS FACT TABLE
No ratings yet
What Is A FACTLESS FACT TABLE
2 pages
Control Engineering-I Lab-1 Dated: 24-10-2007 1. What Is MATLAB
No ratings yet
Control Engineering-I Lab-1 Dated: 24-10-2007 1. What Is MATLAB
9 pages
Roadmap 2 ETL Testing - By Himanshu
100% (1)
Roadmap 2 ETL Testing - By Himanshu
56 pages
Day-2 Aggregate Functions
No ratings yet
Day-2 Aggregate Functions
25 pages
Creating Data Visualizations Using Tableau Desktop (Beginner) _ Map and Data Library
No ratings yet
Creating Data Visualizations Using Tableau Desktop (Beginner) _ Map and Data Library
48 pages
Sales Amount by Month - Sort It by The Correct Month Order, Not Alphabetical Order
No ratings yet
Sales Amount by Month - Sort It by The Correct Month Order, Not Alphabetical Order
6 pages
Netflix Data - Cleaning, Analysis and Visualization - (Data Analyst)
No ratings yet
Netflix Data - Cleaning, Analysis and Visualization - (Data Analyst)
24 pages
Netflix Case Study by Pavithran
No ratings yet
Netflix Case Study by Pavithran
36 pages
HP P6000 Command View Software Suite 10.1 Release Notes: HP Part Number: T5494-96476 Published: June 2012 Edition: 1
No ratings yet
HP P6000 Command View Software Suite 10.1 Release Notes: HP Part Number: T5494-96476 Published: June 2012 Edition: 1
25 pages
Class Notes: About The Poet
No ratings yet
Class Notes: About The Poet
3 pages
Abdalhay Makoni Severo 2020 Preface Introduction
No ratings yet
Abdalhay Makoni Severo 2020 Preface Introduction
35 pages
Taulean 2018
No ratings yet
Taulean 2018
6 pages
How To Use Log Table
No ratings yet
How To Use Log Table
4 pages
FS2 Le3
No ratings yet
FS2 Le3
3 pages
PREPARE L8 Grammar Unit 16 Plus
No ratings yet
PREPARE L8 Grammar Unit 16 Plus
2 pages
TOS (3rd QUARTER) - FILIPINO 10
100% (1)
TOS (3rd QUARTER) - FILIPINO 10
1 page
Kanji Writing 4 01
No ratings yet
Kanji Writing 4 01
1 page
Reasearch Complete
No ratings yet
Reasearch Complete
27 pages
ISB-Oman Parent Portal User Guide 2019
No ratings yet
ISB-Oman Parent Portal User Guide 2019
16 pages
Unit 05 DS
No ratings yet
Unit 05 DS
27 pages
Blue - MPF, Chap 01 - Student
No ratings yet
Blue - MPF, Chap 01 - Student
33 pages
Convert Source Data For Pivot Table
No ratings yet
Convert Source Data For Pivot Table
7 pages
Handbook: Published On Musescore
No ratings yet
Handbook: Published On Musescore
50 pages
8 Karatsuba Document
No ratings yet
8 Karatsuba Document
75 pages
SEPM Unit5
No ratings yet
SEPM Unit5
16 pages
Relative Con Preposicion
No ratings yet
Relative Con Preposicion
3 pages
G3 English Nouns With Answers 276
No ratings yet
G3 English Nouns With Answers 276
3 pages
A Fast Algorithm For The Simplified Theory of Rolling Contact - FASTSIM
No ratings yet
A Fast Algorithm For The Simplified Theory of Rolling Contact - FASTSIM
14 pages
French Syllabus Plan s2
No ratings yet
French Syllabus Plan s2
2 pages
6AD4B Analiz Teksta Art For Heart S Sake by Rube Goldberg
100% (1)
6AD4B Analiz Teksta Art For Heart S Sake by Rube Goldberg
1 page
On Introdution To NoSQL
No ratings yet
On Introdution To NoSQL
56 pages
Hindi Language Questions Paper
No ratings yet
Hindi Language Questions Paper
8 pages
DSD Syllabus - Calicut University
No ratings yet
DSD Syllabus - Calicut University
2 pages
Lesson 2 Working With Text
No ratings yet
Lesson 2 Working With Text
16 pages
Python
No ratings yet
Python
27 pages
奔驰Benz_品牌形象手册
No ratings yet
奔驰Benz_品牌形象手册
48 pages