Netflix Data - Cleaning, Analysis and Visualization - (Data Analyst)
Netflix Data - Cleaning, Analysis and Visualization - (Data Analyst)
Dataset : Dataset is available in the given link. You can download it at your convenience.
About Dataset
Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This
dataset is a cleaned version of the original version which can be found here. The data consist of contents added to
Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be
cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and
visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .
Data Cleaning
Example
what steps you should have to follow
This project involves loading, cleaning, analyzing, and visualizing data from a Netflix
dataset. We'll use Python libraries like Pandas, Matplotlib, and Seaborn to work
through the project. The goal is to explore the dataset, derive insights, and prepare
for potential machine learning tasks.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
Identify and handle missing data, correct data types, and drop duplicates.
Sample code
import pandas as pd
import numpy as np
In [2]:
data=pd.read_csv("/kaggle/input/netflix-data-cleaning-analysis-and-visualization/net
flix1.csv")
data.head()
Out[2]:
1 Crime TV Shows,
TV Julien TV-
1 s3 Ganglands France 9/24/2021 2021 Seaso International TV
Show Leclercq MA
n Shows, TV Act...
Mike 1 TV Dramas, TV
TV United TV-
2 s6 Midnight Mass Flanaga 9/24/2021 2021 Seaso Horror, TV
Show States MA
n n Mysteries
Confessions
Movi Bruno TV- Children & Family
3 s14 of an Invisible Brazil 9/22/2021 2021 91 min
e Garotti PG Movies, Comedies
Girl
Dramas,
Movi Haile United TV- 125 Independent
4 s8 Sankofa 9/24/2021 1993
e Gerima States MA min Movies,
International Movies
In [3]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8790 entries, 0 to 8789
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8790 non-null object
1 type 8790 non-null object
2 title 8790 non-null object
3 director 8790 non-null object
4 country 8790 non-null object
5 date_added 8790 non-null object
6 release_year 8790 non-null int64
7 rating 8790 non-null object
8 duration 8790 non-null object
9 listed_in 8790 non-null object
dtypes: int64(1), object(9)
memory usage: 686.8+ KB
In [4]:
data.shape
Out[4]:
(8790, 10)
In [5]:
data=data.drop_duplicates()
In [6]:
data['type'].value_counts()
Out[6]:
type
Movie 6126
TV Show 2664
In [7]:
freq=data['type'].value_counts()
Out[7]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8790 entries, 0 to 8789
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8790 non-null object
1 type 8790 non-null object
2 title 8790 non-null object
3 director 8790 non-null object
4 country 8790 non-null object
5 date_added 8790 non-null object
6 release_year 8790 non-null int64
7 rating 8790 non-null object
8 duration 8790 non-null object
9 listed_in 8790 non-null object
dtypes: int64(1), object(9)
memory usage: 686.8+ KB
Visual representation of rating frequency of movies and TV Shows on Netflix.
In [9]:
data['rating'].value_counts()
Out[9]:
rating
TV-MA 3205
TV-14 2157
TV-PG 861
R 799
PG-13 490
TV-Y7 333
TV-Y 306
PG 287
TV-G 220
NR 79
G 41
TV-Y7-FV 6
NC-17 3
UR 3
In [10]:
ratings=data['rating'].value_counts().reset_index().sort_values(by='count',
ascending=False)
plt.bar(ratings['rating'], ratings['count'])
plt.xticks(rotation=45, ha='right')
plt.xlabel("Rating Types")
plt.ylabel("Rating Frequency")
Out[10]:
Out[11]:
In [12]:
In [13]:
data.describe()
Out[13]:
date_added release_year
count 8790 8790.000000
In [14]:
data['country'].value_counts()
Out[14]:
country
United States 3240
India 1057
United Kingdom 638
Pakistan 421
Not Given 287
...
Iran 1
West Germany 1
Greece 1
Zimbabwe 1
Soviet Union 1
In [15]:
top_ten_countries=data['country'].value_counts().reset_index().sort_values(by='count
', ascending=False)[:10]
plt.figure(figsize=(10, 6))
plt.bar(top_ten_countries['country'], top_ten_countries['count'])
plt.xticks(rotation=45, ha='right')
plt.xlabel("Country")
plt.ylabel("Frequency")
plt.suptitle("Top 10 countries with most content on Netflix")
plt.show()
In [16]:
data['year']=data['date_added'].dt.year
data['month']=data['date_added'].dt.month
data['day']=data['date_added'].dt.day
In [17]:
monthly_movie_release=data[data['type']=='Movie']['month'].value_counts().sort_index
()
monthly_series_release=data[data['type']=='TV
Show']['month'].value_counts().sort_index()
In [18]:
yearly_movie_releases=data[data['type']=='Movie']['year'].value_counts().sort_index(
)
yearly_series_releases=data[data['type']=='TV
Show']['year'].value_counts().sort_index()
Out[18]:
<matplotlib.legend.Legend at 0x7a14cb8327a0>
In [19]:
popular_movie_genre=data[data['type']=='Movie'].groupby("listed_in").size().sort_val
ues(ascending=False)[:10]
popular_series_genre=data[data['type']=='TV
Show'].groupby("listed_in").size().sort_values(ascending=False)[:10]
plt.bar(popular_movie_genre.index, popular_movie_genre.values)
plt.xticks(rotation=45, ha='right')
plt.xlabel("Genres")
plt.ylabel("Movies Frequency")
plt.suptitle("Top 10 popular genres for movies on Netflix")
plt.show()
Top 10 TV Shows genres
In [20]:
plt.bar(popular_series_genre.index, popular_series_genre.values)
plt.xticks(rotation=45, ha='right')
plt.xlabel("Genres")
plt.ylabel("TV Shows Frequency")
plt.suptitle("Top 10 popular genres for TV Shows on Netflix")
plt.show()
Top 15 directors across Netflix with hoigh frequency of movies and shows.
In [21]:
directors=data['director'].value_counts().reset_index().sort_values(by='count',
ascending=False)[1:15]
plt.bar(directors['director'], directors['count'])
plt.xticks(rotation=45, ha='right')
Out[21]:
linkcode
1 Reference link
2 Reference link for ML project