Eda Final Report
Eda Final Report
J Component report
Slot : F2
In this project we are going to perform Some exploratory Data analysis to find
some hidden trends and pattern in dataset. will going to load and read the data
using pandas, do some cleaning and Processing of data and will explore the
dataset through visualizations and graphs using matplotlib and seaborn and
finally answers some questions related to dataset. Also, we are going to try
implementing a predictive system using k nearest neighbors to produce top 10
suggestions for the user based on his input and also predict its IMDb score
through average ratings.
1.2 Statement
The dataset used in this project is a combination of 5 datasets taken from Kaggle
and IMDb directories and it has been clubbed and filtered to form an unique
dataset.
1.4 Challenges
The challenges faced by author includes cleaning the data before being used for
analysis, extracting features from the processed data necessary for visualizing
and recommending content. Analysing the final results to draw appropriate
conclusions.
ALGORITHM OF RECOMMENDATION SYSTEM
Step 3: Generate duplicate columns for description and listed_in, in the data frame for
prediction model and testing. Convert the tables into string.
Step 4: Rename description duplicate as keywords and split the individual words into
list and remove stopwords using nltk. Similarly, listed_in duplicate to genre and split
genres.
Step 5: Create binary bins for genre, cast, director, keywords by using binary()
function and then create new columns namely genre_bin, cast_bin, director_bin,
keywords_bin.
Step 6: Use Cosine Similarity for finding the similar titles using bins. Using defined
function similarity, we find the distance between 2 titles.
Step 7: Using 10 of the similarly generated titles we can take average of each IMDb
scores to predict the score of searched title.
import re
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import missingno
import random
import stemgraphic
%matplotlib inline
from wordcloud import WordCloud, STOPWORDS
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
DATA DESCRIPTION
data = pd.read_csv('Ottdataset.csv',encoding='unicode_escape')
data.info()
data.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8834 entries, 0 to 8833
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8834 non-null int64
1 Platform 8834 non-null object
2 type 8834 non-null object
3 title 8834 non-null object
4 director 6393 non-null object
5 cast 8098 non-null object
6 country 5889 non-null object
7 date_added 5970 non-null object
8 release_year 8834 non-null int64
9 rating 8678 non-null object
10 duration 8834 non-null object
11 listed_in 8834 non-null object
12 description 8834 non-null object
13 IMDb 8538 non-null object
14 Rotten Tomatoes 8827 non-null object
dtypes: int64(2), object(13)
memory usage: 1.0+ MB
cast \
0 Dr. Pol
1 Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
3 Victoria Vosburg
4 Zach Tyler, Mae Whitman, Jack De Sena, Dee Bra...
data.shape
(8834, 15)
data.columns
print(data.describe)
title \
0 Jingle Pols
1 Bluey
2 Breaking Bad
3 Alaska Animal Rescue
4 Avatar: The Last Airbender
... ...
8829 Love Naggers
8830 Ratones Paranoicos: The Band that Rocked Argen...
8831 Leo the Truck
8832 Steps
8833 Pinkfong! Healthy Habit Songs
director \
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
8829 NaN
8830 Alejandro Ruax, Ramiro Mart??nez
8831 NaN
8832 Rock Davis, Jay Rodriguez
8833 NaN
cast \
0 Dr. Pol
1 Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
3 Victoria Vosburg
4 Zach Tyler, Mae Whitman, Jack De Sena, Dee Bra...
... ...
8829 Seo Jang-hoon, Kim Sook, Han Hye-jin, Kwak Jun...
8830 Juan Sebasti?Ân Guti??rrez, Pablo Cano, Pablo...
8831 Maria Poddubnaya, Sveta Lebedeva, Anuar Shalab...
8832 Rob Morgan, Walter Fauntleroy, Robert G. McKay...
8833 NaN
duration listed_in \
0 45 min Animals & Nature, Documentary, Medical
1 2 Seasons Animation, Kids
2 5 Seasons Crime TV Shows, TV Dramas, TV Thrillers
3 2 Seasons Animals & Nature, Docuseries, Family
4 3 Seasons Classic & Cult TV, Kids' TV, TV Action & Adven...
... ... ...
8829 1 Season International TV Shows, Stand-Up Comedy & Talk...
8830 76 min Documentaries, International Movies, Music & M...
8831 1 Season Kids
8832 118 min Drama
8833 1 Season Animation, Kids
description IMDb \
0 Nat Geo WILD re-joins the Pols in central Mich... 9.6/10
1 Bluey is a six year-old Blue Heeler dog, who t... 9.6/10
2 A high school chemistry teacher dying of cance... 9.4/10
3 Conservation heroes rescue and rehabilitate th... 9.4/10
4 Siblings Katara and Sokka wake young Aang from... 9.3/10
... ... ...
8829 From the quirky to the scandalous, any relatio... NaN
8830 The irrepressible Ratones Paranoicos, Argentin... NaN
8831 Adventures of Leo and friends continue in a ne... NaN
8832 Years after a life-altering robbery, a home he... NaN
8833 Sing with Pinkfong and learn how to form healt... NaN
Rotten Tomatoes
0 44/100
1 71/100
2 100/100
3 42/100
4 93/100
... ...
8829 16/100
8830 12/100
8831 10/100
8832 46/100
8833 10/100
show_id 8834
Platform 3
type 2
title 8480
director 4844
cast 7895
country 539
date_added 1398
release_year 96
rating 23
duration 222
listed_in 1013
description 8828
IMDb 81
Rotten Tomatoes 89
dtype: int64
data.isnull().sum()
show_id 0
Platform 0
type 0
title 0
director 2441
cast 736
country 2945
date_added 2864
release_year 0
rating 156
duration 0
listed_in 0
description 0
IMDb 296
Rotten Tomatoes 7
dtype: int64
if data.isnull().any(axis=None):
print("\nPreview of data with null values:\nxxxxxxxxxxxxx")
print(data[data.isnull().any(axis=1)].head(3))
missingno.matrix(data)
plt.show()
duration listed_in \
0 45 min Animals & Nature, Documentary, Medical
1 2 Seasons Animation, Kids
2 5 Seasons Crime TV Shows, TV Dramas, TV Thrillers
cast \
0 Dr. Pol
1 Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
duration listed_in \
0 45 min Animals & Nature, Documentary, Medical
1 2 Seasons Animation, Kids
2 5 Seasons Crime TV Shows, TV Dramas, TV Thrillers
f
cance... 9.4/10 100/100
PREPARING DATA FOR PERFORMING EXPLORATORY DATA ANALYS
IS
display(data.describe().T)
75% max
show_id 11986.75 19923.0
release_year 2019.00 2021.0
data.IMDb=data.IMDb.str.replace("/10","",regex=True)
data
title \
0 Jingle Pols
1 Bluey
2 Breaking Bad
3 Alaska Animal Rescue
4 Avatar: The Last Airbender
... ...
8829 Love Naggers
8830 Ratones Paranoicos: The Band that Rocked Argen...
8831 Leo the Truck
8832 Steps
8833 Pinkfong! Healthy Habit Songs
director \
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
8829 NaN
8830 Alejandro Ruax, Ramiro Mart??nez
8831 NaN
8832 Rock Davis, Jay Rodriguez
8833 NaN
cast \
0 Dr. Pol
1 Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
3 Victoria Vosburg
4 Zach Tyler, Mae Whitman, Jack De Sena, Dee Bra...
... ...
8829 Seo Jang-hoon, Kim Sook, Han Hye-jin, Kwak Jun...
8830 Juan Sebasti?Ân Guti??rrez, Pablo Cano, Pablo...
8831 Maria Poddubnaya, Sveta Lebedeva, Anuar Shalab...
8832 Rob Morgan, Walter Fauntleroy, Robert G. McKay...
8833 NaN
duration listed_in \
0 45 min Animals & Nature, Documentary, Medical
1 2 Seasons Animation, Kids
2 5 Seasons Crime TV Shows, TV Dramas, TV Thrillers
3 2 Seasons Animals & Nature, Docuseries, Family
4 3 Seasons Classic & Cult TV, Kids' TV, TV Action & Adven...
... ... ...
8829 1 Season International TV Shows, Stand-Up Comedy & Talk...
8830 76 min Documentaries, International Movies, Music & M...
8831 1 Season Kids
8832 118 min Drama
8833 1 Season Animation, Kids
data['IMDb'] = pd.to_numeric(data['IMDb'],errors='coerce')
display(data.describe().T)
75% max
show_id 11986.75 19923.0
release_year 2019.00 2021.0
IMDb 7.30 9.6
data
title \
0 Jingle Pols
1 Bluey
2 Breaking Bad
3 Alaska Animal Rescue
4 Avatar: The Last Airbender
... ...
8829 Love Naggers
8830 Ratones Paranoicos: The Band that Rocked Argen...
8831 Leo the Truck
8832 Steps
8833 Pinkfong! Healthy Habit Songs
director \
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
8829 NaN
8830 Alejandro Ruax, Ramiro Mart??nez
8831 NaN
8832 Rock Davis, Jay Rodriguez
8833 NaN
cast \
0 Dr. Pol
1 Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
3 Victoria Vosburg
4 Zach Tyler, Mae Whitman, Jack De Sena, Dee Bra...
... ...
8829 Seo Jang-hoon, Kim Sook, Han Hye-jin, Kwak Jun...
8830 Juan Sebasti?Ân Guti??rrez, Pablo Cano, Pablo...
8831 Maria Poddubnaya, Sveta Lebedeva, Anuar Shalab...
8832 Rob Morgan, Walter Fauntleroy, Robert G. McKay...
8833 NaN
duration listed_in \
0 45 min Animals & Nature, Documentary, Medical
1 2 Seasons Animation, Kids
2 5 Seasons Crime TV Shows, TV Dramas, TV Thrillers
3 2 Seasons Animals & Nature, Docuseries, Family
4 3 Seasons Classic & Cult TV, Kids' TV, TV Action & Adven...
... ... ...
8829 1 Season International TV Shows, Stand-Up Comedy & Talk...
8830 76 min Documentaries, International Movies, Music & M...
8831 1 Season Kids
8832 118 min Drama
8833 1 Season Animation, Kids
data.loc[1:100]
director cast \
1 NaN Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 NaN Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
3 NaN Victoria Vosburg
4 NaN Zach Tyler, Mae Whitman, Jack De Sena, Dee Bra...
5 NaN NaN
.. ... ...
96 Jonathan Demme David Byrne, Chris Frantz, Jerry Harrison, Tin...
97 NaN Dan Castellaneta, Julie Kavner, Nancy Cartwrig...
98 Chris Bould Bill Hicks
99 Chris Bould Bill Hicks
100 Chris Bould Bill Hicks
duration listed_in \
1 2 Seasons Animation, Kids
2 5 Seasons Crime TV Shows, TV Dramas, TV Thrillers
3 2 Seasons Animals & Nature, Docuseries, Family
4 3 Seasons Classic & Cult TV, Kids' TV, TV Action & Adven...
5 3 Seasons Kids
.. ... ...
96 88 min Documentary, Music Videos and Concerts
97 32 Seasons Animation, Comedy
98 61 min Stand-Up Comedy
99 49 min Arts, Entertainment, and Culture, Comedy, Docu...
100 56 min Stand-Up Comedy
print(data['rating'].unique())
['TV-14' 'TV-Y' 'TV-MA' 'TV-PG' 'TV-Y7' '13+' 'NR' '18+' '16+' 'R' 'PG'
'ALL' 'TV-G' '7+' nan 'PG-13' 'G' 'TV-NR' 'AGES_16_' '16' 'AGES_18_'
'TV-Y7-FV' 'NC-17' 'UNRATED']
data['rating']=data['rating'].str.replace("TV-G","all",regex=True)
data['rating']=data['rating'].str.replace("TV-PG","13+",regex=True)
data['rating']=data['rating'].str.replace("TV-Y7-FV","13+",regex=True)
data['rating']=data['rating'].str.replace("TV-Y7","13+",regex=True)
data['rating']=data['rating'].str.replace("TV-Y","all",regex=True)
data['rating']=data['rating'].str.replace("PG-13","16+",regex=True)
data['rating']=data['rating'].str.replace("TV-14","16+",regex=True)
data['rating']=data['rating'].str.replace("TV-MA","18+",regex=True)
data['rating']=data['rating'].str.replace("NC-17","18+",regex=True)
data['rating']=data['rating'].str.replace("UNRATED","all",regex=True)
data['rating']=data['rating'].str.replace("TV-NR","all",regex=True)
data['rating']=data['rating'].str.replace("NR","all",regex=True)
data['rating']=data['rating'].str.replace("ALL","all",regex=True)
data['rating']=data['rating'].str.replace("PG","13+",regex=True)
data['rating']=data['rating'].str.replace("AGES_18_","18+",regex=True)
data['rating']=data['rating'].str.replace("AGES_16_","18+",regex=True)
data['rating']=data['rating'].str.replace("G","all",regex=True)
data['rating']=data['rating'].str.replace("R","18+",regex=True)
data['rating']=data['rating'].str.replace("+",",",regex=True)
data['rating']=data['rating'].str.replace("16,","16",regex=True)
data['rating']=data['rating'].str.replace(",","+",regex=True)
data['rating']=data['rating'].str.replace("16","16+",regex=True)
print(data['rating'].unique())
data.groupby('Platform')['IMDb'].mean().plot.bar()
plt.show()
EXPLORATORY DATA ANALYSIS WITH THE HELP OF VISUALIZAT
IONS
data.loc[1:100].groupby('Platform')['IMDb'].mean().plot.bar() plt.show()
It's a barplot showing the mean of Disney , Netflix and Prime Video platforms
data.Platform.value_counts(normalize=True).plot.pie()
plt.show()
It's a Pieplot showing the mean of Disney , Netflix and Prime Video platforms
import numpy as np
import matplotlib.pyplot as plt
data.rating.value_counts(normalize=True).plot.pie()
circle = plt.Circle( (0,0), 0.7, color='White')
p=plt.gcf()
p.gca().add_artist(circle)
plt.show()
Donut chart above has TV-MA has the highest rating with all as the least
plt.subplots(figsize=(6,4))
sns.barplot(x="rating", y="Rotten Tomatoes" , data= data.sort_values("Rotten
Tomatoes",ascending=False).head(20))
The bar plot has the ratings of Rotten Tomatoes with 18+ as the highest and all as the least
data_corr = data.corr()
age_groups = pd.DataFrame(data['rating'].value_counts()).reset_index()
age_groups = age_groups.rename(columns={'index':'rating', 'rating':'Count'})
age_groups
rating Count
0 18+ 3409
1 16+ 2210
2 13+ 1914
3 all 1042
4 7+ 103
plt.figure(figsize = (20,16))
plt.figtext(x=0.14, y=0.95,
s='Distribution of TV Shows based on Ratings',
fontsize=25, fontname='monospace')
plt.xticks(fontsize=20, fontname='monospace')
plt.yticks(fontsize=20, fontname='monospace')
plt.xlabel('Rating', fontsize=14)
plt.ylabel('Count', fontsize=14)
for q in [a]:
for w in ['bottom', 'left']:
q.spines[w].set_linewidth(3)
for w in ['right', 'top']:
q.spines[w].set_visible(False)
plt.show()
The above graph we can infer the highest rating in Netflix shows with more than 3000
count being 18+ and 7+ is the least.
SEPERATING GENRE TO MAKE GENRE BASED ANALYSIS
def get_unique_values(genre_list):
more_than_one = 0
only_one = 0
unique_genre = []
for listed_in in genre_list:
try:
values =listed_in.split(",")
if len(values) > 1:
more_than_one += 1
elif len(values) == 1:
only_one += 1
except:
pass
for genre in values:
if genre not in unique_genre:
unique_genre.append(genre)
plt.figure(figsize = (20,16))
plt.figtext(x=0.14, y=0.95,
s='Distribution of TV Shows based on Ratings',
fontsize=25, fontname='monospace')
plt.xticks(fontsize=20, fontname='monospace')
plt.yticks(fontsize=20, fontname='monospace')
plt.xlabel('Rating', fontsize=14)
plt.ylabel('Count', fontsize=14)
for q in [a]:
for w in ['bottom', 'left']:
q.spines[w].set_linewidth(3)
for w in ['right', 'top']:
q.spines[w].set_visible(False)
plt.show()
def get_unique_values(genre_list):
more_than_one = 0
only_one = 0
unique_genre = []
for listed_in in genre_list:
try:
values =listed_in.split(",")
if len(values) > 1:
more_than_one += 1
elif len(values) == 1:
only_one += 1
except:
pass
for genre in values:
if genre not in unique_genre:
unique_genre.append(genre)
unique_genre1=[]
unique_genre1=unique_genre.copy()
len(unique_genre1)
unique_genre
168
unique_genre
temp = []
for i in unique_genre:
if i not in temp:
temp.append(i)
len(temp)
100
unique_genre=temp.copy()
genre_dict = {}
new_df = data[data['listed_in'].notna()]
39
genre_count
Genre Count
0 Drama 3265
1 International 2516
2 Movies 2387
3 Dramas 1943
4 International Movies 1545
.. ... ...
95 Concert Film 2
96 Disaster 2
97 Romantic Comedy 2
98 Fitness 1
99 Travel 1
print(unique_genre)
genre_count
Genre Count
0 Drama 3265
1 International 2516
2 Movies 2387
3 Dramas 1943
4 International Movies 1545
.. ... ...
95 Concert Film 2
96 Disaster 2
97 Romantic Comedy 2
98 Fitness 1
99 Travel 1
plt.figure(figsize=(12,10))
plt.grid(axis='x',color='black', linestyle = ':', alpha=0.5)
plt.title('Top 10 TV Show Genres', fontname='monospace', fontsize=25, y=1.05)
a = sns.barplot(x='Count', y='Genre', data=genre_count[:10], palette='rocket'
)
genres = genre_count['Genre'][:10].tolist()
for i, val in enumerate(listed_in):
x_val = genre_count[genre_count['Genre'] ==val]['Count'].values[0]
a.text(y=i, x= x_val -300,
s=str(x_val),
fontsize=14, fontname='monospace', color='white')
for q in [a]:
for w in ['bottom', 'left']:
q.spines[w].set_linewidth(1.5)
for w in ['right', 'top']:
q.spines[w].set_visible(False)
plt.xlabel('Count', fontsize=15)
plt.ylabel('Genre', fontsize=15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_5180/301003751.py in <module>
6 genres = genre_count['Genre'][:10].tolist()
7 for i, val in enumerate(listed_in):
----> 8 x_val = genre_count[genre_count['Genre'] ==val]['Count'].values[0
]
9 a.text(y=i, x= x_val -300,
10 s=str(x_val),
<AxesSubplot:xlabel='release_year', ylabel='Count'>
The above graphical figure shows the highest count of shows released , with 2020 being the
highest show release year followed by 1920 to be having the least count to be released
print("TV Shows with highest IMDb ratings are= ")
print((data.sort_values("IMDb",ascending=False).head(20))['title'])
plt.subplots(figsize=(8,6))
sns.barplot(x="IMDb", y="title" , data= data.sort_values("IMDb",ascending=Fal
se).head(20))
<AxesSubplot:xlabel='IMDb', ylabel='title'>
TV Shows with highest IMDb ratings are represented in a barplot format with no ascending
order
#barplot of rating
plt.subplots(figsize=(8,6))
sns.barplot(x="IMDb", y="title" , data= data.sort_values("IMDb",ascending=Tru
e).head(20))
<AxesSubplot:xlabel='IMDb', ylabel='title'>
TV Shows with lowest IMDb ratings are represented in a barplot format in an ascending
order
#Overall data of IMDb ratings
plt.figure(figsize=(16, 6))
sns.scatterplot(data=data['IMDb'])
plt.ylabel("Rating")
plt.xlabel('Movies')
plt.title("IMDb Rating Distribution")
#barplot of rating
plt.subplots(figsize=(8,6))
sns.barplot(x="Rotten Tomatoes", y="title" , data= data.sort_values("Rotten T
omatoes",ascending=False).head(20))
<AxesSubplot:xlabel='Rotten Tomatoes', ylabel='title'>
TV Shows with highest Rotten Tomatoes scores are represented in a barplot format with no
ascending order
print("TV Shows with lowest Rotten Tomatoes scores are= ")
print((data.sort_values("Rotten Tomatoes",ascending=True).head(20))['title'])
TV Shows with lowest Rotten Tomatoes scores are represented in a barplot format in an
ascending order
#Overall data of Rotten Tomatoes scores
plt.figure(figsize=(16, 6))
sns.scatterplot(data=data['Rotten Tomatoes'])
plt.ylabel("Rotten Tomatoes score")
plt.xlabel('Movies')
plt.title("Rotten Tomatoes Score Distribution")
plt.subplots(figsize=(8,6))
sns.histplot(netflix["release_year"],kde=False, color="blue")
<AxesSubplot:xlabel='release_year', ylabel='Count'>
The above graphical figure shows the highest count of shows released , with 2020 being the
highest show release year followed by 1960 to be having the least count to be released
plt.subplots(figsize=(8,6))
sns.histplot(netflix["rating"],kde=False, color="cyan")
<AxesSubplot:xlabel='rating', ylabel='Count'>
The above grapgh we can infer the highest rating in Netflix shows with more than 2000
count being TV-MA and NC-17 is the least with zero shows.
plt.subplots(figsize=(8,6))
sns.distplot(netflix["IMDb"],kde=False, color="purple")
c:\users\91812\appdata\local\programs\python\python39\lib\site-packages\seabo
rn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function
and will be removed in a future version. Please adapt your code to use either
`displot` (a figure-level function with similar flexibility) or `histplot` (a
n axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='IMDb'>
From the histogram we can observe the number of shows with respect to their
ratings.Nearly 350 shows got 7.5/10 rating in IMDb. Approximately 2 shows got 10/10
rating.
plt.subplots(figsize=(8,6))
sns.distplot(netflix["Rotten Tomatoes"],kde=False, color="blue")
<AxesSubplot:xlabel='Rotten Tomatoes'>
From the histogram we can observe the number of shows with respect to their
ratings.Nearly 400 shows got 58/100 rating in Rotten tomatoes. Approximately 2 shows
got 100/100 rating.
print("Netflix Shows with highest IMDb ratings are= ")
print((netflix.sort_values("IMDb",ascending=False).head(10))['title'])
plt.subplots(figsize=(8,6))
sns.histplot(Prime["release_year"],kde=False, color="blue")
<AxesSubplot:xlabel='release_year', ylabel='Count'>
plt.subplots(figsize=(8,6))
sns.histplot(Prime["rating"],kde=False, color="cyan")
<AxesSubplot:xlabel='rating', ylabel='Count'>
plt.subplots(figsize=(8,6))
sns.distplot(Prime["IMDb"],kde=False, color="purple")
c:\users\91812\appdata\local\programs\python\python39\lib\site-packages\seabo
rn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function
and will be removed in a future version. Please adapt your code to use either
`displot` (a figure-level function with similar flexibility) or `histplot` (a
n axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='IMDb'>
plt.subplots(figsize=(8,6))
sns.distplot(Prime["Rotten Tomatoes"],kde=False, color="blue")
<AxesSubplot:xlabel='Rotten Tomatoes'>
print("Prime video Shows with highest IMDb ratings are= ")
print((Prime.sort_values("IMDb",ascending=False).head(10))['title'])
plt.subplots(figsize=(8,6))
sns.histplot(Disney["release_year"],kde=False, color="blue")
<AxesSubplot:xlabel='release_year', ylabel='Count'>
plt.subplots(figsize=(8,6))
sns.histplot(Disney["rating"],kde=False, color="cyan")
<AxesSubplot:xlabel='rating', ylabel='Count'>
plt.subplots(figsize=(8,6))
sns.distplot(Disney["IMDb"],kde=False, color="purple")
c:\users\91812\appdata\local\programs\python\python39\lib\site-packages\seabo
rn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function
and will be removed in a future version. Please adapt your code to use either
`displot` (a figure-level function with similar flexibility) or `histplot` (a
n axes-level function for histograms).
warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='IMDb'>
plt.subplots(figsize=(8,6))
sns.distplot(Disney["Rotten Tomatoes"],kde=False, color="blue")
<AxesSubplot:xlabel='Rotten Tomatoes'>
print("Disney Shows with highest IMDb ratings are= ")
print((Disney.sort_values("IMDb",ascending=False).head(10))['title'])
0 Jingle Pols
1 Bluey
3 Alaska Animal Rescue
9 Cosmos: Possible Worlds
14 Heartland Docs, DVM
25 The Imagineering Story
28 Critter Fixers: Country Vets
31 Incredible! The Story of Dr. Pol
49 The Mandalorian
41 One Strange Rock
Name: title, dtype: object
STOPWORDS AND WORDCLOUD BASED ANALYSIS
titles=data["title"].values
text=' '.join(titles)
len(text)
153309
text[1000:1500]
"Masters: Rust to Riches Bo Burnham: Inside Navillera The Family Man The Caro
l Burnett Show House Chappelle's Show Invincible Code Geass: Lelouch of the R
ebellion WWII in HD WWII in HD The Universe Victorian Farm Downton Abbey Haik
yu!! Puffin Rock Downton Abbey Dave Chappelle House of Cards The Repair Shop
Moving Art Norm Macdonald Has a Show Demon Slayer: Kimetsu no Yaiba Anne with
an E Crash Landing on You Stranger Things Arrested Development The Marvelous
Mrs. Maisel Line of Duty Fleabag Sky T"
text = re.sub(r'[^\w\s]','',text)
len(text)
150872
text[1000:1500]
'iches Bo Burnham Inside Navillera The Family Man The Carol Burnett Show Hous
e Chappelles Show Invincible Code Geass Lelouch of the Rebellion WWII in HD W
WII in HD The Universe Victorian Farm Downton Abbey Haikyu Puffin Rock Downto
n Abbey Dave Chappelle House of Cards The Repair Shop Moving Art Norm Macdona
ld Has a Show Demon Slayer Kimetsu no Yaiba Anne with an E Crash Landing on Y
ou Stranger Things Arrested Development The Marvelous Mrs Maisel Line of Duty
Fleabag Sky Tour The Movie Lenox Hill '
#Creating the tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
len(tokens)
25049
tokens[1000:1010]
['to',
'the',
'Edge',
'Sense8',
'Patriot',
'Hunting',
'ISIS',
'Breathe',
'Tumbbad',
'Travel']
words = []
stopwords = nltk.corpus.stopwords.words('english')
#nltk.download()
words_new = []
#Now we need to remove the stop words from the words variable
#Appending to words_new all words that are in words but not in sw
freq_dist = nltk.FreqDist(words_new)
#Frequency Distribution Plot
plt.subplots(figsize=(20,12))
freq_dist.plot(50)
<AxesSubplot:xlabel='Samples', ylabel='Counts'>
Stop words are generally the most common words in a language. From the above frequency
distribution we can observe that 'love' word has highest frequency with 'dead' having least
of all from the given dataset of Tv show tiles
WORD CLOUD WITH STOPWORDS
#wordcloud
plt.subplots(figsize=(16,10))
wordcloud = WordCloud(
stopwords=STOPWORDS,
background_color='white',
max_words=100,
width=1400,
height=1200
).generate(res)
plt.imshow(wordcloud)
plt.title('TV Show Title WordCloud 100 Words')
plt.axis('off')
plt.show()
Word clouds are graphical representations of word frequency that give greater prominence
to words that appear more frequently in a source text. The larger the word in the visual the
more common the word is in the dataset. wordclod of 100 words in Tv show title with love
as the most frequently used word
plt.subplots(figsize=(16,10))
wordcloud = WordCloud(
stopwords=STOPWORDS,
background_color='white',
max_words=500,
width=1400,
height=1200
).generate(res)
plt.imshow(wordcloud)
plt.title('TV Show Title WordCloud 500 Words')
plt.axis('off')
plt.show()
From the above word cloud we can infer the 500 most commonly used words in Tv Show
Titles. Love is the most commonly used word followed by Christmas , girl , life , man and so
on
print("Netflix Shows with lowest IMDb ratings are= ")
print((netflix.sort_values("IMDb",ascending=True).head(10))['title'])
netflix1=netflix.sort_values("IMDb",ascending=False).head(100)[['title',"IMDb
"]]
netflix1.head()
title IMDb
2 Breaking Bad 9.4
6 Our Planet 9.3
4 Avatar: The Last Airbender 9.3
11 Fullmetal Alchemist: Brotherhood 9.1
12 Reply 1988 9.1
From the above observation, Breaking bad has the highest rating with 9.4 followed by Our
planet , Avatar: The Last Airbender with 9.3 IMDb rating and 9.1 as the least rating for
Fullmetal Alchemist: Brotherhood and Reply 1988
#Converting it into a tuple
tuples_netflix_imdb[0:10]
#Making a wordcloud
wordcloud_netflix_imdb = WordCloud(width=1400,height=1200).generate_from_freq
uencies(dict(tuples_netflix_imdb))
plt.subplots(figsize=(12,12))
plt.imshow(wordcloud_netflix_imdb)
plt.title("TV Shows based on IMDb rating(Top 100)")
netflix2=netflix.sort_values("Rotten Tomatoes",ascending=False).head(100)[['t
itle',"Rotten Tomatoes"]]
netflix2.head()
#Converting to Tuple
wordcloud_netflix_tomatoes = WordCloud(width=1400,height=1200).generate_from_
frequencies(dict(tuples_netflix_tomatoes))
plt.subplots(figsize=(12,12))
plt.imshow(wordcloud_netflix_tomatoes)
ratings=data[["title",'IMDb',"Rotten Tomatoes"]]
ratings.head()
len(ratings)
8834
ratings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8834 entries, 0 to 8833
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 8834 non-null object
1 IMDb 8538 non-null float64
2 Rotten Tomatoes 8827 non-null float64
dtypes: float64(2), object(1)
memory usage: 207.2+ KB
label1=data[['IMDb','Rotten Tomatoes']]
label1=label1.dropna()
print(label)
[5 8 6 ... 3 9 3]
from sklearn.cluster import KMeans
import numpy as np
# k means
kmeans = KMeans(n_clusters=3, random_state=0)
label1['cluster'] = kmeans.fit_predict(label1[['IMDb', 'Rotten Tomatoes']])
# get centroids
centroids = kmeans.cluster_centers_
cen_x = [i[0] for i in centroids]
cen_y = [i[1] for i in centroids]
## add to df
label1['cen_x'] = label1.cluster.map({0:cen_x[0], 1:cen_x[1], 2:cen_x[2]})
label1['cen_y'] = label1.cluster.map({0:cen_y[0], 1:cen_y[1], 2:cen_y[2]})
# define and map colors
colors = ['#DF2020', '#81DF20', '#2095DF']
label1['c'] = label1.cluster.map({0:colors[0], 1:colors[1], 2:colors[2]})
<matplotlib.collections.PathCollection at 0x2801ebe6520>
#Removing the data
ratings=ratings.dropna()
ratings["IMDb"]=ratings["IMDb"]*10
#New data
ratings.head()
#Input data
X=ratings[["IMDb","Rotten Tomatoes"]]
#Input data
X=ratings[["IMDb","Rotten Tomatoes"]]
plt.figure(figsize=(10,6))
sns.scatterplot(x = 'IMDb',y = 'Rotten Tomatoes', data = X ,s = 60 )
plt.xlabel('IMDb rating (multiplied by 10)')
plt.ylabel('Rotten Tomatoes')
plt.title('IMDb rating (multiplied by 10) vs Rotten Tomatoes Score')
plt.show()
#Importing KMeans from sklearn
wcss=[]
for i in range(1,11):
km=KMeans(n_clusters=i)
km.fit(X)
wcss.append(km.inertia_)
wcss
[2658782.8560552797,
1416468.3483010572,
971006.2047273932,
744730.8390479111,
595017.879951996,
508191.7320234856,
436636.2687543623,
384902.21546598733,
344940.126989912,
313659.9308850105]
ELBOW CURVE
#The elbow curve
plt.figure(figsize=(12,6))
plt.plot(range(1,11),wcss)
plt.xlabel("K Value")
plt.xticks(np.arange(1,11,1))
plt.ylabel("WCSS")
plt.show()
#Taking 3 clusters
km=KMeans(n_clusters=3)
km.fit(X)
KMeans(n_clusters=3)
y=km.predict(X)
ratings["label"] = y
ratings.head()
plt.figure(figsize=(10,6))
sns.scatterplot(x = 'IMDb',y = 'Rotten Tomatoes',hue="label",
palette=['orange','red','green'], legend='full',data = ratin
gs ,s = 60 )
print(ratings[ratings["label"]==0]["title"].values)
TV Shows in cluster 0
['Bluey' 'Breaking Bad' 'Avatar: The Last Airbender' ...
'After We Collided' 'The Twilight Saga: Eclipse'
'Masters of the Universe: Revelation']
print(ratings[ratings["label"]==1]["title"].values)
TV Shows in cluster 1
['Jingle Pols' 'Alaska Animal Rescue' 'Harmony with A R Rahman' ...
'Dismissed' 'Dismissed' 'The Operative']
print('TV Shows in cluster 2')
print(ratings[ratings["label"]==2]["title"].values)
TV Shows in cluster 2
['Tjovitjo' 'Kibaoh Klashers' 'Robozuna' ... "Izzie's Way Home"
'Finding Jesus' 'Racket Boys']
data.description=data.description.str.replace(" ",",",regex=True)
data
title \
0 Jingle Pols
1 Bluey
2 Breaking Bad
3 Alaska Animal Rescue
4 Avatar: The Last Airbender
... ...
8829 Love Naggers
8830 Ratones Paranoicos: The Band that Rocked Argen...
8831 Leo the Truck
8832 Steps
8833 Pinkfong! Healthy Habit Songs
director \
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
8829 NaN
8830 Alejandro Ruax, Ramiro Mart??nez
8831 NaN
8832 Rock Davis, Jay Rodriguez
8833 NaN
cast \
0 Dr. Pol
1 Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
3 Victoria Vosburg
4 Zach Tyler, Mae Whitman, Jack De Sena, Dee Bra...
... ...
8829 Seo Jang-hoon, Kim Sook, Han Hye-jin, Kwak Jun...
8830 Juan Sebasti?Ân Guti??rrez, Pablo Cano, Pablo...
8831 Maria Poddubnaya, Sveta Lebedeva, Anuar Shalab...
8832 Rob Morgan, Walter Fauntleroy, Robert G. McKay...
8833 NaN
duration listed_in \
0 45 min Animals & Nature, Documentary, Medical
1 2 Seasons Animation, Kids
2 5 Seasons Crime TV Shows, TV Dramas, TV Thrillers
3 2 Seasons Animals & Nature, Docuseries, Family
4 3 Seasons Classic & Cult TV, Kids' TV, TV Action & Adven...
... ... ...
8829 1 Season International TV Shows, Stand-Up Comedy & Talk...
8830 76 min Documentaries, International Movies, Music & M...
8831 1 Season Kids
8832 118 min Drama
8833 1 Season Animation, Kids
data = pd.read_csv('Ottdataset.csv',encoding='unicode_escape')
data.info()
data.head()[:20]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8834 entries, 0 to 8833
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8834 non-null int64
1 Platform 8834 non-null object
2 type 8834 non-null object
3 title 8834 non-null object
4 director 6393 non-null object
5 cast 8098 non-null object
6 country 5889 non-null object
7 date_added 5970 non-null object
8 release_year 8834 non-null int64
9 rating 8678 non-null object
10 duration 8834 non-null object
11 listed_in 8834 non-null object
12 description 8834 non-null object
13 IMDb 8538 non-null object
14 Rotten Tomatoes 8827 non-null object
dtypes: int64(2), object(13)
memory usage: 1.0+ MB
cast \
0 Dr. Pol
1 Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
3 Victoria Vosburg
4 Zach Tyler, Mae Whitman, Jack De Sena, Dee Bra...
duration listed_in \
0 45 min Animals & Nature, Documentary, Medical
1 2 Seasons Animation, Kids
2 5 Seasons Crime TV Shows, TV Dramas, TV Thrillers
3 2 Seasons Animals & Nature, Docuseries, Family
4 3 Seasons Classic & Cult TV, Kids' TV, TV Action & Adven...
data.IMDb=data.IMDb.str.replace("/10","",regex=True)
data['IMDb'] = pd.to_numeric(data['IMDb'],errors='coerce')
data.rating=data.rating.str.replace("/10","",regex=True)
plt.subplots(figsize=(10,10))
list1 = []
for i in data['genre']:
list1.extend(i)
ax = pd.Series(list1).value_counts()[:10].sort_values(ascending=True).plot.ba
rh(width=0.9,color=sns.color_palette('hls',10))
for i, v in enumerate(pd.Series(list1).value_counts()[:10].sort_values(ascend
ing=True).values):
ax.text(.8, i, v,fontsize=12,color='white',weight='bold')
plt.title('Top Genres')
plt.show()
RECOMMENDATION SYSTEM USING K-NEAREST NEIGHBORS:
PREDICT IMDB SCORES
data.head(10)
director \
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 Elliot Weaver, Zander Weaver
8 NaN
9 NaN
cast \
0 Dr. Pol
1 Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
3 Victoria Vosburg
4 Zach Tyler, Mae Whitman, Jack De Sena, Dee Bra...
5 NaN
6 David Attenborough
7 Tom England, Arjun Singh Panam, Joshua Ford, B...
8 A R Rahman, Sajith Vijayan, Bahauddin Dagar, B...
9 Neil deGrasse Tyson
duration genre \
0 45 min Animals & Nature, Documentary, Medical
1 2 Seasons Animation, Kids
2 5 Seasons Crime TV Shows, TV Dramas, TV Thrillers
3 2 Seasons Animals & Nature, Docuseries, Family
4 3 Seasons Classic & Cult TV, Kids' TV, TV Action & Adven...
5 3 Seasons Kids
6 1 Season Docuseries, Science & Nature TV
7 129 min Action, Science Fiction
8 1 Season Arts, Entertainment, and Culture, Documentary
9 1 Season Action-Adventure, Docuseries, Family
CONTENT-BASED FILTERING
These filtering methods are based on the description of an item and a profile of the user’s
preferred choices. In a content-based recommendation system, keywords are used to
describe the items, besides, a user profile is built to state the type of item this user likes. In
other words, the algorithms try to recommend products that are similar to the ones that a
user has liked in the past.
Now let’s generate a list ‘genreList’ with all possible unique genres mentioned in the
dataset.
genreList = []
for index, row in data.iterrows():
genres = row["genre"]
Let’s create a new column in the dataframe that will hold the binary values whether a genre
is present or not in it. First, let’s create a method that will return back a list of binary values
for the genres of each movie. The ‘genreList’ will be useful now to compare against the
values.
return binaryList
We will follow the same notations for other features like the cast, director, and the
keywords.
data['genre_bin'] = data['genre'].apply(lambda x: binary(x))
data['genre_bin'].head()
0 [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...
Name: genre_bin, dtype: object
data['cast'] = data['cast'].astype(str)
data.head()
cast \
0 Dr. Pol
1 Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
3 Victoria Vosburg
4 Zach Tyler, Mae Whitman, Jack De Sena, Dee Bra...
duration genre \
0 45 min [Animals&Nature, Documentary, Medical]
1 2 Seasons [Animation, Kids]
2 5 Seasons [CrimeTVShows, TVDramas, TVThrillers]
3 2 Seasons [Animals&Nature, Docuseries, Family]
4 3 Seasons [Classic&CultTV, KidsTV, TVAction&Adventure]
genre_bin
0 [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...
Now let’s generate a list ‘castList’ with all possible unique cast members mentioned in the
dataset.
castList = []
for index, row in data.iterrows():
casts = row["cast"]
['Dr.Pol',
'DaveMccormack',
'MelanieZanetti',
'BradElliot',
'Hsiao-LingTang',
'BryanCranston',
'AaronPaul',
'AnnaGunn',
'DeanNorris',
'BetsyBrandt']
return binaryList
Let’s create a new column in the dataframe that will hold the binary values whether a cast
is present or not in it.
data['cast_bin'] = data['cast'].apply(lambda x: binary(x))
data['cast_bin'].head()
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: cast_bin, dtype: object
data.head()
cast \
0 [Dr.Pol]
1 [DaveMccormack, MelanieZanetti, BradElliot, Hs...
2 [BryanCranston, AaronPaul, AnnaGunn, DeanNorri...
3 [VictoriaVosburg]
4 [ZachTyler, MaeWhitman, JackDeSena, DeeBradley...
duration genre \
0 45 min [Animals&Nature, Documentary, Medical]
1 2 Seasons [Animation, Kids]
2 5 Seasons [CrimeTVShows, TVDramas, TVThrillers]
3 2 Seasons [Animals&Nature, Docuseries, Family]
4 3 Seasons [Classic&CultTV, KidsTV, TVAction&Adventure]
genre_bin \
0 [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...
cast_bin
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
data.head()
cast \
0 [Dr.Pol]
1 [DaveMccormack, MelanieZanetti, BradElliot, Hs...
2 [BryanCranston, AaronPaul, AnnaGunn, DeanNorri...
3 [VictoriaVosburg]
4 [ZachTyler, MaeWhitman, JackDeSena, DeeBradley...
duration genre \
0 45 min [Animals&Nature, Documentary, Medical]
1 2 Seasons [Animation, Kids]
2 5 Seasons [CrimeTVShows, TVDramas, TVThrillers]
3 2 Seasons [Animals&Nature, Docuseries, Family]
4 3 Seasons [Classic&CultTV, KidsTV, TVAction&Adventure]
genre_bin \
0 [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...
cast_bin
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Generating a list ‘directorList’ with all possible unique directors mentioned in the dataset.
directorList = []
for index, row in data.iterrows():
directors = row["director"]
['nan',
'ElliotWeaver',
'ZanderWeaver',
'YasuhiroIrie',
'GarySing',
'SamirAlAsfory',
'PeterMarcy',
'St??phaneRybojad',
'AdamWingard',
'AlastairFothergill']
def binary(director_list):
binaryList = []
return binaryList
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: director_bin, dtype: object
Since we need keywords for identifying simillar movies and tv shows, we strip keywords
from the description coloumn.
Converting description to string
data['description'] = data['description'].astype(str)
cast \
0 [Dr.Pol]
1 [DaveMccormack, MelanieZanetti, BradElliot, Hs...
2 [BryanCranston, AaronPaul, AnnaGunn, DeanNorri...
3 [VictoriaVosburg]
4 [ZachTyler, MaeWhitman, JackDeSena, DeeBradley...
duration genre \
0 45 min [Animals&Nature, Documentary, Medical]
1 2 Seasons [Animation, Kids]
2 5 Seasons [CrimeTVShows, TVDramas, TVThrillers]
3 2 Seasons [Animals&Nature, Docuseries, Family]
4 3 Seasons [Classic&CultTV, KidsTV, TVAction&Adventure]
genre_bin \
0 [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...
cast_bin \
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
director_bin \
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
keywords
0 Nat Geo WILD re-joins the Pols in central Mich...
1 Bluey is a six year-old Blue Heeler dog, who t...
2 A high school chemistry teacher dying of cance...
3 Conservation heroes rescue and rehabilitate th...
4 Siblings Katara and Sokka wake young Aang from...
Using nltk with stopwords to remove common english words used to build sentences
stop_words = stopwords.words('english')
data['keywords'] = data['keywords'].apply(lambda x: ' '.join([word for word i
n x.split()if word not in (stop_words)]))
data.head()
cast \
0 [Dr.Pol]
1 [DaveMccormack, MelanieZanetti, BradElliot, Hs...
2 [BryanCranston, AaronPaul, AnnaGunn, DeanNorri...
3 [VictoriaVosburg]
4 [ZachTyler, MaeWhitman, JackDeSena, DeeBradley...
duration genre \
0 45 min [Animals&Nature, Documentary, Medical]
1 2 Seasons [Animation, Kids]
2 5 Seasons [CrimeTVShows, TVDramas, TVThrillers]
3 2 Seasons [Animals&Nature, Docuseries, Family]
4 3 Seasons [Classic&CultTV, KidsTV, TVAction&Adventure]
description IMDb Rotten Tomatoes
\
0 Nat Geo WILD re-joins the Pols in central Mich... 9.6 44.0
1 Bluey is a six year-old Blue Heeler dog, who t... 9.6 71.0
2 A high school chemistry teacher dying of cance... 9.4 100.0
3 Conservation heroes rescue and rehabilitate th... 9.4 42.0
4 Siblings Katara and Sokka wake young Aang from... 9.3 93.0
genre_bin \
0 [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...
cast_bin \
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
director_bin \
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
keywords
0 Nat Geo WILD re-joins Pols central Michigan ge...
1 Bluey six year-old Blue Heeler dog, turns ever...
2 A high school chemistry teacher dying cancer t...
3 Conservation heroes rescue rehabilitate wild a...
4 Siblings Katara Sokka wake young Aang long hib...
duration genre \
0 45 min [Animals&Nature, Documentary, Medical]
1 2 Seasons [Animation, Kids]
2 5 Seasons [CrimeTVShows, TVDramas, TVThrillers]
3 2 Seasons [Animals&Nature, Docuseries, Family]
4 3 Seasons [Classic&CultTV, KidsTV, TVAction&Adventure]
genre_bin \
0 [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...
cast_bin \
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
director_bin \
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
keywords
0 [Nat, Geo, WILD, re-joins, Pols, central, Mich...
1 [Bluey, six, year-old, Blue, Heeler, dog, , tu...
2 [A, high, school, chemistry, teacher, dying, c...
3 [Conservation, heroes, rescue, rehabilitate, w...
4 [Siblings, Katara, Sokka, wake, young, Aang, l...
The keywords or tags contain a lot of information about the movie, and it is a key feature in
finding similar movies. For eg: Movies like “Avengers” and “Ant-man” may have common
keywords like superheroes or Marvel.
For analyzing keywords, we will try something different and plot a word cloud to get a
better intuition:
words=data['keywords'].dropna().apply(nltk.word_tokenize)
word=[]
for i in words:
word.extend(i)
word=pd.Series(word)
word=([i for i in word.str.lower() if i not in stop_words])
wc = WordCloud(background_color="black", max_words=2000, stopwords=STOPWORDS,
max_font_size= 60,width=1000,height=1000)
wc.generate(" ".join(word))
plt.imshow(wc)
plt.axis('off')
fig=plt.gcf()
fig.set_size_inches(10,10)
plt.show()
Cleaning keywords coloumn as a string
data['keywords'] = data['keywords'].str.strip('[]').str.replace(' ','').str.r
eplace("'",'').str.replace('"','')
data['keywords'] = data['keywords'].str.split(',')
for i,j in zip(data['keywords'],data.index):
list2 = []
list2 = i
data.loc[j,'keywords'] = str(list2)
data['keywords'] = data['keywords'].str.strip('[]').str.replace(' ','').str.r
eplace("'",'')
data['keywords'] = data['keywords'].str.split(',')
for i,j in zip(data['keywords'],data.index):
list2 = []
list2 = i
list2.sort()
data.loc[j,'keywords'] = str(list2)
data['keywords'] = data['keywords'].str.strip('[]').str.replace(' ','').str.r
eplace("'",'')
data['keywords'] = data['keywords'].str.split(',')
We find ‘words_bin’ from Keywords and remove rows with null values of director and
IMDb
data['words_bin'] = data['keywords'].apply(lambda x: binary(x))
data = data[(data['IMDb']!=0)] #removing the movies with 0 score and without
drector names
data = data[data['director']!='']
data = data[data['director']!='[nan]']
data = data[data['director']!='[NaN]']
data.head(10)
director \
0 [nan]
1 [nan]
2 [nan]
3 [nan]
4 [nan]
5 [nan]
6 [nan]
7 [ElliotWeaver, ZanderWeaver]
8 [nan]
9 [nan]
cast \
0 [Dr.Pol]
1 [DaveMccormack, MelanieZanetti, BradElliot, Hs...
2 [BryanCranston, AaronPaul, AnnaGunn, DeanNorri...
3 [VictoriaVosburg]
4 [ZachTyler, MaeWhitman, JackDeSena, DeeBradley...
5 [nan]
6 [DavidAttenborough]
7 [TomEngland, ArjunSinghPanam, JoshuaFord, BenV...
8 [ARRahman, SajithVijayan, BahauddinDagar, Beda...
9 [NeildeGrasseTyson]
duration genre \
0 45 min [Animals&Nature, Documentary, Medical]
1 2 Seasons [Animation, Kids]
2 5 Seasons [CrimeTVShows, TVDramas, TVThrillers]
3 2 Seasons [Animals&Nature, Docuseries, Family]
4 3 Seasons [Classic&CultTV, KidsTV, TVAction&Adventure]
5 3 Seasons [Kids]
6 1 Season [Docuseries, Science&NatureTV]
7 129 min [Action, ScienceFiction]
8 1 Season [Arts, Entertainment, andCulture, Documentary]
9 1 Season [Action-Adventure, Docuseries, Family]
genre_bin \
0 [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...
5 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
6 [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, ...
7 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
8 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
9 [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
cast_bin \
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
5 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
6 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
7 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
9 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
director_bin \
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
5 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
6 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
7 [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
8 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
9 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
keywords \
0 [Christmas, Geo, Michigan, Nat, Pols, WILD, ce...
1 [, Blue, Bluey, Heeler, adventures., dog, ever...
2 [A, cancer, chemistry, crystal, dying, familys...
3 [AmericaÆ\\\\x92??s, Conservation, animals, fr...
4 [, Aang, Avatar, Fire, Katara, Nation., Siblin...
5 [Aang, Azula, Ba, Black, Day, Fire, Firelord, ...
6 [Experience, ambitious, beauty, change, climat...
7 [, Three, accidentally, alien, amateur, astron...
8 [, , A., A.R, Harmony, Indian, IndiaÆ\\\\x92??...
9 [40, COSMOS:, Carl, POSSIBLE, SaganÆ\\\\x92??s...
words_bin
0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
5 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
6 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
7 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
9 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
CREATING New_id FOR EACH ROW
We generate new coloumn to identify each movie and tv show with new_id
new_id = list(range(0,data.shape[0]))
data['new_id']=new_id
data=data[['title','genre','IMDb','genre_bin','cast_bin','new_id','director',
'director_bin','words_bin']]
data.head()
title genre
\
0 Jingle Pols [Animals&Nature, Documentary, Medical]
1 Bluey [Animation, Kids]
2 Breaking Bad [CrimeTVShows, TVDramas, TVThrillers]
3 Alaska Animal Rescue [Animals&Nature, Docuseries, Family]
4 Avatar: The Last Airbender [Classic&CultTV, KidsTV, TVAction&Adventure]
IMDb genre_bin \
0 9.6 [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 9.6 [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 9.4 [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...
3 9.4 [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
4 9.3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...
director_bin \
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
words_bin
0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
RECOMMENDATION SYSTEM USING K-NEAREST NEIGHBORS –
COSINE SIMILARITY
Below we have defined a function Similarity, which will check the similarity between the
movies and tv shows.
genresA = a['genre_bin']
genresB = b['genre_bin']
scoreA = a['cast_bin']
scoreB = b['cast_bin']
scoreDistance = spatial.distance.cosine(scoreA, scoreB)
directA = a['director_bin']
directB = b['director_bin']
directDistance = spatial.distance.cosine(directA, directB)
descriptionA = a['words_bin']
descriptionB = b['words_bin']
wordsDistance = spatial.distance.cosine(directA, directB)
return genreDistance + directDistance + scoreDistance + wordsDistance
Similarity(12,7)
4.0
It is evident that Reply 1988 and Cosmos are very different movies. Thus the distance is
huge.
print(data.iloc[12])
print(data.iloc[7])
We will now build the score predictor. The main function working under the hood will be
the Similarity() function, which will calculate the similarity between movies, and will find
10 most similar movies. These 10 movies will help in predicting the score for our desired
movie. We will take the average of the scores of similar movies and find the score for the
desired movie.
import operator
def predict_score(name):
#name = input('Enter a movie title: ')
new_movie = data[data['title'].str.contains(name)].iloc[0].to_frame().T
print('Selected Movie: ',new_movie.title.values[0])
def getNeighbors(baseMovie, K):
distances = []
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(K):
neighbors.append(distances[x])
return neighbors
K = 10
avgRating = 0
neighbors = getNeighbors(new_movie, K)
print('\n')
avgRating = avgRating/K
print('The predicted rating for %s is: %f' %(new_movie['title'].values[0]
,avgRating))
print('The actual rating for %s is %f' %(new_movie['title'].values[0],new
_movie['IMDb']))
Now we simply just run the function as predict_score('title name') and enter the movie or
tv show name we would like to find 10 similar movies and it’s predicted ratings.
predict_score('Cosmos')
Recommended to Watch:
Recommended to Watch:
Thus, we have completed the Movie Recommendation System implementation using the K-
Nearest Neighbors algorithm.
Sidenote — K Value
In this project, we have arbitrarily chosen the value K=10. But in other applications of KNN,
finding the value of K is not simple. A small value of K means that noise will have a higher
influence on the result. Research papers and Data scientits usually choose as an odd
number, if the number of classes is 2 and another simple approach to select k is set
K=sqrt(n).