0% found this document useful (0 votes)

124 views114 pages

Eda Final Report

The document is a report from a team of students on their exploratory data analysis project on OTT platforms and movie recommendations. It includes an introduction describing the objectives of analyzing a dataset on movies from different platforms and implementing a recommendation system. It also describes the challenges faced and provides an algorithm for the proposed recommendation system involving preprocessing the data, calculating similarities between titles, and using K-nearest neighbors to predict ratings and recommend titles.

Uploaded by

simran bohra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

124 views114 pages

Eda Final Report

Uploaded by

simran bohra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 114

School of Computer Science and Engineering

J Component report

Programme : MTech Integrated CSE with Spl. In BA

Course Title : Exploratory Data Analytics

Course Code : CSE3040

Slot : F2

Title: EDA On OTT Platforms with Movie Recommendation

Team Members: Antony George Mathew K (20MIA1022)

J. Shree Nikhila (20MIA1023)

Simran Bohra (20MIA1024)

A Sri Karthik (20MIA1032)

Faculty: Sweetlin Hemalatha C Sign:

Date:
1. INTRODUCTION

A recommendation system is a tool of information filtering system which is built

to predict the ”rating” or ”preference” of a user based on his input. A
recommendation system collects data about the user’s preferences directly or
indirectly on different objects like movies, shows , shopping, tourism, TV etc. On
the other hand, in the development of movie recommendation system uses the
user’s previous watched movie. Collaborative filtering is the method to filter or
calculate the items through the sentiments of other . Collaborative filtering first
collects the movie ratings or preference given by IMDB and Rotten Tomatoes and
then suggest movies to the different user based on similar tastes and interests in
the past. K-Nearest Neighbor is implemented on this dataset in order to obtain
the best-optimized result. In the available techniques, the data is scattered which
results in a high number of clusters while in the proposed technique data is
gathered and results in a low number of clusters. The process of recommendation
of a tittle is simplified in this method. The recommender system predicts the
user’s preference of a tittle on the basis of different parameters such as genre,
cast, director, keywords. The recommender system works on the concept that
people are having common preference or similar content. This process optimizes
the process. The work starts with the section 1 as Introduction section with the
basics of data handling. Section 2 discusses the work done by the group for
visualizing the data. Section 3 describe the evolution of the proposed
recommendation system. Section 4 shows the algorithm of the proposed system.
Section 5 shows the implementation of the proposed system.
1.1 Objective

In this project we are going to perform Some exploratory Data analysis to find
some hidden trends and pattern in dataset. will going to load and read the data
using pandas, do some cleaning and Processing of data and will explore the
dataset through visualizations and graphs using matplotlib and seaborn and
finally answers some questions related to dataset. Also, we are going to try
implementing a predictive system using k nearest neighbors to produce top 10
suggestions for the user based on his input and also predict its IMDb score
through average ratings.

1.2 Statement

When it comes to OTT-Platforms the main pitfall is the content-recommendation

for the subscriber, companies often wrongly profile its audience and there is
more time spent on choosing the platform / movie because of too many options.
This can be reverted by data analysis of individual customer and AI to suggest a
movie / show on cross-platform.

1.3 Preparing the Data

The dataset used in this project is a combination of 5 datasets taken from Kaggle
and IMDb directories and it has been clubbed and filtered to form an unique
dataset.

Numerical- show id, release year, IMDb, Rotten Tomatoes

Categorical- platform, type, title, director, cast, country, rating, listed in,
description

1.4 Challenges

The challenges faced by author includes cleaning the data before being used for
analysis, extracting features from the processed data necessary for visualizing
and recommending content. Analysing the final results to draw appropriate
conclusions.
ALGORITHM OF RECOMMENDATION SYSTEM

Algorithm for the proposed algorithm is as follows:

Step 1: Import the python libraries: Numpy, Pandas, Matplotlib, sklearn,

seaborn, missingno, stemgraphic, wordcloud, nltk, random, re.

Step 2: Read the csv file as data frame.

Step 3: Generate duplicate columns for description and listed_in, in the data frame for
prediction model and testing. Convert the tables into string.

Step 4: Rename description duplicate as keywords and split the individual words into
list and remove stopwords using nltk. Similarly, listed_in duplicate to genre and split
genres.

Step 5: Create binary bins for genre, cast, director, keywords by using binary()
function and then create new columns namely genre_bin, cast_bin, director_bin,
keywords_bin.

Step 6: Use Cosine Similarity for finding the similar titles using bins. Using defined
function similarity, we find the distance between 2 titles.

Step 7: Using 10 of the similarly generated titles we can take average of each IMDb
scores to predict the score of searched title.

Step 8: By using predict_score() function we can generate recommended titles and

predicted score using K-nearest neighbor.
IMPORTING NECESSARY PACKAGES AND LIBRARIES

import re
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import missingno
import random
import stemgraphic
%matplotlib inline
from wordcloud import WordCloud, STOPWORDS
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

DATA DESCRIPTION
data = pd.read_csv('Ottdataset.csv',encoding='unicode_escape')
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8834 entries, 0 to 8833
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8834 non-null int64
1 Platform 8834 non-null object
2 type 8834 non-null object
3 title 8834 non-null object
4 director 6393 non-null object
5 cast 8098 non-null object
6 country 5889 non-null object
7 date_added 5970 non-null object
8 release_year 8834 non-null int64
9 rating 8678 non-null object
10 duration 8834 non-null object
11 listed_in 8834 non-null object
12 description 8834 non-null object
13 IMDb 8538 non-null object
14 Rotten Tomatoes 8827 non-null object
dtypes: int64(2), object(13)
memory usage: 1.0+ MB

show_id Platform type title director \

0 977 Disney Movie Jingle Pols NaN
1 209 Disney TV Show Bluey NaN
2 7391 Netflix TV Show Breaking Bad NaN
3 119 Disney TV Show Alaska Animal Rescue NaN
4 3970 Netflix TV Show Avatar: The Last Airbender NaN

cast \
0 Dr. Pol
1 Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
3 Victoria Vosburg
4 Zach Tyler, Mae Whitman, Jack De Sena, Dee Bra...

country date_added release_year rating \

0 NaN November 12, 2019 2013 TV-14
1 Australia, United Kingdom May 28, 2021 2019 TV-Y
2 United States August 2, 2013 2013 TV-MA
3 United States September 1, 2021 2019 TV-PG
4 United States May 15, 2020 2007 TV-Y7
duration listed_in \
0 45 min Animals & Nature, Documentary, Medical
1 2 Seasons Animation, Kids
2 5 Seasons Crime TV Shows, TV Dramas, TV Thrillers
3 2 Seasons Animals & Nature, Docuseries, Family
4 3 Seasons Classic & Cult TV, Kids' TV, TV Action & Adven...

description IMDb Rotten Tomatoes

0 Nat Geo WILD re-joins the Pols in central Mich... 9.6/10 44/100
1 Bluey is a six year-old Blue Heeler dog, who t... 9.6/10 71/100
2 A high school chemistry teacher dying of cance... 9.4/10 100/100
3 Conservation heroes rescue and rehabilitate th... 9.4/10 42/100
4 Siblings Katara and Sokka wake young Aang from... 9.3/10 93/100

data.shape

(8834, 15)

data.columns

Index(['show_id', 'Platform', 'type', 'title', 'director', 'cast', 'country',

'date_added', 'release_year', 'rating', 'duration', 'listed_in',
'description', 'IMDb', 'Rotten Tomatoes'],
dtype='object')

print(data.describe)

<bound method NDFrame.describe of show_id Platform type \

0 977 Disney Movie
1 209 Disney TV Show
2 7391 Netflix TV Show
3 119 Disney TV Show
4 3970 Netflix TV Show
... ... ... ...
8829 2483 Netflix TV Show
8830 2887 Netflix Movie
8831 15369 Prime Video TV Show
8832 16870 Prime Video Movie
8833 19044 Prime Video TV Show

title \
0 Jingle Pols
1 Bluey
2 Breaking Bad
3 Alaska Animal Rescue
4 Avatar: The Last Airbender
... ...
8829 Love Naggers
8830 Ratones Paranoicos: The Band that Rocked Argen...
8831 Leo the Truck
8832 Steps
8833 Pinkfong! Healthy Habit Songs

director \
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
8829 NaN
8830 Alejandro Ruax, Ramiro Mart??nez
8831 NaN
8832 Rock Davis, Jay Rodriguez
8833 NaN

cast \
0 Dr. Pol
1 Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
3 Victoria Vosburg
4 Zach Tyler, Mae Whitman, Jack De Sena, Dee Bra...
... ...
8829 Seo Jang-hoon, Kim Sook, Han Hye-jin, Kwak Jun...
8830 Juan Sebasti?Ân Guti??rrez, Pablo Cano, Pablo...
8831 Maria Poddubnaya, Sveta Lebedeva, Anuar Shalab...
8832 Rob Morgan, Walter Fauntleroy, Robert G. McKay...
8833 NaN

country date_added release_year rating \

0 NaN November 12, 2019 2013 TV-14
1 Australia, United Kingdom May 28, 2021 2019 TV-Y
2 United States August 2, 2013 2013 TV-MA
3 United States September 1, 2021 2019 TV-PG
4 United States May 15, 2020 2007 TV-Y7
... ... ... ... ...
8829 South Korea April 16, 2021 2021 TV-14
8830 NaN January 6, 2021 2021 TV-MA
8831 NaN NaN 2021 TV-Y
8832 NaN NaN 2021 16+
8833 NaN NaN 2021 ALL

duration listed_in \
0 45 min Animals & Nature, Documentary, Medical
1 2 Seasons Animation, Kids
2 5 Seasons Crime TV Shows, TV Dramas, TV Thrillers
3 2 Seasons Animals & Nature, Docuseries, Family
4 3 Seasons Classic & Cult TV, Kids' TV, TV Action & Adven...
... ... ...
8829 1 Season International TV Shows, Stand-Up Comedy & Talk...
8830 76 min Documentaries, International Movies, Music & M...
8831 1 Season Kids
8832 118 min Drama
8833 1 Season Animation, Kids

description IMDb \
0 Nat Geo WILD re-joins the Pols in central Mich... 9.6/10
1 Bluey is a six year-old Blue Heeler dog, who t... 9.6/10
2 A high school chemistry teacher dying of cance... 9.4/10
3 Conservation heroes rescue and rehabilitate th... 9.4/10
4 Siblings Katara and Sokka wake young Aang from... 9.3/10
... ... ...
8829 From the quirky to the scandalous, any relatio... NaN
8830 The irrepressible Ratones Paranoicos, Argentin... NaN
8831 Adventures of Leo and friends continue in a ne... NaN
8832 Years after a life-altering robbery, a home he... NaN
8833 Sing with Pinkfong and learn how to form healt... NaN

Rotten Tomatoes
0 44/100
1 71/100
2 100/100
3 42/100
4 93/100
... ...
8829 16/100
8830 12/100
8831 10/100
8832 46/100
8833 10/100

[8834 rows x 15 columns]>

data.nunique()

show_id 8834
Platform 3
type 2
title 8480
director 4844
cast 7895
country 539
date_added 1398
release_year 96
rating 23
duration 222
listed_in 1013
description 8828
IMDb 81
Rotten Tomatoes 89
dtype: int64

NULL VALUE ANALYSIS

data.isnull()

show_id Platform type title director cast country date_added

\
0 False False False False True False True False
1 False False False False True False False False
2 False False False False True False False False
3 False False False False True False False False
4 False False False False True False False False
... ... ... ... ... ... ... ... ...
8829 False False False False True False False False
8830 False False False False False False True False
8831 False False False False True False True True
8832 False False False False False False True True
8833 False False False False True True True True

release_year rating duration listed_in description IMDb \

0 False False False False False False
1 False False False False False False
2 False False False False False False
3 False False False False False False
4 False False False False False False
... ... ... ... ... ... ...
8829 False False False False False True
8830 False False False False False True
8831 False False False False False True
8832 False False False False False True
8833 False False False False False True
Rotten Tomatoes
0 False
1 False
2 False
3 False
4 False
... ...
8829 False
8830 False
8831 False
8832 False
8833 False

[8834 rows x 15 columns]

data.isnull().sum()

show_id 0
Platform 0
type 0
title 0
director 2441
cast 736
country 2945
date_added 2864
release_year 0
rating 156
duration 0
listed_in 0
description 0
IMDb 296
Rotten Tomatoes 7
dtype: int64

if data.isnull().any(axis=None):
print("\nPreview of data with null values:\nxxxxxxxxxxxxx")
print(data[data.isnull().any(axis=1)].head(3))
missingno.matrix(data)
plt.show()

Preview of data with null values:

country date_added release_year rating \

0 NaN November 12, 2019 2013 TV-14
1 Australia, United Kingdom May 28, 2021 2019 TV-Y
2 United States August 2, 2013 2013 TV-MA

duration listed_in \
0 45 min Animals & Nature, Documentary, Medical
1 2 Seasons Animation, Kids
2 5 Seasons Crime TV Shows, TV Dramas, TV Thrillers

description IMDb Rotten Tomatoes

0 Nat Geo WILD re-joins the Pols in central Mich... 9.6/10 44/100
1 Bluey is a six year-old Blue Heeler dog, who t... 9.6/10 71/100
2 A high school chemistry teacher dying of cance... 9.4/10 100/100
if data.isnull().any(axis=None):
print("\nPreview of data with null values:\nxxxxxxxxxxxxx")
print(data[data.isnull().any(axis=1)].head(3))
missingno.dendrogram(data)
plt.show()

Preview of data with null values:

xxxxxxxxxxxxx
show_id Platform type title director \
0 977 Disney Movie Jingle Pols NaN
1 209 Disney TV Show Bluey NaN
2 7391 Netflix TV Show Breaking Bad NaN

cast \
0 Dr. Pol
1 Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...

country date_added release_year rating \

0 NaN November 12, 2019 2013 TV-14
1 Australia, United Kingdom May 28, 2021 2019 TV-Y
2 United States August 2, 2013 2013 TV-MA

duration listed_in \
0 45 min Animals & Nature, Documentary, Medical
1 2 Seasons Animation, Kids
2 5 Seasons Crime TV Shows, TV Dramas, TV Thrillers

description IMDb Rotten Tomatoes

0 Nat Geo WILD re-joins the Pols in central Mich... 9.6/10 44/100
1 Bluey is a six year-old Blue Heeler dog, who t... 9.6/10 71/100
2 A high school chemistry teacher dying of cance... 9.4/10 100/100
if data.isnull().any(axis=None):
print("\nPreview of data with null values:\nxxxxxxxxxxxxx")
print(data[data.isnull().any(axis=1)].head(3))
missingno.heatmap(data)
plt.show()

Preview of data with null values:

xxxxxxxxxxxxx
show_id Platform type title director \
0 977 Disney Movie Jingle Pols NaN
1 209 Disney TV Show Bluey NaN
2 7391 Netflix TV Show Breaking Bad NaN
cast \
0 Dr. Pol
1 Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
country date_added release_year rating \
0 NaN November 12, 2019 2013 TV-14
1 Australia, United Kingdom May 28, 2021 2019 TV-Y
2 United States August 2, 2013 2013 TV-MA
duration listed_in \
0 45 min Animals & Nature, Documentary, Medical
1 2 Seasons Animation, Kids
2 5 Seasons Crime TV Shows, TV Dramas, TV Thrillers
description IMDb Rotten Tomatoes
0 Nat Geo WILD re-joins the Pols in central Mich... 9.6/10 44/100
1 Bluey is a six year-old Blue Heeler dog, who t... 9.6/10 71/100
2 A high school chemistry teacher dying o

f
cance... 9.4/10 100/100
PREPARING DATA FOR PERFORMING EXPLORATORY DATA ANALYS
IS
display(data.describe().T)

count mean std min 25% 50% \

show_id 8834.0 7460.033960 5151.490348 8.0 3286.25 6055.5
release_year 8834.0 2010.738963 16.273802 1920.0 2011.00 2017.0

75% max
show_id 11986.75 19923.0
release_year 2019.00 2021.0

data.IMDb=data.IMDb.str.replace("/10","",regex=True)

data

show_id Platform type \

director \
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
8829 NaN
8830 Alejandro Ruax, Ramiro Mart??nez
8831 NaN
8832 Rock Davis, Jay Rodriguez
8833 NaN

country date_added release_year rating \

0 NaN November 12, 2019 2013 TV-14
1 Australia, United Kingdom May 28, 2021 2019 TV-Y
2 United States August 2, 2013 2013 TV-MA
3 United States September 1, 2021 2019 TV-PG
4 United States May 15, 2020 2007 TV-Y7
... ... ... ... ...
8829 South Korea April 16, 2021 2021 TV-14
8830 NaN January 6, 2021 2021 TV-MA
8831 NaN NaN 2021 TV-Y
8832 NaN NaN 2021 16+
8833 NaN NaN 2021 ALL

description IMDb Rotten Tomatoes

0 Nat Geo WILD re-joins the Pols in central Mich... 9.6 44/100
1 Bluey is a six year-old Blue Heeler dog, who t... 9.6 71/100
2 A high school chemistry teacher dying of cance... 9.4 100/100
3 Conservation heroes rescue and rehabilitate th... 9.4 42/100
4 Siblings Katara and Sokka wake young Aang from... 9.3 93/100
... ... ... ...
8829 From the quirky to the scandalous, any relatio... NaN 16/100
8830 The irrepressible Ratones Paranoicos, Argentin... NaN 12/100
8831 Adventures of Leo and friends continue in a ne... NaN 10/100
8832 Years after a life-altering robbery, a home he... NaN 46/100
8833 Sing with Pinkfong and learn how to form healt... NaN 10/100

[8834 rows x 15 columns]

data['IMDb'] = pd.to_numeric(data['IMDb'],errors='coerce')

display(data.describe().T)

count mean std min 25% 50% \

show_id 8834.0 7460.033960 5151.490348 8.0 3286.25 6055.5
release_year 8834.0 2010.738963 16.273802 1920.0 2011.00 2017.0
IMDb 8538.0 6.452331 1.196736 1.1 5.70 6.6

75% max
show_id 11986.75 19923.0
release_year 2019.00 2021.0
IMDb 7.30 9.6

data['Rotten Tomatoes']=data['Rotten Tomatoes'].str.replace("/100","",regex=T

rue)

data

show_id Platform type \

director \
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
8829 NaN
8830 Alejandro Ruax, Ramiro Mart??nez
8831 NaN
8832 Rock Davis, Jay Rodriguez
8833 NaN

country date_added release_year rating \

0 NaN November 12, 2019 2013 TV-14
1 Australia, United Kingdom May 28, 2021 2019 TV-Y
2 United States August 2, 2013 2013 TV-MA
3 United States September 1, 2021 2019 TV-PG
4 United States May 15, 2020 2007 TV-Y7
... ... ... ... ...
8829 South Korea April 16, 2021 2021 TV-14
8830 NaN January 6, 2021 2021 TV-MA
8831 NaN NaN 2021 TV-Y
8832 NaN NaN 2021 16+
8833 NaN NaN 2021 ALL

description IMDb Rotten Tomatoes

0 Nat Geo WILD re-joins the Pols in central Mich... 9.6 44
1 Bluey is a six year-old Blue Heeler dog, who t... 9.6 71
2 A high school chemistry teacher dying of cance... 9.4 100
3 Conservation heroes rescue and rehabilitate th... 9.4 42
4 Siblings Katara and Sokka wake young Aang from... 9.3 93
... ... ... ...
8829 From the quirky to the scandalous, any relatio... NaN 16
8830 The irrepressible Ratones Paranoicos, Argentin... NaN 12
8831 Adventures of Leo and friends continue in a ne... NaN 10
8832 Years after a life-altering robbery, a home he... NaN 46
8833 Sing with Pinkfong and learn how to form healt... NaN 10

[8834 rows x 15 columns]

data['Rotten Tomatoes'] = pd.to_numeric(data['Rotten Tomatoes'],errors='coerc

e')

data.loc[1:100]

show_id Platform type title \

1 209 Disney TV Show Bluey
2 7391 Netflix TV Show Breaking Bad
3 119 Disney TV Show Alaska Animal Rescue
4 3970 Netflix TV Show Avatar: The Last Airbender
5 13448 Prime Video TV Show Avatar: The Last Airbender
.. ... ... ... ...
96 11145 Prime Video Movie Stop Making Sense
97 93 Disney TV Show The Simpsons
98 5682 Netflix Movie Bill Hicks: Relentless
99 14171 Prime Video Movie Bill Hicks: Relentless
100 5683 Netflix Movie Bill Hicks: Revelations

director cast \
1 NaN Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 NaN Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
3 NaN Victoria Vosburg
4 NaN Zach Tyler, Mae Whitman, Jack De Sena, Dee Bra...
5 NaN NaN
.. ... ...
96 Jonathan Demme David Byrne, Chris Frantz, Jerry Harrison, Tin...
97 NaN Dan Castellaneta, Julie Kavner, Nancy Cartwrig...
98 Chris Bould Bill Hicks
99 Chris Bould Bill Hicks
100 Chris Bould Bill Hicks

country date_added release_year rating \

1 Australia, United Kingdom May 28, 2021 2019 TV-Y
2 United States August 2, 2013 2013 TV-MA
3 United States September 1, 2021 2019 TV-PG
4 United States May 15, 2020 2007 TV-Y7
5 NaN NaN 2008 TV-Y7
.. ... ... ... ...
96 NaN NaN 1984 7+
97 United States September 29, 2021 1989 TV-PG
98 United Kingdom December 31, 2018 1992 TV-MA
99 NaN NaN 1992 NaN
100 United Kingdom December 31, 2018 1993 TV-MA

duration listed_in \
1 2 Seasons Animation, Kids
2 5 Seasons Crime TV Shows, TV Dramas, TV Thrillers
3 2 Seasons Animals & Nature, Docuseries, Family
4 3 Seasons Classic & Cult TV, Kids' TV, TV Action & Adven...
5 3 Seasons Kids
.. ... ...
96 88 min Documentary, Music Videos and Concerts
97 32 Seasons Animation, Comedy
98 61 min Stand-Up Comedy
99 49 min Arts, Entertainment, and Culture, Comedy, Docu...
100 56 min Stand-Up Comedy

description IMDb Rotten Tomatoes

1 Bluey is a six year-old Blue Heeler dog, who t... 9.6 71.0
2 A high school chemistry teacher dying of cance... 9.4 100.0
3 Conservation heroes rescue and rehabilitate th... 9.4 42.0
4 Siblings Katara and Sokka wake young Aang from... 9.3 93.0
5 Fire promises to be the most exciting season y... 9.3 93.0
.. ... ... ...
96 Stop Making Sense is director Jonathan Demme's... 8.6 72.0
97 The worldÆ’??s favorite nuclear family, in the... 8.6 91.0
98 In one of his most iconic performances, late c... 8.6 63.0
99 In this classic special from 1992, comedian Bi... 8.6 63.0
100 In his final recorded special, the iconoclasti... 8.6 62.0

[100 rows x 15 columns]

print(data['rating'].unique())

['TV-14' 'TV-Y' 'TV-MA' 'TV-PG' 'TV-Y7' '13+' 'NR' '18+' '16+' 'R' 'PG'
'ALL' 'TV-G' '7+' nan 'PG-13' 'G' 'TV-NR' 'AGES_16_' '16' 'AGES_18_'
'TV-Y7-FV' 'NC-17' 'UNRATED']
data['rating']=data['rating'].str.replace("TV-G","all",regex=True)

data['rating']=data['rating'].str.replace("TV-PG","13+",regex=True)

data['rating']=data['rating'].str.replace("TV-Y7-FV","13+",regex=True)

data['rating']=data['rating'].str.replace("TV-Y7","13+",regex=True)

data['rating']=data['rating'].str.replace("TV-Y","all",regex=True)

data['rating']=data['rating'].str.replace("PG-13","16+",regex=True)

data['rating']=data['rating'].str.replace("TV-14","16+",regex=True)

data['rating']=data['rating'].str.replace("TV-MA","18+",regex=True)

data['rating']=data['rating'].str.replace("NC-17","18+",regex=True)

data['rating']=data['rating'].str.replace("UNRATED","all",regex=True)

data['rating']=data['rating'].str.replace("TV-NR","all",regex=True)

data['rating']=data['rating'].str.replace("NR","all",regex=True)

data['rating']=data['rating'].str.replace("ALL","all",regex=True)

data['rating']=data['rating'].str.replace("PG","13+",regex=True)

data['rating']=data['rating'].str.replace("AGES_18_","18+",regex=True)

data['rating']=data['rating'].str.replace("AGES_16_","18+",regex=True)

data['rating']=data['rating'].str.replace("G","all",regex=True)

data['rating']=data['rating'].str.replace("R","18+",regex=True)

data['rating']=data['rating'].str.replace("+",",",regex=True)

data['rating']=data['rating'].str.replace("16,","16",regex=True)

data['rating']=data['rating'].str.replace(",","+",regex=True)

data['rating']=data['rating'].str.replace("16","16+",regex=True)

print(data['rating'].unique())

['16+' 'all' '18+' '13+' '7+' nan]

data.groupby('Platform')['IMDb'].mean().plot.bar()
plt.show()
EXPLORATORY DATA ANALYSIS WITH THE HELP OF VISUALIZAT
IONS

data.loc[1:100].groupby('Platform')['IMDb'].mean().plot.bar() plt.show()

It's a barplot showing the mean of Disney , Netflix and Prime Video platforms

data.Platform.value_counts(normalize=True).plot.pie()
plt.show()
It's a Pieplot showing the mean of Disney , Netflix and Prime Video platforms
import numpy as np
import matplotlib.pyplot as plt
data.rating.value_counts(normalize=True).plot.pie()
circle = plt.Circle( (0,0), 0.7, color='White')
p=plt.gcf()
p.gca().add_artist(circle)
plt.show()

Donut chart above has TV-MA has the highest rating with all as the least
plt.subplots(figsize=(6,4))
sns.barplot(x="rating", y="Rotten Tomatoes" , data= data.sort_values("Rotten
Tomatoes",ascending=False).head(20))

<AxesSubplot:xlabel='rating', ylabel='Rotten Tomatoes'>

The bar plot has the ratings of Rotten Tomatoes with 18+ as the highest and all as the least
data_corr = data.corr()

fig, ax = plt.subplots(figsize=(8, 6))

# mask
mask = np.triu(np.ones_like(data_corr, dtype=np.bool))
# adjust mask and df
mask = mask[1:, :-1]
corr = data_corr.iloc[1:,:-1].copy()
# color map
cmap = sns.diverging_palette(0, 230, 90, 60, as_cmap=True)
# plot heatmap
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f",
linewidths=5, cmap=cmap, vmin=-1, vmax=1,
cbar_kws={"shrink": .8}, square=True)
# ticks
yticks = [i.upper() for i in corr.index]
xticks = [i.upper() for i in corr.columns]
plt.yticks(plt.yticks()[0], labels=yticks, rotation=0)
plt.xticks(plt.xticks()[0], labels=xticks)
plt.show()

age_groups = pd.DataFrame(data['rating'].value_counts()).reset_index()
age_groups = age_groups.rename(columns={'index':'rating', 'rating':'Count'})
age_groups

rating Count
0 18+ 3409
1 16+ 2210
2 13+ 1914
3 all 1042
4 7+ 103

plt.figure(figsize = (20,16))

a = sns.barplot(x='rating', y='Count', data = age_groups, palette='Spectral',

linewidth=3)

plt.figtext(x=0.14, y=0.95,
s='Distribution of TV Shows based on Ratings',
fontsize=25, fontname='monospace')
plt.xticks(fontsize=20, fontname='monospace')
plt.yticks(fontsize=20, fontname='monospace')
plt.xlabel('Rating', fontsize=14)
plt.ylabel('Count', fontsize=14)

plt.grid(axis='y', color='black', linestyle = ':', alpha=0.5)

for q in [a]:
for w in ['bottom', 'left']:
q.spines[w].set_linewidth(3)
for w in ['right', 'top']:
q.spines[w].set_visible(False)

plt.show()

The above graph we can infer the highest rating in Netflix shows with more than 3000
count being 18+ and 7+ is the least.
SEPERATING GENRE TO MAKE GENRE BASED ANALYSIS
def get_unique_values(genre_list):
more_than_one = 0
only_one = 0
unique_genre = []
for listed_in in genre_list:
try:
values =listed_in.split(",")
if len(values) > 1:
more_than_one += 1
elif len(values) == 1:
only_one += 1
except:
pass
for genre in values:
if genre not in unique_genre:
unique_genre.append(genre)

return unique_genre, more_than_one, only_one

unique_genre, more_than_one, only_one =

get_unique_values(data['listed_in'].unique())

print('Total Number of Unique Genres are: ', len(unique_genre))

print('Movies having more than one genre: ', more_than_one)
print('Movies having only one genre: ', only_one) age_groups =
pd.DataFrame(data['rating'].value_counts()).reset_index()
age_groups = age_groups.rename(columns={'index':'rating', 'rating':'Count'})
age_groups

plt.figure(figsize = (20,16))

a = sns.barplot(x='rating', y='Count', data = age_groups,

palette='Spectral',linewidth=3)

plt.figtext(x=0.14, y=0.95,
s='Distribution of TV Shows based on Ratings',
fontsize=25, fontname='monospace')

plt.xticks(fontsize=20, fontname='monospace')
plt.yticks(fontsize=20, fontname='monospace')
plt.xlabel('Rating', fontsize=14)
plt.ylabel('Count', fontsize=14)

plt.grid(axis='y', color='black', linestyle = ':', alpha=0.5)

for q in [a]:
for w in ['bottom', 'left']:
q.spines[w].set_linewidth(3)
for w in ['right', 'top']:
q.spines[w].set_visible(False)

plt.show()

def get_unique_values(genre_list):
more_than_one = 0
only_one = 0
unique_genre = []
for listed_in in genre_list:
try:
values =listed_in.split(",")
if len(values) > 1:
more_than_one += 1
elif len(values) == 1:
only_one += 1
except:
pass
for genre in values:
if genre not in unique_genre:
unique_genre.append(genre)

return unique_genre, more_than_one, only_one

unique_genre, more_than_one, only_one = get_unique_values(data['listed_in'].u

nique())

print('Total Number of Unique Genres are: ', len(unique_genre))

print('Movies having more than one genre: ', more_than_one)
print('Movies having only one genre: ', only_one)

Total Number of Unique Genres are: 168

Movies having more than one genre: 967
Movies having only one genre: 46

unique_genre1=[]
unique_genre1=unique_genre.copy()
len(unique_genre1)

unique_genre = [x.strip(' ') for x in unique_genre]

unique_genre

Total Number of Unique Genres are: 168

Movies having more than one genre: 967
Movies having only one genre: 46
unique_genre1=[]
unique_genre1=unique_genre.copy()
len(unique_genre1)

168

unique_genre = [x.strip(' ') for x in unique_genre]

unique_genre

['Animals & Nature',

'Documentary',
'Medical',
'Animation',
'Kids',
'Crime TV Shows',
'TV Dramas',
'TV Thrillers',
'Docuseries',
'Family',
'Classic & Cult TV',
"Kids' TV",
'TV Action & Adventure',
'Kids',
'Docuseries',
'Science & Nature TV',
'Action',
'Science Fiction',
'Arts',
'Entertainment',
'and Culture',
'Action-Adventure',
'Documentary',
'Unscripted',
'Anime Series',
'International TV Shows',
'International TV Shows',
'Korean TV Shows',
'Romantic TV Shows',
'Crime TV Shows',
'Drama',
'Romance',
'Comedies',
'Dramas',
'International Movies',
'Drama',
'Suspense',
'Horror Movies',
'Thrillers',
'Historical',
'TV Mysteries',
'Documentaries',
'Biographical',
'British TV Shows',
'Classic & Cult TV',
'Spanish-Language TV Shows',
'Special Interest',
'TV Comedies',
"Kids' TV",
'Reality TV',
'Stand-Up Comedy',
'International',
'Comedy',
'Horror',
'TV Comedies',
'Biographical',
'Sports',
'Teen TV Shows',
'Stand-Up Comedy & Talk Shows',
'TV Dramas',
'Reality TV',
'TV Horror',
'TV Sci-Fi & Fantasy',
'Adventure',
'Music & Musicals',
'Animation',
'Music Videos and Concerts',
'Comedy',
'TV Action & Adventure',
'Dramas',
'Independent Movies',
'TV Horror',
'TV Shows',
'LGBTQ Movies',
'Action & Adventure',
'Classic Movies',
'Fantasy',
'Mystery',
'Faith and Spirituality',
'Children & Family Movies',
'Documentaries',
'Stand-Up Comedy & Talk Shows',
'Coming of Age',
'Fantasy',
'Variety',
'Special Interest',
'Reality',
'Survival',
'Sports Movies',
'Music & Musicals',
'Stand-Up Comedy',
'Coming of Age',
'Romantic TV Shows',
'Romantic Movies',
'Sci-Fi & Fantasy',
'Western',
'Musical',
'Classic Movies',
'Cult Movies',
'Buddy',
'Game Show / Competition',
'Animals & Nature',
'International Movies',
'Suspense',
'Comedies',
'LGBTQ',
'Thrillers',
'Horror Movies',
'Military and War',
'Anthology',
'Romance',
'Talk Show and Variety',
'Faith & Spirituality',
'International',
'Science Fiction',
'Anime',
'Anime',
'Horror',
'Crime',
'Anime Features',
'Spy/Espionage',
'Dance',
'Adventure',
'Family',
'Music',
'Arthouse',
'Anime Features',
'Children & Family Movies',
'Lifestyle',
'Sports',
'Soap Opera / Melodrama',
'Music',
'Young Adult Audience',
'Western',
'Arthouse',
'Unscripted',
'Travel',
'TV Sci-Fi & Fantasy',
'Thriller',
'Arts',
'Independent Movies',
'Music Videos and Concerts',
'Spanish-Language TV Shows',
'Military and War',
'Buddy',
'Parody',
'Musical',
'Historical',
'Concert Film',
'LGBTQ',
'Cult Movies',
'Disaster',
'Faith and Spirituality',
'Anthology',
'Crime',
'Game Show / Competition',
'Talk Show and Variety',
'Reality',
'Sci-Fi & Fantasy',
'Movies',
'Young Adult Audience',
'Romantic Comedy',
'LGBTQ Movies',
'Dance',
'Romantic Movies',
'Superhero',
'Fitness',
'Talk Show']

temp = []
for i in unique_genre:
if i not in temp:
temp.append(i)

len(temp)

100

unique_genre=temp.copy()

genre_dict = {}

for val in unique_genre:

genre_dict[val] = 0

unique_genre = [x.strip(' ') for x in unique_genre]

new_df = data[data['listed_in'].notna()]

for listed_in in unique_genre:

count = new_df[new_df['listed_in'].str.contains(listed_in)].shape[0]
genre_dict[listed_in] = count
count

genre_count = pd.DataFrame(columns=['Genre', 'Count'],

data = {'Genre':[val for val in genre_dict.keys()]
,
'Count': [val for val in genre_dict.value
s()]}).sort_values(by='Count', ascending=False).reset_index(drop=True)

genre_count

Genre Count
0 Drama 3265
1 International 2516
2 Movies 2387
3 Dramas 1943
4 International Movies 1545
.. ... ...
95 Concert Film 2
96 Disaster 2
97 Romantic Comedy 2
98 Fitness 1
99 Travel 1

[100 rows x 2 columns]

print(unique_genre)

['Animals & Nature', 'Documentary', 'Medical', 'Animation', 'Kids', 'Crime TV

Shows', 'TV Dramas', 'TV Thrillers', 'Docuseries', 'Family', 'Classic & Cult
TV', "Kids' TV", 'TV Action & Adventure', 'Science & Nature TV', 'Action', 'S
cience Fiction', 'Arts', 'Entertainment', 'and Culture', 'Action-Adventure',
'Unscripted', 'Anime Series', 'International TV Shows', 'Korean TV Shows', 'R
omantic TV Shows', 'Drama', 'Romance', 'Comedies', 'Dramas', 'International M
ovies', 'Suspense', 'Horror Movies', 'Thrillers', 'Historical', 'TV Mysteries
', 'Documentaries', 'Biographical', 'British TV Shows', 'Spanish-Language TV
Shows', 'Special Interest', 'TV Comedies', 'Reality TV', 'Stand-Up Comedy', '
International', 'Comedy', 'Horror', 'Sports', 'Teen TV Shows', 'Stand-Up Come
dy & Talk Shows', 'TV Horror', 'TV Sci-Fi & Fantasy', 'Adventure', 'Music & M
usicals', 'Music Videos and Concerts', 'Independent Movies', 'TV Shows', 'LGB
TQ Movies', 'Action & Adventure', 'Classic Movies', 'Fantasy', 'Mystery', 'Fa
ith and Spirituality', 'Children & Family Movies', 'Coming of Age', 'Variety'
, 'Reality', 'Survival', 'Sports Movies', 'Romantic Movies', 'Sci-Fi & Fantas
y', 'Western', 'Musical', 'Cult Movies', 'Buddy', 'Game Show / Competition',
'LGBTQ', 'Military and War', 'Anthology', 'Talk Show and Variety', 'Faith & S
pirituality', 'Anime', 'Crime', 'Anime Features', 'Spy/Espionage', 'Dance', '
Music', 'Arthouse', 'Lifestyle', 'Soap Opera / Melodrama', 'Young Adult Audie
nce', 'Travel', 'Thriller', 'Parody', 'Concert Film', 'Disaster', 'Movies', '
Romantic Comedy', 'Superhero', 'Fitness', 'Talk Show']
unique_genre

['Animals & Nature',

genre_count

Genre Count
0 Drama 3265
1 International 2516
2 Movies 2387
3 Dramas 1943
4 International Movies 1545
.. ... ...
95 Concert Film 2
96 Disaster 2
97 Romantic Comedy 2
98 Fitness 1
99 Travel 1

[100 rows x 2 columns]

plt.figure(figsize=(12,10))
plt.grid(axis='x',color='black', linestyle = ':', alpha=0.5)
plt.title('Top 10 TV Show Genres', fontname='monospace', fontsize=25, y=1.05)
a = sns.barplot(x='Count', y='Genre', data=genre_count[:10], palette='rocket'
)

genres = genre_count['Genre'][:10].tolist()
for i, val in enumerate(listed_in):
x_val = genre_count[genre_count['Genre'] ==val]['Count'].values[0]
a.text(y=i, x= x_val -300,
s=str(x_val),
fontsize=14, fontname='monospace', color='white')

for q in [a]:
for w in ['bottom', 'left']:
q.spines[w].set_linewidth(1.5)
for w in ['right', 'top']:
q.spines[w].set_visible(False)

plt.xlabel('Count', fontsize=15)
plt.ylabel('Genre', fontsize=15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

plt.show()
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_5180/301003751.py in <module>
6 genres = genre_count['Genre'][:10].tolist()
7 for i, val in enumerate(listed_in):
----> 8 x_val = genre_count[genre_count['Genre'] ==val]['Count'].values[0
]
9 a.text(y=i, x= x_val -300,
10 s=str(x_val),

IndexError: index 0 is out of bounds for axis 0 with size 0

The above graph shows the top ten tv show and movie genres
plt.subplots(figsize=(8,6))
sns.histplot(data["release_year"],kde=False, color="green")

<AxesSubplot:xlabel='release_year', ylabel='Count'>
The above graphical figure shows the highest count of shows released , with 2020 being the
highest show release year followed by 1920 to be having the least count to be released
print("TV Shows with highest IMDb ratings are= ")
print((data.sort_values("IMDb",ascending=False).head(20))['title'])

TV Shows with highest IMDb ratings are=

0 Jingle Pols
1 Bluey
2 Breaking Bad
3 Alaska Animal Rescue
4 Avatar: The Last Airbender
5 Avatar: The Last Airbender
6 Our Planet
7 Cosmos
8 Harmony with A R Rahman
9 Cosmos: Possible Worlds
10 Clarkson's Farm
15 Okupas
18 Word of Honor
17 Word of Honor
16 The Last Dance
12 Reply 1988
14 Heartland Docs, DVM
13 My Mister
11 Fullmetal Alchemist: Brotherhood
25 The Imagineering Story
Name: title, dtype: object

plt.subplots(figsize=(8,6))
sns.barplot(x="IMDb", y="title" , data= data.sort_values("IMDb",ascending=Fal
se).head(20))
<AxesSubplot:xlabel='IMDb', ylabel='title'>

TV Shows with highest IMDb ratings are represented in a barplot format with no ascending
order

print("TV Shows with lowest IMDb ratings are= ")

print((data.sort_values("IMDb",ascending=True).head(20))['title'])

TV Shows with lowest IMDb ratings are=

8537 Racket Boys
8536 Finding Jesus
8534 Aerials
8535 Izzie's Way Home
8533 Terror at Bigfoot Pond
8532 Jonas Brothers: The Concert Experience
8531 Himmatwala
8530 Virus Shark
8529 Hampton's Legion
8528 Race 3
8527 Myriam Fares: The Journey
8526 Cross: Rise of the Villains
8525 Stinger
8524 Snitch'd
8522 Hajwala: The Missing Engine
8523 Maximum Impact
8521 Student of the Year 2
8520 Romina
8519 Deewana Main Deewana
8518 Who's Your Caddy?
Name: title, dtype: object

#barplot of rating
plt.subplots(figsize=(8,6))
sns.barplot(x="IMDb", y="title" , data= data.sort_values("IMDb",ascending=Tru
e).head(20))
<AxesSubplot:xlabel='IMDb', ylabel='title'>

TV Shows with lowest IMDb ratings are represented in a barplot format in an ascending
order
#Overall data of IMDb ratings

plt.figure(figsize=(16, 6))

sns.scatterplot(data=data['IMDb'])
plt.ylabel("Rating")
plt.xlabel('Movies')
plt.title("IMDb Rating Distribution")

Text(0.5, 1.0, 'IMDb Rating Distribution')

The above figure is a scatter plot with the IMDb ratings

print("TV Shows with highest Rotten Tomatoes scores are= ")
print((data.sort_values("Rotten Tomatoes",ascending=False).head(20))['title']
)

TV Shows with highest Rotten Tomatoes scores are=

2 Breaking Bad
1051 The Irishman
237 Dangal
83 Stranger Things
930 Mary Poppins
29 David Attenborough: A Life on Our Planet
21 Attack on Titan
204 Loki
46 Better Call Saul
49 The Mandalorian
363 Tumbbad
45 Dark
50 Peaky Blinders
1479 The Social Dilemma
465 The Walking Dead
55 Dark
5 Avatar: The Last Airbender
4 Avatar: The Last Airbender
92 The Boys
473 Article 15
Name: title, dtype: object

#barplot of rating
plt.subplots(figsize=(8,6))
sns.barplot(x="Rotten Tomatoes", y="title" , data= data.sort_values("Rotten T
omatoes",ascending=False).head(20))
<AxesSubplot:xlabel='Rotten Tomatoes', ylabel='title'>

TV Shows with highest Rotten Tomatoes scores are represented in a barplot format with no
ascending order
print("TV Shows with lowest Rotten Tomatoes scores are= ")
print((data.sort_values("Rotten Tomatoes",ascending=True).head(20))['title'])

TV Shows with lowest Rotten Tomatoes scores are=

8833 Pinkfong! Healthy Habit Songs
8673 Field of Stars
8674 Dropping the Soap
8677 Chasing November
8681 Sirenetta & the Second Star
8682 Mexico Untamed
8689 Best Job Ever
8690 Fearless Adventures with Jack Randall
8693 Wild Russia
8700 Falz Experience
8703 The Confrontation
8722 Uchimura Summers
8724 SAS Rogue Warriors
8725 Quark Science
8727 Pinkfong! Christmas Carols
8728 Little Big Awesome
8752 The Last Bomb of the Second World War
8764 ChuChuTV Bedtime Stories & Moral Stories for K...
8672 Learn with Ted The Train
8765 ChuChuTV Surprise Eggs Learning Videos (English)
Name: title, dtype: object
#barplot of rating
plt.subplots(figsize=(8,6))
sns.barplot(x="Rotten Tomatoes", y="title" , data= data.sort_values("Rotten T
omatoes",ascending=True).head(20))

<AxesSubplot:xlabel='Rotten Tomatoes', ylabel='title'>

TV Shows with lowest Rotten Tomatoes scores are represented in a barplot format in an
ascending order
#Overall data of Rotten Tomatoes scores

plt.figure(figsize=(16, 6))
sns.scatterplot(data=data['Rotten Tomatoes'])
plt.ylabel("Rotten Tomatoes score")
plt.xlabel('Movies')
plt.title("Rotten Tomatoes Score Distribution")

Text(0.5, 1.0, 'Rotten Tomatoes Score Distribution')

ANALYSIS OF DATA ACROSS PLATFORMS
NETFLIX
#selecting netflix shows
netflix=data[data["Platform"]=="Netflix"]

print("Number of shows on Netflix= ", len(netflix))

Number of shows on Netflix= 4954

plt.subplots(figsize=(8,6))
sns.histplot(netflix["release_year"],kde=False, color="blue")

<AxesSubplot:xlabel='release_year', ylabel='Count'>

The above graphical figure shows the highest count of shows released , with 2020 being the
highest show release year followed by 1960 to be having the least count to be released
plt.subplots(figsize=(8,6))
sns.histplot(netflix["rating"],kde=False, color="cyan")

<AxesSubplot:xlabel='rating', ylabel='Count'>

The above grapgh we can infer the highest rating in Netflix shows with more than 2000
count being TV-MA and NC-17 is the least with zero shows.
plt.subplots(figsize=(8,6))
sns.distplot(netflix["IMDb"],kde=False, color="purple")

c:\users\91812\appdata\local\programs\python\python39\lib\site-packages\seabo
rn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function
and will be removed in a future version. Please adapt your code to use either
`displot` (a figure-level function with similar flexibility) or `histplot` (a
n axes-level function for histograms).
warnings.warn(msg, FutureWarning)

<AxesSubplot:xlabel='IMDb'>
From the histogram we can observe the number of shows with respect to their
ratings.Nearly 350 shows got 7.5/10 rating in IMDb. Approximately 2 shows got 10/10
rating.

plt.subplots(figsize=(8,6))
sns.distplot(netflix["Rotten Tomatoes"],kde=False, color="blue")

<AxesSubplot:xlabel='Rotten Tomatoes'>
From the histogram we can observe the number of shows with respect to their
ratings.Nearly 400 shows got 58/100 rating in Rotten tomatoes. Approximately 2 shows
got 100/100 rating.
print("Netflix Shows with highest IMDb ratings are= ")
print((netflix.sort_values("IMDb",ascending=False).head(10))['title'])

Netflix Shows with highest IMDb ratings are=

2 Breaking Bad
6 Our Planet
4 Avatar: The Last Airbender
11 Fullmetal Alchemist: Brotherhood
12 Reply 1988
13 My Mister
15 Okupas
16 The Last Dance
17 Word of Honor
24 Leah Remini: Scientology and the Aftermath
Name: title, dtype: object
PRIMEVIDEO

#selecting prime videos shows

Prime=data[data["Platform"]=="Prime Video"]

print("Number of shows on Prime videos= ", len(Prime))

Number of shows on Prime videos= 2902

plt.subplots(figsize=(8,6))
sns.histplot(Prime["release_year"],kde=False, color="blue")

<AxesSubplot:xlabel='release_year', ylabel='Count'>
plt.subplots(figsize=(8,6))
sns.histplot(Prime["rating"],kde=False, color="cyan")

<AxesSubplot:xlabel='rating', ylabel='Count'>

plt.subplots(figsize=(8,6))
sns.distplot(Prime["IMDb"],kde=False, color="purple")

<AxesSubplot:xlabel='IMDb'>
plt.subplots(figsize=(8,6))
sns.distplot(Prime["Rotten Tomatoes"],kde=False, color="blue")

<AxesSubplot:xlabel='Rotten Tomatoes'>
print("Prime video Shows with highest IMDb ratings are= ")
print((Prime.sort_values("IMDb",ascending=False).head(10))['title'])

Prime video Shows with highest IMDb ratings are=

5 Avatar: The Last Airbender
7 Cosmos
8 Harmony with A R Rahman
10 Clarkson's Farm
18 Word of Honor
20 Firefly
22 Special Forces
27 The Untamed
57 Uncle Tom
61 The Family Man
Name: title, dtype: object
DISNEY

#selecting Disney shows

Disney=data[data["Platform"]=="Disney"]

print("Number of shows on Disney= ", len(Disney))

Number of shows on Disney= 978

plt.subplots(figsize=(8,6))
sns.histplot(Disney["release_year"],kde=False, color="blue")

<AxesSubplot:xlabel='release_year', ylabel='Count'>
plt.subplots(figsize=(8,6))
sns.histplot(Disney["rating"],kde=False, color="cyan")

<AxesSubplot:xlabel='rating', ylabel='Count'>

plt.subplots(figsize=(8,6))
sns.distplot(Disney["IMDb"],kde=False, color="purple")

<AxesSubplot:xlabel='IMDb'>
plt.subplots(figsize=(8,6))
sns.distplot(Disney["Rotten Tomatoes"],kde=False, color="blue")

<AxesSubplot:xlabel='Rotten Tomatoes'>
print("Disney Shows with highest IMDb ratings are= ")
print((Disney.sort_values("IMDb",ascending=False).head(10))['title'])

Disney Shows with highest IMDb ratings are=

0 Jingle Pols
1 Bluey
3 Alaska Animal Rescue
9 Cosmos: Possible Worlds
14 Heartland Docs, DVM
25 The Imagineering Story
28 Critter Fixers: Country Vets
31 Incredible! The Story of Dr. Pol
49 The Mandalorian
41 One Strange Rock
Name: title, dtype: object
STOPWORDS AND WORDCLOUD BASED ANALYSIS

#Taking the values

titles=data["title"].values

#Joining into a single string

text=' '.join(titles)

len(text)

153309

text[1000:1500]

"Masters: Rust to Riches Bo Burnham: Inside Navillera The Family Man The Caro
l Burnett Show House Chappelle's Show Invincible Code Geass: Lelouch of the R
ebellion WWII in HD WWII in HD The Universe Victorian Farm Downton Abbey Haik
yu!! Puffin Rock Downton Abbey Dave Chappelle House of Cards The Repair Shop
Moving Art Norm Macdonald Has a Show Demon Slayer: Kimetsu no Yaiba Anne with
an E Crash Landing on You Stranger Things Arrested Development The Marvelous
Mrs. Maisel Line of Duty Fleabag Sky T"

#Removing the punctuation

text = re.sub(r'[^\w\s]','',text)

len(text)

150872

#Punctuation has been removed

text[1000:1500]

'iches Bo Burnham Inside Navillera The Family Man The Carol Burnett Show Hous
e Chappelles Show Invincible Code Geass Lelouch of the Rebellion WWII in HD W
WII in HD The Universe Victorian Farm Downton Abbey Haikyu Puffin Rock Downto
n Abbey Dave Chappelle House of Cards The Repair Shop Moving Art Norm Macdona
ld Has a Show Demon Slayer Kimetsu no Yaiba Anne with an E Crash Landing on Y
ou Stranger Things Arrested Development The Marvelous Mrs Maisel Line of Duty
Fleabag Sky Tour The Movie Lenox Hill '
#Creating the tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')

#Tokenizing the text

tokens = tokenizer.tokenize(text)

len(tokens)

25049

tokens[1000:1010]

['to',
'the',
'Edge',
'Sense8',
'Patriot',
'Hunting',
'ISIS',
'Breathe',
'Tumbbad',
'Travel']

#now we shall make everything lowercase for uniformity

#to hold the new lower case words

words = []

# Looping through the tokens and make them lower case

for word in tokens:
words.append(word.lower())

#Stop words are generally the most common words in a language.

#English stop words from nltk.

stopwords = nltk.corpus.stopwords.words('english')

#nltk.download()

words_new = []

#Now we need to remove the stop words from the words variable
#Appending to words_new all words that are in words but not in sw

for word in words:

if word not in stopwords:
words_new.append(word)

freq_dist = nltk.FreqDist(words_new)
#Frequency Distribution Plot
plt.subplots(figsize=(20,12))
freq_dist.plot(50)

<AxesSubplot:xlabel='Samples', ylabel='Counts'>

Stop words are generally the most common words in a language. From the above frequency
distribution we can observe that 'love' word has highest frequency with 'dead' having least
of all from the given dataset of Tv show tiles
WORD CLOUD WITH STOPWORDS

#converting into string

res=' '.join([i for i in words_new if not i.isdigit()])

#wordcloud

plt.subplots(figsize=(16,10))
wordcloud = WordCloud(
stopwords=STOPWORDS,
background_color='white',
max_words=100,
width=1400,
height=1200
).generate(res)

plt.imshow(wordcloud)
plt.title('TV Show Title WordCloud 100 Words')
plt.axis('off')
plt.show()
Word clouds are graphical representations of word frequency that give greater prominence
to words that appear more frequently in a source text. The larger the word in the visual the
more common the word is in the dataset. wordclod of 100 words in Tv show title with love
as the most frequently used word
plt.subplots(figsize=(16,10))
wordcloud = WordCloud(
stopwords=STOPWORDS,
background_color='white',
max_words=500,
width=1400,
height=1200
).generate(res)

plt.imshow(wordcloud)
plt.title('TV Show Title WordCloud 500 Words')
plt.axis('off')
plt.show()

From the above word cloud we can infer the 500 most commonly used words in Tv Show
Titles. Love is the most commonly used word followed by Christmas , girl , life , man and so
on
print("Netflix Shows with lowest IMDb ratings are= ")
print((netflix.sort_values("IMDb",ascending=True).head(10))['title'])

Netflix Shows with lowest IMDb ratings are=

8537 Racket Boys
8534 Aerials
8531 Himmatwala
8527 Myriam Fares: The Journey
8526 Cross: Rise of the Villains
8522 Hajwala: The Missing Engine
8520 Romina
8519 Deewana Main Deewana
8515 Game Winning Hit
8509 Joker
Name: title, dtype: object

print("Netflix Shows with highest Rotten Tomatoes score are= ")

print((netflix.sort_values("Rotten Tomatoes",ascending=False).head(10))['titl
e'])

Netflix Shows with highest Rotten Tomatoes score are=

2 Breaking Bad
1051 The Irishman
237 Dangal
83 Stranger Things
21 Attack on Titan
29 David Attenborough: A Life on Our Planet
46 Better Call Saul
50 Peaky Blinders
1479 The Social Dilemma
465 The Walking Dead
Name: title, dtype: object

print("Netflix Shows with lowest Rotten Tomatoes score are= ")

print((netflix.sort_values("Rotten Tomatoes",ascending=True).head(10))['title
'])

Netflix Shows with lowest Rotten Tomatoes score are=

8791 Dear Affy
8541 Lock Your Girls In
8555 Monty Python Conquers America
8764 ChuChuTV Bedtime Stories & Moral Stories for K...
8765 ChuChuTV Surprise Eggs Learning Videos (English)
8703 The Confrontation
8603 Dhia Sofea
8604 Be with Me
8700 Falz Experience
8614 You're Everything To Me
Name: title, dtype: object
#Taking the title and rating data

netflix1=netflix.sort_values("IMDb",ascending=False).head(100)[['title',"IMDb
"]]
netflix1.head()

title IMDb
2 Breaking Bad 9.4
6 Our Planet 9.3
4 Avatar: The Last Airbender 9.3
11 Fullmetal Alchemist: Brotherhood 9.1
12 Reply 1988 9.1

From the above observation, Breaking bad has the highest rating with 9.4 followed by Our
planet , Avatar: The Last Airbender with 9.3 IMDb rating and 9.1 as the least rating for
Fullmetal Alchemist: Brotherhood and Reply 1988
#Converting it into a tuple

tuples_netflix_imdb = [tuple(x) for x in netflix1.values]

#Looks like this

tuples_netflix_imdb[0:10]

[('Breaking Bad', 9.4),

('Our Planet', 9.3),
('Avatar: The Last Airbender', 9.3),
('Fullmetal Alchemist: Brotherhood', 9.1),
('Reply 1988', 9.1),
('My Mister', 9.1),
('Okupas', 9.1),
('The Last Dance', 9.1),
('Word of Honor', 9.1),
('Leah Remini: Scientology and the Aftermath', 9.0)]

#Making a wordcloud

wordcloud_netflix_imdb = WordCloud(width=1400,height=1200).generate_from_freq
uencies(dict(tuples_netflix_imdb))

plt.subplots(figsize=(12,12))
plt.imshow(wordcloud_netflix_imdb)
plt.title("TV Shows based on IMDb rating(Top 100)")

Text(0.5, 1.0, 'TV Shows based on IMDb rating(Top 100)')

#Taking the title value and Rotten Tomatoes Score

netflix2=netflix.sort_values("Rotten Tomatoes",ascending=False).head(100)[['t
itle',"Rotten Tomatoes"]]
netflix2.head()

title Rotten Tomatoes

2 Breaking Bad 100.0
1051 The Irishman 98.0
237 Dangal 97.0
83 Stranger Things 96.0
21 Attack on Titan 95.0

#Converting to Tuple

tuples_netflix_tomatoes = [tuple(x) for x in netflix2.values]

#Word Cloud generation

wordcloud_netflix_tomatoes = WordCloud(width=1400,height=1200).generate_from_
frequencies(dict(tuples_netflix_tomatoes))

plt.subplots(figsize=(12,12))
plt.imshow(wordcloud_netflix_tomatoes)

plt.title("TV Shows based on Rotten Tomatoes Score(Top 100)")

Text(0.5, 1.0, 'TV Shows based on Rotten Tomatoes Score(Top 100)')

CLUSTER ANALYSIS

#Taking the relevant data

ratings=data[["title",'IMDb',"Rotten Tomatoes"]]
ratings.head()

title IMDb Rotten Tomatoes

0 Jingle Pols 9.6 44.0
1 Bluey 9.6 71.0
2 Breaking Bad 9.4 100.0
3 Alaska Animal Rescue 9.4 42.0
4 Avatar: The Last Airbender 9.3 93.0

len(ratings)

8834

ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8834 entries, 0 to 8833
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 8834 non-null object
1 IMDb 8538 non-null float64
2 Rotten Tomatoes 8827 non-null float64
dtypes: float64(2), object(1)
memory usage: 207.2+ KB

label1=data[['IMDb','Rotten Tomatoes']]

label1=label1.dropna()

from sklearn.cluster import KMeans

import numpy as np
kmeans = KMeans(n_clusters= 10)

#predict the labels of clusters.

label = kmeans.fit_predict(label1)

print(label)

[5 8 6 ... 3 9 3]
from sklearn.cluster import KMeans
import numpy as np
# k means
kmeans = KMeans(n_clusters=3, random_state=0)
label1['cluster'] = kmeans.fit_predict(label1[['IMDb', 'Rotten Tomatoes']])
# get centroids
centroids = kmeans.cluster_centers_
cen_x = [i[0] for i in centroids]
cen_y = [i[1] for i in centroids]
## add to df
label1['cen_x'] = label1.cluster.map({0:cen_x[0], 1:cen_x[1], 2:cen_x[2]})
label1['cen_y'] = label1.cluster.map({0:cen_y[0], 1:cen_y[1], 2:cen_y[2]})
# define and map colors
colors = ['#DF2020', '#81DF20', '#2095DF']
label1['c'] = label1.cluster.map({0:colors[0], 1:colors[1], 2:colors[2]})

import matplotlib.pyplot as plt

plt.scatter(label1.IMDb, label1["Rotten Tomatoes"], c=label1.c, alpha = 0.6,
s=10)

<matplotlib.collections.PathCollection at 0x2801ebe6520>
#Removing the data

ratings=ratings.dropna()

ratings["IMDb"]=ratings["IMDb"]*10

#New data

ratings.head()

title IMDb Rotten Tomatoes

0 Jingle Pols 96.0 44.0
1 Bluey 96.0 71.0
2 Breaking Bad 94.0 100.0
3 Alaska Animal Rescue 94.0 42.0
4 Avatar: The Last Airbender 93.0 93.0

#Input data

X=ratings[["IMDb","Rotten Tomatoes"]]

#Input data

X=ratings[["IMDb","Rotten Tomatoes"]]

#Scatterplot of the input data

plt.figure(figsize=(10,6))
sns.scatterplot(x = 'IMDb',y = 'Rotten Tomatoes', data = X ,s = 60 )
plt.xlabel('IMDb rating (multiplied by 10)')
plt.ylabel('Rotten Tomatoes')
plt.title('IMDb rating (multiplied by 10) vs Rotten Tomatoes Score')
plt.show()
#Importing KMeans from sklearn

from sklearn.cluster import KMeans

wcss=[]

for i in range(1,11):
km=KMeans(n_clusters=i)
km.fit(X)
wcss.append(km.inertia_)

wcss

[2658782.8560552797,
1416468.3483010572,
971006.2047273932,
744730.8390479111,
595017.879951996,
508191.7320234856,
436636.2687543623,
384902.21546598733,
344940.126989912,
313659.9308850105]

ELBOW CURVE
#The elbow curve

plt.figure(figsize=(12,6))

plt.plot(range(1,11),wcss)

plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="8")

plt.xlabel("K Value")
plt.xticks(np.arange(1,11,1))
plt.ylabel("WCSS")

plt.show()
#Taking 3 clusters

km=KMeans(n_clusters=3)

#Fitting the input data

km.fit(X)

KMeans(n_clusters=3)

#predicting the labels of the input data

y=km.predict(X)

#adding the labels to a column named label

ratings["label"] = y

#The new dataframe with the clustering done

ratings.head()

title IMDb Rotten Tomatoes label

0 Jingle Pols 96.0 44.0 1
1 Bluey 96.0 71.0 0
2 Breaking Bad 94.0 100.0 0
3 Alaska Animal Rescue 94.0 42.0 1
4 Avatar: The Last Airbender 93.0 93.0 0

#Scatterplot of the clusters

plt.figure(figsize=(10,6))
sns.scatterplot(x = 'IMDb',y = 'Rotten Tomatoes',hue="label",
palette=['orange','red','green'], legend='full',data = ratin
gs ,s = 60 )

plt.xlabel('IMDb rating(Multiplied by 10)')

plt.ylabel('Rotten Tomatoes score')
plt.title('IMDb rating(Multiplied by 10) vs Rotten Tomatoes score')
plt.show()
print('Number of Cluster 0 TV Shows are=')
print(len(ratings[ratings["label"]==0]))
print("--------------------------------------------")
print('Number of Cluster 1 TV Shows are=')
print(len(ratings[ratings["label"]==1]))
print("--------------------------------------------")
print('Number of Cluster 2 TV Shows are=')
print(len(ratings[ratings["label"]==2]))
print("--------------------------------------------")

Number of Cluster 0 TV Shows are=

2581
--------------------------------------------
Number of Cluster 1 TV Shows are=
3510
--------------------------------------------
Number of Cluster 2 TV Shows are=
2447
--------------------------------------------

print('TV Shows in cluster 0')

print(ratings[ratings["label"]==0]["title"].values)

TV Shows in cluster 0
['Bluey' 'Breaking Bad' 'Avatar: The Last Airbender' ...
'After We Collided' 'The Twilight Saga: Eclipse'
'Masters of the Universe: Revelation']

print('TV Shows in cluster 1')

print(ratings[ratings["label"]==1]["title"].values)

TV Shows in cluster 1
['Jingle Pols' 'Alaska Animal Rescue' 'Harmony with A R Rahman' ...
'Dismissed' 'Dismissed' 'The Operative']
print('TV Shows in cluster 2')

print(ratings[ratings["label"]==2]["title"].values)

TV Shows in cluster 2
['Tjovitjo' 'Kibaoh Klashers' 'Robozuna' ... "Izzie's Way Home"
'Finding Jesus' 'Racket Boys']

MACHINE LEARNING USING KNN ALGORITHM

data.description=data.description.str.replace(" ",",",regex=True)

data

show_id Platform type \

director \
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
8829 NaN
8830 Alejandro Ruax, Ramiro Mart??nez
8831 NaN
8832 Rock Davis, Jay Rodriguez
8833 NaN

country date_added release_year rating \

0 NaN November 12, 2019 2013 16+
1 Australia, United Kingdom May 28, 2021 2019 all
2 United States August 2, 2013 2013 18+
3 United States September 1, 2021 2019 13+
4 United States May 15, 2020 2007 13+
... ... ... ... ...
8829 South Korea April 16, 2021 2021 16+
8830 NaN January 6, 2021 2021 18+
8831 NaN NaN 2021 all
8832 NaN NaN 2021 16+
8833 NaN NaN 2021 all

description IMDb Rotten Tomatoe

s
0 Nat,Geo,WILD,re-joins,the,Pols,in,central,Mich... 9.6 44.
0
1 Bluey,is,a,six,year-old,Blue,Heeler,dog,,who,t... 9.6 71.
0
2 A,high,school,chemistry,teacher,dying,of,cance... 9.4 100.
0
3 Conservation,heroes,rescue,and,rehabilitate,th... 9.4 42.
0
4 Siblings,Katara,and,Sokka,wake,young,Aang,from... 9.3 93.
0
... ... ... ..
.
8829 From,the,quirky,to,the,scandalous,,any,relatio... NaN 16.
0
8830 The,irrepressible,Ratones,Paranoicos,,Argentin... NaN 12.
0
8831 Adventures,of,Leo,and,friends,continue,in,a,ne... NaN 10.
0
8832 Years,after,a,life-altering,robbery,,a,home,he... NaN 46.
0
8833 Sing,with,Pinkfong,and,learn,how,to,form,healt... NaN 10.
0

[8834 rows x 15 columns]

data = pd.read_csv('Ottdataset.csv',encoding='unicode_escape')
data.info()
data.head()[:20]

show_id Platform type title director \

cast \
0 Dr. Pol
1 Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
3 Victoria Vosburg
4 Zach Tyler, Mae Whitman, Jack De Sena, Dee Bra...

country date_added release_year rating \

description IMDb Rotten Tomatoes

data.IMDb=data.IMDb.str.replace("/10","",regex=True)

data['IMDb'] = pd.to_numeric(data['IMDb'],errors='coerce')

data['Rotten Tomatoes']=data['Rotten Tomatoes'].str.replace("/100","",regex=T

rue)
data['Rotten Tomatoes'] = pd.to_numeric(data['Rotten Tomatoes'],errors='coerc
e')

data = data.rename(columns = {"listed_in":"genre"})

data.rating=data.rating.str.replace("/10","",regex=True)

plt.subplots(figsize=(10,10))
list1 = []
for i in data['genre']:
list1.extend(i)
ax = pd.Series(list1).value_counts()[:10].sort_values(ascending=True).plot.ba
rh(width=0.9,color=sns.color_palette('hls',10))
for i, v in enumerate(pd.Series(list1).value_counts()[:10].sort_values(ascend
ing=True).values):
ax.text(.8, i, v,fontsize=12,color='white',weight='bold')
plt.title('Top Genres')
plt.show()
RECOMMENDATION SYSTEM USING K-NEAREST NEIGHBORS:
PREDICT IMDB SCORES
data.head(10)

show_id Platform type title \

0 977 Disney Movie Jingle Pols
1 209 Disney TV Show Bluey
2 7391 Netflix TV Show Breaking Bad
3 119 Disney TV Show Alaska Animal Rescue
4 3970 Netflix TV Show Avatar: The Last Airbender
5 13448 Prime Video TV Show Avatar: The Last Airbender
6 5389 Netflix TV Show Our Planet
7 13038 Prime Video Movie Cosmos
8 12507 Prime Video TV Show Harmony with A R Rahman
9 327 Disney TV Show Cosmos: Possible Worlds

director \
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 Elliot Weaver, Zander Weaver
8 NaN
9 NaN

cast \
0 Dr. Pol
1 Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
3 Victoria Vosburg
4 Zach Tyler, Mae Whitman, Jack De Sena, Dee Bra...
5 NaN
6 David Attenborough
7 Tom England, Arjun Singh Panam, Joshua Ford, B...
8 A R Rahman, Sajith Vijayan, Bahauddin Dagar, B...
9 Neil deGrasse Tyson

country date_added release_year rating \

0 NaN November 12, 2019 2013 TV-14
1 Australia, United Kingdom May 28, 2021 2019 TV-Y
2 United States August 2, 2013 2013 TV-MA
3 United States September 1, 2021 2019 TV-PG
4 United States May 15, 2020 2007 TV-Y7
5 NaN NaN 2008 TV-Y7
6 United States, United Kingdom April 5, 2019 2019 TV-PG
7 NaN NaN 2019 13+
8 NaN NaN 2018 NR
9 United States December 25, 2020 2020 TV-14

duration genre \
0 45 min Animals & Nature, Documentary, Medical
1 2 Seasons Animation, Kids
2 5 Seasons Crime TV Shows, TV Dramas, TV Thrillers
3 2 Seasons Animals & Nature, Docuseries, Family
4 3 Seasons Classic & Cult TV, Kids' TV, TV Action & Adven...
5 3 Seasons Kids
6 1 Season Docuseries, Science & Nature TV
7 129 min Action, Science Fiction
8 1 Season Arts, Entertainment, and Culture, Documentary
9 1 Season Action-Adventure, Docuseries, Family

description IMDb Rotten Tomatoes

0 Nat Geo WILD re-joins the Pols in central Mich... 9.6 44.0
1 Bluey is a six year-old Blue Heeler dog, who t... 9.6 71.0
2 A high school chemistry teacher dying of cance... 9.4 100.0
3 Conservation heroes rescue and rehabilitate th... 9.4 42.0
4 Siblings Katara and Sokka wake young Aang from... 9.3 93.0
5 Fire promises to be the most exciting season y... 9.3 93.0
6 Experience our planet's natural beauty and exa... 9.3 82.0
7 Three amateur astronomers accidentally interce... 9.3 82.0
8 Harmony with A.R Rahman' is a curated explorat... 9.2 52.0
9 COSMOS: POSSIBLE WORLDS continues Carl SaganÆ’... 9.2 62.0

RECOMMENDATION SYSTEM USING K-NEAREST NEIGHBORS:

PREDICT IMDB SCORES

Recommendation systems are becoming increasingly important in today’s hectic world.

People are always in the lookout for products/services that are best suited for them.
In this part of the code, we will understand the basic of Recommendation Systems and
build a Movie Recommendation System using collaborative filtering by implementing the
K-Nearest Neighbors algorithm.

CONTENT-BASED FILTERING
These filtering methods are based on the description of an item and a profile of the user’s
preferred choices. In a content-based recommendation system, keywords are used to
describe the items, besides, a user profile is built to state the type of item this user likes. In
other words, the algorithms try to recommend products that are similar to the ones that a
user has liked in the past.

WORKING WITH THE GENRE COLOUM

We will clean the genre column to find the genre_list

We split individual genres using string split() function

data['genre'] = data['genre'].str.strip('[]').str.replace(' ','').str.replace

("'",'')
data['genre'] = data['genre'].str.split(',')

Now let’s generate a list ‘genreList’ with all possible unique genres mentioned in the
dataset.

genreList = []
for index, row in data.iterrows():
genres = row["genre"]

for genre in genres:

if genre not in genreList:
genreList.append(genre)
genreList[:10] #now we have a list with unique genresgenreList = []
for index, row in data.iterrows():
genres = row["genre"]

for genre in genres:

if genre not in genreList:
genreList.append(genre)
genreList[:10] #now we have a list with unique genres
['Animals&Nature',
'Documentary',
'Medical',
'Animation',
'Kids',
'CrimeTVShows',
'TVDramas',
'TVThrillers',
'Docuseries',
'Family']

Let’s create a new column in the dataframe that will hold the binary values whether a genre
is present or not in it. First, let’s create a method that will return back a list of binary values
for the genres of each movie. The ‘genreList’ will be useful now to compare against the
values.

Applying the binary() function to the ‘genres’ column to get ‘genre_list’

Creating the binary values for cast
def binary(genre_list):
binaryList = []

for genre in genreList:

if genre in genre_list:
binaryList.append(1)
else:
binaryList.append(0)

return binaryList

We will follow the same notations for other features like the cast, director, and the
keywords.
data['genre_bin'] = data['genre'].apply(lambda x: binary(x))
data['genre_bin'].head()

0 [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...
Name: genre_bin, dtype: object

WORKING WITH THE CAST COLOUMN

Now lets do the same for cast coloum

Here we change cast coloum to string

data['cast'] = data['cast'].astype(str)
data.head()

show_id Platform type title director \

cast \
0 Dr. Pol
1 Dave Mccormack, Melanie Zanetti, Brad Elliot, ...
2 Bryan Cranston, Aaron Paul, Anna Gunn, Dean No...
3 Victoria Vosburg
4 Zach Tyler, Mae Whitman, Jack De Sena, Dee Bra...

country date_added release_year rating \

description IMDb Rotten Tomatoes

genre_bin
0 [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...

We split individual cast members using string split() function

data['cast'] = data['cast'].str.strip('[]').str.replace(' ','').str.replace("

'",'')
data['cast'] = data['cast'].str.split(',')

Now let’s generate a list ‘castList’ with all possible unique cast members mentioned in the
dataset.
castList = []
for index, row in data.iterrows():
casts = row["cast"]

for cast in casts:

if cast not in castList:
castList.append(cast)
castList[:10] #now we have a list with unique genres

['Dr.Pol',
'DaveMccormack',
'MelanieZanetti',
'BradElliot',
'Hsiao-LingTang',
'BryanCranston',
'AaronPaul',
'AnnaGunn',
'DeanNorris',
'BetsyBrandt']

Applying the binary() function to the ‘cast’ column to get ‘cast_list’

Creating the binary values for cast
def binary(cast_list):
binaryList = []

for cast in castList:

if cast in cast_list:
binaryList.append(1)
else:
binaryList.append(0)

return binaryList

Let’s create a new column in the dataframe that will hold the binary values whether a cast
is present or not in it.
data['cast_bin'] = data['cast'].apply(lambda x: binary(x))
data['cast_bin'].head()

0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: cast_bin, dtype: object

data.head()

show_id Platform type title director \

cast \
0 [Dr.Pol]
1 [DaveMccormack, MelanieZanetti, BradElliot, Hs...
2 [BryanCranston, AaronPaul, AnnaGunn, DeanNorri...
3 [VictoriaVosburg]
4 [ZachTyler, MaeWhitman, JackDeSena, DeeBradley...

country date_added release_year rating \

description IMDb Rotten Tomatoes

cast_bin
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

WORKING WITH THE DIRECTOR COLOUM

Here we change director coloum to string

data['director'] = data['director'].astype(str)

data['director'] = data['director'].str.strip('[]').str.replace(' ','').str.r

eplace("'",'')
data['director'] = data['director'].str.split(',')

data.head()

show_id Platform type title director \

cast \
0 [Dr.Pol]
1 [DaveMccormack, MelanieZanetti, BradElliot, Hs...
2 [BryanCranston, AaronPaul, AnnaGunn, DeanNorri...
3 [VictoriaVosburg]
4 [ZachTyler, MaeWhitman, JackDeSena, DeeBradley...

country date_added release_year rating \

description IMDb Rotten Tomatoes

directorList = []
for index, row in data.iterrows():
directors = row["director"]

for director in directors:

if director not in directorList:
directorList.append(director)
directorList[:10] #now we have a list with unique genres

['nan',
'ElliotWeaver',
'ZanderWeaver',
'YasuhiroIrie',
'GarySing',
'SamirAlAsfory',
'PeterMarcy',
'St??phaneRybojad',
'AdamWingard',
'AlastairFothergill']

Applying the binary() function to the ‘director’ column to get ‘director_list’

def binary(director_list):
binaryList = []

for director in directorList:

if director in director_list:
binaryList.append(1)
else:
binaryList.append(0)

return binaryList

We create a new column ‘director_bin’ as we have done earlier

data['director_bin'] = data['director'].apply(lambda x: binary(x))
data['director_bin'].head()

0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: director_bin, dtype: object

DUPLICATING DESCRIPTION TO GENERATE KEYWORDS IN NEW

COLOUM

Since we need keywords for identifying simillar movies and tv shows, we strip keywords
from the description coloumn.
Converting description to string
data['description'] = data['description'].astype(str)

Duplicating description coloum and renaming it keywords

data['keywords'] = data['description']
data.head()

show_id Platform type title director \

cast \
0 [Dr.Pol]
1 [DaveMccormack, MelanieZanetti, BradElliot, Hs...
2 [BryanCranston, AaronPaul, AnnaGunn, DeanNorri...
3 [VictoriaVosburg]
4 [ZachTyler, MaeWhitman, JackDeSena, DeeBradley...

country date_added release_year rating \

description IMDb Rotten Tomatoes

keywords
0 Nat Geo WILD re-joins the Pols in central Mich...
1 Bluey is a six year-old Blue Heeler dog, who t...
2 A high school chemistry teacher dying of cance...
3 Conservation heroes rescue and rehabilitate th...
4 Siblings Katara and Sokka wake young Aang from...

Using nltk with stopwords to remove common english words used to build sentences

from nltk.corpus import stopwords

stop_words = stopwords.words('english')
data['keywords'] = data['keywords'].apply(lambda x: ' '.join([word for word i
n x.split()if word not in (stop_words)]))

data.head()

show_id Platform type title director \

cast \
0 [Dr.Pol]
1 [DaveMccormack, MelanieZanetti, BradElliot, Hs...
2 [BryanCranston, AaronPaul, AnnaGunn, DeanNorri...
3 [VictoriaVosburg]
4 [ZachTyler, MaeWhitman, JackDeSena, DeeBradley...

country date_added release_year rating \

duration genre \
0 45 min [Animals&Nature, Documentary, Medical]
1 2 Seasons [Animation, Kids]
2 5 Seasons [CrimeTVShows, TVDramas, TVThrillers]
3 2 Seasons [Animals&Nature, Docuseries, Family]
4 3 Seasons [Classic&CultTV, KidsTV, TVAction&Adventure]
description IMDb Rotten Tomatoes
\
0 Nat Geo WILD re-joins the Pols in central Mich... 9.6 44.0
1 Bluey is a six year-old Blue Heeler dog, who t... 9.6 71.0
2 A high school chemistry teacher dying of cance... 9.4 100.0
3 Conservation heroes rescue and rehabilitate th... 9.4 42.0
4 Siblings Katara and Sokka wake young Aang from... 9.3 93.0

keywords
0 Nat Geo WILD re-joins Pols central Michigan ge...
1 Bluey six year-old Blue Heeler dog, turns ever...
2 A high school chemistry teacher dying cancer t...
3 Conservation heroes rescue rehabilitate wild a...
4 Siblings Katara Sokka wake young Aang long hib...

Cleaning keywords coloumn

data.keywords=data.keywords.str.replace(" ",",",regex=True)
data['keywords'] = data['keywords'].str.strip('[]').str.replace(' ','').str.r
eplace("'",'')
data['keywords'] = data['keywords'].str.split(',')
data.head()

show_id Platform type title director \

0 977 Disney Movie Jingle Pols [nan]
1 209 Disney TV Show Bluey [nan]
2 7391 Netflix TV Show Breaking Bad [nan]
3 119 Disney TV Show Alaska Animal Rescue [nan]
4 3970 Netflix TV Show Avatar: The Last Airbender [nan]
cast \
0 [Dr.Pol]
1 [DaveMccormack, MelanieZanetti, BradElliot, Hs...
2 [BryanCranston, AaronPaul, AnnaGunn, DeanNorri...
3 [VictoriaVosburg]
4 [ZachTyler, MaeWhitman, JackDeSena, DeeBradley...

country date_added release_year rating \

description IMDb Rotten Tomatoes

keywords
0 [Nat, Geo, WILD, re-joins, Pols, central, Mich...
1 [Bluey, six, year-old, Blue, Heeler, dog, , tu...
2 [A, high, school, chemistry, teacher, dying, c...
3 [Conservation, heroes, rescue, rehabilitate, w...
4 [Siblings, Katara, Sokka, wake, young, Aang, l...

WORKING WITH THE KEYWORDS COLOUMN

The keywords or tags contain a lot of information about the movie, and it is a key feature in
finding similar movies. For eg: Movies like “Avengers” and “Ant-man” may have common
keywords like superheroes or Marvel.

For analyzing keywords, we will try something different and plot a word cloud to get a
better intuition:

Converting keywords coloumn to string

data['keywords'] = data['keywords'].astype(str)
plt.subplots(figsize=(12,12))
stop_words = set(stopwords.words('english'))
stop_words.update(',',';','!','?','.','(',')','$','#','+',':','...',' ','')

words=data['keywords'].dropna().apply(nltk.word_tokenize)
word=[]
for i in words:
word.extend(i)
word=pd.Series(word)
word=([i for i in word.str.lower() if i not in stop_words])
wc = WordCloud(background_color="black", max_words=2000, stopwords=STOPWORDS,
max_font_size= 60,width=1000,height=1000)
wc.generate(" ".join(word))
plt.imshow(wc)
plt.axis('off')
fig=plt.gcf()
fig.set_size_inches(10,10)
plt.show()
Cleaning keywords coloumn as a string
data['keywords'] = data['keywords'].str.strip('[]').str.replace(' ','').str.r
eplace("'",'').str.replace('"','')
data['keywords'] = data['keywords'].str.split(',')
for i,j in zip(data['keywords'],data.index):
list2 = []
list2 = i
data.loc[j,'keywords'] = str(list2)
data['keywords'] = data['keywords'].str.strip('[]').str.replace(' ','').str.r
eplace("'",'')
data['keywords'] = data['keywords'].str.split(',')
for i,j in zip(data['keywords'],data.index):
list2 = []
list2 = i
list2.sort()
data.loc[j,'keywords'] = str(list2)
data['keywords'] = data['keywords'].str.strip('[]').str.replace(' ','').str.r
eplace("'",'')
data['keywords'] = data['keywords'].str.split(',')

Making the word_list

words_list = []
for index, row in data.iterrows():
genres = row["keywords"]

for genre in genres:

if genre not in words_list:
words_list.append(genre)

We find ‘words_bin’ from Keywords and remove rows with null values of director and
IMDb
data['words_bin'] = data['keywords'].apply(lambda x: binary(x))
data = data[(data['IMDb']!=0)] #removing the movies with 0 score and without
drector names
data = data[data['director']!='']
data = data[data['director']!='[nan]']
data = data[data['director']!='[NaN]']

data.head(10)

show_id Platform type title \

director \
0 [nan]
1 [nan]
2 [nan]
3 [nan]
4 [nan]
5 [nan]
6 [nan]
7 [ElliotWeaver, ZanderWeaver]
8 [nan]
9 [nan]

cast \
0 [Dr.Pol]
1 [DaveMccormack, MelanieZanetti, BradElliot, Hs...
2 [BryanCranston, AaronPaul, AnnaGunn, DeanNorri...
3 [VictoriaVosburg]
4 [ZachTyler, MaeWhitman, JackDeSena, DeeBradley...
5 [nan]
6 [DavidAttenborough]
7 [TomEngland, ArjunSinghPanam, JoshuaFord, BenV...
8 [ARRahman, SajithVijayan, BahauddinDagar, Beda...
9 [NeildeGrasseTyson]

country date_added release_year rating \

0 NaN November 12, 2019 2013 TV-14
1 Australia, United Kingdom May 28, 2021 2019 TV-Y
2 United States August 2, 2013 2013 TV-MA
3 United States September 1, 2021 2019 TV-PG
4 United States May 15, 2020 2007 TV-Y7
5 NaN NaN 2008 TV-Y7
6 United States, United Kingdom April 5, 2019 2019 TV-PG
7 NaN NaN 2019 13+
8 NaN NaN 2018 NR
9 United States December 25, 2020 2020 TV-14

duration genre \
0 45 min [Animals&Nature, Documentary, Medical]
1 2 Seasons [Animation, Kids]
2 5 Seasons [CrimeTVShows, TVDramas, TVThrillers]
3 2 Seasons [Animals&Nature, Docuseries, Family]
4 3 Seasons [Classic&CultTV, KidsTV, TVAction&Adventure]
5 3 Seasons [Kids]
6 1 Season [Docuseries, Science&NatureTV]
7 129 min [Action, ScienceFiction]
8 1 Season [Arts, Entertainment, andCulture, Documentary]
9 1 Season [Action-Adventure, Docuseries, Family]

description IMDb Rotten Tomatoes

\
0 Nat Geo WILD re-joins the Pols in central Mich... 9.6 44.0
1 Bluey is a six year-old Blue Heeler dog, who t... 9.6 71.0
2 A high school chemistry teacher dying of cance... 9.4 100.0
3 Conservation heroes rescue and rehabilitate th... 9.4 42.0
4 Siblings Katara and Sokka wake young Aang from... 9.3 93.0
5 Fire promises to be the most exciting season y... 9.3 93.0
6 Experience our planet's natural beauty and exa... 9.3 82.0
7 Three amateur astronomers accidentally interce... 9.3 82.0
8 Harmony with A.R Rahman' is a curated explorat... 9.2 52.0
9 COSMOS: POSSIBLE WORLDS continues Carl SaganÆ’... 9.2 62.0

genre_bin \
0 [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...
5 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
6 [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, ...
7 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
8 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
9 [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...

cast_bin \
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
5 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
6 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
7 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
9 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

director_bin \
0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
5 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
6 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
7 [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
8 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
9 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

keywords \
0 [Christmas, Geo, Michigan, Nat, Pols, WILD, ce...
1 [, Blue, Bluey, Heeler, adventures., dog, ever...
2 [A, cancer, chemistry, crystal, dying, familys...
3 [AmericaÆ\\\\x92??s, Conservation, animals, fr...
4 [, Aang, Avatar, Fire, Katara, Nation., Siblin...
5 [Aang, Azula, Ba, Black, Day, Fire, Firelord, ...
6 [Experience, ambitious, beauty, change, climat...
7 [, Three, accidentally, alien, amateur, astron...
8 [, , A., A.R, Harmony, Indian, IndiaÆ\\\\x92??...
9 [40, COSMOS:, Carl, POSSIBLE, SaganÆ\\\\x92??s...

words_bin
0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
5 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
6 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
7 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
8 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
9 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
CREATING New_id FOR EACH ROW

We generate new coloumn to identify each movie and tv show with new_id

new_id = list(range(0,data.shape[0]))
data['new_id']=new_id
data=data[['title','genre','IMDb','genre_bin','cast_bin','new_id','director',
'director_bin','words_bin']]
data.head()

title genre
\
0 Jingle Pols [Animals&Nature, Documentary, Medical]
1 Bluey [Animation, Kids]
2 Breaking Bad [CrimeTVShows, TVDramas, TVThrillers]
3 Alaska Animal Rescue [Animals&Nature, Docuseries, Family]
4 Avatar: The Last Airbender [Classic&CultTV, KidsTV, TVAction&Adventure]

IMDb genre_bin \
0 9.6 [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 9.6 [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 9.4 [0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...
3 9.4 [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
4 9.3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...

cast_bin new_id director \

0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 0 [nan]
1 [0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 1 [nan]
2 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2 [nan]
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 3 [nan]
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 4 [nan]

Below we have defined a function Similarity, which will check the similarity between the
movies and tv shows.

from scipy import spatial

def Similarity(show_id1, show_id2):
a = data.iloc[show_id1]
b = data.iloc[show_id2]

genresA = a['genre_bin']
genresB = b['genre_bin']

genreDistance = spatial.distance.cosine(genresA, genresB)

scoreA = a['cast_bin']
scoreB = b['cast_bin']
scoreDistance = spatial.distance.cosine(scoreA, scoreB)

directA = a['director_bin']
directB = b['director_bin']
directDistance = spatial.distance.cosine(directA, directB)

descriptionA = a['words_bin']
descriptionB = b['words_bin']
wordsDistance = spatial.distance.cosine(directA, directB)
return genreDistance + directDistance + scoreDistance + wordsDistance

Let’s check the Similarity between 2 random movies

Similarity(12,7)

4.0
It is evident that Reply 1988 and Cosmos are very different movies. Thus the distance is
huge.

print(data.iloc[12])
print(data.iloc[7])

title Reply 1988

genre [InternationalTVShows, KoreanTVShows, Romantic...
IMDb 9.1
genre_bin [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
cast_bin [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
new_id 12
director [nan]
director_bin [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
words_bin [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: 12, dtype: object
title Cosmos
genre [Action, ScienceFiction]
IMDb 9.3
genre_bin [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
cast_bin [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
new_id 7
director [ElliotWeaver, ZanderWeaver]
director_bin [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
words_bin [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: 7, dtype: object
SCORE PREDICTOR

We will now build the score predictor. The main function working under the hood will be
the Similarity() function, which will calculate the similarity between movies, and will find
10 most similar movies. These 10 movies will help in predicting the score for our desired
movie. We will take the average of the scores of similar movies and find the score for the
desired movie.

import operator

def predict_score(name):
#name = input('Enter a movie title: ')
new_movie = data[data['title'].str.contains(name)].iloc[0].to_frame().T
print('Selected Movie: ',new_movie.title.values[0])
def getNeighbors(baseMovie, K):
distances = []

for index, movie in data.iterrows():

if movie['new_id'] != baseMovie['new_id'].values[0]:
dist = Similarity(baseMovie['new_id'].values[0], movie['new_i
d'])
distances.append((movie['new_id'], dist))

distances.sort(key=operator.itemgetter(1))
neighbors = []

for x in range(K):
neighbors.append(distances[x])
return neighbors

K = 10
avgRating = 0
neighbors = getNeighbors(new_movie, K)

print('\nRecommended to Watch: \n')

for neighbor in neighbors:
avgRating = avgRating+data.iloc[neighbor[0]][2]
print( data.iloc[neighbor[0]][0]+" | Genres: "+str(data.iloc[neighbor
[0]][1]).strip('[]').replace(' ','')+" | Rating: "+str(data.iloc[neighbor[0]]
[2]))

print('\n')
avgRating = avgRating/K
print('The predicted rating for %s is: %f' %(new_movie['title'].values[0]
,avgRating))
print('The actual rating for %s is %f' %(new_movie['title'].values[0],new
_movie['IMDb']))

Now we simply just run the function as predict_score('title name') and enter the movie or
tv show name we would like to find 10 similar movies and it’s predicted ratings.

predict_score('Cosmos')

Selected Movie: Cosmos

Recommended to Watch:

The Saturn V Story | Genres: 'Documentary' | Rating: 7.6

The predicted rating for Cosmos is: 7.340000

The actual rating for Cosmos is 9.300000
predict_score('Better Call Saul')

Selected Movie: Better Call Saul

Recommended to Watch:

Breaking Bad | Genres: 'CrimeTVShows','TVDramas','TVThrillers' | Rating: 9.4

The predicted rating for Better Call Saul is: 7.880000

The actual rating for Better Call Saul is 8.800000

Thus, we have completed the Movie Recommendation System implementation using the K-
Nearest Neighbors algorithm.

Sidenote — K Value

In this project, we have arbitrarily chosen the value K=10. But in other applications of KNN,
finding the value of K is not simple. A small value of K means that noise will have a higher
influence on the result. Research papers and Data scientits usually choose as an odd
number, if the number of classes is 2 and another simple approach to select k is set
K=sqrt(n).

Amjad Khan
No ratings yet
Amjad Khan
2 pages
SRMDB - in (B28 - Research Paper)
No ratings yet
SRMDB - in (B28 - Research Paper)
5 pages
A Detailed Lesson Plan in MATHEMATICS 6 Day 3
No ratings yet
A Detailed Lesson Plan in MATHEMATICS 6 Day 3
10 pages
Movielens Recommender System Capstone Project: Compiled by Mahesh Halkeri
No ratings yet
Movielens Recommender System Capstone Project: Compiled by Mahesh Halkeri
19 pages
MovieLens Project Report
No ratings yet
MovieLens Project Report
19 pages
Character When Relevant
No ratings yet
Character When Relevant
4 pages
Technical Documenetflix Technicalnt
No ratings yet
Technical Documenetflix Technicalnt
15 pages
Technical Docs of NETFLIX MOVIES AND TV SHOWS CLUSTERING
No ratings yet
Technical Docs of NETFLIX MOVIES AND TV SHOWS CLUSTERING
12 pages
2331 Mid Program Project v1 Es3 D2i02jl
No ratings yet
2331 Mid Program Project v1 Es3 D2i02jl
5 pages
NM (2) - Merged
No ratings yet
NM (2) - Merged
16 pages
NM (2) - Merged - Organized
No ratings yet
NM (2) - Merged - Organized
16 pages
DSV Final
No ratings yet
DSV Final
14 pages
Movie - Recommendation Pranali
No ratings yet
Movie - Recommendation Pranali
12 pages
Team 10 Movie Prediction
No ratings yet
Team 10 Movie Prediction
14 pages
Personalize Movie Recommendation System CS 229 Project Final Writeup
0% (1)
Personalize Movie Recommendation System CS 229 Project Final Writeup
6 pages
Report Final-MovieLens
No ratings yet
Report Final-MovieLens
47 pages
Netflix Movies and TV Shows Clustering
No ratings yet
Netflix Movies and TV Shows Clustering
29 pages
Movie Recommendation System Using ML: Submitted By
No ratings yet
Movie Recommendation System Using ML: Submitted By
32 pages
Paper 52-Cinematic Curator A Machine Learning Approach
No ratings yet
Paper 52-Cinematic Curator A Machine Learning Approach
8 pages
Divya NM (1) - 2
No ratings yet
Divya NM (1) - 2
41 pages
BCA 8th Proposal
No ratings yet
BCA 8th Proposal
17 pages
Movie Recommendation System
No ratings yet
Movie Recommendation System
32 pages
Project Report CP 7th
No ratings yet
Project Report CP 7th
20 pages
Movie Recommender System Using Content Based AndCollaborative Filtering
No ratings yet
Movie Recommender System Using Content Based AndCollaborative Filtering
7 pages
ML 210490131009 Oep
No ratings yet
ML 210490131009 Oep
8 pages
Vaibhav - Project Report On Movie Recommender System Using Machine Learning
No ratings yet
Vaibhav - Project Report On Movie Recommender System Using Machine Learning
11 pages
SML PBL
No ratings yet
SML PBL
18 pages
Movies Recommendation Using Machine Learning - Research Paper
No ratings yet
Movies Recommendation Using Machine Learning - Research Paper
11 pages
Assignment 15 (Mini Project)
No ratings yet
Assignment 15 (Mini Project)
8 pages
Movie Recommendation System Using Machine Learning
No ratings yet
Movie Recommendation System Using Machine Learning
15 pages
Group 15 Report
No ratings yet
Group 15 Report
23 pages
Movie Recommendation System Using Machine Learning Techniques
No ratings yet
Movie Recommendation System Using Machine Learning Techniques
21 pages
Move Rs
No ratings yet
Move Rs
17 pages
Movie Recommender Systems
No ratings yet
Movie Recommender Systems
11 pages
Ai Final Project
No ratings yet
Ai Final Project
28 pages
Dsbda Mini Project
No ratings yet
Dsbda Mini Project
14 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
30 pages
Movie Tracker: Developed by
No ratings yet
Movie Tracker: Developed by
11 pages
Research - Article - Updated Final
No ratings yet
Research - Article - Updated Final
13 pages
Survey On Cinematics Recommendation System
No ratings yet
Survey On Cinematics Recommendation System
10 pages
Project Synopsis
No ratings yet
Project Synopsis
14 pages
Movie Reccomendation System Report
No ratings yet
Movie Reccomendation System Report
25 pages
Recommendation System
No ratings yet
Recommendation System
11 pages
Final - Viva PPTX Santosh
No ratings yet
Final - Viva PPTX Santosh
24 pages
Movie Recommendation Engine Using Artificial Intelligence
No ratings yet
Movie Recommendation Engine Using Artificial Intelligence
30 pages
Moviesuggester - Jupyter Notebook
No ratings yet
Moviesuggester - Jupyter Notebook
11 pages
Report System Predaction
No ratings yet
Report System Predaction
5 pages
Movie Recommendation System: Synopsis For Project (KCA 353)
No ratings yet
Movie Recommendation System: Synopsis For Project (KCA 353)
17 pages
Final Report
No ratings yet
Final Report
20 pages
ML Project Movie Recommendation System
No ratings yet
ML Project Movie Recommendation System
2 pages
Final Report Ai Application
No ratings yet
Final Report Ai Application
18 pages
Movie at
No ratings yet
Movie at
11 pages
8th Proposal
No ratings yet
8th Proposal
17 pages
Project Movie Recommend
No ratings yet
Project Movie Recommend
4 pages
MOvie Recommendation System Project Report
No ratings yet
MOvie Recommendation System Project Report
30 pages
Predictive CA2
No ratings yet
Predictive CA2
13 pages
CONTENT BASED MOVIE RECOMMENDING SYSTEM Ijariie19301 PDF
No ratings yet
CONTENT BASED MOVIE RECOMMENDING SYSTEM Ijariie19301 PDF
6 pages
Movi3 Recommender System
No ratings yet
Movi3 Recommender System
15 pages
Final Report Format SSP
No ratings yet
Final Report Format SSP
14 pages
Recommendation System 1696663388
No ratings yet
Recommendation System 1696663388
29 pages
Rosp
No ratings yet
Rosp
17 pages
Movie Recommendation System Report
No ratings yet
Movie Recommendation System Report
18 pages
Unity 5 Game Optimization: Master performance optimization for Unity3D applications with tips and techniques that cover every aspect of the Unity3D Engine
From Everand
Unity 5 Game Optimization: Master performance optimization for Unity3D applications with tips and techniques that cover every aspect of the Unity3D Engine
Chris Dickinson
5/5 (1)
The Egyptian Culture PowerPoint
No ratings yet
The Egyptian Culture PowerPoint
29 pages
Grade 9 - Ems - Exam - Term 4
No ratings yet
Grade 9 - Ems - Exam - Term 4
6 pages
History of Africa Chuchu
No ratings yet
History of Africa Chuchu
3 pages
12 Capital Budgeting Version 2 Key
No ratings yet
12 Capital Budgeting Version 2 Key
10 pages
Normandy vs. Duque
No ratings yet
Normandy vs. Duque
2 pages
POST Newspaper For 16th of January, 2016
No ratings yet
POST Newspaper For 16th of January, 2016
56 pages
CS Xii PB1 QP 2024-25 (Set-1)
No ratings yet
CS Xii PB1 QP 2024-25 (Set-1)
10 pages
Review of Gran Turismo 6
No ratings yet
Review of Gran Turismo 6
4 pages
Imran Shaikh Project
No ratings yet
Imran Shaikh Project
79 pages
Review Till Priliminary
No ratings yet
Review Till Priliminary
56 pages
InductiveReasoningTest4 Questions
100% (1)
InductiveReasoningTest4 Questions
31 pages
UltraPoxy Data Sheet English v3
No ratings yet
UltraPoxy Data Sheet English v3
2 pages
Origin of Theatre Forms in India
No ratings yet
Origin of Theatre Forms in India
10 pages
Receivable Record (6-5-24)
No ratings yet
Receivable Record (6-5-24)
36 pages
Intro Practical
No ratings yet
Intro Practical
6 pages
Present Perfect Tense
No ratings yet
Present Perfect Tense
2 pages
Unit 5
No ratings yet
Unit 5
81 pages
81 686 Katoomba To Scenic World Via Echo PT Loop Service 20180723
No ratings yet
81 686 Katoomba To Scenic World Via Echo PT Loop Service 20180723
4 pages
Entrep Q1 Mod1
No ratings yet
Entrep Q1 Mod1
18 pages
Cause Effect Eng.5lpppp
100% (2)
Cause Effect Eng.5lpppp
3 pages
BANK OF TANZANIA - Circular
No ratings yet
BANK OF TANZANIA - Circular
1 page
Liturgy of St. John (Eliz. English) - Staff Notation
100% (2)
Liturgy of St. John (Eliz. English) - Staff Notation
99 pages
Competition Law in India by Nishith Desai
No ratings yet
Competition Law in India by Nishith Desai
120 pages
PHY 20 Physics For Engineers
No ratings yet
PHY 20 Physics For Engineers
4 pages
Briyana Butler Resume 2-4-4
No ratings yet
Briyana Butler Resume 2-4-4
1 page
Customer Behavior
No ratings yet
Customer Behavior
14 pages
Thermodynamics Problems
No ratings yet
Thermodynamics Problems
10 pages