Clustering of Top 250 Movies From IMDB: Mustafa Panbiharwala Siddharth Ajit

The document discusses clustering 250 top movies from IMDB using various techniques. Key features like genre, actors, directors were used and text from plots was represented using TF-IDF. Dimensionality reduction via PCA was applied. Clustering algorithms like K-means, EM, DBSCAN and hierarchical clustering were used. K-means identified 5 clusters with interpretable themes. DBSCAN found 5 clusters grouped by director/genre with less outliers. Future work could include scaling, evaluating best k, alternative distances and hierarchical clustering.

Uploaded by

Revathy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views37 pages

Clustering of Top 250 Movies From IMDB: Mustafa Panbiharwala Siddharth Ajit

Uploaded by

Revathy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Clustering of top 250 movies

from IMDB

Mustafa Panbiharwala Siddharth Ajit

Broad goals
• Find out if there’s any intrinsic pattern or clusters among top
rated movies (Based on IMDb Ratings).
• Identify similarities among the movies of the same cluster.
• Use Dimensionality Reduction to improve clustering output
• Repeat the above steps for different clustering algorithms.
Data acquisition
• Movie features were obtained using OMDB’s API.
• Plot summary were scrapped from IMDB
https://www.imdb.com/chart/top.
• Data was directly acquired from OMDB’s Database
through it’s public API.

• The API had movie’s IMDB id as one of the outputs.

• Used movie’s IMDB to scrap the data from IMDB’s

website.

• OMDB’s API had the following Output-

Feature selection
• Features used for clustering : year, runtime ,genre director, actors,
plot, language, country.
• Certain features like year of release was discretized to categorical
variables by choosing a suitable cutoff value.
• For text heavy columns, important words were pulled after cleaning
and sorting according to TF-IDF scores.
Genre
Actors
Directors
Language
Countries
Feature representation
• Features – Genre, Actors, Directors, Language and Countries were one
hot encoded.
Text processing of plot summary
These are standard procedures for NLP problems.
• Stripping of all punctuations, stop words etc.
• Tokenizing – breaking text into sentences and words.
• Stemming – reducing the inflected word to the root form.
TF-IDF
• Stands for term frequency–inverse document frequency
• Reveals the importance of a word to a document in a corpus.

TF(t) = (Number of times term t appears in a document) / (Total

number of terms in the document)
IDF(t) = log(Total number of documents / Number of documents with
term t).

TF-IDF = TF(t) * IDF(t)

Terms with TF-IDF scores

• The list was combed through for relevant words

• Top 20 words were selected and one hot encoded

words selected:
father, police, family , men , young , war , child , home , young, son ,
love, money , friend, mother , escape , boy , girl, murder , brother,
escape, german , gang
Final Dataframe
Curse of dimensionality
• Final data frame is sparse

• As predictor space explodes, clustering becomes difficult due to

various reasons – Effect of noise, high correlation among predictors.

• Approach – Dimension reduction ( PCA)

Principal component vectors
• Orthogonal transformation to convert set of correlated variables to a
linearly uncorrelated variables called principal vectors.
• The first principal component has the highest variance among all the
principal components
• 24 top principal components were used for clustering
• They explained 99.8% of the variance in our dataset.
PCA solution

• Solve using singular value decomposition S = A’ΛA

• A matrix contains eigen vectors of S
• Λ is a diagonal matrix corresponding to each eigen vector
• Retain top q eigen values from Λ
• Now, the predictor subspace has been mapped from p dim to q dim
Predictors sorted by loading vector weights
Tsne 2-D
Clustering
• K means
• EM clustering
• DBSCAN
• Hierarchical clustering
K means clustering

• Randomly assign data points to (1…K) clusters

• Compute centroids of clusters
• Calculate distance measure from data points to corresponding
centroids and reassign the points to centroids by distance
• Repeat the above steps until there is no reassignment of data points
EM clustering GMM
(K means)
• K = 5 ( From DBSCAN)
• Cluster formations were difficult to interpret at higher K values
• Minimum SSE was obtained at k = 8
K means results – Cluster 0
• No of clusters – 5 Movies
Star wars : Episode V
• Theme : family, biography
Star wars : Episode IV
• Genre : Fantasy, comedy, The great dictator
Mad Max : Fury road
biography Whiplash
Jaws
Passion of Joan of Arc
K means results – Cluster 1
• Theme : Son, family, Murder Movies
• Runtime is very high in comparison The godfather part I
The godfather part 2
to the median value Schindler’s list
• Genre : Dark, History, biography Once upon a time in America
Gandhi
Gone with the wind
Ben-Hur
K means results – Cluster 2
• Theme :War, kill Movies
Dark knight
• Runtime is very high in comparison Dark knight rises
Pulp fiction
to the median value Inglorious basterds
• Genre : Action, Crime, thriller Django Unchained
Scarface
There will be blood

Once upon a time in west

K means results – Cluster 3
• Theme : young Movies
Spirited away
Finding Nemo
Modern times
• Genre : Comedy, Animation, Romance City lights
Casablanca
Toy story
Hachi : A dog’s tale
Monty Python
Truman Show
K means results – Cluster 4
• Theme : home, family Movies
Shawshank Redemption
• Actor : Al Pacino Matrix
The shining
• Genre : Thriller, crime Terminator 2
Beautiful mind
To kill a mockingbird
Se7en
Prestige
Good will hunting
V for Vendetta
La la land
Catch me if you can
DBSCAN
• DBSCAN is a density based clustering algorithm.
• Robust towards Outlier Detection.
• The most common distance metric used is Euclidean distance
especially for high-dimensional data, this metric can be rendered
almost useless due to the so-called “curse of dimensionality”.
DBSCAN Results – Cluster 1
Movies
• No of clusters – 5 Spirited Away
• Director : Hiyao Miyazaki Finding Nemo

• Theme : Young, world Nausicaa of the valley of wind

• Genre :Animation, Adventure Wall -E
The Lion king
Cluster 2
• Director : Christopher Nolan Movies
• Theme : Help, dark The dark night

• Actors : Christian Bale, Di Caprio Inception

• Genre : Drama, thriller Prestige

Batman begins
Cluster 3
• Actor : Charlie Chaplin Movies
• Theme : love, family City lights
• Genre : Comedy, drama Modern times
The great dictator
The gold rush
Cluster 4
• Director : Stanley Kubrick Movies
• Theme : war, men, German Barry Lyndon
Gandhi
• Genre :Drama, autobiography
Judgement at Nuremberg
• All the movies in these cluster have Lawrence of Arabia
extremely high run time Schindler's list
Seven Samurai
Cluster 5
• Director : Clint Eastwood Movies
• Theme : war, gang Lord of the rings series
• Genre :Action, fantasy The good, the bad and the ugly
• Movies in these cluster are also longer Mad Max : Fury road
Gangs of Wasseypur
in comparison to median run time Roshomon
Yojimbo
Outlier clusters
• DBSCAN assigned many data points as outliers
• Why ?? – DBSCAN looks for spatial density for clustering. In higher
dimensions, it is difficult to find such density clusters.
• The clusters formed by DBSCAN are easier to interpret than K means
Future work
• Scaling of predictors
• In K means clustering, finding the best value of k that is optimum
according to gap statistic, silhouette score and has good
interpretability
• Different distance measures such as jaccard , Gower distance
• Heirarchical clustering

Netflix Movies and TV Shows Clustering
No ratings yet
Netflix Movies and TV Shows Clustering
29 pages
Lecture 1
No ratings yet
Lecture 1
55 pages
Instructional Guide Class 4
No ratings yet
Instructional Guide Class 4
253 pages
Mark Greaney's Gray Man Series: Books 1-3: THE GRAY MAN, ON TARGET, BALLISTIC
From Everand
Mark Greaney's Gray Man Series: Books 1-3: THE GRAY MAN, ON TARGET, BALLISTIC
Mark Greaney
No ratings yet
Data Mining, Data Wharehousing and Olap
No ratings yet
Data Mining, Data Wharehousing and Olap
33 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Guidelines On OECD Code 2
No ratings yet
Guidelines On OECD Code 2
45 pages
Outline: Three Basic Algorithms
No ratings yet
Outline: Three Basic Algorithms
34 pages
Datamining 1class
No ratings yet
Datamining 1class
76 pages
COEN413 Machine Learning-2
No ratings yet
COEN413 Machine Learning-2
38 pages
Project Report
No ratings yet
Project Report
3 pages
Ebook Global Talent Management
No ratings yet
Ebook Global Talent Management
216 pages
Data Science and AI
No ratings yet
Data Science and AI
15 pages
Clustering
No ratings yet
Clustering
104 pages
DM C1 Overview
No ratings yet
DM C1 Overview
55 pages
Open Hole Logging Costs ( ) : Platform Express
No ratings yet
Open Hole Logging Costs ( ) : Platform Express
8 pages
BI Unit 3 Part 1
No ratings yet
BI Unit 3 Part 1
51 pages
Solutions-Grand Marks Booster Challenege#1
No ratings yet
Solutions-Grand Marks Booster Challenege#1
66 pages
DATA MINING For Search Engines
No ratings yet
DATA MINING For Search Engines
33 pages
Top 10 Algorithms in Data Mining - 10algorithms-08
No ratings yet
Top 10 Algorithms in Data Mining - 10algorithms-08
37 pages
Module 5 - Clustering - Afterclassb
No ratings yet
Module 5 - Clustering - Afterclassb
49 pages
Wk. 1. Introduction (08.10.2020)
No ratings yet
Wk. 1. Introduction (08.10.2020)
30 pages
3 Data Mining
No ratings yet
3 Data Mining
58 pages
16 Important Data Science Papers
No ratings yet
16 Important Data Science Papers
248 pages
m3 Final-1
No ratings yet
m3 Final-1
171 pages
Matthew 3
No ratings yet
Matthew 3
7 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
L1 Intro
No ratings yet
L1 Intro
32 pages
Human Centred Design For Mental Health Services Workshop Report 250523
No ratings yet
Human Centred Design For Mental Health Services Workshop Report 250523
26 pages
11 Watt Light
No ratings yet
11 Watt Light
14 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Dayo Asubiojo - Resume - Dec 26
No ratings yet
Dayo Asubiojo - Resume - Dec 26
4 pages
DM Lec1
No ratings yet
DM Lec1
40 pages
DM - Week 2 - Clustering - Mentor Deck
No ratings yet
DM - Week 2 - Clustering - Mentor Deck
15 pages
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
No ratings yet
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
32 pages
Ds Module 5
No ratings yet
Ds Module 5
49 pages
SunlightV6 On Top6872265039
No ratings yet
SunlightV6 On Top6872265039
5 pages
Clustering Today
No ratings yet
Clustering Today
52 pages
Netflix HD
No ratings yet
Netflix HD
21 pages
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
No ratings yet
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
55 pages
Strategists Toolkit Diversification Matrices Template
No ratings yet
Strategists Toolkit Diversification Matrices Template
4 pages
Fibonacci
No ratings yet
Fibonacci
2 pages
Movie Recommendation System in R Jupyter Notebook
No ratings yet
Movie Recommendation System in R Jupyter Notebook
18 pages
Kolkata City Accident Report - 2018
No ratings yet
Kolkata City Accident Report - 2018
48 pages
RE Paper
No ratings yet
RE Paper
25 pages
Data Mining: Introduction: Lecture Notes For Chapter 1
No ratings yet
Data Mining: Introduction: Lecture Notes For Chapter 1
32 pages
Lecture - 2 - Data Mining Concepts
No ratings yet
Lecture - 2 - Data Mining Concepts
30 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
K-Means Clustering and Related Algorithms: Ryan P. Adams
No ratings yet
K-Means Clustering and Related Algorithms: Ryan P. Adams
16 pages
Eng201 Final Term Solved Paper Spring 2010
No ratings yet
Eng201 Final Term Solved Paper Spring 2010
17 pages
Mannila 1997
No ratings yet
Mannila 1997
15 pages
Eschatology - Kingdom of God
50% (2)
Eschatology - Kingdom of God
13 pages
Manual Spot Welder
No ratings yet
Manual Spot Welder
9 pages
Ads - Phase 5
No ratings yet
Ads - Phase 5
14 pages
105 Amazing Short Haircuts For Women in 2023 - Love Hairstyles
No ratings yet
105 Amazing Short Haircuts For Women in 2023 - Love Hairstyles
1 page
2a. Basic Data Mining Techniques
No ratings yet
2a. Basic Data Mining Techniques
39 pages
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
No ratings yet
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
18 pages
Aminoguanidine Bicarbonate
No ratings yet
Aminoguanidine Bicarbonate
8 pages
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
No ratings yet
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
33 pages
Data Mining
No ratings yet
Data Mining
87 pages
CPTest Manual
No ratings yet
CPTest Manual
19 pages
Data Mining 1
No ratings yet
Data Mining 1
56 pages
Bulkhead Connector: 1" PM25 Entire
No ratings yet
Bulkhead Connector: 1" PM25 Entire
3 pages
Hung-Son Intro-DM KD PDF
No ratings yet
Hung-Son Intro-DM KD PDF
58 pages
Technical Documenetflix Technicalnt
No ratings yet
Technical Documenetflix Technicalnt
15 pages
Guided Observation
No ratings yet
Guided Observation
5 pages
Ansi Aga B109.4 - 2016
No ratings yet
Ansi Aga B109.4 - 2016
39 pages
Country Archiving Request Full-Fillment Amazon Requests EQA UK 3270 14931 2039 NA India 1371 8475 NA 921 Australia 160 NA NA NA Singapore 111 37 NA NA US NA NA NA NA
No ratings yet
Country Archiving Request Full-Fillment Amazon Requests EQA UK 3270 14931 2039 NA India 1371 8475 NA 921 Australia 160 NA NA NA Singapore 111 37 NA NA US NA NA NA NA
13 pages
Knowledge Discovery & Data Mining
No ratings yet
Knowledge Discovery & Data Mining
30 pages
Imdb Scrape v3
No ratings yet
Imdb Scrape v3
9 pages
Assignment 3
No ratings yet
Assignment 3
1 page
Imdb Scrape v1
No ratings yet
Imdb Scrape v1
9 pages
Land Development Project
No ratings yet
Land Development Project
19 pages
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
No ratings yet
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
47 pages
CMSC422 Project Presentation
No ratings yet
CMSC422 Project Presentation
17 pages
Washing Machine Owner's Instructions: B1485AV/ B1285AV/ B1285AS/ B1285A/ B1085A/ R1285AV/ R1085A/ F1285AV/ F1085A
No ratings yet
Washing Machine Owner's Instructions: B1485AV/ B1285AV/ B1285AS/ B1285A/ B1085A/ R1285AV/ R1085A/ F1285AV/ F1085A
22 pages
T Baxter Portrait DWG Workshop May 19
No ratings yet
T Baxter Portrait DWG Workshop May 19
3 pages
Python Cheat Sheet-1
No ratings yet
Python Cheat Sheet-1
8 pages
IMS DB StudentWorkBook
100% (2)
IMS DB StudentWorkBook
60 pages
What Is Traveller
No ratings yet
What Is Traveller
7 pages
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
No ratings yet
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
48 pages
IT355: Soft Computing: Paper Implementation
No ratings yet
IT355: Soft Computing: Paper Implementation
14 pages
1.1 What Is Data Mining?
No ratings yet
1.1 What Is Data Mining?
6 pages
Troubleshooting
No ratings yet
Troubleshooting
6 pages
Amoral Politics The Persistent Truth of Machiavellism PDF
No ratings yet
Amoral Politics The Persistent Truth of Machiavellism PDF
2 pages
An Introduction To Data Mining IIT Bombay
No ratings yet
An Introduction To Data Mining IIT Bombay
48 pages
Movies Final Report
No ratings yet
Movies Final Report
22 pages
Recommendation System Based On Collaborative Filtering: Zheng Wen December 12, 2008
No ratings yet
Recommendation System Based On Collaborative Filtering: Zheng Wen December 12, 2008
10 pages
Interzone 241 Jul: Aug 2012
From Everand
Interzone 241 Jul: Aug 2012
TTA Press
No ratings yet
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
5 pages
Video Data Mining: Junghwan Oh
No ratings yet
Video Data Mining: Junghwan Oh
5 pages
Agile Manifesto and 12 Principles
No ratings yet
Agile Manifesto and 12 Principles
3 pages
Technical Docs of NETFLIX MOVIES AND TV SHOWS CLUSTERING
No ratings yet
Technical Docs of NETFLIX MOVIES AND TV SHOWS CLUSTERING
12 pages
1,3 Butadiene
No ratings yet
1,3 Butadiene
7 pages
CICS
100% (1)
CICS
171 pages
Climatronic Sharan
0% (1)
Climatronic Sharan
11 pages
FX Dryer Part List
100% (1)
FX Dryer Part List
22 pages