0% found this document useful (0 votes)
84 views37 pages

Clustering of Top 250 Movies From IMDB: Mustafa Panbiharwala Siddharth Ajit

The document discusses clustering 250 top movies from IMDB using various techniques. Key features like genre, actors, directors were used and text from plots was represented using TF-IDF. Dimensionality reduction via PCA was applied. Clustering algorithms like K-means, EM, DBSCAN and hierarchical clustering were used. K-means identified 5 clusters with interpretable themes. DBSCAN found 5 clusters grouped by director/genre with less outliers. Future work could include scaling, evaluating best k, alternative distances and hierarchical clustering.

Uploaded by

Revathy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views37 pages

Clustering of Top 250 Movies From IMDB: Mustafa Panbiharwala Siddharth Ajit

The document discusses clustering 250 top movies from IMDB using various techniques. Key features like genre, actors, directors were used and text from plots was represented using TF-IDF. Dimensionality reduction via PCA was applied. Clustering algorithms like K-means, EM, DBSCAN and hierarchical clustering were used. K-means identified 5 clusters with interpretable themes. DBSCAN found 5 clusters grouped by director/genre with less outliers. Future work could include scaling, evaluating best k, alternative distances and hierarchical clustering.

Uploaded by

Revathy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Clustering of top 250 movies

from IMDB

Mustafa Panbiharwala Siddharth Ajit


Broad goals
• Find out if there’s any intrinsic pattern or clusters among top
rated movies (Based on IMDb Ratings).
• Identify similarities among the movies of the same cluster.
• Use Dimensionality Reduction to improve clustering output
• Repeat the above steps for different clustering algorithms.
Data acquisition
• Movie features were obtained using OMDB’s API.
• Plot summary were scrapped from IMDB
https://www.imdb.com/chart/top.
• Data was directly acquired from OMDB’s Database
through it’s public API.

• The API had movie’s IMDB id as one of the outputs.

• Used movie’s IMDB to scrap the data from IMDB’s


website.

• OMDB’s API had the following Output-


Feature selection
• Features used for clustering : year, runtime ,genre director, actors,
plot, language, country.
• Certain features like year of release was discretized to categorical
variables by choosing a suitable cutoff value.
• For text heavy columns, important words were pulled after cleaning
and sorting according to TF-IDF scores.
Genre
Actors
Directors
Language
Countries
Feature representation
• Features – Genre, Actors, Directors, Language and Countries were one
hot encoded.
Text processing of plot summary
These are standard procedures for NLP problems.
• Stripping of all punctuations, stop words etc.
• Tokenizing – breaking text into sentences and words.
• Stemming – reducing the inflected word to the root form.
TF-IDF
• Stands for term frequency–inverse document frequency
• Reveals the importance of a word to a document in a corpus.

TF(t) = (Number of times term t appears in a document) / (Total


number of terms in the document)
IDF(t) = log(Total number of documents / Number of documents with
term t).

TF-IDF = TF(t) * IDF(t)


Terms with TF-IDF scores

• The list was combed through for relevant words


• Top 20 words were selected and one hot encoded

words selected:
father, police, family , men , young , war , child , home , young, son ,
love, money , friend, mother , escape , boy , girl, murder , brother,
escape, german , gang
Final Dataframe
Curse of dimensionality
• Final data frame is sparse

• As predictor space explodes, clustering becomes difficult due to


various reasons – Effect of noise, high correlation among predictors.

• Approach – Dimension reduction ( PCA)


Principal component vectors
• Orthogonal transformation to convert set of correlated variables to a
linearly uncorrelated variables called principal vectors.
• The first principal component has the highest variance among all the
principal components
• 24 top principal components were used for clustering
• They explained 99.8% of the variance in our dataset.
PCA solution

• Solve using singular value decomposition S = A’ΛA


• A matrix contains eigen vectors of S
• Λ is a diagonal matrix corresponding to each eigen vector
• Retain top q eigen values from Λ
• Now, the predictor subspace has been mapped from p dim to q dim
Predictors sorted by loading vector weights
Tsne 2-D
Clustering
• K means
• EM clustering
• DBSCAN
• Hierarchical clustering
K means clustering

• Randomly assign data points to (1…K) clusters


• Compute centroids of clusters
• Calculate distance measure from data points to corresponding
centroids and reassign the points to centroids by distance
• Repeat the above steps until there is no reassignment of data points
EM clustering GMM
(K means)
• K = 5 ( From DBSCAN)
• Cluster formations were difficult to interpret at higher K values
• Minimum SSE was obtained at k = 8
K means results – Cluster 0
• No of clusters – 5 Movies
Star wars : Episode V
• Theme : family, biography
Star wars : Episode IV
• Genre : Fantasy, comedy, The great dictator
Mad Max : Fury road
biography Whiplash
Jaws
Passion of Joan of Arc
K means results – Cluster 1
• Theme : Son, family, Murder Movies
• Runtime is very high in comparison The godfather part I
The godfather part 2
to the median value Schindler’s list
• Genre : Dark, History, biography Once upon a time in America
Gandhi
Gone with the wind
Ben-Hur
K means results – Cluster 2
• Theme :War, kill Movies
Dark knight
• Runtime is very high in comparison Dark knight rises
Pulp fiction
to the median value Inglorious basterds
• Genre : Action, Crime, thriller Django Unchained
Scarface
There will be blood

Once upon a time in west


K means results – Cluster 3
• Theme : young Movies
Spirited away
Finding Nemo
Modern times
• Genre : Comedy, Animation, Romance City lights
Casablanca
Toy story
Hachi : A dog’s tale
Monty Python
Truman Show
K means results – Cluster 4
• Theme : home, family Movies
Shawshank Redemption
• Actor : Al Pacino Matrix
The shining
• Genre : Thriller, crime Terminator 2
Beautiful mind
To kill a mockingbird
Se7en
Prestige
Good will hunting
V for Vendetta
La la land
Catch me if you can
DBSCAN
• DBSCAN is a density based clustering algorithm.
• Robust towards Outlier Detection.
• The most common distance metric used is Euclidean distance
especially for high-dimensional data, this metric can be rendered
almost useless due to the so-called “curse of dimensionality”.
DBSCAN Results – Cluster 1
Movies
• No of clusters – 5 Spirited Away
• Director : Hiyao Miyazaki Finding Nemo

• Theme : Young, world Nausicaa of the valley of wind


• Genre :Animation, Adventure Wall -E
The Lion king
Cluster 2
• Director : Christopher Nolan Movies
• Theme : Help, dark The dark night

• Actors : Christian Bale, Di Caprio Inception

• Genre : Drama, thriller Prestige


Batman begins
Cluster 3
• Actor : Charlie Chaplin Movies
• Theme : love, family City lights
• Genre : Comedy, drama Modern times
The great dictator
The gold rush
Cluster 4
• Director : Stanley Kubrick Movies
• Theme : war, men, German Barry Lyndon
Gandhi
• Genre :Drama, autobiography
Judgement at Nuremberg
• All the movies in these cluster have Lawrence of Arabia
extremely high run time Schindler's list
Seven Samurai
Cluster 5
• Director : Clint Eastwood Movies
• Theme : war, gang Lord of the rings series
• Genre :Action, fantasy The good, the bad and the ugly
• Movies in these cluster are also longer Mad Max : Fury road
Gangs of Wasseypur
in comparison to median run time Roshomon
Yojimbo
Outlier clusters
• DBSCAN assigned many data points as outliers
• Why ?? – DBSCAN looks for spatial density for clustering. In higher
dimensions, it is difficult to find such density clusters.
• The clusters formed by DBSCAN are easier to interpret than K means
Future work
• Scaling of predictors
• In K means clustering, finding the best value of k that is optimum
according to gap statistic, silhouette score and has good
interpretability
• Different distance measures such as jaccard , Gower distance
• Heirarchical clustering

You might also like