0% found this document useful (0 votes)

49 views25 pages

RE Paper

Uploaded by

irathernotsay91

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views25 pages

RE Paper

Uploaded by

irathernotsay91

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 25

APPLYING DATA MINING

ALGORITHMS ON IMDb DATASET

BITS Pilani Hyderabad Campus
CS F415 Data Mining Project
Shobhit Sharma ([email protected])

ABSTRACT

This research paper concerns with applying Data Mining Algorithms on IMDb Non-Commercial
Datasets to gain important insights into the data. We will primarily be looking at Association Rules.

The challenge is to learn what makes a good movie good using association rule mining.

The first part focuses on exploring the data to gain some important insights to draw some
conclusions in the second part where we apply apriori algorithm to try learn what makes a good
movie good using association rules.

This paper draws some important insights into the data and shows the failure of apriori algorithm in
working with these datasets.

Keywords:

INTRODUCTION

Everyone enjoys watching “good” movies. But what constitutes a “good” movie? Although this is a
rather subjective question, Data Mining can be used to help us understand not only what constitutes
a good movie, but also a bad movie as well.

RELATED WORK
APPROACH/METHODOLOGY

Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8
character set. The first line in each file contains headers that describe what is in each column. A ‘\N’
is used to denote that a particular field is missing or null for that title/name. The available datasets
are as follows:

title.akas.tsv.gz

 titleId (string) - a tconst, an alphanumeric unique identifier of the title

 ordering (integer) – a number to uniquely identify rows for a given titleId
 title (string) – the localized title
 region (string) - the region for this version of the title
 language (string) - the language of the title
 types (array) - Enumerated set of attributes for this alternative title. One or more of the
following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay".
New values may be added in the future without warning
 attributes (array) - Additional terms to describe this alternative title, not enumerated
 isOriginalTitle (boolean) – 0: not original title; 1: original title

title.basics.tsv.gz

 tconst (string) - alphanumeric unique identifier of the title

 titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video,
etc)
 primaryTitle (string) – the more popular title / the title used by the filmmakers on
promotional materials at the point of release
 originalTitle (string) - original title, in the original language
 isAdult (boolean) - 0: non-adult title; 1: adult title
 startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the
series start year
 endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
 runtimeMinutes – primary runtime of the title, in minutes
 genres (string array) – includes up to three genres associated with the title

title.crew.tsv.gz

 tconst (string) - alphanumeric unique identifier of the title

 directors (array of nconsts) - director(s) of the given title
 writers (array of nconsts) – writer(s) of the given title

title.episode.tsv.gz

 tconst (string) - alphanumeric identifier of episode

 parentTconst (string) - alphanumeric identifier of the parent TV Series
 seasonNumber (integer) – season number the episode belongs to
 episodeNumber (integer) – episode number of the tconst in the TV series
title.principals.tsv.gz

 tconst (string) - alphanumeric unique identifier of the title

 ordering (integer) – a number to uniquely identify rows for a given titleId
 nconst (string) - alphanumeric unique identifier of the name/person
 category (string) - the category of job that person was in
 job (string) - the specific job title if applicable, else '\N'
 characters (string) - the name of the character played if applicable, else '\N'

title.ratings.tsv.gz

 tconst (string) - alphanumeric unique identifier of the title

 averageRating – weighted average of all the individual user ratings
 numVotes - number of votes the title has received

name.basics.tsv.gz

 nconst (string) - alphanumeric unique identifier of the name/person

 primaryName (string)– name by which the person is most often credited
 birthYear – in YYYY format
 deathYear – in YYYY format if applicable, else '\N'
 primaryProfession (array of strings)– the top-3 professions of the person
 knownForTitles (array of tconsts) – titles the person is known for

EXPERIMENTS

Let’s go over the 7zipped files one by one except title.episode.tvs.gz as we are not interested in TV
series.

title.ratings.tsv.gz

The ratings table contains over 1 million movie rating entries.

Average rating distribution shows a classic negative skewed distribution where the median is larger
than mean. Vote count distribution is heavily clustered around small values (0–1000 votes).

We then used the pandas.qcut function to discretise the variable into equal-sized buckets based on
rank or based on sample quantiles.

title.basics.tsv.gz

The dataset is way larger than the rating table with over 6 million entries each of which describes
basic information about the video titles. I used the word “video” as it not only includes films but also
tv series, short videos (short films and music clips), even video games.

We now plot the title type distribution.

We then trimmed the dataset to only include movies. After trimming, the majority of the titles are
now movies (82%) and the remaining 120K titles are TV movies.

Next, we check for the distribution of genres. The dataset contains various permutations of the
keywords in genres which are represented uniquely, hence the pie chart.
We can use Scikit-Learn’s CountVectorizer feature extraction technique to detect and count each
unique genre.

We can plot the distribution of these genres, however since a movie may have multiple genres at a
time, their count will not be 100%.
Next we plot the trend for the number of voters per year and voter counts per year/per film.

Here is another graph with two subplots. The first provides the average film rating annually, while the
second shows the average voters per picture annually as well. Average movie ratings have been
erratic since the 1920s, with no discernible upward or downward trend.
Now let's see how movie runtimes are distributed in terms of minutes. The single bar below shows
that there are a few films that stand out as having longer runs over 50,000 minutes.

There are some outliers which we will clean off and see the distributions of films with 300minutes of
runtime or less. These movies were shot, for example, with camera mounted on a ship for 35 days
and such. So we will filter these.
We now list the Top 20 movies with the highest voter count and accordingly:-

 Popular movies like Fight Club, The Matrix, Lord of the Rings trio are made the list as
expected
 Except for The Godfather, Top20 voted films are from the 90s to the present
Now that we have the combined table back, let's see which 20 highest-rated movies are there. We
will simply sort it by average rating. Similar to the IMDb website's Top 250 most rated movies, we will
impose a requirement to prevent listing "unknown" high rated movies. Consequently, we will only
evaluate the following movies that have at least 25,000 user ratings:

Let’s now list the Worst rated 20 films:

Let's visualize the median (average) ratings for various film genres. We'll make use of the identical
counter as before:

Documentaries and news categories—both political and horror—have the highest rating averages;
sci-fi and horror come in last (maybe because there are too many extreme examples in both).

We now try to see the correlation between higher rating and runtime length.
 Boxplot medians (vertical lines in each bin) show that the average rating tends to increase
when movie runtime increases
 The reason that films with 84m and less runtime being exception could be that this group
contains a lot of animation films which have high rates

name.basics.tsv.gz

The dataset head is shown below.

Nearly ten million names from a variety of professions, including writers, directors, actors, and
actresses, are included in the table. It includes their most well-known artwork in addition to
information on their birth and death years.

The birth year distribution of the persons in the dataset is displayed below. The dataset contains
even ancient writers from year 4 A.D.
Distribution of lifespans of persons in the dataset:

Let's look at the Top 10 individuals in the sample with the longest lifespans:

 The longest verified human life span in recorded history was that of Jeanne Louise Calment,
who passed away at the age of 122.
 She did not participate in any films at all, but she did make an appearance in a documentary
on her, which is why she is included in our dataset.

title.crew.tsv.gz

Contains the director and writer information for all the titles in IMDb and contains 6.6 million entries.
 The pie chart below says that it is not as common to have multiple directors (8%) for a film as
compared to multiple writers (44%)
 The vast majority of films (92%) have a single director, almost 1/3 of films have a writer duo

We can also sort for the Writer Count and director count.

Let's now examine directors' success analytically. First, we must establish our own standards for
success. The director's average rating score is the first criterion that immediately comes to mind.
Once more, in order to exclude unidentified local directors, we will need to set some thresholds.
Limits:
 Only movies with 25,000 or more votes will be admitted.
 Only directors of three or more films are eligible to apply.
 Only directors with a median voter count of 100,000 were accepted.

We are searching for accomplished directors with a global presence. In order to accomplish this, we
must merge the four tables that we have looked at thus far. We can now apply threshold and reveal
the most successful 10 directors.

Visualizing the directors with our criteria in a scatter plot.

 Font and colour of the director is proportional to their average rating

 Circle diameter of the director is proportional to their total number of films
 The x-axis is displayed in log scale to avoid tightly packed dots
 Directors on top of the y-scale have higher rating averages whereas the directors in the right-
hand side of the graph have more popular films
 Christopher Nolan sits on the top right of the graph, one can say his films are both popular as
well as high quality
We can do the same for the writers.
title.principals.tsv.gz

It contains over 38 million principal cast/crew members such as actors/actresses, writer, director,
producer, editor, etc. which are associated with each film.
We can see the distribution of professions in the dataset.

Who has been busiest in the film industry? Let's find out:

William Shakespeare and Ilaiyaraaja are the authors with the most contributions. As an actor,
Brahmanandam appeared in over a thousand films. There aren't many well-known figures there,
save Shakespeare.
title.akas.tsv.gz

The table consist of the title’s also known as (AKA) information such as the variations of title in
different countries. There are 21 million rows in the table.
Now let's first translate region codes into national names. Next, tally them and represent the Top 20
visually:

The count indicates how many distinct numbers of films were shown in theaters across the nation;
the greater the number, the more varied films exhibited; it does not provide the overall number of
theaters or tickets.

PART TWO – APPLYING THE APRIORI ALGORITHM

First we will merge the title_rating, title_basics, name basics and title_principals into one csv file and
drop some unneeded columns and rows.
Next we create itemsets.

Figure 1Example of an itemset.

After that, I generated the binary categorical representation for the data and searched for any
intriguing co-occurring items using the apriori algorithm:
RESULTS
A lot of research points and areas to work with during the data exploration phase.

 The average rating is 6.89 (mean) which is different from the median value (7.1), we will
examine this further. The average number of votes per film is close to a thousand (mean),
this time median (20) is significantly different than the mean. Average rating distribution
shows a classic negative skewed distribution where the median is larger than mean.

 Over 140K samples (more than 10%) have only 5 votes counted. The films having more than
1095 votes are represented with a single bar in the plot which counts around 50K (these are
the significant/popular films we are mostly interested in). The distribution is similar to
Poisson Distribution.

 Majority of the titles are TV episodes (71%). Movies (including TV movies) only cover 10% of
the dataset. Out of those 10%, 82% are movies and the remaining 18% are TVMovies.
 37 million vote count total for the films released in 2013 which was the peak year and
follows dramatic decline in the count although the number of films made per year increases
till 2017. More research has to be done in this area.
 Since the 1920s, average film ratings tend to fluctuate without any monotonic trend. The
average rating reaches its peak in the 90s and early 2000s and has been dropping
dramatically since. We can further analyse and establish that 90s films got the attention of
users the most. This can be related to the dominance of the age group of the IMDb voters.

 Highest rating average is taken by documentaries and news categories (both political). Horror
and sci-fi are in the bottom of the list (could be due to too many extreme examples in both)
 Average rating tends to increase when movie runtime increases. The reason that films with
84m and less runtime being exception could be that this group contains a lot of animation
films which have high rates
 The pie chart below says that it is not as common to have multiple directors (8%) for a film as
compared to multiple writers (44%). The vast majority of films (92%) have a single director,
almost 1/3 of films have a writer duo.

 Christopher Nolan takes the first place in the Directors ordered by their mean film ratings
and also median vote counts per film.
 The US clearly wins in terms of the many different numbers of films showed in cinemas in
the country.

 Using apriori, we were able to generate the following itemsets.

Figure 2Frequent Itemsets for minimum support of 0.1

CONCLUSION
The work done in this project is mostly of exploratory nature. We took the IMDb non-commercial
dataset and analysed the data with various graphs and visualization tricks to gain insights. Lastly, we
also tried analysing the results of apriori algorithm in generating meaningful associations on real
world sparse dataset.

In general, I consider the results to be a significant setback for the apriori method. To obtain any
significant rules, I had to lower the minimum support to 0.1; even then, I was unable to generate any
rules including more than two elements. I think this is because the itemset is too sparse in this
example. It might be acceptable to lower the minimal support to 0.1, but I would still feel more at
ease with a greater support.

Another problem is that, in order to fit the binary representation of the itemset into RAM, I had to
significantly filter the original data which may lead to some loss of information.

Future work in this area could be to research deeper into the areas

Clustering of Top 250 Movies From IMDB: Mustafa Panbiharwala Siddharth Ajit
No ratings yet
Clustering of Top 250 Movies From IMDB: Mustafa Panbiharwala Siddharth Ajit
37 pages
Movie Notebook
No ratings yet
Movie Notebook
91 pages
Example Project
No ratings yet
Example Project
31 pages
Netflix Data Analysis
No ratings yet
Netflix Data Analysis
23 pages
Netflix Businesscase ShivangKhare
No ratings yet
Netflix Businesscase ShivangKhare
73 pages
Anurag Chaturvedi Netflix - Jupyter - Notebook Case Study
No ratings yet
Anurag Chaturvedi Netflix - Jupyter - Notebook Case Study
27 pages
Netflix Movies and TV Shows Clustering
No ratings yet
Netflix Movies and TV Shows Clustering
29 pages
Group 15 Report
No ratings yet
Group 15 Report
23 pages
SDM - Task B - Group 1G - Movies
No ratings yet
SDM - Task B - Group 1G - Movies
11 pages
Imdb Movie Analysis
No ratings yet
Imdb Movie Analysis
17 pages
DSLAB5
No ratings yet
DSLAB5
17 pages
Technical Docs of NETFLIX MOVIES AND TV SHOWS CLUSTERING
No ratings yet
Technical Docs of NETFLIX MOVIES AND TV SHOWS CLUSTERING
12 pages
Movielens Recommender System Capstone Project: Compiled by Mahesh Halkeri
No ratings yet
Movielens Recommender System Capstone Project: Compiled by Mahesh Halkeri
19 pages
Technical Documenetflix Technicalnt
No ratings yet
Technical Documenetflix Technicalnt
15 pages
Netflix Analysis Report (2105878 - Bibhudutta Swain)
No ratings yet
Netflix Analysis Report (2105878 - Bibhudutta Swain)
19 pages
Ads - Phase 5
No ratings yet
Ads - Phase 5
14 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
17 pages
Imdb Scrape v1
No ratings yet
Imdb Scrape v1
9 pages
Student Details
No ratings yet
Student Details
10 pages
Project 2 - Movielens Case Study
No ratings yet
Project 2 - Movielens Case Study
5 pages
MovieLens Project Report
No ratings yet
MovieLens Project Report
19 pages
Movie Recommendation System in R Jupyter Notebook
No ratings yet
Movie Recommendation System in R Jupyter Notebook
18 pages
Imdb Scrape v3
No ratings yet
Imdb Scrape v3
9 pages
Report On Data Visualization: Under The Guidance of
No ratings yet
Report On Data Visualization: Under The Guidance of
7 pages
Dav Project
No ratings yet
Dav Project
22 pages
Moviesuggester - Jupyter Notebook
No ratings yet
Moviesuggester - Jupyter Notebook
11 pages
IMDB Movie Analysis1
No ratings yet
IMDB Movie Analysis1
14 pages
IMDB Movie Analysis - PDF
No ratings yet
IMDB Movie Analysis - PDF
8 pages
A Predictor For Movie Success: 2.1 Data Collection
No ratings yet
A Predictor For Movie Success: 2.1 Data Collection
5 pages
Netflix Content Analysis Using Python
No ratings yet
Netflix Content Analysis Using Python
16 pages
DM Theory Mid Term
No ratings yet
DM Theory Mid Term
9 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
09 Assignment 2 Six Degrees
No ratings yet
09 Assignment 2 Six Degrees
10 pages
09 Assignment 2 Six Degrees
No ratings yet
09 Assignment 2 Six Degrees
10 pages
Hands-On Lab - Importing Data in R
No ratings yet
Hands-On Lab - Importing Data in R
8 pages
STA220 FInal Project Report
No ratings yet
STA220 FInal Project Report
30 pages
Business Intelligence Project Report
No ratings yet
Business Intelligence Project Report
14 pages
Movie Analysis PDF
No ratings yet
Movie Analysis PDF
11 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Project 5
No ratings yet
Project 5
13 pages
Project Movielense Solution
29% (7)
Project Movielense Solution
4 pages
Predicting Movie Success Based On IMDB Data
No ratings yet
Predicting Movie Success Based On IMDB Data
4 pages
Coding Round Question & Answers
No ratings yet
Coding Round Question & Answers
56 pages
Detailed Lesson Plan in Math 10 (Finding The Unknown Variables in An Arithmetic Sequence)
100% (4)
Detailed Lesson Plan in Math 10 (Finding The Unknown Variables in An Arithmetic Sequence)
4 pages
Movies Statistical Analysis
No ratings yet
Movies Statistical Analysis
3 pages
MINI PROJECT (3)
No ratings yet
MINI PROJECT (3)
18 pages
JPEG Standard, MPEG and Recognition
No ratings yet
JPEG Standard, MPEG and Recognition
32 pages
CFD Analysis of Manifold
No ratings yet
CFD Analysis of Manifold
27 pages
MINI PROJECT
No ratings yet
MINI PROJECT
17 pages
Chapter 5: Stationary Perturbation Theory: (From Cohen-Tannoudji, Chapter XI)
No ratings yet
Chapter 5: Stationary Perturbation Theory: (From Cohen-Tannoudji, Chapter XI)
46 pages
Ayush - Anand IP Project 1 17
No ratings yet
Ayush - Anand IP Project 1 17
17 pages
Project Problem Statement
No ratings yet
Project Problem Statement
3 pages
Netflix Analysis (1)
No ratings yet
Netflix Analysis (1)
22 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
11 pages
Imdb
No ratings yet
Imdb
11 pages
FSMQ Maxima and Minima PDF
No ratings yet
FSMQ Maxima and Minima PDF
5 pages
IMDB Dataframe Insights
No ratings yet
IMDB Dataframe Insights
3 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
2 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
EC6303-Signals and Systems
No ratings yet
EC6303-Signals and Systems
10 pages
IMDB MOVIES analysis
No ratings yet
IMDB MOVIES analysis
13 pages
1/4 Din Setpoint Programmer: FORM 3707 Edition 1 © May 1996 PRICE $10.00
No ratings yet
1/4 Din Setpoint Programmer: FORM 3707 Edition 1 © May 1996 PRICE $10.00
98 pages
The Movie Database: 1 Background
No ratings yet
The Movie Database: 1 Background
14 pages
Report
No ratings yet
Report
31 pages
Me6301 Engineering Thermodynamics Nov Dec 2011
No ratings yet
Me6301 Engineering Thermodynamics Nov Dec 2011
3 pages
QUBE-Servo 2 - Second Order Systems Workbook (Student)
No ratings yet
QUBE-Servo 2 - Second Order Systems Workbook (Student)
6 pages
Assignment Final
No ratings yet
Assignment Final
1 page
Programming
No ratings yet
Programming
30 pages
Netflix Project
No ratings yet
Netflix Project
20 pages
Ce 1252 - Strength of Materials: Two Mark Question & Answers
No ratings yet
Ce 1252 - Strength of Materials: Two Mark Question & Answers
21 pages
DURERS MAGIC SQUARE Inclusion and Home Learning Guide
No ratings yet
DURERS MAGIC SQUARE Inclusion and Home Learning Guide
8 pages
Topic 2 - Exponential Models
No ratings yet
Topic 2 - Exponential Models
34 pages
18.085 Computational Science and Engineering I: Mit Opencourseware
No ratings yet
18.085 Computational Science and Engineering I: Mit Opencourseware
13 pages
DC-1 Assignment-8
No ratings yet
DC-1 Assignment-8
5 pages
Estimation of The Low Cycle Fatigue Life For Submarine Pressure Hull
No ratings yet
Estimation of The Low Cycle Fatigue Life For Submarine Pressure Hull
12 pages
2024 MAM2084F CT1 Memo
No ratings yet
2024 MAM2084F CT1 Memo
10 pages
Physics Midterm Review Packet
No ratings yet
Physics Midterm Review Packet
6 pages
2021 Article
No ratings yet
2021 Article
17 pages
Cfa-Level-Ii-Errata 2021
No ratings yet
Cfa-Level-Ii-Errata 2021
3 pages
Algorithms For Data Compression in Wireless Computing Systems
No ratings yet
Algorithms For Data Compression in Wireless Computing Systems
7 pages
Design & Analysis of Algorithms (DAA) Unit - II
No ratings yet
Design & Analysis of Algorithms (DAA) Unit - II
24 pages
Age Calculation
No ratings yet
Age Calculation
4 pages
Deep Work Rules for Focused Success in a Distracted World 1st Edition Cal Newport download
100% (4)
Deep Work Rules for Focused Success in a Distracted World 1st Edition Cal Newport download
139 pages
10th Maths - Monday Test-2
No ratings yet
10th Maths - Monday Test-2
8 pages
CV Example School & No Experience
No ratings yet
CV Example School & No Experience
5 pages
Hidden Figures Ch12to14 Questions
No ratings yet
Hidden Figures Ch12to14 Questions
2 pages
Optional Challenge 2
0% (6)
Optional Challenge 2
3 pages

RE Paper

Uploaded by

RE Paper

Uploaded by

APPLYING DATA MINING

ALGORITHMS ON IMDb DATASET

 titleId (string) - a tconst, an alphanumeric unique identifier of the title

 tconst (string) - alphanumeric unique identifier of the title

 tconst (string) - alphanumeric unique identifier of the title

 tconst (string) - alphanumeric identifier of episode

 tconst (string) - alphanumeric unique identifier of the title

 tconst (string) - alphanumeric unique identifier of the title

 nconst (string) - alphanumeric unique identifier of the name/person

The ratings table contains over 1 million movie rating entries.

We now plot the title type distribution.

Let’s now list the Worst rated 20 films:

The dataset head is shown below.

Visualizing the directors with our criteria in a scatter plot.

 Font and colour of the director is proportional to their average rating

PART TWO – APPLYING THE APRIORI ALGORITHM

Figure 1Example of an itemset.

 Using apriori, we were able to generate the following itemsets.

You might also like