APPLYING DATA MINING
ALGORITHMS ON IMDb DATASET
BITS Pilani Hyderabad Campus
CS F415 Data Mining Project
Shobhit Sharma (
[email protected])
ABSTRACT
This research paper concerns with applying Data Mining Algorithms on IMDb Non-Commercial
Datasets to gain important insights into the data. We will primarily be looking at Association Rules.
The challenge is to learn what makes a good movie good using association rule mining.
The first part focuses on exploring the data to gain some important insights to draw some
conclusions in the second part where we apply apriori algorithm to try learn what makes a good
movie good using association rules.
This paper draws some important insights into the data and shows the failure of apriori algorithm in
working with these datasets.
Keywords:
INTRODUCTION
Everyone enjoys watching “good” movies. But what constitutes a “good” movie? Although this is a
rather subjective question, Data Mining can be used to help us understand not only what constitutes
a good movie, but also a bad movie as well.
RELATED WORK
APPROACH/METHODOLOGY
Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8
character set. The first line in each file contains headers that describe what is in each column. A ‘\N’
is used to denote that a particular field is missing or null for that title/name. The available datasets
are as follows:
title.akas.tsv.gz
titleId (string) - a tconst, an alphanumeric unique identifier of the title
ordering (integer) – a number to uniquely identify rows for a given titleId
title (string) – the localized title
region (string) - the region for this version of the title
language (string) - the language of the title
types (array) - Enumerated set of attributes for this alternative title. One or more of the
following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay".
New values may be added in the future without warning
attributes (array) - Additional terms to describe this alternative title, not enumerated
isOriginalTitle (boolean) – 0: not original title; 1: original title
title.basics.tsv.gz
tconst (string) - alphanumeric unique identifier of the title
titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video,
etc)
primaryTitle (string) – the more popular title / the title used by the filmmakers on
promotional materials at the point of release
originalTitle (string) - original title, in the original language
isAdult (boolean) - 0: non-adult title; 1: adult title
startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the
series start year
endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
runtimeMinutes – primary runtime of the title, in minutes
genres (string array) – includes up to three genres associated with the title
title.crew.tsv.gz
tconst (string) - alphanumeric unique identifier of the title
directors (array of nconsts) - director(s) of the given title
writers (array of nconsts) – writer(s) of the given title
title.episode.tsv.gz
tconst (string) - alphanumeric identifier of episode
parentTconst (string) - alphanumeric identifier of the parent TV Series
seasonNumber (integer) – season number the episode belongs to
episodeNumber (integer) – episode number of the tconst in the TV series
title.principals.tsv.gz
tconst (string) - alphanumeric unique identifier of the title
ordering (integer) – a number to uniquely identify rows for a given titleId
nconst (string) - alphanumeric unique identifier of the name/person
category (string) - the category of job that person was in
job (string) - the specific job title if applicable, else '\N'
characters (string) - the name of the character played if applicable, else '\N'
title.ratings.tsv.gz
tconst (string) - alphanumeric unique identifier of the title
averageRating – weighted average of all the individual user ratings
numVotes - number of votes the title has received
name.basics.tsv.gz
nconst (string) - alphanumeric unique identifier of the name/person
primaryName (string)– name by which the person is most often credited
birthYear – in YYYY format
deathYear – in YYYY format if applicable, else '\N'
primaryProfession (array of strings)– the top-3 professions of the person
knownForTitles (array of tconsts) – titles the person is known for
EXPERIMENTS
Let’s go over the 7zipped files one by one except title.episode.tvs.gz as we are not interested in TV
series.
title.ratings.tsv.gz
The ratings table contains over 1 million movie rating entries.
Average rating distribution shows a classic negative skewed distribution where the median is larger
than mean. Vote count distribution is heavily clustered around small values (0–1000 votes).
We then used the pandas.qcut function to discretise the variable into equal-sized buckets based on
rank or based on sample quantiles.
title.basics.tsv.gz
The dataset is way larger than the rating table with over 6 million entries each of which describes
basic information about the video titles. I used the word “video” as it not only includes films but also
tv series, short videos (short films and music clips), even video games.
We now plot the title type distribution.
We then trimmed the dataset to only include movies. After trimming, the majority of the titles are
now movies (82%) and the remaining 120K titles are TV movies.
Next, we check for the distribution of genres. The dataset contains various permutations of the
keywords in genres which are represented uniquely, hence the pie chart.
We can use Scikit-Learn’s CountVectorizer feature extraction technique to detect and count each
unique genre.
We can plot the distribution of these genres, however since a movie may have multiple genres at a
time, their count will not be 100%.
Next we plot the trend for the number of voters per year and voter counts per year/per film.
Here is another graph with two subplots. The first provides the average film rating annually, while the
second shows the average voters per picture annually as well. Average movie ratings have been
erratic since the 1920s, with no discernible upward or downward trend.
Now let's see how movie runtimes are distributed in terms of minutes. The single bar below shows
that there are a few films that stand out as having longer runs over 50,000 minutes.
There are some outliers which we will clean off and see the distributions of films with 300minutes of
runtime or less. These movies were shot, for example, with camera mounted on a ship for 35 days
and such. So we will filter these.
We now list the Top 20 movies with the highest voter count and accordingly:-
Popular movies like Fight Club, The Matrix, Lord of the Rings trio are made the list as
expected
Except for The Godfather, Top20 voted films are from the 90s to the present
Now that we have the combined table back, let's see which 20 highest-rated movies are there. We
will simply sort it by average rating. Similar to the IMDb website's Top 250 most rated movies, we will
impose a requirement to prevent listing "unknown" high rated movies. Consequently, we will only
evaluate the following movies that have at least 25,000 user ratings:
Let’s now list the Worst rated 20 films:
Let's visualize the median (average) ratings for various film genres. We'll make use of the identical
counter as before:
Documentaries and news categories—both political and horror—have the highest rating averages;
sci-fi and horror come in last (maybe because there are too many extreme examples in both).
We now try to see the correlation between higher rating and runtime length.
Boxplot medians (vertical lines in each bin) show that the average rating tends to increase
when movie runtime increases
The reason that films with 84m and less runtime being exception could be that this group
contains a lot of animation films which have high rates
name.basics.tsv.gz
The dataset head is shown below.
Nearly ten million names from a variety of professions, including writers, directors, actors, and
actresses, are included in the table. It includes their most well-known artwork in addition to
information on their birth and death years.
The birth year distribution of the persons in the dataset is displayed below. The dataset contains
even ancient writers from year 4 A.D.
Distribution of lifespans of persons in the dataset:
Let's look at the Top 10 individuals in the sample with the longest lifespans:
The longest verified human life span in recorded history was that of Jeanne Louise Calment,
who passed away at the age of 122.
She did not participate in any films at all, but she did make an appearance in a documentary
on her, which is why she is included in our dataset.
title.crew.tsv.gz
Contains the director and writer information for all the titles in IMDb and contains 6.6 million entries.
The pie chart below says that it is not as common to have multiple directors (8%) for a film as
compared to multiple writers (44%)
The vast majority of films (92%) have a single director, almost 1/3 of films have a writer duo
We can also sort for the Writer Count and director count.
Let's now examine directors' success analytically. First, we must establish our own standards for
success. The director's average rating score is the first criterion that immediately comes to mind.
Once more, in order to exclude unidentified local directors, we will need to set some thresholds.
Limits:
Only movies with 25,000 or more votes will be admitted.
Only directors of three or more films are eligible to apply.
Only directors with a median voter count of 100,000 were accepted.
We are searching for accomplished directors with a global presence. In order to accomplish this, we
must merge the four tables that we have looked at thus far. We can now apply threshold and reveal
the most successful 10 directors.
Visualizing the directors with our criteria in a scatter plot.
Font and colour of the director is proportional to their average rating
Circle diameter of the director is proportional to their total number of films
The x-axis is displayed in log scale to avoid tightly packed dots
Directors on top of the y-scale have higher rating averages whereas the directors in the right-
hand side of the graph have more popular films
Christopher Nolan sits on the top right of the graph, one can say his films are both popular as
well as high quality
We can do the same for the writers.
title.principals.tsv.gz
It contains over 38 million principal cast/crew members such as actors/actresses, writer, director,
producer, editor, etc. which are associated with each film.
We can see the distribution of professions in the dataset.
Who has been busiest in the film industry? Let's find out:
William Shakespeare and Ilaiyaraaja are the authors with the most contributions. As an actor,
Brahmanandam appeared in over a thousand films. There aren't many well-known figures there,
save Shakespeare.
title.akas.tsv.gz
The table consist of the title’s also known as (AKA) information such as the variations of title in
different countries. There are 21 million rows in the table.
Now let's first translate region codes into national names. Next, tally them and represent the Top 20
visually:
The count indicates how many distinct numbers of films were shown in theaters across the nation;
the greater the number, the more varied films exhibited; it does not provide the overall number of
theaters or tickets.
PART TWO – APPLYING THE APRIORI ALGORITHM
First we will merge the title_rating, title_basics, name basics and title_principals into one csv file and
drop some unneeded columns and rows.
Next we create itemsets.
Figure 1Example of an itemset.
After that, I generated the binary categorical representation for the data and searched for any
intriguing co-occurring items using the apriori algorithm:
RESULTS
A lot of research points and areas to work with during the data exploration phase.
The average rating is 6.89 (mean) which is different from the median value (7.1), we will
examine this further. The average number of votes per film is close to a thousand (mean),
this time median (20) is significantly different than the mean. Average rating distribution
shows a classic negative skewed distribution where the median is larger than mean.
Over 140K samples (more than 10%) have only 5 votes counted. The films having more than
1095 votes are represented with a single bar in the plot which counts around 50K (these are
the significant/popular films we are mostly interested in). The distribution is similar to
Poisson Distribution.
Majority of the titles are TV episodes (71%). Movies (including TV movies) only cover 10% of
the dataset. Out of those 10%, 82% are movies and the remaining 18% are TVMovies.
37 million vote count total for the films released in 2013 which was the peak year and
follows dramatic decline in the count although the number of films made per year increases
till 2017. More research has to be done in this area.
Since the 1920s, average film ratings tend to fluctuate without any monotonic trend. The
average rating reaches its peak in the 90s and early 2000s and has been dropping
dramatically since. We can further analyse and establish that 90s films got the attention of
users the most. This can be related to the dominance of the age group of the IMDb voters.
Highest rating average is taken by documentaries and news categories (both political). Horror
and sci-fi are in the bottom of the list (could be due to too many extreme examples in both)
Average rating tends to increase when movie runtime increases. The reason that films with
84m and less runtime being exception could be that this group contains a lot of animation
films which have high rates
The pie chart below says that it is not as common to have multiple directors (8%) for a film as
compared to multiple writers (44%). The vast majority of films (92%) have a single director,
almost 1/3 of films have a writer duo.
Christopher Nolan takes the first place in the Directors ordered by their mean film ratings
and also median vote counts per film.
The US clearly wins in terms of the many different numbers of films showed in cinemas in
the country.
Using apriori, we were able to generate the following itemsets.
Figure 2Frequent Itemsets for minimum support of 0.1
CONCLUSION
The work done in this project is mostly of exploratory nature. We took the IMDb non-commercial
dataset and analysed the data with various graphs and visualization tricks to gain insights. Lastly, we
also tried analysing the results of apriori algorithm in generating meaningful associations on real
world sparse dataset.
In general, I consider the results to be a significant setback for the apriori method. To obtain any
significant rules, I had to lower the minimum support to 0.1; even then, I was unable to generate any
rules including more than two elements. I think this is because the itemset is too sparse in this
example. It might be acceptable to lower the minimal support to 0.1, but I would still feel more at
ease with a greater support.
Another problem is that, in order to fit the binary representation of the itemset into RAM, I had to
significantly filter the original data which may lead to some loss of information.
Future work in this area could be to research deeper into the areas