IMDB Movie Analysis
IMDB Movie Analysis
Result:
Data Pre-Processing
2. Checking for duplicates of entire row
Action:
• We extracted only rows whose movie_title are not duplicate, the first instance of duplicate movie_title and all
duplicates of movie_title of Out of the Blue and The Host.
Data Pre-Processing
4 a. Checking for null values row wise
Data Pre-Processing
4 b. Checking frequency of row wise null values
Action:
• We dropped all the rows where number of null values was greater than 9
Data Pre-Processing
5. Checking for null values column wise
Data Pre-Processing
6. Dealing with Null values in gross column
Data Pre-Processing
Action:
Fitting a Linear Regression model on imdb_score and gross and then predicting Gross value based on IMDB Score of the
row with null value in gross.
Data Pre-Processing
7. Dealing with Null values in budget column
Data Pre-Processing
Action:
Fitting a Linear Regression model on imdb_score and budget and then predicting Gross value based on IMDB Score of
the row with null value in budget.
Data Pre-Processing
8. Dealing with Null values in aspect_ratio column
Data Pre-Processing
Action:
Getting the aspect_ratio of null values by first finding the language and their corresponding most frequent aspect_ratio.
Then replacing null values of aspect_ratio with aspect_ratio value corresponding to the row’s language value as found
above.
Data Pre-Processing
9. Dealing with Null values in content_rating column
Data Pre-Processing
Action:
First, we got the most frequent content_rating for each genre from the rows which didn’t had null values for
content_rating. Then we found the most common content_rating from the genres of rows with null values of
content_rating and replaced them with the value we found.
Data Pre-Processing
10. Dealing with Null values in plot_keywords column
Data Pre-Processing
Action:
For each of the movie in the above dataframe, we scrapped the movie’s Wikipedia page to get a plot summary and then
used KeywordExtractor function from yake library to get top 5 keywords.
Data Pre-Processing
11 a. Dealing with Null values in title_year column
Action:
For each of the movie in the dataframe obtained by extracting rows where title_year is null, we scrapped the movie’s
Wikipedia page to get the release year of the movie and store it in the original dataframe.
Data Pre-Processing
11 b. Dealing with Null values in title_year column which were not affected by the previous process
Action:
The title_year of above two rows are replaced manually by looking through internet.
Data Pre-Processing
12 a. Dealing with Null values in director_name column
Action:
For each of the movie in the dataframe obtained by extracting rows where director_name is null, we scrapped the
movie’s Wikipedia page to get director name of the movie and store it in the original dataframe.
Data Pre-Processing
12 b. Dealing with Null values in director_name column which were not affected by the previous process
Action:
The director_name of such movies are replaced actor_name_1 values of the same movie
Data Pre-Processing
13 a. Dealing with Null values in director_facebook_likes column
Data Pre-Processing
Action:
The director_facebook_likes of corresponding director_name are searched in the dataset and those values replaces the
null values of director_facebook_likes.
Data Pre-Processing
13 b. Dealing with Null values in director_facebook_likes column which were not affected by the previous process
Action:
The director_facebook_likes of such rows are replaced difference between cast_total_facebook_likes and sum of
actor_1_facebook_likes, actor_2_facebook_likes, actor_3_facebook_likes of the same row.
Data Pre-Processing
14. Dealing with Null values in num_critic_for_reviews column
Action:
The null values are replaced with 0.
Data Pre-Processing
15. Dealing with Null values in actor_1_name, actor_2_name, actor_3_name, actor_1_facebook_likes,
actor_2_facebook_likes, actor_3_facebook_likes column
Action:
The null values of actor names are replaced with ‘N.A.’ and null values of actor’s facebook likes are replaced with 0.
Data Pre-Processing
17 b. Dealing with Null values in duration column which were not affected by the previous process
Action:
The null values of duration of such rows are replaced with values that we found through manual check on the internet.
Data Pre-Processing
18. Dealing with Null values in facenumber_in_poster column
Action:
The null values of facenumber_in_poster column of such rows are replaced with 0 after confirming them on the internet
Action:
The null values of language columnof such rows are replaced with ‘English’ after confirming them on the internet
Action:
The null values of country column of such rows are replaced with ‘USA’ after confirming them on the internet
Data Pre-Processing
21. Dealing with Outliers in duration column
Action:
Replaced the values above and below the upper and
lower whisker marks respectively with median.
Data Pre-Processing
22. Dealing with Errors in aspect_ratio column
Action:
Only 16.00 seems to be an error which is most probably a ratio of 16/9. So replaced all row values of aspect_ratio where
16.00 is present to 1.78 (16/9)
Data Pre-Processing
23. Dealing with Errors in country column
Action:
Only Official site seems to be an error which we replaced with USA after manual check on the internet
Data Pre-Processing
24. Dealing with Outliers in title_year column
Action:
Values less than 1916 seems to be outliers which we replaced with correct values of release year of the movies after
manual check on the internet
Data Pre-Processing
A. Movie Genre Analysis: Analyze the distribution of movie genres and their impact on the IMDB score.
Task: Determine the most common genres of movies in the dataset. Then, for each genre, calculate descriptive statistics (mean,
median, mode, range, variance, standard deviation) of the IMDB scores.
Result:
• The top 7 most common genres are Drama,
Distribution of Movie Genres
Comedy, Thriller, Action, Romance,
3000
Adventure and Crime.
2533
2500
1847
2000
sfdgsdfgdfg
1363
Count
1500
1112
1083
887
869
1000
594
582
539
531
484
291
500
239
212
210
202
177
131
119
94
1
0
Horror
Sci-Fi
Romance
Western
Mystery
Comedy
Thriller
Biography
Documentary
Musical
Reality-TV
Crime
Drama
Music
Action
Fantasy
Film-Noir
History
News
Sport
Short
Family
War
Adventure
Animation
Insights
A. Movie Genre Analysis: Analyze the distribution of movie genres and their impact on the IMDB score.
Task: Determine the most common genres of movies in the dataset. Then, for each genre, calculate descriptive statistics (mean,
median, mode, range, variance, standard deviation) of the IMDB scores.
Result:
• All the top 7 genres’ descriptive
Descriptive Statistics of Top 7 Genres
statistics are almost at same
9.50
10.00
9.30
9.30
level.
9.10
9.00
8.90
8.60
9.00
7.80
7.40
8.00
7.30
7.00
6.90
6.90
6.80
6.77
6.70
6.70
6.70
6.60
6.60
6.60
6.60
6.56
6.50
6.50
6.50
6.45
6.44
6.40
6.40
7.00
6.31
6.30
6.30
6.23
6.19
6.00
sfdgsdfgdfg
IMDB Score
5.00
4.00
2.40
3.00
2.20
2.10
2.00
1.90
1.70
1.70
2.00
1.29
1.25
1.19
1.14
1.12
1.12
1.09
1.06
1.06
1.03
0.99
0.99
0.95
0.91
1.00
0.00
Mean Median Mode Max Min Range Variance Standard
Deviation
A. Movie Genre Analysis: Analyze the distribution of movie genres and their impact on the IMDB score.
Task: Determine the most common genres of movies in the dataset. Then, for each genre, calculate descriptive statistics (mean,
median, mode, range, variance, standard deviation) of the IMDB scores.
Result:
Q1. Why Drama is the most common genre?
• The top 5 genres common with Drama
Genres common with Drama
800
are Comedy, Thriller, Action,
Romance and Crime. These are also
700
among the top 7 overall most common
720
674
657
600 genres.
500
516
Count
400
300
332
273
251
200
230
187
185
100
145
143
136
134
53
121
101
13
32
Musical 67
1
0
Horror
Sci-Fi
Thriller
Western
Romance
Mystery
Comedy
Biography
Reality-TV
Crime
Documentary
Action
Fantasy
Music
Film-Noir
News
Sport
Short
Family
War
Adventure
History
Animation
Insights
A. Movie Genre Analysis: Analyze the distribution of movie genres and their impact on the IMDB score.
Task: Determine the most common genres of movies in the dataset. Then, for each genre, calculate descriptive statistics (mean,
median, mode, range, variance, standard deviation) of the IMDB scores.
Result:
Q2. Why Drama being the most common genre has a large Range but small Standard Deviation of IMDB Score?
Distribution of IMDB Score of Drama Genre • The distribution of IMDB Scores of
140 0.45
Drama Genre follows Normal
0.4
120 Distribution closely. Since it follows
0.35 Normal Distribution, we can say that
100
0.3 99.7% of its data lies between IMDB
80
Score of 4.0 and 9.5. This is the reason
0.25
that it's range i.e. difference between
Count
PDF
60 0.2 maximum score and minimum score is
0.15 large but Standard Deviation is small.
40
0.1
20
0.05
0 0
3.0
6.0
6.5
9.5
1.5
2.0
2.5
3.5
4.0
4.5
5.0
5.5
7.0
7.5
8.0
8.5
9.0
10.0
IMDB Score
Total Count PDF Mean 1 STDev 2 STDev 3 STDev
Insights
B. Movie Duration Analysis: Analyze the distribution of movie durations and its impact on the IMDB score.
Task: Analyze the distribution of movie durations and identify the relationship between movie duration and IMDB score.
Result:
Distribution of Movie Duration 350
Distribution of Movie Duration with its PDF 0.03
3 2 1 1 2 3
300 STDev STDev STDev STDev STDev STDev 0.025
250
0.02
Count
200
Count
PDF
0.015
150
0.01
100
0.005
50
, 7
, 7
7 , 77
7 ,7
7 ,7
0 0
,
,
7,
7 ,
,
,
,
,
,
7,
,
,
,
,
,
,
,
,
,
7,
,
,
,
,
,
,
,
,
50 60 70 80 90 100 110 120 130 140 150 160
Duration
Duration (mins) Bins Duration PDF Mean
B. Movie Duration Analysis: Analyze the distribution of movie durations and its impact on the IMDB score.
Task: Analyze the distribution of movie durations and identify the relationship between movie duration and IMDB score.
Result:
• The adjacent scatter plot shows that duration
Duration vs. IMDB Score
and imdb scores have a positive relationship.
10
7
IMDB Score
0
0 20 40 60 80 100 120 140 160 180
Duration (mins)
0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9-10
Insights
B. Movie Duration Analysis: Analyze the distribution of movie durations and its impact on the IMDB score.
Task: Analyze the distribution of movie durations and identify the relationship between movie duration and IMDB score.
Result:
Q1. Why do we have a positive slope treadline in the above plot?
• The first plot shows that movies with higher duration is more than that of lower duration. The second plotshows that the density
of the datapoints in higher duration area is high and concentrated on higher IMDB ratings area whereas density of datapoints in
lower duration area is comparitively low and is distributed among higher and lower IMDB ratings area. This is the reason of
positive slope treadline.
Insights
B. Movie Duration Analysis: Analyze the distribution of movie durations and its impact on the IMDB score.
Task: Analyze the distribution of movie durations and identify the relationship between movie duration and IMDB score.
Result:
Q2. Why movies with higher duration tend to have higher ratings?
• The above plot shows that movies with higher
Duration vs. Budget
duration has higher budgets
100
Millions
90
80
70
Average Budget
60
50
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160
Duration
Insights
B. Movie Duration Analysis: Analyze the distribution of movie durations and its impact on the IMDB score.
Task: Analyze the distribution of movie durations and identify the relationship between movie duration and IMDB score.
Result:
Q3. Why movies with higher budget tend to have higher ratings?
• The above plot shows that as budget of
Popularity of Cast vs. Budget
films increase, the popularity of casts
260000000-270000000 increase. That is the producer is able to
240000000-250000000
cast more popular actors and actress
220000000-230000000
200000000-210000000 with increase in budget which in turn
180000000-190000000 increases the popularity of the movies
160000000-170000000
thus increasing its IMDB Score
Budget Bins
140000000-150000000
120000000-130000000
100000000-110000000
80000000-90000000
60000000-70000000
40000000-50000000
20000000-30000000
0-10000000
0
10000
20000
30000
40000
50000
60000
Average Facebook Likes of Cast
Insights
C. Language Analysis: Situation: Examine the distribution of movies based on their language.
Task: Determine the most common languages used in movies and analyze their impact on the IMDB score using descriptive statistics.
Result:
Distribution of Languages Descriptive Statistics of Languages
4600 4592 9.00
7.80
7.60
7.20
7.20
7.20
7.15
4580 8.00
7.05
7.04
6.95
6.94
6.79
6.70
6.63
6.50
6.40
Count
4560 7.00
4540 6.00
IMDB Score
4520 5.00
4.00
80 73
60 3.00
1.89
1.37
40
1.27
2.00
1.13
1.03
1.02
40
0.84
0.72
28
0.71
24
0.52
19 16
20 11 11 11 8 8
1.00
5 5 5 4 4 4 4 3 3 3 2 2 2 2 2 2
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 0.00
Icelandi c
Abori ginal
Tami l
English
Fil ipino
Mongolian
Polish
Italian
Dutch
Kazakh
Slov enian
Danish
Persi an
Urdu
Mandarin
Korean
Aramaic
Czech
Swedish
Hungarian
Spanish
Romanian
• The above left plots show that English is the most common language used in movies followed by French, Spanish, Hindi and
Mandarin. The above right plot shows that French language has comparatively higher mean and median but lower variance and
standard deviation implying that most of the French language movies have their imdb score on the higher side.
Insights
Task: Determine the most common languages used in movies and 3710
3700
analyze their impact on the IMDB score using descriptive
3690
Count
statistics.
3680
3670
Result:
3660
Q1. Why English is the most common language? 3650
450
• The adjacent plot shows that USA is the most common 400
350
country in the dataset and most films in USA are in English
300
language as it is the most spoken language in the country. 250
200
150
100
50
0
Japan
South Africa
Libya
Dominican Republic
Indonesia
Panama
Israel
Switzerland
Bahamas
Greece
China
Slovakia
France
Kenya
Brazil
Ireland
Nigeria
Slovenia
Romania
Egypt
USA
Czech Republic
Germany
New Zealand
India
Netherlands
Peru
Poland
Hong Kong
Georgia
Total English
Insights
6.000
60.00% maintained consistently high scores for
Percentile
5.000
50.00% large number of movies to be compared
4.000
40.00% for top directors to those who has
3.000
30.00% performed well in few movies.
2.000 20.00%
1.000 10.00%
0.000 0.00%
Tony Scott
Shawn Levy
Richard Linklater
Bobby Farrelly
Clint Eastwood
Richard Donner
Brian De Palma
Steven Spielberg
Stephen Frears
Wes Craven
Peter Jackson
Michael Bay
Sam Raimi
Oliver Stone
Renny Harlin
Tim Burton
Brett Ratner
Chris Columbus
Ron Howard
Woody Allen
David Fincher
Kevin Smith
Rob Reiner
Tyler Perry
Martin Scorsese
Robert Zemeckis
• The average IMDB Scores are between 7 and 8 for the top directors with the above condition. Also their percentile score is above
80%.
Insights
D. Director Analysis: Influence of directors on movie ratings.
Task: Identify the top directors based on their average IMDB score and analyze their contribution to the success of movies using
percentile calculations.
Result:
Q1. Why above list of directors are considered as top directors?
Distribution of IMDB Scores Director wise
250 5
200 4
150 3
100 2
50 1
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10
IMDB Score
Total Peter Jackson David Fincher Martin Scorsese Steven Spielberg Richard Linklater Robert Zemeckis Clint Eastwood
• The above plot shows that the top 7 directors have all of their ratings on the higher side of x axis.
Insights
D. Director Analysis: Influence of directors on movie ratings.
Task: Identify the top directors based on their average IMDB score and analyze their contribution to the success of movies using
percentile calculations.
Result:
Q2. Why above directors have their movie's rating on higher side of x axis?
Millions
Millions
160
100
140
120
80
100
80 60
Margin
Margin
60
40
40
20
20
0
-20 0
-40
-20
-60
-80 -40
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
IMDB Score IMDB Score
Margin Linear (Margin) Margin Linear (Margin)
Insights
D. Director Analysis: Influence of directors on movie ratings.
Task: Identify the top directors based on their average IMDB score and analyze their contribution to the success of movies using
percentile calculations.
Result:
Q2. Why above directors have their movie's rating on higher side of x axis?
Millions
Millions
50 350
40
300
30
20 250
10 200
Margin
Margin
0
150
-10
-20 100
-30
50
-40
-50 0
-60 -50
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
IMDB Score IMDB Score
Margin Linear (Margin) Margin Linear (Margin)
Insights
D. Director Analysis: Influence of directors on movie ratings.
Task: Identify the top directors based on their average IMDB score and analyze their contribution to the success of movies using
percentile calculations.
Result:
Q2. Why above directors have their movie's rating on higher side of x axis?
Millions
15 250
10 200
5 150
Margin
Margin
0 100
50
-5
0
-10
-50
-15
-100
-20 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10 IMDB Score
IMDB Score
Margin Linear (Margin) Margin Linear (Margin)
Insights
D. Director Analysis: Influence of directors on movie ratings.
Task: Identify the top directors based on their average IMDB score and analyze their contribution to the success of movies using
percentile calculations.
Result:
Q2. Why above directors have their movie's rating on higher side of x axis?
• The above plots shows the margins of movies with respect to
Steven Spielberg
500 it's IMDB Score of the above top directors. In all the plots, we
Millions
200
100
-100
0 1 2 3 4 5 6 7 8 9 10
IMDB Score
Margin Linear (Margin)
Insights
D. Director Analysis: Influence of directors on movie ratings.
Task: Identify the top directors based on their average IMDB score and analyze their contribution to the success of movies using
percentile calculations.
Result:
Q3. Why margin has a positive relation with IMDB Score for the above top directors?
200
160
120
80
40
0
-100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300
Margin Millions
Linear (Clint Eastwood) Linear (David Fincher) Linear (Martin Scorsese) Linear (Peter Jackson)
Linear (Richard Linklater) Linear (Robert Zemeckis) Linear (Steven Spielberg)
Insights
Q4. Why high gross collection leads to high IMDB Score for these directors?
• High gross collections implies high footfall for the movies. This leads to more popularity of the movies leading to high IMDB
Scores.
Insights
E. Budget Analysis: Explore the relationship between movie budgets and their financial success.
Task: Analyze the correlation between movie budgets and gross earnings, and identify the movies with the highest profit margin.
Result:
800
Gross vs. Budget
Millions
700
600
500
• The above table shows that the correlation between Gross
and Budget is positive and more than 0.5. That is, the
Gross
400
relationship shows that as budget of movies increase, there is
a very high probability that the gross collection of the movie 300
E. Budget Analysis: Explore the relationship between movie budgets and their financial success.
Task: Analyze the correlation between movie budgets and gross earnings, and identify the movies with the highest profit margin.
Result:
Q1. Why overall treadline's slope is close to one and not greater than one?
500
400
300
-------
300
200
100
Margin
0 -------
0
-100
-200
-300 -------
300
-400
-500
-------
600
-600
0 1 2 3 4 5 6 7 8 9 10
IMDB Score
Insights
E. Budget Analysis: Explore the relationship between movie budgets and their financial success.
Task: Analyze the correlation between movie budgets and gross earnings, and identify the movies with the highest profit margin.
Result:
Q3. Why many Movies with negative margins have high IMDB Scrores?
• Top 5 genres for movies with negative margins and 5
Movie Genres with Negative Margins and 5 or more IMDB Score
or more IMDB Scores are Biography, War, History,
100%
Sport and Documentary. These genres have domain
90%
specific target audience. So IMDB Score can be high as
80% critics might like the movies but number of footfalls
will be low leading to less Gross collection and
Percentage of Entire Data
70%
30% 27%
24%
20% 17%
10%
0% 0%
0%
Comedy
Sci-Fi
Horror
Western
Mystery
Romance
Thriller
Reality-TV
Biography
Musical
Fantasy
Crime
Drama
Documentary
Music
Action
Film-Noir
Adventure
News
Sport
Short
Family
Animation
War
History
Insights
E. Budget Analysis: Explore the relationship between movie budgets and their financial success.
Task: Analyze the correlation between movie budgets and gross earnings, and identify the movies with the highest profit margin.
Result:
Q4. Why some Movies with negative margins have low IMDB Scrores?
Movie Genres with Negative Margins and 5 or more IMDB Score • Top 5 genres for movies with negative margins and
100%
less than 5 IMDB Scores are Sci-Fi, Fantasy, Horror,
Family and Musical. Although the percentage of films
90%
here is very less but still these are popular genres and
80%
ideally they should not have negative margins.
70%
Percentage of Total
60%
50%
40%
30%
20%
10% 6% 5% 7% 6% 7% 7% 7% 8% 6% 8%
4% 3% 4% 4%
2% 3%
1% 1% 1% 2% 3% 0% 0% 0% 0%
0%
Thriller
Horror
Sci-Fi
Western
Romance
Mystery
Comedy
Biography
Musical
Reality-TV
Crime
Drama
Action
Documentary
Music
Fantasy
Film-Noir
History
News
Short
Sport
Family
War
Adventure
Animation
Insights
E. Budget Analysis: Explore the relationship between movie budgets and their financial success.
Task: Analyze the correlation between movie budgets and gross earnings, and identify the movies with the highest profit margin.
Result:
Q5. Why movies of popular genres have low IMDB Scores and negative margins?
• Movies with IMDB Score less than 5 have
IMDB Score wise average total facebook likes of cast of Movies
800 cast whose total facebook likes are very low.
Thousands
500
FB Likes of Cast
400
300
200
100
0
2
8
2.8
3.8
4.8
5.8
6.8
7.8
8.8
1.7
2.2
2.4
3.2
3.4
3.6
4.2
4.4
4.6
5.2
5.4
5.6
6.2
6.4
6.6
7.2
7.4
7.6
8.2
8.4
8.6
IMDB Score
Result
Contact: [email protected]