0% found this document useful (0 votes)
45 views80 pages

IMDB Movie Analysis

IMDB movie analysis of project on IMDB... To analyse the movies with are presenting running successfully

Uploaded by

Preeti Desai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views80 pages

IMDB Movie Analysis

IMDB movie analysis of project on IMDB... To analyse the movies with are presenting running successfully

Uploaded by

Preeti Desai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Project 5

IMDB Movie Analysis


From: Syed Afnan
Project Description

• This Project is about giving insights about success of a


Movie based on IMDB data provided.

• These insights will be helpful for filmmakers and


other stakeholders during production of a movie

• As a Data Analyst, we received dataset with details


about movies with their IMDB Score and queries and
we are required to analyze the data and provide
insights to those queries.

• To do this, we will use Python programming language


for data pre-processing and Microsoft Excel software
to extract and filter data from the pre-processed
dataset and analyse them to get the required insights.

Image Source: Google Images


Approach

• We had received a dataset containing information about


movies from IMDB Database like director name, actor
names, genres, plot keywords, budget, gross collection,
IMDB Score etc.

• We first tried to understand the whole dataset.

• Then we performed Data Pre-Processing where we checked


and did necessary changes for missing data, error in data,
outliers in data and duplicates in data.

• Then we directly started filtering data as per the queries we


had. We also plotted some graphs for better understanding.
Then we drew conclusions and insights based on the filtered
data and the plots.

• Now, since we got all the insights required, we will make a


detailed report about it and hand them over to the hiring
department.

Image Source: Google Images


Tech-Stack Used

• Python 3.10.9 – Programming language used for


Data Pre-processing

• Jupyter Notebook 6.5.2 – Interactive platform to


write and execute codes in various programming
languages (in this case Python).

• Microsoft Excel 2016 – A spreadsheet editor


software used mainly by professionals to enter
data in table format, perform computations, plot
graphs etc. Here Microsoft Excel is used to filter
data and plot graphs to get insights about the
movies

Image Source: Google Images


Dataset Information (column)
• color: Whether movie is colored or black and white
• director_name: Name of the Diretor
• num_critic_for_reviews: Number of critics that gave reviews for the movie
• duration: Duration of the movie
• director_facebook_likes: Number of facebook likes the director has
• actor_3_facebook_likes: Number of facebook likes actor 3 has
• actor_2_name: Name of actor 2
• actor_1_facebook_likes: How many facebook likes actor 1 has
• gross: Gross collection of the movie
• genres: Genres of the movie
• actor_1_name: Name of actor 1
• movie_title: Name of the movie
• num_voted_users: Number of users voted for the movie
• cast_total_facebook_likes: Number of facebook likes the whole cast has together
Dataset Information (column)
• actor_3_name: Name of actor 3
• facenumber_in_poster: Number of faces in the movie poster
• plot_keywords: Keywords of the movie plot
• movie_imdb_link: Link to IMDB page of the movie
• num_user_for_reviews: Number of users who gave review to the movie
• language: Languag of the movie
• country: Country in which the movie was made
• content_rating: The certificate that the movie got
• budget: Budget of the movie
• title_year: Release year of the movie
• actor_2_facebook_likes: Number of facebook likes actor 2 has
• imdb_score: IMDB Score of the movie
• aspect_ratio: The aspect ratio in which the movie was made
• movie_facebook_likes: Number of facebook likes the movie has
Dataset Information

The dataset has 5043 rows and 28 columns in total


Data Pre-Processing
1. Rearranging columns of the dataset for convenience during analysis

Result:
Data Pre-Processing
2. Checking for duplicates of entire row

Action:

• Keeping one, dropped rest of the duplicates.


Data Pre-Processing
3 a. Checking for duplicates movie_title column
Data Pre-Processing
3 b. Further checking which column values are different for duplicate movie_title
Data Pre-Processing
Result:
From the above dataframe, we observed that except for movie "Out of the Blue" and "The Host", rest almost all the
movies, the difference between column values for rows with same movie_title are in columns related to facebook likes
and "num_voted_users". So we can drop the duplicate rows leaving just one copy of the row without much effect on the
overall analysis.
Action:

• We extracted only rows whose movie_title are not duplicate, the first instance of duplicate movie_title and all
duplicates of movie_title of Out of the Blue and The Host.
Data Pre-Processing
4 a. Checking for null values row wise
Data Pre-Processing
4 b. Checking frequency of row wise null values

Action:

• We dropped all the rows where number of null values was greater than 9
Data Pre-Processing
5. Checking for null values column wise
Data Pre-Processing
6. Dealing with Null values in gross column
Data Pre-Processing
Action:
Fitting a Linear Regression model on imdb_score and gross and then predicting Gross value based on IMDB Score of the
row with null value in gross.
Data Pre-Processing
7. Dealing with Null values in budget column
Data Pre-Processing
Action:
Fitting a Linear Regression model on imdb_score and budget and then predicting Gross value based on IMDB Score of
the row with null value in budget.
Data Pre-Processing
8. Dealing with Null values in aspect_ratio column
Data Pre-Processing
Action:
Getting the aspect_ratio of null values by first finding the language and their corresponding most frequent aspect_ratio.
Then replacing null values of aspect_ratio with aspect_ratio value corresponding to the row’s language value as found
above.
Data Pre-Processing
9. Dealing with Null values in content_rating column
Data Pre-Processing
Action:
First, we got the most frequent content_rating for each genre from the rows which didn’t had null values for
content_rating. Then we found the most common content_rating from the genres of rows with null values of
content_rating and replaced them with the value we found.
Data Pre-Processing
10. Dealing with Null values in plot_keywords column
Data Pre-Processing
Action:
For each of the movie in the above dataframe, we scrapped the movie’s Wikipedia page to get a plot summary and then
used KeywordExtractor function from yake library to get top 5 keywords.
Data Pre-Processing
11 a. Dealing with Null values in title_year column
Action:
For each of the movie in the dataframe obtained by extracting rows where title_year is null, we scrapped the movie’s
Wikipedia page to get the release year of the movie and store it in the original dataframe.
Data Pre-Processing
11 b. Dealing with Null values in title_year column which were not affected by the previous process

Action:
The title_year of above two rows are replaced manually by looking through internet.
Data Pre-Processing
12 a. Dealing with Null values in director_name column
Action:
For each of the movie in the dataframe obtained by extracting rows where director_name is null, we scrapped the
movie’s Wikipedia page to get director name of the movie and store it in the original dataframe.
Data Pre-Processing
12 b. Dealing with Null values in director_name column which were not affected by the previous process
Action:
The director_name of such movies are replaced actor_name_1 values of the same movie
Data Pre-Processing
13 a. Dealing with Null values in director_facebook_likes column
Data Pre-Processing
Action:

The director_facebook_likes of corresponding director_name are searched in the dataset and those values replaces the
null values of director_facebook_likes.
Data Pre-Processing
13 b. Dealing with Null values in director_facebook_likes column which were not affected by the previous process

Action:
The director_facebook_likes of such rows are replaced difference between cast_total_facebook_likes and sum of
actor_1_facebook_likes, actor_2_facebook_likes, actor_3_facebook_likes of the same row.
Data Pre-Processing
14. Dealing with Null values in num_critic_for_reviews column

Action:
The null values are replaced with 0.
Data Pre-Processing
15. Dealing with Null values in actor_1_name, actor_2_name, actor_3_name, actor_1_facebook_likes,
actor_2_facebook_likes, actor_3_facebook_likes column
Action:
The null values of actor names are replaced with ‘N.A.’ and null values of actor’s facebook likes are replaced with 0.
Data Pre-Processing

16. Dealing with Null values in color column


Action:
The null values of color are replaced with ‘Color’ that we found through manual check on the internet.
Data Pre-Processing
17 a. Dealing with Null values in duration column
Action:
For each of the movie in the dataframe obtained by extracting rows where duration is null, we scrapped the movie’s
Wikipedia page to get duration of the movie and store it in the original dataframe.

17 b. Dealing with Null values in duration column which were not affected by the previous process
Action:
The null values of duration of such rows are replaced with values that we found through manual check on the internet.
Data Pre-Processing
18. Dealing with Null values in facenumber_in_poster column
Action:

The null values of facenumber_in_poster column of such rows are replaced with 0 after confirming them on the internet

19. Dealing with Null values in language column

Action:
The null values of language columnof such rows are replaced with ‘English’ after confirming them on the internet

20. Dealing with Null values in country column

Action:
The null values of country column of such rows are replaced with ‘USA’ after confirming them on the internet
Data Pre-Processing
21. Dealing with Outliers in duration column

Action:
Replaced the values above and below the upper and
lower whisker marks respectively with median.
Data Pre-Processing
22. Dealing with Errors in aspect_ratio column

Action:
Only 16.00 seems to be an error which is most probably a ratio of 16/9. So replaced all row values of aspect_ratio where
16.00 is present to 1.78 (16/9)
Data Pre-Processing
23. Dealing with Errors in country column

Action:
Only Official site seems to be an error which we replaced with USA after manual check on the internet
Data Pre-Processing
24. Dealing with Outliers in title_year column

Action:
Values less than 1916 seems to be outliers which we replaced with correct values of release year of the movies after
manual check on the internet
Data Pre-Processing

25. Dealing with Outliers in budget column


Action:
Replaced values of budget less than 0 with median value

26. Dealing with Outliers in gross column


Action:
Replaced values of gross less than 0 with median value
Data Pre-Processing

25. Dealing with Outliers in budget column


Action:
Replaced values of budget less than 0 with median value

26. Dealing with Outliers in gross column


Action:
Replaced values of gross less than 0 with median value
Data Pre-Processing

27. Dealing with errors in director_name column found during analysis


Action:
Replaced error values with correct name after checking on the internet. Replaced multiple names in a single row to the
first name.
Data Pre-Processing

28. Dealing with errors in country column found during analysis


Action:
Replaced ‘New Line’ with USA, ‘Soviet Union’ with Russia and ‘West Germany’ with Germany.
Data Pre-Processing

29. Dealing with errors in gross column found during analysis


Action:
Found that some row values of gross column have very less values. On further checking, found that they were just gross
collections from USA but not World Wide collections. So replaced them with World Wide collection values after checking
on the iiternet.
Data Pre-Processing

30. Dealing with errors in movie_facebook_likes column found during analysis


Action:
Found that some row values of movie_facebook_likes column are 0. So replaced them with average of
movie_facebook_likes of corresponding imdb_score value.
Data Pre-Processing

31. Dealing with errors in cast_total_facebook_likes column found during analysis


Action:
Found that some row values of cast_total_facebook_likes column are 0. So replaced them with average of
cast_total_facebook_likes of corresponding imdb_score value.
Data Pre-Processing

32. Feature Engineering involving budget and gross column.


Action:
Created a new column margin which has the difference between gross and budget values.
Data Pre-Processing

33. Feature Engineering involving plot_keywords column.


Action:
Seperated the plot keywords using pipe(|) as separator and then deleted the plot_keywords column.
Data Pre-Processing

34. Feature Engineering involving genres column.


Action:
Seperated the genres using pipe(|) as separator and then deleted the genres column.
Data Pre-Processing

35. Feature Engineering involving genres column.


Action:
Created a new column no_genre which has values about the number of genres of a particular row.
Insights

A. Movie Genre Analysis: Analyze the distribution of movie genres and their impact on the IMDB score.
Task: Determine the most common genres of movies in the dataset. Then, for each genre, calculate descriptive statistics (mean,
median, mode, range, variance, standard deviation) of the IMDB scores.
Result:
• The top 7 most common genres are Drama,
Distribution of Movie Genres
Comedy, Thriller, Action, Romance,
3000
Adventure and Crime.
2533

2500
1847

2000
sfdgsdfgdfg
1363
Count

1500
1112

1083

887

869

1000
594

582

539

531

484

291

500
239

212

210

202

177

131

119

94

1
0
Horror
Sci-Fi
Romance

Western
Mystery
Comedy

Thriller

Biography

Documentary
Musical

Reality-TV
Crime
Drama

Music
Action

Fantasy

Film-Noir
History

News
Sport

Short
Family

War
Adventure

Animation
Insights

A. Movie Genre Analysis: Analyze the distribution of movie genres and their impact on the IMDB score.
Task: Determine the most common genres of movies in the dataset. Then, for each genre, calculate descriptive statistics (mean,
median, mode, range, variance, standard deviation) of the IMDB scores.
Result:
• All the top 7 genres’ descriptive
Descriptive Statistics of Top 7 Genres
statistics are almost at same

9.50
10.00

9.30

9.30
level.

9.10
9.00

8.90
8.60
9.00

7.80

7.40
8.00

7.30

7.00
6.90

6.90
6.80
6.77

6.70
6.70

6.70
6.60
6.60
6.60

6.60
6.56

6.50

6.50

6.50
6.45
6.44

6.40

6.40

7.00
6.31

6.30
6.30
6.23
6.19

6.00
sfdgsdfgdfg
IMDB Score

5.00

4.00

2.40
3.00
2.20

2.10
2.00

1.90
1.70
1.70

2.00

1.29
1.25
1.19

1.14
1.12
1.12

1.09
1.06

1.06

1.03
0.99
0.99

0.95
0.91
1.00

0.00
Mean Median Mode Max Min Range Variance Standard
Deviation

Drama Comedy Thriller Action Romance Adventure Crime


Insights

A. Movie Genre Analysis: Analyze the distribution of movie genres and their impact on the IMDB score.
Task: Determine the most common genres of movies in the dataset. Then, for each genre, calculate descriptive statistics (mean,
median, mode, range, variance, standard deviation) of the IMDB scores.
Result:
Q1. Why Drama is the most common genre?
• The top 5 genres common with Drama
Genres common with Drama
800
are Comedy, Thriller, Action,
Romance and Crime. These are also
700
among the top 7 overall most common
720
674
657

600 genres.
500
516
Count

400

300
332

273
251

200
230

187

185
100
145
143

136
134

53
121
101

13
32

Musical 67

1
0
Horror
Sci-Fi
Thriller

Western
Romance

Mystery
Comedy

Biography

Reality-TV
Crime

Documentary
Action

Fantasy

Music

Film-Noir

News
Sport

Short
Family

War
Adventure

History
Animation
Insights

A. Movie Genre Analysis: Analyze the distribution of movie genres and their impact on the IMDB score.
Task: Determine the most common genres of movies in the dataset. Then, for each genre, calculate descriptive statistics (mean,
median, mode, range, variance, standard deviation) of the IMDB scores.
Result:
Q2. Why Drama being the most common genre has a large Range but small Standard Deviation of IMDB Score?
Distribution of IMDB Score of Drama Genre • The distribution of IMDB Scores of
140 0.45
Drama Genre follows Normal
0.4
120 Distribution closely. Since it follows
0.35 Normal Distribution, we can say that
100
0.3 99.7% of its data lies between IMDB
80
Score of 4.0 and 9.5. This is the reason
0.25
that it's range i.e. difference between
Count

PDF
60 0.2 maximum score and minimum score is
0.15 large but Standard Deviation is small.
40
0.1

20
0.05

0 0
3.0

6.0

6.5

9.5
1.5

2.0

2.5

3.5

4.0

4.5

5.0

5.5

7.0

7.5

8.0

8.5

9.0

10.0
IMDB Score
Total Count PDF Mean 1 STDev 2 STDev 3 STDev
Insights

B. Movie Duration Analysis: Analyze the distribution of movie durations and its impact on the IMDB score.
Task: Analyze the distribution of movie durations and identify the relationship between movie duration and IMDB score.
Result:
Distribution of Movie Duration 350
Distribution of Movie Duration with its PDF 0.03
3 2 1 1 2 3
300 STDev STDev STDev STDev STDev STDev 0.025

250
0.02
Count

200

Count

PDF
0.015
150
0.01
100

0.005
50
, 7

, 7
7 , 77
7 ,7
7 ,7

0 0
,
,
7,

7 ,
,
,
,
,
,
7,
,

,
,

,
,
,
,
,
,
7,
,
,

,
,
,
,
,
,
50 60 70 80 90 100 110 120 130 140 150 160
Duration
Duration (mins) Bins Duration PDF Mean

• The distribution of Movie


Durations shows that it closely
follows a Normal Distribution.
Insights

B. Movie Duration Analysis: Analyze the distribution of movie durations and its impact on the IMDB score.
Task: Analyze the distribution of movie durations and identify the relationship between movie duration and IMDB score.
Result:
• The adjacent scatter plot shows that duration
Duration vs. IMDB Score
and imdb scores have a positive relationship.
10

7
IMDB Score

0
0 20 40 60 80 100 120 140 160 180
Duration (mins)
0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9-10
Insights

B. Movie Duration Analysis: Analyze the distribution of movie durations and its impact on the IMDB score.
Task: Analyze the distribution of movie durations and identify the relationship between movie duration and IMDB score.
Result:
Q1. Why do we have a positive slope treadline in the above plot?
• The first plot shows that movies with higher duration is more than that of lower duration. The second plotshows that the density
of the datapoints in higher duration area is high and concentrated on higher IMDB ratings area whereas density of datapoints in
lower duration area is comparitively low and is distributed among higher and lower IMDB ratings area. This is the reason of
positive slope treadline.
Insights

B. Movie Duration Analysis: Analyze the distribution of movie durations and its impact on the IMDB score.
Task: Analyze the distribution of movie durations and identify the relationship between movie duration and IMDB score.
Result:
Q2. Why movies with higher duration tend to have higher ratings?
• The above plot shows that movies with higher
Duration vs. Budget
duration has higher budgets
100
Millions

90

80

70
Average Budget

60

50

40

30

20

10

0
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160
Duration
Insights

B. Movie Duration Analysis: Analyze the distribution of movie durations and its impact on the IMDB score.
Task: Analyze the distribution of movie durations and identify the relationship between movie duration and IMDB score.
Result:
Q3. Why movies with higher budget tend to have higher ratings?
• The above plot shows that as budget of
Popularity of Cast vs. Budget
films increase, the popularity of casts
260000000-270000000 increase. That is the producer is able to
240000000-250000000
cast more popular actors and actress
220000000-230000000
200000000-210000000 with increase in budget which in turn
180000000-190000000 increases the popularity of the movies
160000000-170000000
thus increasing its IMDB Score
Budget Bins

140000000-150000000
120000000-130000000
100000000-110000000
80000000-90000000
60000000-70000000
40000000-50000000
20000000-30000000
0-10000000
0

10000

20000

30000

40000

50000

60000
Average Facebook Likes of Cast
Insights

C. Language Analysis: Situation: Examine the distribution of movies based on their language.
Task: Determine the most common languages used in movies and analyze their impact on the IMDB score using descriptive statistics.
Result:
Distribution of Languages Descriptive Statistics of Languages
4600 4592 9.00

7.80
7.60
7.20

7.20
7.20
7.15
4580 8.00

7.05
7.04

6.95
6.94

6.79

6.70
6.63

6.50
6.40
Count

4560 7.00

4540 6.00

IMDB Score
4520 5.00

4.00
80 73

60 3.00

1.89

1.37
40

1.27
2.00

1.13
1.03

1.02
40

0.84
0.72
28

0.71
24

0.52
19 16
20 11 11 11 8 8
1.00
5 5 5 4 4 4 4 3 3 3 2 2 2 2 2 2
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 0.00
Icelandi c
Abori ginal

Tami l
English

Fil ipino

Mongolian
Polish
Italian

Dutch

Kazakh

Slov enian
Danish

Persi an

Mean Median Mode Variance Standard Deviation


Japanese

Urdu
Mandarin

Korean

Aramaic

Czech
Swedish

Hungarian
Spanish

Romanian

English French Spanish Hindi Mandarin

• The above left plots show that English is the most common language used in movies followed by French, Spanish, Hindi and
Mandarin. The above right plot shows that French language has comparatively higher mean and median but lower variance and
standard deviation implying that most of the French language movies have their imdb score on the higher side.
Insights

C. Language Analysis: Situation: Examine the distribution of Distribution of Countries


movies based on their language. 3720

Task: Determine the most common languages used in movies and 3710
3700
analyze their impact on the IMDB score using descriptive
3690

Count
statistics.
3680
3670
Result:
3660
Q1. Why English is the most common language? 3650

450
• The adjacent plot shows that USA is the most common 400
350
country in the dataset and most films in USA are in English
300
language as it is the most spoken language in the country. 250
200
150
100
50
0

Japan

South Africa

Libya

Dominican Republic
Indonesia
Panama
Israel

Switzerland

Bahamas
Greece
China

Slovakia
France

Kenya
Brazil
Ireland

Nigeria

Slovenia
Romania

Egypt
USA

Czech Republic
Germany

New Zealand
India

Netherlands

Peru
Poland
Hong Kong

Georgia
Total English
Insights

D. Director Analysis: Influence of directors on movie ratings.


Task: Identify the top directors based on their average IMDB score and analyze their contribution to the success of movies using
percentile calculations.
Result:
• The adjacent plot considers only those
Director wise average IMDB Score and it's percentile
9.000 100.00% directors whose movie counts are more
8.000 90.00% than 9 and the range of IMDB Scores is
7.000 80.00% less than equal to 3 as otherwise it
70.00% would be unfair for those who has
Average IMDB Score

6.000
60.00% maintained consistently high scores for

Percentile
5.000
50.00% large number of movies to be compared
4.000
40.00% for top directors to those who has
3.000
30.00% performed well in few movies.
2.000 20.00%
1.000 10.00%

0.000 0.00%
Tony Scott

Shawn Levy
Richard Linklater

Bobby Farrelly
Clint Eastwood

Richard Donner
Brian De Palma
Steven Spielberg

Stephen Frears

Wes Craven
Peter Jackson

Michael Bay
Sam Raimi

Oliver Stone

Paul W.S. Anderson

Renny Harlin
Tim Burton

Brett Ratner
Chris Columbus
Ron Howard
Woody Allen
David Fincher

Kevin Smith
Rob Reiner

Tyler Perry
Martin Scorsese

Robert Zemeckis

Average of imdb_score Percentile


Insights

D. Director Analysis: Influence of directors on movie ratings.


Task: Identify the top directors based on their average IMDB score and analyze their contribution to the success of movies using
percentile calculations.
Result:
• For top directors, only the top 7 directors are considered as there is a drop in percentile after the first 7 directors in the previous
plot.

• The average IMDB Scores are between 7 and 8 for the top directors with the above condition. Also their percentile score is above
80%.
Insights
D. Director Analysis: Influence of directors on movie ratings.
Task: Identify the top directors based on their average IMDB score and analyze their contribution to the success of movies using
percentile calculations.
Result:
Q1. Why above list of directors are considered as top directors?
Distribution of IMDB Scores Director wise
250 5

200 4

Director Wise Count


Total Count

150 3

100 2

50 1

0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10
IMDB Score

Total Peter Jackson David Fincher Martin Scorsese Steven Spielberg Richard Linklater Robert Zemeckis Clint Eastwood

• The above plot shows that the top 7 directors have all of their ratings on the higher side of x axis.
Insights
D. Director Analysis: Influence of directors on movie ratings.
Task: Identify the top directors based on their average IMDB score and analyze their contribution to the success of movies using
percentile calculations.
Result:
Q2. Why above directors have their movie's rating on higher side of x axis?

Clint Eastwood David Fincher


180 120

Millions
Millions

160
100
140
120
80
100
80 60
Margin

Margin
60
40
40
20
20
0
-20 0
-40
-20
-60
-80 -40
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
IMDB Score IMDB Score
Margin Linear (Margin) Margin Linear (Margin)
Insights
D. Director Analysis: Influence of directors on movie ratings.
Task: Identify the top directors based on their average IMDB score and analyze their contribution to the success of movies using
percentile calculations.
Result:
Q2. Why above directors have their movie's rating on higher side of x axis?

Martin Scorsese Peter Jackson


400
60

Millions
Millions

50 350
40
300
30
20 250

10 200

Margin
Margin

0
150
-10
-20 100
-30
50
-40
-50 0

-60 -50
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
IMDB Score IMDB Score
Margin Linear (Margin) Margin Linear (Margin)
Insights
D. Director Analysis: Influence of directors on movie ratings.
Task: Identify the top directors based on their average IMDB score and analyze their contribution to the success of movies using
percentile calculations.
Result:
Q2. Why above directors have their movie's rating on higher side of x axis?

Richard Linklater Robert Zemeckis


20
300
Millions

Millions
15 250

10 200

5 150

Margin
Margin

0 100

50
-5
0
-10
-50
-15
-100
-20 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10 IMDB Score
IMDB Score
Margin Linear (Margin) Margin Linear (Margin)
Insights
D. Director Analysis: Influence of directors on movie ratings.
Task: Identify the top directors based on their average IMDB score and analyze their contribution to the success of movies using
percentile calculations.
Result:
Q2. Why above directors have their movie's rating on higher side of x axis?
• The above plots shows the margins of movies with respect to
Steven Spielberg
500 it's IMDB Score of the above top directors. In all the plots, we
Millions

have a positive slope treadline for margins. That is with


400
increase in margin, there is increase in IMDB Score.
300
Margin

200

100

-100
0 1 2 3 4 5 6 7 8 9 10
IMDB Score
Margin Linear (Margin)
Insights
D. Director Analysis: Influence of directors on movie ratings.
Task: Identify the top directors based on their average IMDB score and analyze their contribution to the success of movies using
percentile calculations.
Result:
Q3. Why margin has a positive relation with IMDB Score for the above top directors?

Margin vs. Gross for Top 7 Directors


• The above margin vs gross plot shows
400 that margin and gross are directly
Millions

360 proportinal for all the top 7 directors.


320 So high margin implies high gross
280
collection.
240
Gross

200

160

120

80

40

0
-100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300

Margin Millions
Linear (Clint Eastwood) Linear (David Fincher) Linear (Martin Scorsese) Linear (Peter Jackson)
Linear (Richard Linklater) Linear (Robert Zemeckis) Linear (Steven Spielberg)
Insights

D. Director Analysis: Influence of directors on movie ratings.


Task: Identify the top directors based on their average IMDB score and analyze their contribution to the success of movies using
percentile calculations.
Result:

Q4. Why high gross collection leads to high IMDB Score for these directors?

• High gross collections implies high footfall for the movies. This leads to more popularity of the movies leading to high IMDB
Scores.
Insights

E. Budget Analysis: Explore the relationship between movie budgets and their financial success.
Task: Analyze the correlation between movie budgets and gross earnings, and identify the movies with the highest profit margin.
Result:

800
Gross vs. Budget

Millions
700

600

500
• The above table shows that the correlation between Gross
and Budget is positive and more than 0.5. That is, the

Gross
400
relationship shows that as budget of movies increase, there is
a very high probability that the gross collection of the movie 300

will also increase.


200

• The adjacent plot shows the relationship between Gross and


100
Budget. The overall treadline has a slope close to 1.
0
0 50 100 150 200 250 300 350
Budget Millions
Loss Profit Linear (Loss) Linear (Profit) Linear (Overall)
Insights

E. Budget Analysis: Explore the relationship between movie budgets and their financial success.
Task: Analyze the correlation between movie budgets and gross earnings, and identify the movies with the highest profit margin.

Result:

Q1. Why overall treadline's slope is close to one and not greater than one?

• Because of presence of Movies with negative margins.


Insights
E. Budget Analysis: Explore the relationship between movie budgets and their financial success.
Task: Analyze the correlation between movie budgets and gross earnings, and identify the movies with the highest profit margin.
Result:
Q2. Why movies have negative margins?
• Many Movies with negative margin typically
Margin vs. IMDB Score Count
600 have high IMDB Scores.
-------
600
Millions

500

400

300
-------
300
200

100
Margin

0 -------
0
-100

-200

-300 -------
300

-400

-500
-------
600
-600
0 1 2 3 4 5 6 7 8 9 10
IMDB Score
Insights
E. Budget Analysis: Explore the relationship between movie budgets and their financial success.
Task: Analyze the correlation between movie budgets and gross earnings, and identify the movies with the highest profit margin.
Result:
Q3. Why many Movies with negative margins have high IMDB Scrores?
• Top 5 genres for movies with negative margins and 5
Movie Genres with Negative Margins and 5 or more IMDB Score
or more IMDB Scores are Biography, War, History,
100%
Sport and Documentary. These genres have domain
90%
specific target audience. So IMDB Score can be high as
80% critics might like the movies but number of footfalls
will be low leading to less Gross collection and
Percentage of Entire Data

70%

60% negative margin.


49% 51%
49%
50% 46% 45%
44% 44%
39% 38% 39% 38% 39% 38%
40% 35% 35% 36% 34% 34%
33% 33%

30% 27%
24%

20% 17%

10%
0% 0%
0%
Comedy

Sci-Fi

Horror

Western
Mystery
Romance
Thriller

Reality-TV
Biography

Musical
Fantasy
Crime
Drama

Documentary
Music
Action

Film-Noir
Adventure

News
Sport

Short
Family

Animation

War
History
Insights
E. Budget Analysis: Explore the relationship between movie budgets and their financial success.
Task: Analyze the correlation between movie budgets and gross earnings, and identify the movies with the highest profit margin.
Result:
Q4. Why some Movies with negative margins have low IMDB Scrores?
Movie Genres with Negative Margins and 5 or more IMDB Score • Top 5 genres for movies with negative margins and
100%
less than 5 IMDB Scores are Sci-Fi, Fantasy, Horror,
Family and Musical. Although the percentage of films
90%
here is very less but still these are popular genres and
80%
ideally they should not have negative margins.
70%
Percentage of Total

60%

50%

40%

30%

20%

10% 6% 5% 7% 6% 7% 7% 7% 8% 6% 8%
4% 3% 4% 4%
2% 3%
1% 1% 1% 2% 3% 0% 0% 0% 0%
0%
Thriller

Horror
Sci-Fi

Western
Romance

Mystery
Comedy

Biography

Musical

Reality-TV
Crime
Drama

Action

Documentary
Music
Fantasy

Film-Noir
History

News
Short
Sport
Family

War
Adventure

Animation
Insights
E. Budget Analysis: Explore the relationship between movie budgets and their financial success.
Task: Analyze the correlation between movie budgets and gross earnings, and identify the movies with the highest profit margin.
Result:
Q5. Why movies of popular genres have low IMDB Scores and negative margins?
• Movies with IMDB Score less than 5 have
IMDB Score wise average total facebook likes of cast of Movies
800 cast whose total facebook likes are very low.
Thousands

That is their casts are not popular. So movies


700
became less popular.
600

500
FB Likes of Cast

400

300

200

100

0
2

8
2.8

3.8

4.8

5.8

6.8

7.8

8.8
1.7

2.2
2.4

3.2
3.4
3.6

4.2
4.4
4.6

5.2
5.4
5.6

6.2
6.4
6.6

7.2
7.4
7.6

8.2
8.4
8.6
IMDB Score
Result

• Through this project, I was able to understand the importance


of Data Analytics in Movies analysis as it provides valuable
insights such as director’s relationship with IMDB Score,
genre’s relationship with IMDB Score, budget’s relationship
with IMDB Score etc. which helps in making Data-Driven
Decisions.

• In this project I was able to get insights about various


questions like relationship between budget and gross
collections, relationship between genres and ratings etc. I also
got experience in Data Preprocessing like Data Cleaning,
handling Outliers, Feature Engineering etc. in this project. I
can communicate the insights to relevant stakeholders as per
the requirements using which they can make proper data-
driven decisions.

• This Project has also helped me in understanding the various


functions of MS-Excel and Python and its working.

Image Source: Google Images


Thank You

Contact: [email protected]

You might also like