0% found this document useful (0 votes)

116 views23 pages

IMDB Movie Analysis: by Biswajeet Nayak

The document analyzes an IMDB movie dataset. It cleans the data by dropping unnecessary columns and rows, fills missing values, and finds movies with highest profits, IMDB top 250 movies, best directors, popular genres, and critic-favorite and audience-favorite actors.

Uploaded by

ACE 2111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

116 views23 pages

IMDB Movie Analysis: by Biswajeet Nayak

Uploaded by

ACE 2111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

IMDB Movie

Analysis

by
Biswajeet Nayak
Project Description
IMDb registered users can cast a vote (from 1 to 10) on every released
title in the database. Individual votes are then aggregated and
summarized as a single IMDb rating, this rating describes the
popularity of a movie in the public.
The dataset contains only one csv file IMDB_Movies.csv has been used
in this project for the analysis.
The libraries for data analysis and visualization used in this project are
Numpy & Pandas.

A. Cleaning the data: This is one of the most important step to

perform before moving forward with the analysis. Use your
knowledge learned till now to do this. (Dropping columns,
removing null values, etc.)
Your task: Clean the data

B. Movies with highest profit: Create a new column

called profit which contains the difference of the two
columns: gross and budget. Sort the column using
the profit column as reference. Plot profit (y-axis) vs budget (x-
axis) and observe the outliers using the appropriate chart type.
Your task: Find the movies with the highest profit?

C. Top 250: Create a new column IMDb_Top_250 and store the

top 250 movies with the highest IMDb Rating (corresponding to
the column: imdb_score). Also make sure that for all of these
movies, the num_voted_users is greater than 25,000. Also add
a Rank column containing the values 1 to 250 indicating the ranks
of the corresponding films.

Extract all the movies in the IMDb_Top_250 column which are

not in the English language and store them in a new column
named Top_Foreign_Lang_Film. You can use your own
imagination also!
Your task: Find IMDB Top 250

D. Best Directors: TGroup the column using the director_name

1|P a ge
column.
Find out the top 10 directors for whom the mean of imdb_score is
the highest and store them in a new column top10director. In case
of a tie in IMDb score between two directors, sort them
alphabetically.
Your task: Find the best directors

E. Popular Genres: Perform this step using the knowledge gained

while performing previous steps.
Your task: Find popular genres

F. Charts: Create three new columns

namely, Meryl_Streep, Leo_Caprio, and Brad_Pitt which contain
the movies in which the actors: 'Meryl Streep', 'Leonardo
DiCaprio', and 'Brad Pitt' are the lead actors. Use only
the actor_1_name column for extraction. Also, make sure that
you use the names 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad
Pitt' for the said extraction.

Append the rows of all these columns and store them in a new
column named Combined.

Group the combined column using the actor_1_name column.

Find the mean of

the num_critic_for_reviews and num_users_for_review and
identify the actors which have the highest mean.

Observe the change in number of voted users over decades using a

bar chart. Create a column called decade which represents the
decade to which every movie belongs to. For example,
the title_year year 1923, 1925 should be stored as 1920s. Sort the
column based on the column decade, group it by decade and find
the sum of users voted in each decade. Store this in a new data
frame called df_by_decade.

Your task: Find the critic-favorite and audience-favorite actors

2|P a ge
Approach and Tech Used
For this project I used Jupyter Notebook (Anaconda) to run my
queries and charts.

The notebook extends the console-based approach to interactive

computing in a qualitatively new direction, providing a web-based
application suitable for capturing the whole computation process:
developing, documenting, and executing code, as well as
communicating the results. The Jupyter notebook combines two
components:

A web application: a browser-based tool for interactive authoring

of documents which combine explanatory text, mathematics,
computations and their rich media output.

Notebook documents: a representation of all content visible in

the web application, including inputs and outputs of the
computations, explanatory text, mathematics, images, and rich
media representations of objects.

This project helped me in understanding the tables at a much-

detailed manner and helped to improve my strength in extracting
data from tables in a more efficient manner.

3|P a ge
Dataset
First, we imported all the libraries needed:

Next, we read the dataset file given to us:

movies = pd.read_csv('IMDB_Movies.csv')
OrgData = movies

Output:

4|P a ge
Cleaning the data
We find out the number of null values in the dataset:

For column-wise null count:

movies.isnull().sum(axis=0).sort_values(ascending=False)
gross 884
budget 492
aspect_ratio 329
content_rating 303
plot_keywords 153
title_year 108
director_name 104
director_facebook_likes 104
num_critic_for_reviews 50
actor_3_name 23
actor_3_facebook_likes 23
num_user_for_reviews 20
color 19
duration 15
facenumber_in_poster 13
actor_2_name 13
actor_2_facebook_likes 13
language 12
actor_1_name 7
actor_1_facebook_likes 7
country 5
movie_facebook_likes 0
genres 0
movie_title 0
num_voted_users 0
movie_imdb_link 0
imdb_score 0
cast_total_facebook_likes 0
dtype: int64

For row-wise null count:

movies.isnull().sum(axis=1).sort_values(ascending=False)
279 15
4 13
4945 11
2241 11
2342 10
..
2708 0
2707 0
5|P a ge
2706 0
2705 0
0 0
Length: 5043, dtype: int64

For column-wise null percentages:

movies.isnull().sum(axis=0).sort_values(ascending=False)/len(movies
) * 100
gross 17.529248
budget 9.756098
aspect_ratio 6.523895
content_rating 6.008328
plot_keywords 3.033908
title_year 2.141582
director_name 2.062265
director_facebook_likes 2.062265
num_critic_for_reviews 0.991473
actor_3_name 0.456078
actor_3_facebook_likes 0.456078
num_user_for_reviews 0.396589
color 0.376760
duration 0.297442
facenumber_in_poster 0.257783
actor_2_name 0.257783
actor_2_facebook_likes 0.257783
language 0.237954
actor_1_name 0.138806
actor_1_facebook_likes 0.138806
country 0.099147
movie_facebook_likes 0.000000
genres 0.000000
movie_title 0.000000
num_voted_users 0.000000
movie_imdb_link 0.000000
imdb_score 0.000000
cast_total_facebook_likes 0.000000
dtype: float64

There are many columns which are not that important for our study so
we will drop those columns:

movies=movies.drop(['color','director_facebook_likes','actor_1_facebo
ok_likes','actor_2_facebook_likes','actor_3_facebook_likes','actor_2_
name','cast_total_facebook_likes','actor_3_name','duration','facenum
ber_in_poster','content_rating','country','movie_imdb_link','aspect_r
atio','plot_keywords'],axis=1)

6|P a ge
After that we will drop unnecessary rows using columns with high Null
percentages.

For dropping the rows:

round(movies.isnull().sum(axis=0).sort_values(ascending=False)/len(
movies)*100,2)
gross 17.53
budget 9.76
title_year 2.14
director_name 2.06
num_critic_for_reviews 0.99
num_user_for_reviews 0.40
language 0.24
actor_1_name 0.14
movie_facebook_likes 0.00
imdb_score 0.00
num_voted_users 0.00
movie_title 0.00
genres 0.00
dtype: float64

7|P a ge
movies=movies[movies['gross'].notnull()]
movies=movies[movies['budget'].notnull()]
round(movies.isnull().sum().sort_values(ascending=False)/len(movie
s)*100,2)
language 0.08
actor_1_name 0.08
num_critic_for_reviews 0.03
movie_facebook_likes 0.00
imdb_score 0.00
title_year 0.00
budget 0.00
num_user_for_reviews 0.00
num_voted_users 0.00
movie_title 0.00
genres 0.00
gross 0.00
director_name 0.00
dtype: float64

Some of the rows might have greater than five NaN values. Such rows
aren't of much use for the analysis and hence, should be removed.

For dropping the rows:

(movies.isnull().sum(axis=1).sort_values(ascending=False)>5).sum()

movies=movies[movies.isnull().sum(axis=1).sort_values(ascending=F
alse) <= 5]

8|P a ge
After that we will fill the missing NaN values:

round(movies.isnull().sum().sort_values(ascending=False)/len(movie
s)*100,2)
language 0.08
actor_1_name 0.08
num_critic_for_reviews 0.03
movie_facebook_likes 0.00
imdb_score 0.00
title_year 0.00
budget 0.00
num_user_for_reviews 0.00
num_voted_users 0.00
movie_title 0.00
genres 0.00
gross 0.00
director_name 0.00
dtype: float64

We can see that the Language column has some NaN values, we will
replace that with English since its in the Max.

movies.groupby('language').language.count().sort_values(ascending=F
alse)
language
English 3707
French 37
Spanish 26
Mandarin 15
German 13
Japanese 12
Hindi 10
Cantonese 8
Italian 7
Korean 5
Portuguese 5
Norwegian 4
Hebrew 3
Persian 3
Dutch 3
Danish 3
Thai 3
Dari 2
Indonesian 2
Aboriginal 2
Icelandic 1
Hungarian 1
Arabic 1
Aramaic 1
Bosnian 1
Telugu 1

9|P a ge
Czech 1
Swedish 1
Russian 1
Romanian 1
Dzongkha 1
None 1
Filipino 1
Mongolian 1
Maya 1
Kazakh 1
Vietnamese 1
Zulu 1
Name: language, dtype: int64

movies.language = movies.language.fillna('English')

10 | P a g e
Movies with highest profit:
We will change the unit of the budget and gross columns
from $ to million $:

movies['budget']=movies['budget']/1000000
movies['gross']=movies['gross']/1000000

I created a new column named ‘profit’.

movies['profit']=movies['gross']-movies['budget']

11 | P a g e
After that I sorted it in Descending order:

movies.sort_values(by='profit',ascending=False)

12 | P a g e
And then I found out the top 10 movies that made the most profit:

top10 = movies.sort_values(by='profit',ascending=False).head(10)

After we found the top 10 profiting movies , we can notice a

duplicate value. So,it seems like the dataframe has duplicate values as
well.

Hence we drop the duplicates and repeat the steps:

movies.drop_duplicates(keep='first',inplace=True)

13 | P a g e
And then I found out the top 10 movies that made the most profit:

top10 = movies.sort_values(by='profit',ascending=False).head(10)

14 | P a g e
IMDb Top 250:
I created a new dataframe IMDb_Top_250 to show the top 250 movies
with the highest IMDb Ratings.

IMDb_Top_250=movies[movies['num_voted_users']>25000].sort_v
alues(by='imdb_score',ascending=False).head(250)

After that I sorted it in Descending order:

IMDb_Top_250['Rank']=IMDb_Top_250['imdb_score'].rank(method
='first',ascending=False)

15 | P a g e
After that I found the Top250 movies in the “English” language:
IMDb_Top_250[IMDb_Top_250['language']!='English']

16 | P a g e
Best directors:
I created a new dataframe top10director to show the best directors
with the highest IMDb Ratings.

top10director=movies.groupby('director_name').imdb_score.mean().s
ort_values(ascending=False).head(10)
director_name
Charles Chaplin 8.600000
Tony Kaye 8.600000
Ron Fricke 8.500000
Damien Chazelle 8.500000
Majid Majidi 8.500000
Alfred Hitchcock 8.500000
Sergio Leone 8.433333
Christopher Nolan 8.425000
Asghar Farhadi 8.400000
Richard Marquand 8.400000
Name: imdb_score, dtype: float64

17 | P a g e
Popular genres:
A film might have multiple genres, so here I am just taking the first two
which are the main genres and creating a column for them:

TempGenre=movies.genres.str.split('|',expand=True).iloc[:,0:2]
TempGenre.columns=['genre_1','genre_2']
TempGenre.genre_2.fillna(TempGenre.genre_1,inplace=True)

After that I found out the top genres by finding the mean of the gross
values:

movies.groupby(['genre_1','genre_2']).gross.mean().sort_values(ascen
ding=False).head(5)
genre_1 genre_2
Family Sci-Fi 434.949459
Adventure Sci-Fi 228.627758
Family 118.919540
Animation 116.998550
Action Adventure 109.595465
Name: gross, dtype: float64

18 | P a g e
Critic-favorite and audience-favorite
actors:
There are a lot of actors in the list, I chose 4 top actors among them to
check their ratings:

Meryl_Streep = movies[movies['actor_1_name']=='Meryl Streep']

Leo_Caprio = movies[movies['actor_1_name']=='Leonardo DiCaprio']
Brad_Pitt = movies[movies['actor_1_name']=='Brad Pitt']

combined=Meryl_Streep.append([Leo_Caprio,Brad_Pitt])

combined.groupby('actor_1_name')[['num_critic_for_reviews','num_
user_for_reviews']].mean()

19 | P a g e
After ther I calculated the decades:

movies['decade']=movies['title_year'].apply(lambda x: (x//10)
*10).astype(np.int64)
movies['decade']=movies['decade'].astype(str)+'s'
movies=movies.sort_values(['decade'])

Then I created the data frame df_by_decade:

df_by_decade=movies.groupby('decade')
df_by_decade['num_voted_users'].sum()
df_by_decade=pd.DataFrame(df_by_decade['num_voted_users'].su
m())
20 | P a g e
After that I plotted the bar graph for number of voted users vs
decade:

df_by_decade.plot.bar(figsize=(15,8),width=0.8,hatch="//",edgecolor
='k')
plt.xlabel("Decade")
plt.ylabel("Voted Users")
plt.title("Voted users over Decades")
plt.yscale('log')
plt.show()

21 | P a g e
22 | P a g e

Assignment 1
No ratings yet
Assignment 1
3 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
17 pages
IMDB Movie Analysis 05 Project
No ratings yet
IMDB Movie Analysis 05 Project
7 pages
Project 5
No ratings yet
Project 5
5 pages
IMDb+Movie+Assignment Stub
No ratings yet
IMDb+Movie+Assignment Stub
9 pages
IMDB Movie Analysis1
No ratings yet
IMDB Movie Analysis1
14 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
80 pages
COM 428 - Jupyter Notebook2 - 101223
No ratings yet
COM 428 - Jupyter Notebook2 - 101223
16 pages
SDM - Task B - Group 1G - Movies
No ratings yet
SDM - Task B - Group 1G - Movies
11 pages
IMDB Analysis
No ratings yet
IMDB Analysis
4 pages
Netflix Business Case Study - Data Exploration and Visualisation.. Sonam Meshram
No ratings yet
Netflix Business Case Study - Data Exploration and Visualisation.. Sonam Meshram
27 pages
Recommendation Engine 1657857468
No ratings yet
Recommendation Engine 1657857468
15 pages
Imdb
No ratings yet
Imdb
11 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
23 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
17 pages
Final Project1 IMDB Movie Analysis PDF
No ratings yet
Final Project1 IMDB Movie Analysis PDF
9 pages
Group 15 Report
No ratings yet
Group 15 Report
23 pages
Import As Import As Import As Import Import As From Import: 'Ggplot'
No ratings yet
Import As Import As Import As Import Import As From Import: 'Ggplot'
13 pages
Recommendation System 1696663388
No ratings yet
Recommendation System 1696663388
29 pages
Department of Computer Science and Engineering (Data Science) Subject: Recommender System Laboratory (DJS22DSL6012)
No ratings yet
Department of Computer Science and Engineering (Data Science) Subject: Recommender System Laboratory (DJS22DSL6012)
16 pages
Investigate A Dataset
No ratings yet
Investigate A Dataset
14 pages
Moviesuggester - Jupyter Notebook
No ratings yet
Moviesuggester - Jupyter Notebook
11 pages
IMDB Movie Analysis Report
No ratings yet
IMDB Movie Analysis Report
11 pages
Project Report ON Movie Management: By: Kritika Sharma Class: XII-C
No ratings yet
Project Report ON Movie Management: By: Kritika Sharma Class: XII-C
23 pages
Report Final-MovieLens
No ratings yet
Report Final-MovieLens
47 pages
IMDB Dataframe Insights
No ratings yet
IMDB Dataframe Insights
3 pages
Project 4 Imdb Movie Analysis
No ratings yet
Project 4 Imdb Movie Analysis
17 pages
Source Code Source Code
No ratings yet
Source Code Source Code
4 pages
Recomendacao de Filmes Chatbot
No ratings yet
Recomendacao de Filmes Chatbot
24 pages
Movies Final Report
No ratings yet
Movies Final Report
22 pages
Movie Recommendation System Analysis
No ratings yet
Movie Recommendation System Analysis
8 pages
Bollywood and Heart Data Analysis
No ratings yet
Bollywood and Heart Data Analysis
15 pages
Python Project
No ratings yet
Python Project
1 page
Movie Notebook
No ratings yet
Movie Notebook
91 pages
Pandas Data Frame For Beginners
No ratings yet
Pandas Data Frame For Beginners
25 pages
Final Project
No ratings yet
Final Project
7 pages
DB TP1 Enonce
No ratings yet
DB TP1 Enonce
3 pages
Task 1 Description
No ratings yet
Task 1 Description
7 pages
Practical Work 1 - Recommender Systems
No ratings yet
Practical Work 1 - Recommender Systems
3 pages
Swati Mam The - Iscale Movies Project Code
No ratings yet
Swati Mam The - Iscale Movies Project Code
13 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Ads - Phase 5
No ratings yet
Ads - Phase 5
14 pages
Divya NM (1) - 2
No ratings yet
Divya NM (1) - 2
41 pages
IP CSV Project For Class 12
No ratings yet
IP CSV Project For Class 12
22 pages
Movie Data Analysis Netflix
No ratings yet
Movie Data Analysis Netflix
16 pages
Movie Recommender
No ratings yet
Movie Recommender
19 pages
21Bcs5066 - Deepanshu Tyagi Source Code: #Importing Libraries
No ratings yet
21Bcs5066 - Deepanshu Tyagi Source Code: #Importing Libraries
18 pages
Informatics Practices Project Synopsis Title: Imdb Movie Analysis System
No ratings yet
Informatics Practices Project Synopsis Title: Imdb Movie Analysis System
24 pages
Web Engineering Laboratory 3: Web2py Database Access
No ratings yet
Web Engineering Laboratory 3: Web2py Database Access
12 pages
Project Movielense Solution
No ratings yet
Project Movielense Solution
4 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
17 pages
Project Movielense Solution
29% (7)
Project Movielense Solution
4 pages
IMDB Movie Analysis Project Report
No ratings yet
IMDB Movie Analysis Project Report
8 pages
Big Data Project-2 Report
No ratings yet
Big Data Project-2 Report
22 pages
Movie Recommendation System
No ratings yet
Movie Recommendation System
22 pages
Rotten Tomatoes Audience Rating Prediction
No ratings yet
Rotten Tomatoes Audience Rating Prediction
36 pages
A1: Resit Coursework: Big Data (6CS030)
100% (1)
A1: Resit Coursework: Big Data (6CS030)
40 pages
Recommendation System
No ratings yet
Recommendation System
11 pages

IMDB Movie Analysis: by Biswajeet Nayak

Uploaded by

IMDB Movie Analysis: by Biswajeet Nayak

Uploaded by

IMDB Movie

A. Cleaning the data: This is one of the most important step to

B. Movies with highest profit: Create a new column

C. Top 250: Create a new column IMDb_Top_250 and store the

Extract all the movies in the IMDb_Top_250 column which are

D. Best Directors: TGroup the column using the director_name

E. Popular Genres: Perform this step using the knowledge gained

F. Charts: Create three new columns

Group the combined column using the actor_1_name column.

Find the mean of

Observe the change in number of voted users over decades using a

Your task: Find the critic-favorite and audience-favorite actors

The notebook extends the console-based approach to interactive

A web application: a browser-based tool for interactive authoring

Notebook documents: a representation of all content visible in

This project helped me in understanding the tables at a much-

Next, we read the dataset file given to us:

For column-wise null count:

For row-wise null count:

For column-wise null percentages:

For dropping the rows:

For dropping the rows:

I created a new column named ‘profit’.

After we found the top 10 profiting movies , we can notice a

Hence we drop the duplicates and repeat the steps:

After that I sorted it in Descending order:

Meryl_Streep = movies[movies['actor_1_name']=='Meryl Streep']

Then I created the data frame df_by_decade:

You might also like