0% found this document useful (0 votes)
116 views23 pages

IMDB Movie Analysis: by Biswajeet Nayak

The document analyzes an IMDB movie dataset. It cleans the data by dropping unnecessary columns and rows, fills missing values, and finds movies with highest profits, IMDB top 250 movies, best directors, popular genres, and critic-favorite and audience-favorite actors.

Uploaded by

ACE 2111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views23 pages

IMDB Movie Analysis: by Biswajeet Nayak

The document analyzes an IMDB movie dataset. It cleans the data by dropping unnecessary columns and rows, fills missing values, and finds movies with highest profits, IMDB top 250 movies, best directors, popular genres, and critic-favorite and audience-favorite actors.

Uploaded by

ACE 2111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

IMDB Movie

Analysis

by
Biswajeet Nayak
Project Description
IMDb registered users can cast a vote (from 1 to 10) on every released
title in the database. Individual votes are then aggregated and
summarized as a single IMDb rating, this rating describes the
popularity of a movie in the public.
The dataset contains only one csv file IMDB_Movies.csv has been used
in this project for the analysis.
The libraries for data analysis and visualization used in this project are
Numpy & Pandas.

A. Cleaning the data: This is one of the most important step to


perform before moving forward with the analysis. Use your
knowledge learned till now to do this. (Dropping columns,
removing null values, etc.)
Your task: Clean the data

B. Movies with highest profit: Create a new column


called profit which contains the difference of the two
columns: gross and budget. Sort the column using
the profit column as reference. Plot profit (y-axis) vs budget (x-
axis) and observe the outliers using the appropriate chart type.
Your task: Find the movies with the highest profit?

C. Top 250: Create a new column IMDb_Top_250 and store the


top 250 movies with the highest IMDb Rating (corresponding to
the column: imdb_score). Also make sure that for all of these
movies, the num_voted_users is greater than 25,000. Also add
a Rank column containing the values 1 to 250 indicating the ranks
of the corresponding films.

Extract all the movies in the IMDb_Top_250 column which are


not in the English language and store them in a new column
named Top_Foreign_Lang_Film. You can use your own
imagination also!
Your task: Find IMDB Top 250

D. Best Directors: TGroup the column using the director_name

1|P a ge
column.
Find out the top 10 directors for whom the mean of imdb_score is
the highest and store them in a new column top10director. In case
of a tie in IMDb score between two directors, sort them
alphabetically.
Your task: Find the best directors

E. Popular Genres: Perform this step using the knowledge gained


while performing previous steps.
Your task: Find popular genres

F. Charts: Create three new columns


namely, Meryl_Streep, Leo_Caprio, and Brad_Pitt which contain
the movies in which the actors: 'Meryl Streep', 'Leonardo
DiCaprio', and 'Brad Pitt' are the lead actors. Use only
the actor_1_name column for extraction. Also, make sure that
you use the names 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad
Pitt' for the said extraction.

Append the rows of all these columns and store them in a new
column named Combined.

Group the combined column using the actor_1_name column.

Find the mean of


the num_critic_for_reviews and num_users_for_review and
identify the actors which have the highest mean.

Observe the change in number of voted users over decades using a


bar chart. Create a column called decade which represents the
decade to which every movie belongs to. For example,
the title_year year 1923, 1925 should be stored as 1920s. Sort the
column based on the column decade, group it by decade and find
the sum of users voted in each decade. Store this in a new data
frame called df_by_decade.

Your task: Find the critic-favorite and audience-favorite actors

2|P a ge
Approach and Tech Used
For this project I used Jupyter Notebook (Anaconda) to run my
queries and charts.

The notebook extends the console-based approach to interactive


computing in a qualitatively new direction, providing a web-based
application suitable for capturing the whole computation process:
developing, documenting, and executing code, as well as
communicating the results. The Jupyter notebook combines two
components:

A web application: a browser-based tool for interactive authoring


of documents which combine explanatory text, mathematics,
computations and their rich media output.

Notebook documents: a representation of all content visible in


the web application, including inputs and outputs of the
computations, explanatory text, mathematics, images, and rich
media representations of objects.

This project helped me in understanding the tables at a much-


detailed manner and helped to improve my strength in extracting
data from tables in a more efficient manner.

3|P a ge
Dataset
First, we imported all the libraries needed:

Next, we read the dataset file given to us:

movies = pd.read_csv('IMDB_Movies.csv')
OrgData = movies

Output:

4|P a ge
Cleaning the data
We find out the number of null values in the dataset:

For column-wise null count:

movies.isnull().sum(axis=0).sort_values(ascending=False)
gross 884
budget 492
aspect_ratio 329
content_rating 303
plot_keywords 153
title_year 108
director_name 104
director_facebook_likes 104
num_critic_for_reviews 50
actor_3_name 23
actor_3_facebook_likes 23
num_user_for_reviews 20
color 19
duration 15
facenumber_in_poster 13
actor_2_name 13
actor_2_facebook_likes 13
language 12
actor_1_name 7
actor_1_facebook_likes 7
country 5
movie_facebook_likes 0
genres 0
movie_title 0
num_voted_users 0
movie_imdb_link 0
imdb_score 0
cast_total_facebook_likes 0
dtype: int64

For row-wise null count:

movies.isnull().sum(axis=1).sort_values(ascending=False)
279 15
4 13
4945 11
2241 11
2342 10
..
2708 0
2707 0
5|P a ge
2706 0
2705 0
0 0
Length: 5043, dtype: int64

For column-wise null percentages:

movies.isnull().sum(axis=0).sort_values(ascending=False)/len(movies
) * 100
gross 17.529248
budget 9.756098
aspect_ratio 6.523895
content_rating 6.008328
plot_keywords 3.033908
title_year 2.141582
director_name 2.062265
director_facebook_likes 2.062265
num_critic_for_reviews 0.991473
actor_3_name 0.456078
actor_3_facebook_likes 0.456078
num_user_for_reviews 0.396589
color 0.376760
duration 0.297442
facenumber_in_poster 0.257783
actor_2_name 0.257783
actor_2_facebook_likes 0.257783
language 0.237954
actor_1_name 0.138806
actor_1_facebook_likes 0.138806
country 0.099147
movie_facebook_likes 0.000000
genres 0.000000
movie_title 0.000000
num_voted_users 0.000000
movie_imdb_link 0.000000
imdb_score 0.000000
cast_total_facebook_likes 0.000000
dtype: float64

There are many columns which are not that important for our study so
we will drop those columns:

movies=movies.drop(['color','director_facebook_likes','actor_1_facebo
ok_likes','actor_2_facebook_likes','actor_3_facebook_likes','actor_2_
name','cast_total_facebook_likes','actor_3_name','duration','facenum
ber_in_poster','content_rating','country','movie_imdb_link','aspect_r
atio','plot_keywords'],axis=1)

6|P a ge
After that we will drop unnecessary rows using columns with high Null
percentages.

For dropping the rows:

round(movies.isnull().sum(axis=0).sort_values(ascending=False)/len(
movies)*100,2)
gross 17.53
budget 9.76
title_year 2.14
director_name 2.06
num_critic_for_reviews 0.99
num_user_for_reviews 0.40
language 0.24
actor_1_name 0.14
movie_facebook_likes 0.00
imdb_score 0.00
num_voted_users 0.00
movie_title 0.00
genres 0.00
dtype: float64

7|P a ge
movies=movies[movies['gross'].notnull()]
movies=movies[movies['budget'].notnull()]
round(movies.isnull().sum().sort_values(ascending=False)/len(movie
s)*100,2)
language 0.08
actor_1_name 0.08
num_critic_for_reviews 0.03
movie_facebook_likes 0.00
imdb_score 0.00
title_year 0.00
budget 0.00
num_user_for_reviews 0.00
num_voted_users 0.00
movie_title 0.00
genres 0.00
gross 0.00
director_name 0.00
dtype: float64

Some of the rows might have greater than five NaN values. Such rows
aren't of much use for the analysis and hence, should be removed.

For dropping the rows:

(movies.isnull().sum(axis=1).sort_values(ascending=False)>5).sum()

movies=movies[movies.isnull().sum(axis=1).sort_values(ascending=F
alse) <= 5]

8|P a ge
After that we will fill the missing NaN values:

round(movies.isnull().sum().sort_values(ascending=False)/len(movie
s)*100,2)
language 0.08
actor_1_name 0.08
num_critic_for_reviews 0.03
movie_facebook_likes 0.00
imdb_score 0.00
title_year 0.00
budget 0.00
num_user_for_reviews 0.00
num_voted_users 0.00
movie_title 0.00
genres 0.00
gross 0.00
director_name 0.00
dtype: float64

We can see that the Language column has some NaN values, we will
replace that with English since its in the Max.

movies.groupby('language').language.count().sort_values(ascending=F
alse)
language
English 3707
French 37
Spanish 26
Mandarin 15
German 13
Japanese 12
Hindi 10
Cantonese 8
Italian 7
Korean 5
Portuguese 5
Norwegian 4
Hebrew 3
Persian 3
Dutch 3
Danish 3
Thai 3
Dari 2
Indonesian 2
Aboriginal 2
Icelandic 1
Hungarian 1
Arabic 1
Aramaic 1
Bosnian 1
Telugu 1

9|P a ge
Czech 1
Swedish 1
Russian 1
Romanian 1
Dzongkha 1
None 1
Filipino 1
Mongolian 1
Maya 1
Kazakh 1
Vietnamese 1
Zulu 1
Name: language, dtype: int64

movies.language = movies.language.fillna('English')

10 | P a g e
Movies with highest profit:
We will change the unit of the budget and gross columns
from $ to million $:

movies['budget']=movies['budget']/1000000
movies['gross']=movies['gross']/1000000

I created a new column named ‘profit’.

movies['profit']=movies['gross']-movies['budget']

11 | P a g e
After that I sorted it in Descending order:

movies.sort_values(by='profit',ascending=False)

12 | P a g e
And then I found out the top 10 movies that made the most profit:

top10 = movies.sort_values(by='profit',ascending=False).head(10)

After we found the top 10 profiting movies , we can notice a


duplicate value. So,it seems like the dataframe has duplicate values as
well.

Hence we drop the duplicates and repeat the steps:

movies.drop_duplicates(keep='first',inplace=True)

13 | P a g e
And then I found out the top 10 movies that made the most profit:

top10 = movies.sort_values(by='profit',ascending=False).head(10)

14 | P a g e
IMDb Top 250:
I created a new dataframe IMDb_Top_250 to show the top 250 movies
with the highest IMDb Ratings.

IMDb_Top_250=movies[movies['num_voted_users']>25000].sort_v
alues(by='imdb_score',ascending=False).head(250)

After that I sorted it in Descending order:

IMDb_Top_250['Rank']=IMDb_Top_250['imdb_score'].rank(method
='first',ascending=False)

15 | P a g e
After that I found the Top250 movies in the “English” language:
IMDb_Top_250[IMDb_Top_250['language']!='English']

16 | P a g e
Best directors:
I created a new dataframe top10director to show the best directors
with the highest IMDb Ratings.

top10director=movies.groupby('director_name').imdb_score.mean().s
ort_values(ascending=False).head(10)
director_name
Charles Chaplin 8.600000
Tony Kaye 8.600000
Ron Fricke 8.500000
Damien Chazelle 8.500000
Majid Majidi 8.500000
Alfred Hitchcock 8.500000
Sergio Leone 8.433333
Christopher Nolan 8.425000
Asghar Farhadi 8.400000
Richard Marquand 8.400000
Name: imdb_score, dtype: float64

17 | P a g e
Popular genres:
A film might have multiple genres, so here I am just taking the first two
which are the main genres and creating a column for them:

TempGenre=movies.genres.str.split('|',expand=True).iloc[:,0:2]
TempGenre.columns=['genre_1','genre_2']
TempGenre.genre_2.fillna(TempGenre.genre_1,inplace=True)

After that I found out the top genres by finding the mean of the gross
values:

movies.groupby(['genre_1','genre_2']).gross.mean().sort_values(ascen
ding=False).head(5)
genre_1 genre_2
Family Sci-Fi 434.949459
Adventure Sci-Fi 228.627758
Family 118.919540
Animation 116.998550
Action Adventure 109.595465
Name: gross, dtype: float64

18 | P a g e
Critic-favorite and audience-favorite
actors:
There are a lot of actors in the list, I chose 4 top actors among them to
check their ratings:

Meryl_Streep = movies[movies['actor_1_name']=='Meryl Streep']


Leo_Caprio = movies[movies['actor_1_name']=='Leonardo DiCaprio']
Brad_Pitt = movies[movies['actor_1_name']=='Brad Pitt']

combined=Meryl_Streep.append([Leo_Caprio,Brad_Pitt])

combined.groupby('actor_1_name')[['num_critic_for_reviews','num_
user_for_reviews']].mean()

19 | P a g e
After ther I calculated the decades:

movies['decade']=movies['title_year'].apply(lambda x: (x//10)
*10).astype(np.int64)
movies['decade']=movies['decade'].astype(str)+'s'
movies=movies.sort_values(['decade'])

Then I created the data frame df_by_decade:

df_by_decade=movies.groupby('decade')
df_by_decade['num_voted_users'].sum()
df_by_decade=pd.DataFrame(df_by_decade['num_voted_users'].su
m())
20 | P a g e
After that I plotted the bar graph for number of voted users vs
decade:

df_by_decade.plot.bar(figsize=(15,8),width=0.8,hatch="//",edgecolor
='k')
plt.xlabel("Decade")
plt.ylabel("Voted Users")
plt.title("Voted users over Decades")
plt.yscale('log')
plt.show()

21 | P a g e
22 | P a g e

You might also like