IMDB Movie Analysis: by Biswajeet Nayak
IMDB Movie Analysis: by Biswajeet Nayak
Analysis
by
Biswajeet Nayak
Project Description
IMDb registered users can cast a vote (from 1 to 10) on every released
title in the database. Individual votes are then aggregated and
summarized as a single IMDb rating, this rating describes the
popularity of a movie in the public.
The dataset contains only one csv file IMDB_Movies.csv has been used
in this project for the analysis.
The libraries for data analysis and visualization used in this project are
Numpy & Pandas.
1|P a ge
column.
Find out the top 10 directors for whom the mean of imdb_score is
the highest and store them in a new column top10director. In case
of a tie in IMDb score between two directors, sort them
alphabetically.
Your task: Find the best directors
Append the rows of all these columns and store them in a new
column named Combined.
2|P a ge
Approach and Tech Used
For this project I used Jupyter Notebook (Anaconda) to run my
queries and charts.
3|P a ge
Dataset
First, we imported all the libraries needed:
movies = pd.read_csv('IMDB_Movies.csv')
OrgData = movies
Output:
4|P a ge
Cleaning the data
We find out the number of null values in the dataset:
movies.isnull().sum(axis=0).sort_values(ascending=False)
gross 884
budget 492
aspect_ratio 329
content_rating 303
plot_keywords 153
title_year 108
director_name 104
director_facebook_likes 104
num_critic_for_reviews 50
actor_3_name 23
actor_3_facebook_likes 23
num_user_for_reviews 20
color 19
duration 15
facenumber_in_poster 13
actor_2_name 13
actor_2_facebook_likes 13
language 12
actor_1_name 7
actor_1_facebook_likes 7
country 5
movie_facebook_likes 0
genres 0
movie_title 0
num_voted_users 0
movie_imdb_link 0
imdb_score 0
cast_total_facebook_likes 0
dtype: int64
movies.isnull().sum(axis=1).sort_values(ascending=False)
279 15
4 13
4945 11
2241 11
2342 10
..
2708 0
2707 0
5|P a ge
2706 0
2705 0
0 0
Length: 5043, dtype: int64
movies.isnull().sum(axis=0).sort_values(ascending=False)/len(movies
) * 100
gross 17.529248
budget 9.756098
aspect_ratio 6.523895
content_rating 6.008328
plot_keywords 3.033908
title_year 2.141582
director_name 2.062265
director_facebook_likes 2.062265
num_critic_for_reviews 0.991473
actor_3_name 0.456078
actor_3_facebook_likes 0.456078
num_user_for_reviews 0.396589
color 0.376760
duration 0.297442
facenumber_in_poster 0.257783
actor_2_name 0.257783
actor_2_facebook_likes 0.257783
language 0.237954
actor_1_name 0.138806
actor_1_facebook_likes 0.138806
country 0.099147
movie_facebook_likes 0.000000
genres 0.000000
movie_title 0.000000
num_voted_users 0.000000
movie_imdb_link 0.000000
imdb_score 0.000000
cast_total_facebook_likes 0.000000
dtype: float64
There are many columns which are not that important for our study so
we will drop those columns:
movies=movies.drop(['color','director_facebook_likes','actor_1_facebo
ok_likes','actor_2_facebook_likes','actor_3_facebook_likes','actor_2_
name','cast_total_facebook_likes','actor_3_name','duration','facenum
ber_in_poster','content_rating','country','movie_imdb_link','aspect_r
atio','plot_keywords'],axis=1)
6|P a ge
After that we will drop unnecessary rows using columns with high Null
percentages.
round(movies.isnull().sum(axis=0).sort_values(ascending=False)/len(
movies)*100,2)
gross 17.53
budget 9.76
title_year 2.14
director_name 2.06
num_critic_for_reviews 0.99
num_user_for_reviews 0.40
language 0.24
actor_1_name 0.14
movie_facebook_likes 0.00
imdb_score 0.00
num_voted_users 0.00
movie_title 0.00
genres 0.00
dtype: float64
7|P a ge
movies=movies[movies['gross'].notnull()]
movies=movies[movies['budget'].notnull()]
round(movies.isnull().sum().sort_values(ascending=False)/len(movie
s)*100,2)
language 0.08
actor_1_name 0.08
num_critic_for_reviews 0.03
movie_facebook_likes 0.00
imdb_score 0.00
title_year 0.00
budget 0.00
num_user_for_reviews 0.00
num_voted_users 0.00
movie_title 0.00
genres 0.00
gross 0.00
director_name 0.00
dtype: float64
Some of the rows might have greater than five NaN values. Such rows
aren't of much use for the analysis and hence, should be removed.
(movies.isnull().sum(axis=1).sort_values(ascending=False)>5).sum()
movies=movies[movies.isnull().sum(axis=1).sort_values(ascending=F
alse) <= 5]
8|P a ge
After that we will fill the missing NaN values:
round(movies.isnull().sum().sort_values(ascending=False)/len(movie
s)*100,2)
language 0.08
actor_1_name 0.08
num_critic_for_reviews 0.03
movie_facebook_likes 0.00
imdb_score 0.00
title_year 0.00
budget 0.00
num_user_for_reviews 0.00
num_voted_users 0.00
movie_title 0.00
genres 0.00
gross 0.00
director_name 0.00
dtype: float64
We can see that the Language column has some NaN values, we will
replace that with English since its in the Max.
movies.groupby('language').language.count().sort_values(ascending=F
alse)
language
English 3707
French 37
Spanish 26
Mandarin 15
German 13
Japanese 12
Hindi 10
Cantonese 8
Italian 7
Korean 5
Portuguese 5
Norwegian 4
Hebrew 3
Persian 3
Dutch 3
Danish 3
Thai 3
Dari 2
Indonesian 2
Aboriginal 2
Icelandic 1
Hungarian 1
Arabic 1
Aramaic 1
Bosnian 1
Telugu 1
9|P a ge
Czech 1
Swedish 1
Russian 1
Romanian 1
Dzongkha 1
None 1
Filipino 1
Mongolian 1
Maya 1
Kazakh 1
Vietnamese 1
Zulu 1
Name: language, dtype: int64
movies.language = movies.language.fillna('English')
10 | P a g e
Movies with highest profit:
We will change the unit of the budget and gross columns
from $ to million $:
movies['budget']=movies['budget']/1000000
movies['gross']=movies['gross']/1000000
movies['profit']=movies['gross']-movies['budget']
11 | P a g e
After that I sorted it in Descending order:
movies.sort_values(by='profit',ascending=False)
12 | P a g e
And then I found out the top 10 movies that made the most profit:
top10 = movies.sort_values(by='profit',ascending=False).head(10)
movies.drop_duplicates(keep='first',inplace=True)
13 | P a g e
And then I found out the top 10 movies that made the most profit:
top10 = movies.sort_values(by='profit',ascending=False).head(10)
14 | P a g e
IMDb Top 250:
I created a new dataframe IMDb_Top_250 to show the top 250 movies
with the highest IMDb Ratings.
IMDb_Top_250=movies[movies['num_voted_users']>25000].sort_v
alues(by='imdb_score',ascending=False).head(250)
IMDb_Top_250['Rank']=IMDb_Top_250['imdb_score'].rank(method
='first',ascending=False)
15 | P a g e
After that I found the Top250 movies in the “English” language:
IMDb_Top_250[IMDb_Top_250['language']!='English']
16 | P a g e
Best directors:
I created a new dataframe top10director to show the best directors
with the highest IMDb Ratings.
top10director=movies.groupby('director_name').imdb_score.mean().s
ort_values(ascending=False).head(10)
director_name
Charles Chaplin 8.600000
Tony Kaye 8.600000
Ron Fricke 8.500000
Damien Chazelle 8.500000
Majid Majidi 8.500000
Alfred Hitchcock 8.500000
Sergio Leone 8.433333
Christopher Nolan 8.425000
Asghar Farhadi 8.400000
Richard Marquand 8.400000
Name: imdb_score, dtype: float64
17 | P a g e
Popular genres:
A film might have multiple genres, so here I am just taking the first two
which are the main genres and creating a column for them:
TempGenre=movies.genres.str.split('|',expand=True).iloc[:,0:2]
TempGenre.columns=['genre_1','genre_2']
TempGenre.genre_2.fillna(TempGenre.genre_1,inplace=True)
After that I found out the top genres by finding the mean of the gross
values:
movies.groupby(['genre_1','genre_2']).gross.mean().sort_values(ascen
ding=False).head(5)
genre_1 genre_2
Family Sci-Fi 434.949459
Adventure Sci-Fi 228.627758
Family 118.919540
Animation 116.998550
Action Adventure 109.595465
Name: gross, dtype: float64
18 | P a g e
Critic-favorite and audience-favorite
actors:
There are a lot of actors in the list, I chose 4 top actors among them to
check their ratings:
combined=Meryl_Streep.append([Leo_Caprio,Brad_Pitt])
combined.groupby('actor_1_name')[['num_critic_for_reviews','num_
user_for_reviews']].mean()
19 | P a g e
After ther I calculated the decades:
movies['decade']=movies['title_year'].apply(lambda x: (x//10)
*10).astype(np.int64)
movies['decade']=movies['decade'].astype(str)+'s'
movies=movies.sort_values(['decade'])
df_by_decade=movies.groupby('decade')
df_by_decade['num_voted_users'].sum()
df_by_decade=pd.DataFrame(df_by_decade['num_voted_users'].su
m())
20 | P a g e
After that I plotted the bar graph for number of voted users vs
decade:
df_by_decade.plot.bar(figsize=(15,8),width=0.8,hatch="//",edgecolor
='k')
plt.xlabel("Decade")
plt.ylabel("Voted Users")
plt.title("Voted users over Decades")
plt.yscale('log')
plt.show()
21 | P a g e
22 | P a g e