COM 428 - Jupyter Notebook2 - 101223
COM 428 - Jupyter Notebook2 - 101223
GROUP MEMBERS
1. JAMES KIMONDO P107/1111G/18
2. AGNES MUTISYA P107/1180G/19
3. KIPKOECH ENOCK P107/1202G/19
4. KIPNGETICH TALAM P107/1138G/18
5. CHEPNGENO MILLICENT P107/1215G/19
6. KIOKO TIMOTHY P107/1196G/19
7. POLINE SILAS P107/1191G/19
8. JEDIDAH NDERITU P107/1167G/19
9. SIMON MUTIO P107/1149G/19
10. IVY MBOGO P107/1209G/17
Introduction
Dataset Description
The research background of the given dataset is to understand the factors that are associated
with the revenue generated by movies. The dataset contains information about 10,000 movies
collected from The Movie Database (TMDb), including user ratings and revenue. The data
includes multiple variables such as genres, cast, budget, release date, and revenue, among
others. The research aims to identify the properties and characteristics of movies that generate
high revenue. Moreover, it also intends to explore the most popular genres over the years to
analyze the trends in movie preferences. The dataset provides an opportunity to analyze and
model the factors associated with revenue and genrepreferences.
Problem Statement
The aim of this project is to explore the The Movie Database (TMDb) and
identify the most popular genres from year to year, and determine the
properties associated with movies that have high revenues.
Hypothesis
Data Collection
The TMDb dataset was obtained from an online platform Kaggle.com
In [3]: TMD=pd.read_csv('tmdb_movies_data.csv')
TMD.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 10866 non-null int64
1 imdb_id 10856 non-null object
2 popularity 10866 non-null float64
3 budget 10866 non-null int64
4 revenue 10866 non-null int64
5 original_title 10866 non-null object
6 cast 10790 non-null object
7 homepage 2936 non-null object
8 director 10822 non-null object
9 tagline 8042 non-null object
10 keywords 9373 non-null object
11 overview 10862 non-null object
12 runtime 10866 non-null int64
13 genres 10843 non-null object
14 production_companies 9836 non-null object
15 release_date 10866 non-null object
16 vote_count 10866 non-null int64
17 vote_average 10866 non-null float64
18 release_year 10866 non-null int64
19 budget_adj 10866 non-null float64
20 revenue_adj 10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
Data Preprocessing
Observation From The Dataset
Data Cleaning
Out[4]: 1
4. Missing values
Descriptive Statistics
In [9]: TMDB.describe()
Univariate Analysis
id 46528 287524
revenue 11087569 0
Thomas Dekker|Robert
Kate Bosworth|Jang Dong-gun|Geoffrey
cast Englund|Cleopatra
Rush|Dann...
Coleman...
assassin|small
keywords phobia|doctor|fear
town|revenge|deception|super speed
runtime 100 95
vote_count 74 15
In [11]: find_minmax('revenue')
Out[11]: 1386 48
id 19995 265208
revenue 2781505847 0
runtime 162 92
In [19]: # make group for each year and count the number of movies in each year
year=TMDB.groupby('release_year').count()['id']
#make group of the data according to their release year and count the tota
TMDB.groupby('release_year').count()['id'].plot(xticks = np.arange(1950,20
#set the figure size and labels
sb.set(rc={'figure.figsize':(12,6)})
plt.title("Year Vs Number Of Movies",fontsize = 16)
plt.xlabel('Release year',fontsize = 14)
plt.ylabel('Number Of Movies',fontsize = 14)
2014 has got the highest release of movies from the above dataset.
Drama is the most favoured among the genres in most movie release.
In [14]: i = 0
genre_count = []
for genre in total_genre_movies.index:
genre_count.append([genre, total_genre_movies[i]])
i = i+1
plt.rc('font', weight='bold')
f, ax = plt.subplots(figsize=(8, 8))
genre_count.sort(key = lambda x:x[1], reverse = True)
labels, sizes = zip(*genre_count)
labels_selected = [n if v > sum(sizes) * 0.01 else '' for n, v in genre_co
ax.pie(sizes, labels=labels_selected,
autopct = lambda x:'{:2.0f}%'.format(x) if x > 1 else '',
shadow=False, startangle=0)
ax.axis('equal')
plt.tight_layout()
From the chart, we can see that Drama is the most common movie genre,
with the highest count of movies. Comedy and Action are also popular
genres with a relatively high count of movies. Other genres such as Horror
and War have a relatively lower count of movies.
Bivariate Analysis
- Movies with higher budgets,higher ratings and popularity tend to have higher
revenues.
In [15]: correlation=TMDB.corr()
correlation
In [16]: plt.figure(figsize=(12,8))
sb.heatmap(data=correlation, annot = True, fmt = '.3f',
cmap = 'vlag_r', center = 0)
plt.yticks(rotation = 45)
plt.show()
Positive Correlation
Negative correlation
- How the revenue and popularity differs budget and runtime and how does popularity
depends on profit?.
-Which length movies most liked by the audiences according to their popularity?
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is co
rrectly specified.
[2] The condition number is large, 1.5e+08. This might indicate that ther
e are
strong multicollinearity or other numerical problems.