0% found this document useful (0 votes)
43 views16 pages

COM 428 - Jupyter Notebook2 - 101223

Uploaded by

Kimondo King
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views16 pages

COM 428 - Jupyter Notebook2 - 101223

Uploaded by

Kimondo King
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

4/3/23, 10:04 PM COM 428 - Jupyter Notebook

COM 428E Data Mining and Warehousing

GROUP MEMBERS
1. JAMES KIMONDO P107/1111G/18
2. AGNES MUTISYA P107/1180G/19
3. KIPKOECH ENOCK P107/1202G/19
4. KIPNGETICH TALAM P107/1138G/18
5. CHEPNGENO MILLICENT P107/1215G/19
6. KIOKO TIMOTHY P107/1196G/19
7. POLINE SILAS P107/1191G/19
8. JEDIDAH NDERITU P107/1167G/19
9. SIMON MUTIO P107/1149G/19
10. IVY MBOGO P107/1209G/17

Project: The Movie Database Analysis


Table of Contents
Introduction
Problem Statement and Formulate Hypothesis
Data Mining Technique
Data Collection
Data Preprocessing
Explanatory Data Analysis
Estimate the Model
Model Intepretation and Conclusion

Introduction

Dataset Description
The research background of the given dataset is to understand the factors that are associated
with the revenue generated by movies. The dataset contains information about 10,000 movies
collected from The Movie Database (TMDb), including user ratings and revenue. The data
includes multiple variables such as genres, cast, budget, release date, and revenue, among
others. The research aims to identify the properties and characteristics of movies that generate
high revenue. Moreover, it also intends to explore the most popular genres over the years to
analyze the trends in movie preferences. The dataset provides an opportunity to analyze and
model the factors associated with revenue and genrepreferences.

localhost:8888/notebooks/Downloads/COM 428.ipynb# 1/16


4/3/23, 10:04 PM COM 428 - Jupyter Notebook

In [33]:  #importation of libraries


import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.simplefilter('ignore')

Problem Statement and Formulate Hypothesis

Problem Statement

The aim of this project is to explore the The Movie Database (TMDb) and
identify the most popular genres from year to year, and determine the
properties associated with movies that have high revenues.

Hypothesis

Movies with higher budgets tend to have higher revenues.


Movies with certain genres tend to have higher revenues.
Movies with higher user ratings tend to have higher revenues.
Movies with certain cast members tend to have higher revenues.

Data Mining Technique


Regression analysis is a statistical technique used in data mining to identify the
relationship between a dependent variable and one or more independent variables. It is
used to model the relationship between variables and to make predictions about future
data points.
We will consider a Multiple regression model which is a type of regression analysis that
uses more than one independent variable to predict the dependent variable. It extends
simple linear regression by allowing for more complex relationships between variables.
The goal is to find the best combination of predictor variables that can explain the variation
in the outcome variable. It is often used in data mining because it allows for more accurate
predictions and a better understanding of the underlying relationships between variables.
It can be used to analyze complex data sets with many variables, and it can identify which
variables are most important for predicting the outcome variable. Multiple regression can
also be used to control for confounding variables and to assess the statistical significance
of the relationships between variables. Overall, multiple regression is a powerful and
versatile tool in data mining that can help to uncover important insights and make accurate
predictions about future outcomes.
localhost:8888/notebooks/Downloads/COM 428.ipynb# 2/16
4/3/23, 10:04 PM COM 428 - Jupyter Notebook

Data Collection
The TMDb dataset was obtained from an online platform Kaggle.com

The data contains:

Total Rows = 10866


Total Columns = 21

In [3]:  TMD=pd.read_csv('tmdb_movies_data.csv')
TMD.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 10866 non-null int64
1 imdb_id 10856 non-null object
2 popularity 10866 non-null float64
3 budget 10866 non-null int64
4 revenue 10866 non-null int64
5 original_title 10866 non-null object
6 cast 10790 non-null object
7 homepage 2936 non-null object
8 director 10822 non-null object
9 tagline 8042 non-null object
10 keywords 9373 non-null object
11 overview 10862 non-null object
12 runtime 10866 non-null int64
13 genres 10843 non-null object
14 production_companies 9836 non-null object
15 release_date 10866 non-null object
16 vote_count 10866 non-null int64
17 vote_average 10866 non-null float64
18 release_year 10866 non-null int64
19 budget_adj 10866 non-null float64
20 revenue_adj 10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

Data Preprocessing
Observation From The Dataset

localhost:8888/notebooks/Downloads/COM 428.ipynb# 3/16


4/3/23, 10:04 PM COM 428 - Jupyter Notebook

The columns 'budget', 'revenue', 'budget_adj', 'revenue_adj' has not been


given.
But for this dataset we will assume the currency is in US dollor.
The dataset contain lots of movies where the budget or revenue have a
value of '0'.
The dataset has got many missing values.

Data Cleaning

We need to remove duplicate rows from the dataset


Changing format of release date into datetime format
Remove the unused colums that are not needed during the data mining
process.
Replace the missing values.

1. Removal of Duplicate values

In [4]:  #total duplicates in the dataset


sum(TMD.duplicated())

Out[4]: 1

In [5]:  #dropping of the duplicated


TMD.drop_duplicates(subset=None, keep='first', inplace=True)

2. Release Date Format

In [6]:  #changing from object to datetime format


TMD['release_date'] = pd.to_datetime(TMD['release_date'])

3. Removal of Unrequired Columns

In [7]:  #removing of the columns not necessary during dataming


TMD.drop(['budget_adj','revenue_adj','overview','imdb_id','homepage','tagl

4. Missing values

localhost:8888/notebooks/Downloads/COM 428.ipynb# 4/16


4/3/23, 10:04 PM COM 428 - Jupyter Notebook

In [8]:  #replacing with zero


TMDB=TMD.fillna(0)
TMDB.info()

Data columns (total 15 columns):


# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 10865 non-null int64
1 popularity 10865 non-null float64
2 budget 10865 non-null int64
3 revenue 10865 non-null int64
4 original_title 10865 non-null object
5 cast 10865 non-null object
6 director 10865 non-null object
7 keywords 10865 non-null object
8 runtime 10865 non-null int64
9 genres 10865 non-null object
10 production_companies 10865 non-null object
11 release_date 10865 non-null datetime64[ns]
12 vote_count 10865 non-null int64
13 vote_average 10865 non-null float64
14 release_year 10865 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(6), object(6)
memory usage: 1.3+ MB

Explanatory Data Analysis

Descriptive Statistics

In [9]:  TMDB.describe()

Out[9]: id popularity budget revenue runtime vote_coun

count 10865.000000 10865.000000 1.086500e+04 1.086500e+04 10865.000000 10865.00000

mean 66066.374413 0.646446 1.462429e+07 3.982690e+07 102.071790 217.39963

std 92134.091971 1.000231 3.091428e+07 1.170083e+08 31.382701 575.64462

min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.00000

25% 10596.000000 0.207575 0.000000e+00 0.000000e+00 90.000000 17.00000

50% 20662.000000 0.383831 0.000000e+00 0.000000e+00 99.000000 38.00000

75% 75612.000000 0.713857 1.500000e+07 2.400000e+07 111.000000 146.00000

max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.00000

Univariate Analysis

- Highest And Lowest Movie Budget

localhost:8888/notebooks/Downloads/COM 428.ipynb# 5/16


4/3/23, 10:04 PM COM 428 - Jupyter Notebook

In [10]:  def find_minmax(x):


#use the function 'idmin' to find the index of lowest profit movie.
min_index = TMDB[x].idxmin()
#use the function 'idmax' to find the index of Highest profit movie.
high_index = TMDB[x].idxmax()
high = pd.DataFrame(TMDB.loc[high_index,:])
low = pd.DataFrame(TMDB.loc[min_index,:])
#print the movie with high and low budget
print("Movie Which Has Highest "+ x + " : ",TMDB['original_title'][hig
print("Movie Which Has Lowest "+ x + " : ",TMDB['original_title'][min
return pd.concat([high,low],axis = 1)
#information of the budget
TMDB['budget'] = TMDB['budget'].replace(0,np.NAN)
find_minmax('budget')

Movie Which Has Highest budget : The Warrior's Way


Movie Which Has Lowest budget : Fear Clinic

Out[10]: 2244 1151

id 46528 287524

popularity 0.25054 0.177102

budget 425000000.0 1.0

revenue 11087569 0

original_title The Warrior's Way Fear Clinic

Thomas Dekker|Robert
Kate Bosworth|Jang Dong-gun|Geoffrey
cast Englund|Cleopatra
Rush|Dann...
Coleman...

director Sngmoo Lee Robert Hall

assassin|small
keywords phobia|doctor|fear
town|revenge|deception|super speed

runtime 100 95

genres Adventure|Fantasy|Action|Western|Thriller Horror

Dry County Films|Anchor


production_companies Boram Entertainment Inc.
Bay Entertainment|Movi...

release_date 2010-12-02 00:00:00 2014-10-31 00:00:00

vote_count 74 15

vote_average 6.4 4.1

release_year 2010 2014

- Highest And Lowest Movie Revenue

localhost:8888/notebooks/Downloads/COM 428.ipynb# 6/16


4/3/23, 10:04 PM COM 428 - Jupyter Notebook

In [11]:  find_minmax('revenue')

Movie Which Has Highest revenue : Avatar


Movie Which Has Lowest revenue : Wild Card

Out[11]: 1386 48

id 19995 265208

popularity 9.432768 2.93234

budget 237000000.0 30000000.0

revenue 2781505847 0

original_title Avatar Wild Card

Sam Worthington|Zoe Jason Statham|Michael


cast
Saldana|Sigourney Weaver|S... Angarano|Milo Ventimigli...

director James Cameron Simon West

culture clash|future|space war|space


keywords gambling|bodyguard|remake
colony|so...

runtime 162 92

-A year with the highest release of movies

In [19]:  # make group for each year and count the number of movies in each year
year=TMDB.groupby('release_year').count()['id']

#make group of the data according to their release year and count the tota
TMDB.groupby('release_year').count()['id'].plot(xticks = np.arange(1950,20

#set the figure size and labels
sb.set(rc={'figure.figsize':(12,6)})
plt.title("Year Vs Number Of Movies",fontsize = 16)
plt.xlabel('Release year',fontsize = 14)
plt.ylabel('Number Of Movies',fontsize = 14)

Out[19]: Text(0, 0.5, 'Number Of Movies')

localhost:8888/notebooks/Downloads/COM 428.ipynb# 7/16


4/3/23, 10:04 PM COM 428 - Jupyter Notebook

2014 has got the highest release of movies from the above dataset.

-Genre with the highest release of movies

In [13]:  def count_genre(x):


# convert the column to string data type
TMDB[x] = TMDB[x].astype(str)

# concatenate all the rows of the genres


data_plot = TMDB[x].str.cat(sep='|')
data = pd.Series(data_plot.split('|'))

# count each of the genres and return


info = data.value_counts(ascending=False)
return info

# call the function for counting the movies of each genre
total_genre_movies = count_genre('genres')

# plot a 'barh' plot using plot function for 'genre vs number of movies'
total_genre_movies.plot(kind='barh', figsize=(15, 7), fontsize=12)

# setup the title and the labels of the plot
plt.title("Genre With Highest Release", fontsize=16)
plt.xlabel('Number Of Movies', fontsize=14)
plt.ylabel("Genres", fontsize=14)

Out[13]: Text(0, 0.5, 'Genres')

Drama is the most favoured among the genres in most movie release.

localhost:8888/notebooks/Downloads/COM 428.ipynb# 8/16


4/3/23, 10:04 PM COM 428 - Jupyter Notebook

In [14]:  i = 0
genre_count = []
for genre in total_genre_movies.index:
genre_count.append([genre, total_genre_movies[i]])
i = i+1

plt.rc('font', weight='bold')
f, ax = plt.subplots(figsize=(8, 8))
genre_count.sort(key = lambda x:x[1], reverse = True)
labels, sizes = zip(*genre_count)
labels_selected = [n if v > sum(sizes) * 0.01 else '' for n, v in genre_co
ax.pie(sizes, labels=labels_selected,
autopct = lambda x:'{:2.0f}%'.format(x) if x > 1 else '',
shadow=False, startangle=0)
ax.axis('equal')
plt.tight_layout()

localhost:8888/notebooks/Downloads/COM 428.ipynb# 9/16


4/3/23, 10:04 PM COM 428 - Jupyter Notebook

From the chart, we can see that Drama is the most common movie genre,
with the highest count of movies. Comedy and Action are also popular
genres with a relatively high count of movies. Other genres such as Horror
and War have a relatively lower count of movies.

Bivariate Analysis

- Movies with higher budgets,higher ratings and popularity tend to have higher
revenues.

In [15]:  correlation=TMDB.corr()
correlation

Out[15]: id popularity budget revenue runtime vote_count vote_averag

id 1.000000 -0.014351 -0.075766 -0.099235 -0.088368 -0.035555 -0.05839

popularity -0.014351 1.000000 0.479961 0.663360 0.139032 0.800828 0.2095

budget -0.075766 0.479961 1.000000 0.700162 0.265575 0.580050 0.0920

revenue -0.099235 0.663360 0.700162 1.000000 0.162830 0.791174 0.17254

runtime -0.088368 0.139032 0.265575 0.162830 1.000000 0.163273 0.1568

vote_count -0.035555 0.800828 0.580050 0.791174 0.163273 1.000000 0.2538

vote_average -0.058391 0.209517 0.092014 0.172541 0.156813 0.253818 1.00000

release_year 0.511393 0.089806 0.215402 0.057070 -0.117187 0.107962 -0.11757

localhost:8888/notebooks/Downloads/COM 428.ipynb# 10/16


4/3/23, 10:04 PM COM 428 - Jupyter Notebook

In [16]:  plt.figure(figsize=(12,8))
sb.heatmap(data=correlation, annot = True, fmt = '.3f',
cmap = 'vlag_r', center = 0)
plt.yticks(rotation = 45)
plt.show()

Positive Correlation

popularity and budget


popularity and revenue
user rating nad popularity

Negative correlation

runtime and release year

- How the revenue and popularity differs budget and runtime and how does popularity
depends on profit?.

localhost:8888/notebooks/Downloads/COM 428.ipynb# 11/16


4/3/23, 10:04 PM COM 428 - Jupyter Notebook

In [17]:  ax = sb.regplot(x=TMDB['revenue'], y=TMDB['budget'])



#set the title and labels of the figure
ax.set_title("Revenue Vs Budget",fontsize=16)
ax.set_xlabel("Revenue",fontsize=14)
ax.set_ylabel("Budget",fontsize=14)
#set the figure size
sb.set(rc={'figure.figsize':(12,6)})

-Which length movies most liked by the audiences according to their popularity?

In [18]:  TMDB.groupby('runtime')['popularity'].mean().plot(figsize = (13,5),xticks=


plt.title("Runtime Vs Popularity",fontsize = 16)
plt.xlabel('Runtime',fontsize = 14)
plt.ylabel('Average Popularity',fontsize = 14)
sb.set(rc={'figure.figsize':(16,6)})

localhost:8888/notebooks/Downloads/COM 428.ipynb# 12/16


4/3/23, 10:04 PM COM 428 - Jupyter Notebook

Most of the people prefer watching a movie with Aa runtime of between


100 and 200 compared to a runtime outside this boundary

In [31]:  from sklearn.preprocessing import LabelEncoder


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Convert cast and genres columns to string
TMDB_model[['cast','genres']] = TMDB_model[['cast','genres']].astype(str)

# Encode categorical variables
le = LabelEncoder()
TMDB_model['cast'] = le.fit_transform(TMDB_model['cast'])
TMDB_model['genres'] = le.fit_transform(TMDB_model['genres'])

# Prepare data for modeling
X = TMDB_model[['popularity', 'budget', 'cast', 'genres']]
y = TMDB_model['revenue']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, r

# Create and fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model
score = model.score(X_test, y_test)

localhost:8888/notebooks/Downloads/COM 428.ipynb# 13/16


4/3/23, 10:04 PM COM 428 - Jupyter Notebook

In [32]:  import statsmodels.api as sm



# Add constant to the independent variables
X_train = sm.add_constant(X_train)

# Fit the model
model = sm.OLS(y_train, X_train).fit()

# Print the summary
print(model.summary())

localhost:8888/notebooks/Downloads/COM 428.ipynb# 14/16


4/3/23, 10:04 PM COM 428 - Jupyter Notebook

OLS Regression Results


=========================================================================
=====
Dep. Variable: revenue R-squared:
0.629
Model: OLS Adj. R-squared:
0.628
Method: Least Squares F-statistic:
1749.
Date: Mon, 03 Apr 2023 Prob (F-statistic):
0.00
Time: 18:35:54 Log-Likelihood: -8
2016.
No. Observations: 4135 AIC: 1.64
0e+05
Df Residuals: 4130 BIC: 1.64
1e+05
Df Model: 4
Covariance Type: nonrobust
=========================================================================
=====
coef std err t P>|t| [0.025
0.975]
-------------------------------------------------------------------------
-----
const -3.986e+07 4.69e+06 -8.496 0.000 -4.91e+07 -3.0
7e+07
popularity 4.807e+07 1.32e+06 36.331 0.000 4.55e+07 5.0
7e+07
budget 2.1768 0.047 46.708 0.000 2.085
2.268
cast 2106.5788 1041.907 2.022 0.043 63.881 414
9.277
genres 1398.0277 4863.308 0.287 0.774 -8136.675 1.0
9e+04
=========================================================================
=====
Omnibus: 3436.760 Durbin-Watson:
2.037
Prob(Omnibus): 0.000 Jarque-Bera (JB): 50374
0.129
Skew: 3.248 Prob(JB):
0.00
Kurtosis: 56.680 Cond. No. 1.5
0e+08
=========================================================================
=====

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is co
rrectly specified.
[2] The condition number is large, 1.5e+08. This might indicate that ther
e are
strong multicollinearity or other numerical problems.

localhost:8888/notebooks/Downloads/COM 428.ipynb# 15/16


4/3/23, 10:04 PM COM 428 - Jupyter Notebook

Model Intepretation and Conclusion


The overall model is statistically significant with a p-value less than 0.05,
indicating that it fits the data well.
The R-squared value of 0.629 suggests that 62.9% of the variance in
revenue is explained by the independent variables.
The popularity and budget variables have a statistically significant positive
effect on revenue, indicating that movies with higher popularity and budget
tend to generate higher revenues.
The cast variable has a statistically significant positive effect on revenue,
indicating that movies with more popular and higher-paid actors tend to
generate higher revenues.
The genres variable does not have a statistically significant effect on
revenue, as its p-value is greater than 0.05.
The condition number of 1.5e+08 suggests that there may be strong
multicollinearity or other numerical problems in the model.
The skewness value of 3.248 indicates a significant positive skewness in
the distribution of the residuals.

localhost:8888/notebooks/Downloads/COM 428.ipynb# 16/16

You might also like