0% found this document useful (0 votes)
293 views10 pages

Project: Predicting Box Office Revenues: A Report Submitted To

1) The project aims to build a machine learning model to predict box office revenues using factors like popularity, budget, runtime, genre, production company, release date from a dataset of over 4,000 movies. 2) Exploratory data analysis found relationships between higher revenues and higher popularity, budget, runtime, action and sci-fi genres, major production companies, and release during summer or December. 3) A support vector machine model was built and found to predict revenues with 97.2% accuracy, outperforming a random forest model at 89.56% accuracy based on metrics like MAE, MSE, and RMSE. 4) The trained SVM model was used to predict revenues for test data
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
293 views10 pages

Project: Predicting Box Office Revenues: A Report Submitted To

1) The project aims to build a machine learning model to predict box office revenues using factors like popularity, budget, runtime, genre, production company, release date from a dataset of over 4,000 movies. 2) Exploratory data analysis found relationships between higher revenues and higher popularity, budget, runtime, action and sci-fi genres, major production companies, and release during summer or December. 3) A support vector machine model was built and found to predict revenues with 97.2% accuracy, outperforming a random forest model at 89.56% accuracy based on metrics like MAE, MSE, and RMSE. 4) The trained SVM model was used to predict revenues for test data
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Project: Predicting box office revenues

A report submitted to

Prof. Ujjwal Das

In the partial fulfilment of the course

Advanced Methods for Data Analysis (AMDA)

By
Faisal Abrar - 2011072
Sai Jyothi - 2011299
Soumava Ghosh - 2011247
Vivek Sreenivasan - 2011278

On
19-12-2020

Business Problem
The business problem we are focusing on is predicting the box office revenues. The
entertainment industry has observed various changes in technology, which has modernized the
industry. Accordingly, consumer preferences have changed, leading to an increased need for
managers to estimate the predicted revenue for the film planned to be released. The costs of
creating a film have increased over the years. Additionally, due to the increased focus on ROI
and box office revenue as a critical parameter for the success of the film house, the industry is
investing in improving their prediction accuracy to make informed decisions like movie
scheduling, advertising strategy etc.

The project aims to build a machine learning predictive model to estimate the revenue generated
at the box office, taking into consideration the various factors.

Date Set used

The data set contains data regarding movies released worldwide, with 4398 movie entries
containing several details, including movie details, overviews and credits. This dataset is
extracted from TMDB API and is certified by TMDB. We can also access data of many other
additional movies, actors and actresses, crew members, and TV shows from the API's provided.

Variable Used

id belongs_to_collection budget

genre homepage imdb_id

original_language original_title overview

popularity poster_path production_companies

production_countries production_companies release_date


run_time spoken_languages status

tagline title keywords

cast crew revenue

Exploratory Data Analysis

The complete analysis for this dataset has been done using R software. In the first step of our
study, exploratory data analysis, we have tried to identify the various variables that impact the
revenue and understand the relationships we have with the dependent variable. Some of the
graphs have been shown below, which have been plotted using ggplot and ggplotextra packages.

The first 3 plots show the relationship of the 3 variables, viz, popularity, budget and runtime,
with the revenue earned.

As can be seen from the graphs, all 3 have a positive relationship with the revenue, ie, increasing
each of them leads to an increase in revenue. This seems logical as well, as an increase in budget
increases the quality of production and cast, leading to a high preference by consumers. Also,
more popularity (marketing and advertising of the movie) leads to better reception. Runtime
shows a similar trend but does not have a strong relationship like budget and popularity.
Next, we have tried to see the impact of genre on the revenues earned by plotting the number of
movies for different genres and the median revenue earned by different genres.

It can be observed that movies in action and science fiction genres have a higher median
revenue. But genres with low movie counts like foreign or history do not represent a correct
representation of the impact on revenue due to the low sample size.

Next, we have tried to see the effect of the production house on the revenues earned.

The revenues earned by popular big production houses is much more than the small production
companies.

In the next set of plots, we have tried to see the effect of the movie's release time (year, quarter,
month, week and day) on the revenues earned.
From the above plots, it can be seen that:
1. The revenue on average has increased with the increase in year
2. The revenues earned in 3 months, June, July, and December, are much higher than those
released in other months. A possible reason for this could be that many big movies target
a summer release (June and July), while some major blockbuster movies aim for the
December release to capitalize on the winter holiday season.
3. The revenue earned for movies releasing on Wednesdays seems to be higher than the
other days of the week.

In the next set of plots, we estimated the number of occurrences of variables for the following
parameters 1. Genres, 2. Production Companies 3. Production Countries 4. Spoken Languages 5.
Keywords and created a correlation matrix with the revenue.
We found the following observations:
1. The higher the median revenue, the more genres a film has. The bigger the number of
production companies in a film, up to six, the higher the revenue.
2. A greater number appears to produce more erratic outcomes. Smaller sample sizes could
account for this.
3. There does not appear to be a clear correlation between the number of producing
countries and revenue. There appears to be no discernible trend in the number of spoken
languages as well.
4. There is a correlation between having more keywords and having greater revenue.
Data Processing
1. We created a new variable to convert the “belongs to collection” variable to a categorical
variable containing 2 values “ Collection” and “ No Collection”
2. We extracted the main genre from the “genre” column to create a new variable named
“main genre”
3. We extracted the first main id from the “production companies” column to create a new
variable named “ prod comp id”
4. Next, we extracted the production company name to create a new variable named “ top
prod comp” and categorized it such a that companies with less than 60 movies have value
as “others”

Models

Multi-class SVM
SVM is a supervised machine learning algorithm which can be used for both classification and
regression. Our problem is a regression problem and we have used the extension, multi-class
SVM to predict the revenue earned. We have used the e1071 package to build support vector
machines.

Data Preparation
After removing all the null values from the full_dataset we have splitted the data into testing and
training parts. After data preparation, we looked into different variables to select important
variables for the revenue.

About the model and test data


Using the SVM function and predict function we have extracted the SVM model.
About the model and test data
Using the SVM function and predict function we have extracted the SVM model. The results
were found that SVM type is of eps-regression and the kernel is radial. The number of support
vectors was 2441 in total.

Next, we predicted the test data and results plot to compare visually. This shows that the revenue
is distributed unevenly with different figures.
Accuracy
We have calculated the MAE, MSE, RMSE, R-squared along with accuracy. The accuracy
turned out to be about 97.2%.

Comparison
Along with the multiclass SVM we have used Random forest since its regression model. The
summary of the random forest is given below. Total about 501 trees were obtained and accuracy
turned out to be 89.56%. Of two models comparing the accuracy, MSE, MAE the multi classifier
SVM is the best one.
Prediction
We have created our model and trained the model. Using the multi classifier SVM we will
predict the revenue of the test data using the movie id. We have saved the predicted data into a
csv file. The glimpse of data is given below.

Conclusion
Using this dataset in the future we can predict the ratings of the movies releases based on the
cast, crew along with the revenues using machine learning algorithms.

You might also like