Project: Predicting Box Office Revenues: A Report Submitted To
Project: Predicting Box Office Revenues: A Report Submitted To
A report submitted to
By
Faisal Abrar - 2011072
Sai Jyothi - 2011299
Soumava Ghosh - 2011247
Vivek Sreenivasan - 2011278
On
19-12-2020
Business Problem
The business problem we are focusing on is predicting the box office revenues. The
entertainment industry has observed various changes in technology, which has modernized the
industry. Accordingly, consumer preferences have changed, leading to an increased need for
managers to estimate the predicted revenue for the film planned to be released. The costs of
creating a film have increased over the years. Additionally, due to the increased focus on ROI
and box office revenue as a critical parameter for the success of the film house, the industry is
investing in improving their prediction accuracy to make informed decisions like movie
scheduling, advertising strategy etc.
The project aims to build a machine learning predictive model to estimate the revenue generated
at the box office, taking into consideration the various factors.
The data set contains data regarding movies released worldwide, with 4398 movie entries
containing several details, including movie details, overviews and credits. This dataset is
extracted from TMDB API and is certified by TMDB. We can also access data of many other
additional movies, actors and actresses, crew members, and TV shows from the API's provided.
Variable Used
id belongs_to_collection budget
The complete analysis for this dataset has been done using R software. In the first step of our
study, exploratory data analysis, we have tried to identify the various variables that impact the
revenue and understand the relationships we have with the dependent variable. Some of the
graphs have been shown below, which have been plotted using ggplot and ggplotextra packages.
The first 3 plots show the relationship of the 3 variables, viz, popularity, budget and runtime,
with the revenue earned.
As can be seen from the graphs, all 3 have a positive relationship with the revenue, ie, increasing
each of them leads to an increase in revenue. This seems logical as well, as an increase in budget
increases the quality of production and cast, leading to a high preference by consumers. Also,
more popularity (marketing and advertising of the movie) leads to better reception. Runtime
shows a similar trend but does not have a strong relationship like budget and popularity.
Next, we have tried to see the impact of genre on the revenues earned by plotting the number of
movies for different genres and the median revenue earned by different genres.
It can be observed that movies in action and science fiction genres have a higher median
revenue. But genres with low movie counts like foreign or history do not represent a correct
representation of the impact on revenue due to the low sample size.
Next, we have tried to see the effect of the production house on the revenues earned.
The revenues earned by popular big production houses is much more than the small production
companies.
In the next set of plots, we have tried to see the effect of the movie's release time (year, quarter,
month, week and day) on the revenues earned.
From the above plots, it can be seen that:
1. The revenue on average has increased with the increase in year
2. The revenues earned in 3 months, June, July, and December, are much higher than those
released in other months. A possible reason for this could be that many big movies target
a summer release (June and July), while some major blockbuster movies aim for the
December release to capitalize on the winter holiday season.
3. The revenue earned for movies releasing on Wednesdays seems to be higher than the
other days of the week.
In the next set of plots, we estimated the number of occurrences of variables for the following
parameters 1. Genres, 2. Production Companies 3. Production Countries 4. Spoken Languages 5.
Keywords and created a correlation matrix with the revenue.
We found the following observations:
1. The higher the median revenue, the more genres a film has. The bigger the number of
production companies in a film, up to six, the higher the revenue.
2. A greater number appears to produce more erratic outcomes. Smaller sample sizes could
account for this.
3. There does not appear to be a clear correlation between the number of producing
countries and revenue. There appears to be no discernible trend in the number of spoken
languages as well.
4. There is a correlation between having more keywords and having greater revenue.
Data Processing
1. We created a new variable to convert the “belongs to collection” variable to a categorical
variable containing 2 values “ Collection” and “ No Collection”
2. We extracted the main genre from the “genre” column to create a new variable named
“main genre”
3. We extracted the first main id from the “production companies” column to create a new
variable named “ prod comp id”
4. Next, we extracted the production company name to create a new variable named “ top
prod comp” and categorized it such a that companies with less than 60 movies have value
as “others”
Models
Multi-class SVM
SVM is a supervised machine learning algorithm which can be used for both classification and
regression. Our problem is a regression problem and we have used the extension, multi-class
SVM to predict the revenue earned. We have used the e1071 package to build support vector
machines.
Data Preparation
After removing all the null values from the full_dataset we have splitted the data into testing and
training parts. After data preparation, we looked into different variables to select important
variables for the revenue.
Next, we predicted the test data and results plot to compare visually. This shows that the revenue
is distributed unevenly with different figures.
Accuracy
We have calculated the MAE, MSE, RMSE, R-squared along with accuracy. The accuracy
turned out to be about 97.2%.
Comparison
Along with the multiclass SVM we have used Random forest since its regression model. The
summary of the random forest is given below. Total about 501 trees were obtained and accuracy
turned out to be 89.56%. Of two models comparing the accuracy, MSE, MAE the multi classifier
SVM is the best one.
Prediction
We have created our model and trained the model. Using the multi classifier SVM we will
predict the revenue of the test data using the movie id. We have saved the predicted data into a
csv file. The glimpse of data is given below.
Conclusion
Using this dataset in the future we can predict the ratings of the movies releases based on the
cast, crew along with the revenues using machine learning algorithms.