0% found this document useful (0 votes)
31 views60 pages

Final Report

The document discusses predicting the financial success of films using machine learning techniques. It proposes a decision support system that analyzes historical film data to predict a film's approximate success rate and profitability based on features like budget, ratings, genre, etc. The system aims to help film investors avoid risks by forecasting revenue from pre-release and post-release features using methods like neural networks and support vector machines.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views60 pages

Final Report

The document discusses predicting the financial success of films using machine learning techniques. It proposes a decision support system that analyzes historical film data to predict a film's approximate success rate and profitability based on features like budget, ratings, genre, etc. The system aims to help film investors avoid risks by forecasting revenue from pre-release and post-release features using methods like neural networks and support vector machines.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

FORECASTING FILM FINANCIAL SUCCESS

WITH MACHINE LEARNING


PROJECT REPORT
Submitted by

GOKULA KRISHNAN.A (211520205043)

GOKULNATH.K (211520205044)

RAM VILAS.H (211520205112)

In partial fulfillment for the award of the degree


of

BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY

PANIMALAR INSTITUTE OF TECHNOLOGY


ANNA UNIVERSITY: CHENNAI 600 025

MAY 2024
PANIMALAR INSTITUTE OF TECHNOLOGY
ANNA UNIVERSITY: CHENNAI 600 025

BONAFIDE CERTIFICATE
Certified that this project report titled “Forecasting Film Financial
Success With Machine Learning” is the bonafide work of
“GOKULAKRISHNAN.A(211520205043),GOKULNATH.K
(211520205044) and RAM VILAS.H (211520205112)” who carried out
the project work under my supervision.

SIGNATURE SIGNATURE
MRS. V.RAJESHWARI,.M.E.,
Dr. S. SUMA CHRISTAL MARY, M.E, Ph.D
ASSISTANT PROFESSOR,
HEAD OF THE DEPARTMENT,
Department of Information Technology,
Department of Information Technology,
Panimalar Institute of Technology, Poonamallee,
Panimalar Institute of Technology, Poonamallee,
Chennai 600 123.
Chennai 600 123.

Certified that the candidates were examined in the university project


viva-voce Examination held on ____________ at Panimalar Institute
of Technology, Chennai 600123.

INTERNAL EXAMINER EXTERNAL EXAMINER

II
ACKNOWLEDGEMENT

A project of this magnitude and nature requires kind co-operation and support
from many, for successful completion. We wish to express our sincere thanks to all
those who were involved in the completion of this project.

We seek the blessing from the Founder of our institution Dr. JEPPIAAR,
M.A., Ph.D., for having been a role model who has been our source of inspiration
behind our success in education in his premier institution.

We would like to express our deep gratitude to our beloved Secretary and
Correspondent Dr. P. CHINNADURAI, M.A., Ph.D., for his kind words and
enthusiastic motivation which inspired us a lot in completing this project.

We also express our sincere thanks and gratitude to our dynamic Directors
Mrs. C. VIJAYA RAJESHWARI, Dr. C. SAKTHI KUMAR, M.E., Ph.D., and
Dr. SARANYA SREE SAKTHI KUMAR, B.E., M.B.A, for providing us with
necessary facilities for completion of this project.

We also express our appreciation and gratefulness to our respected Principal


Dr. T. JAYANTHY, M.E., Ph.D., who helped us in the completion of the project.
We wish to convey our thanks and gratitude to our Head of the Department,
Dr. S. SUMA CHRISTAL MARY, M.E., Ph.D., for her full support by providing
ample time to complete our project.
Special thanks to our Project Guide Mrs.V.RAJESHWARI, M.E., Assistant
Professor for her expert advice, valuable information and guidance throughout the
completion of the project.
Last, we thank our parents and friends for providing their extensive moral
support and encouragement during the course of the project.

III
ABSTRACT

Predicting society's reaction to a new product in the sense of popularity and


adaption rate has become an emerging field of data analysis. The motion picture
industry is a multi-billion-dollar business, and there is a massive amount of data
related to films that is available over the internet. This study proposes a decision
support system for film investment sector using machine learning techniques.
This research helps investors associated with this business avoid investment risks.
The system predicts an approximate success rate of a film based on its
profitability by analyzing historical data from different sources like IMDb, Rotten
Tomatoes, Box Office Mojo, and Metacritic. Using Support Vector Machine
(SVM), Neural Network and Natural Language Processing, the system predicts a
film box office profit based on some pre-released features and post-released
features. This paper shows Neural Network gives an accuracy of 84.1% for pre-
released features and 89.27% for all features, while SVM has 83.44% and 88.87%
accuracy for pre-released features and all features respectively, when one away
prediction is considered. Moreover, we figure out that budget, IMDb votes, and
no. Out screens are the most important features which play avital role in
predicting a film's box office success.

IV
TABLE OF CONTENTS

CHAPTER TITLE PAGE


NO: NO:

ABSTRACT IV

LIST OF FIGURES VII


1 INTRODUCTION 1

1.1 Overview 2
1.2 Machine Learning 2
1.3 Purpose 4

2 LITERATURE SURVEY 5
3 SYSTEM ANALYSIS & SPECIFICATION 9

3.1 Existing System 10


3.1.1 Drawbacks of Existing System 11

3.2 Proposed System 11


3.2.1 Advantages of proposed System 12
3.3 Hardware Requirements 12

3.4 Software Requirements 13


4 SYSTEM DESIGN AND SYSTEM 14
IMPLEMENTATION

4.1 System Architecture 15


4.2 Methodology 15

4.2.1 Data Extraction 16


4.2.2 Data Processing and Analysis 17

V
4.2.3 Model Building 18

4.3 Data Gathering 18


4.4 Data Description 20

4.5 Flow Chart 22


5 EXPERIMENTAL INVESTIGATION 23

5.1 Support Vector Machine 24


5.2 Random Forest Classification 25

5.3 KNN Classification Algorithm 26


5.4 Linear Regression 26
6 EXPERIMENTAL RESULTS 28

6.1 Flask 29
6.2 Home Page 30

6.3 Result page 32


6.4 Advantages and Disadvantages 32

7 CONCLUSION AND FUTURE ENHANCEMENT 34


7.1 Conclusion 35

7.2 Future Enhancement 35


8 REFERENCES 36
9 APPENDIX 39

9.1 Colab notebook 40


9.2 Flask Code 40

9.3 Html Files 41

VI
LIST OF FIGURES

FIGURE NO FIGURE NAME PAGE NO


1.1 PROCESS OF MACHINE LEARNING 3
4.1 SYSTEM ARCHITECTURE 15
4.2 DATA GATHERING 18
4.3 DATA DESCRIPTION 21
4.4 FLOW CHART 22
5.1 SUPPORT VECTOR MACHINE 25
5.2 RANDOM FOREST CLASSIFIER 25
5.3 KNN CLASSIFICATION ALGORITHM 26
5.4 LINEAR REGRESSION 27
6.1 FLASK 29
6.2 HOME PAGE 31
6.3 RESULT PAGE 32
9.1 FLASK CODE 41
9.2 MOVIE BOX OFFICE OUTPUT 50
9.3 REVENUE PREDICT OUTPUT 53

VII
CHAPTER-1
INTRODUCTION

1
CHAPTER 1
INTRODUCTION

1.1 OVERVIEW
Predicting society's reaction to a new product in the sense of popularity and
adoption rate has become an emerging field of data analysis, and such kind of
analysis can help the film industry to take appropriate decisions. Can film studios
and its related stakeholders use a forecasting method for the prediction of revenue
that a new film can generate based on a few given input attributes like budget,
runtime, released year, popularity, and so on. This study marks as a decision
support system for the film investment sector using machine learning techniques.
This project helps investors associated with this business for avoiding investment
risks. The system predicts an approximate success rate of a film based on its
profitability by analyzing historical data from different sources like Online rating,
Director, Budget, Pre-Release business, Genre, etc.

1.2 MACHINE LEARNING


Machine learning is to predict the future from past data. Machine learning (ML)
is a type of artificial intelligence (AI) that provides computers with the ability to
learn without being explicitly programmed. Machine learning focuses on the
development of Computer Programs that can change when exposed to new data
and the basics of Machine Learning, implementation of a simple machine
learning algorithm using Python. The process of training and prediction involves
the use of specialized algorithms. It feeds the training data to an algorithm, and
the algorithm uses this training data to give predictions on new test data. Machine

2
learning can be roughly separated into three categories. There are supervised
learning, unsupervised learning, and reinforcement learning.

Figure 1.1 Process of machine learning

Data scientists use many different kinds of machine learning algorithms to


discover patterns in Python that lead to actionable insights. At a high level, these
different algorithms can be classified into two groups based on the way they
“learn” about data to make predictions: supervised and unsupervised learning.

Classification predictive modelling is the task of approximating a mapping


function from input variables(X) to discrete output variables(y). In machine
learning and statistics, classification is a supervised learning approach in which
the computer program learns from the data input given to it and then uses this
learning to classify new observations.

Supervised Machine Learning is the majority of practical machine learning uses


supervised learning. Supervised learning is where have input variables (X) and
an output variable (y) and use an algorithm to learn the mapping function from
the input to the output is y = f(X). The goal is to approximate the mapping

3
function so well that when you have new input data (X) that you can predict the
output variables (y) for that data. This problem has as goal the construction of a
succinct model that can predict the value of the dependent attribute from the
attribute variables.

The difference between the two tasks is the fact that the dependent attribute is
numerical for categorical for classification. A classification model attempts to
draw some conclusion from observed values. Given one or more inputs a
classification model will try to predict the value of one or more outcomes. A
classification problem is when the output variable is a category, such as “red” or
“blue”.

1.3 Purpose

The film industry has grown immensely over the past few decades generating
billions of dollars of revenue for the stakeholders. Now people can watch films
online and offline on a variety of mobile devices during leisure or travel through
Netflix, YouTube and downloads. A prediction system to assess the box office
success of new films can help the film producers and directors make informed
decisions when making the film in order to increase the chance of profitability
and box office gross success. New social media tools are constantly appearing
which are enabling people to gather information on films and post comments
about films. These comments can influence the initial prediction about the box
office gross success of a film which some of the existing research do not consider.
Critic reviews often come out a few days before the film is released and may,
therefore, help in prediction and at the same time influence the box office
revenue.

4
CHAPTER-2
LITERATURE SURVEY

5
CHAPTER 2
LITERATURE SURVEY

Success of a film primarily depends on the perspectives that how the film has
been justified. In early days, a number of people prioritized gross box office
revenue ([2], [3], [4]), initially. Few previous work ([4], [5], [6]), portend gross
of a film depending on stochastic and regression models by using IMDb data.
Some of them categorized either success or flop based on their revenues and apply
binary classifications for forecast. The measurement of success of a film does not
solely depend on revenue. Success of films rely on a numerous issue like
actors/actresses, director, time of release, background story etc. Further few
people had made a prediction model with some pre-released data which were used
as their features [7]. In most of the case, people considered a very few features.
As a result, their models work poorly. However, they ignored participation of
audiences on whom success of a film mostly depends. Although few people adopt
many applications of NLP for sentiment analysis ([8], [9]) and gathered film
reviews for their test domain. But the accuracy of prediction lies on how big the
test domain is. A small domain is not a good idea for measurement. Again most
of them did not take critics reviews in account. Besides, users’ reviews can be
biased as a fan of actor/actress may fail to give unbiased opinion.

M. T. Lash and K. Zhao’s [10] main contribution was, firstly they developed a
decision support system using machine learning, text mining and social network
analysis to predict film profitability not revenue. Their research features several
features such as dynamic network features, plot topic distributions means the
match between “what” and “who” and the match between “what” and “when”
and the use of profit based star power measures. They analyzed film success in

6
three categories, audience based, released based and film based. Their hypothesis
based on the more optimistic, positive, or excited the audiences are about a film,
the more likely it is to have a higher revenue. Similarly, a film with more
pessimistic and negative receptions from the public may attract fewer people to
fill seats.

They retrieve data from different types of media. Such as Twitter, comments from
YouTube, blogs, new 5 articles and film reviews, star rating from reviews, the
sentiment of reviews or comments have been used as a means for assessing
audience’s excitement towards a film. Their original dataset collected from both
Box-office Mojo and IMDb. They focused on the films released in USA and
excluded all foreign films from their experiment. In A neural network had been
used in the prediction of financial success of a box office film before releasing
the film in theaters. This forecasting had been converted into a classification
problem categorized in 9 classes. The model was represented with very few
features. In [11] A. Sivasantoshreddy, P. Kasat, and A. Jain tried to predict a film
box-office opening prediction using hype analysis.

Current research in the field involves using machine learning models to predict
box office success in terms of revenue. Many models have been built for the
purpose and analyzed for performance. Research could be broadly divided into
two categories- Quantitative and Qualitative. Quantitative research predicts
success in numbers, while Qualitative research pre dicts whether the film will be
successful or not. Another closely related field in the qualitative domain is based
on personal choices that build film recommendation systems.

However, it is essential to note that predicting a film’s suc cess is most relevant
at the beginning or before its production. It is even more helpful if the producers

7
are informed about what decisions to make for the film to succeed. Predicting the
requirements for a film with the required quantified success helps production
companies in decision-making. A model that predicts a film’s characteristics
based on the revenue to be generated and a particular genre has not been given
much attention in the past and forms the motivation for this project.

A neural network had been used in the prediction of financial success of a box
office film before releasing the film in theaters [12]. This forecasting had been
converted into a classification problem categorized in 9 classes. The model was
represented with very few features. In [13], it was tried to improve film gross
prediction through News analysis where quantitative news data generated by
Lydia (high-speed text processing system for collecting and analyzing news data).
It contained two different models (regression and k-nearest neighbor models).
But they considered only high budget films. The model failed if common word
used as name and it could not predict if there were no news about a film.

8
CHAPTER 3
SYSTEM ANALYSIS &
SPECIFICATION

9
CHAPTER 3
SYSTEM ANALYSIS

3.1 EXISTING SYSTEM

Films, in general, are products that have a long development stage until they reach
final consumers and normally at a high cost level. We can describe a film
development process, in a broader way, as being composed of 4 (four) stages:
Pre-production, production, post-production and distribution.The large growth in
number of films releasing over the past few decades Film Prediction is necessary.
The only way people can check whether the film will be worth to watch is through
applications, so this system would analyze the reviews posted by other users, as
these reviews are large in number which the user cannot read and gets confused.
Following are the aims and objectives suggested by our system. The first step is
to identify a dataset of film data which is suitable for analysis. Relevant attributes
need to be selected from the film data. Attributes can be general pre production
information regarding film productions such as film title, sequel, genre, language
and information about writers, actors, and directors. Similarly, the data must
include some measure of success, such as user film ratings. Secondly, the relevant
dataset has to be prepared and structured in such a way that the data used is
representative of the film scene at large, as well as suitable for analysis by the
relevant machine learning techniques and algorithms. Further, correlation is
performed on relevant dataset to find the relationship between al the variable with
each other. The important step in training our system is to apply classification
model. There are many classifiers. Lastly, the prediction performance of the
relevant machine learning algorithm has to be evaluated on the dataset in order to
determine success and failure of film accurately.

10
3.1.1 DRAWBACKS OF EXISTING SYSTEM

● Bias and subjectivity: User reviews can be biased and subjective, leading
to inaccurate predictions.

● Limited scope: The system may struggle to predict the success of niche or
unconventional films that appeal to a smaller audience.

● Influence of external factors: Early reviews and ratings can be influenced


by factors such as hype, marketing, and preconceived notions about the
film or its creators, impacting the system's predictions.

● Reliability of reviews: The reliability of user reviews can vary, making it


challenging to accurately predict a film's success based on these reviews
alone.

3.2 PROPOSED SYSTEM

This is a classic example of supervised learning. We have been provided with a


fixed number of features for each data point, and our aim will be to train a variety
of Supervised Learning algorithms on this data, so that when a new data point
arises, our best performing classifier can be used to categorize the data point as a
positive example or negative. Exact details of the number and types of algorithms
used for training is included in the 'Algorithms and Techniques' sub-section of
the 'Analysis' part. This project focuses on the related works of various films to
calculate box office gross prediction such that algorithms were implemented
using Google Colab that is a machine learning software written in Python.
Various attributes that are essential in the prediction of film gross were examined
and the dataset of films were also evaluated. This project compares various
classification algorithms such as Random Forest, Support Vector Machine and
KNN classification Algorithm with an aim to identify the best technique. Based
on this study, Linear Regression with the highest accuracy outperformed the other
algorithms and can be further utilized in the prediction of film box office gross
recommended to the user. Later by using Flask app create html files and create a

11
user interface to display the film box office gross prediction values.

3.2.1 ADVANTAGES OF PROPOSED SYSTEM

● Accurate predictions: By training various supervised learning algorithms


on a dataset of film attributes, the system can accurately predict the box
office gross of films, helping users make informed decisions.

● Comparison of algorithms: The system compares the performance of


different classification algorithms, such as Random Forest, Support Vector
Machine, and KNN, to identify the best-performing technique, ensuring
the use of the most effective method for prediction.

● Utilization of Linear Regression: Based on the study, Linear Regression is


identified as the algorithm with the highest accuracy, providing a reliable
method for predicting film box office gross.

● User-friendly interface: By using Flask to create a web application, the


system provides a user-friendly interface for users to input data and view
the predicted box office gross values, enhancing usability and accessibility.

3.3 HARDWARE REQUIREMENTS

The following is the Hardware required to complete this project:

● Internet connection to download and activate.


● Administration access to install and run Anaconda Navigator.
● Minimum 10GB free disk space.
● Windows 8.1 or 10 (64-bit or 32-bit version) OR Cloud: Get started free,
*Cloud account required.

12
Minimum System Requirements To run Office Excel 2013, your computer needs
to meet the following minimum hardware requirements:

● 500 megahertz (MHz).


● 256 megabytes (MB) RAM.
● 1.5 gigabytes (GB) available space.
● 1024x768 or higher resolution monitor.

3.4 SOFTWARE REQUIREMENTS

The software specification are the specification of the system. It should include
both the specification and a definition of the requirements. It is a set of what the
system should do rather than how it should do it. The software requirements
provide the basis forcreating the software requirement specification. It is useful
estimating cost, planning team activities, performing tasks and tracking the
team’s progress throughout the development activity.

● Google Colaboratory Notebook OR Jupyter Notebook.


● Spyder and Pycharm Community.
● Microsoft Excel 2013.

13
CHAPTER 4
SYSTEM DESIGN &
SYSTEM IMPLEMENTATION

14
CHAPTER 4
SYSTEM DESIGN & SYSTEM IMPLEMENTATION

4.1 SYSTEM ARCHITECTURE

Figure 4.1 system architecture

4.2 METHODOLOGY

The project was completed in three phases


1. Data Extraction
2. Data Processing and Analysis
3. Model Building

15
4.2.1 Data Extraction

A significant part of the dataset required for the project was extracted from the
global TMDB dataset using its APIs. Following this, the OMDB API was used to
extract the MPAA ratings and IMDb ratings and votes of each film in the dataset.
The final dataset has the features- Genres, ID, Original Language, Original Title,
Overview, Popularity Rating, Release Date, Title, TMDB Rating, TMDB Vote
Count, IMDb ID, Budget, Revenue, Production Companies, Cast, Crew,
Production Countries, Spoken Languages, Runtime, Tagline, MPAA Rating,
IMDb Rating, IMDb Vote Count and Star Power.

The final dataset has 6065 films. With regards to Genres, Cast, Crew and
Production Companies, the dataset returned a JSON array of responses. The
popularity index of the most popular cast in the film was taken as the star power.
Only the top few values in each JSON were considered in the final dataset since
these are the elements along with star power that attract majority of the audience .

The extracted dataset was modified, and some of the es sential features were
updated. Features like ID, IMDb ID, Original title, and Tagline were not relevant
data for the predicting model and were thus removed. All the NULL values in the
dataset were removed by changing the NULL values in ’runtime’ to the median
value and replacing the NULL values in the other columns with an empty set.
After removing all the NULL values from all entries, the ’release date’ feature
was modified by splitting it into three distinct features for the day, month, and
year of the release. All data except the first three members in the cast of each
entry were deleted. Then, a check was done for outliers, which are data points
distant from the rest of the data in the dataset. They have the ability to distort the
final result and prediction, and thus, were removed.

16
The given dataset was then explored to understand the relationships between the
features given, how they interact with each other, spot anomalies in the data, and
find patterns to help build the model. For this purpose, histograms were plotted
to study the range of features like runtime, budget, popularity, release-data, IMDb
rating, revenue, etc., to study the range of this data. Following this, a correlation
matrix was plotted with the same features mentioned above to find the linear
interaction between every pair of features. This contained the correlation
coefficient between each pair. In addition to this, bar plots and frequency
polygons were plotted to study the given data.

4.2.2. Data Processing and Analysis


CLEANING AND VALIDATION

● Identifying Inconsistencies, Errors, and Missing Values: This involves


thoroughly examining the dataset to spot any discrepancies, such as
outliers, incorrect data entries, or missing values. For instance, missing
values might occur due to data collection errors or incomplete records.

● Rectifying Data Integrity Issues: Once identified, inconsistencies and


errors need to be addressed. This may involve imputing missing values,
removing outliers, or correcting erroneous entries. The goal is to ensure
that the data is accurate and reliable for analysis.

● The library ’sklearn’, a robust and commonly used library for machine
learning problems, was used to preprocess the dataset. Simple Imputer
replaced the missing values with me dian values and a power transformer
to transform the data to look more Gaussian, minimize variance, and
stabilize skew ness for features like budget, runtime, and popularity. After
dropping all unwanted features, label encoding was performed on the

17
categorical features, as they needed to be made more expressive. The
original category columns were removed, and a transformed version of the
dataset was returned, which was suitable for performing modeling. After
all the processing, transformation, and label encoding, the final dataset
contained eleven features. This data was then split into test and train.

4.2.3 Model Building

The preprocessed data was passed through different regression models .Specific
functions were written to calculate the model’s performance, i.e., the train Mean
RMSE and test mean RMSE for the final model. The 4 most important features
according to the results of each regression model used were found.

4.3 Data Gathering

Figure 4.2 Data Gathering

● Data: This is the starting point, where raw data related to films is collected.

18
● Feature Extraction: This involves extracting useful features from the
data. In this context, we can see several types of feature extraction,
including:

● Textual Embedding: This is a technique used to convert text data into


numerical vectors that can be processed by machine learning algorithms.

● Network Embedding: This is a method for converting network data (such


as a casting network) into numerical vectors.

● Visual Embedding: This likely refers to extracting features from visual


data, such as film posters or frames from the film itself.

● Story abstracts: This could refer to extracting features from story


summaries or synopses.

● Casting network: This refers to the network of actors and their


relationships in a film.

● Production: This could refer to features related to the production of the


film, such as budget, crew, or shooting location.

● Trailers: This could involve extracting features from film trailers, such as
audio or visual cues.

● Distribution data: This likely refers to data related to the distribution of


the film, such as box office revenue (BOR) or release dates.

● Prediction: After extracting features, the next step is to use them to make

19
predictions. This could involve predicting the success of a film, the target
audience, or other factors.

● Ranking: Finally, the predictions are ranked to determine which films are
most likely to be successful or meet other criteria.

Overall, it seems like this context describes a system for processing film data,
extracting useful features, making predictions, and ranking those predictions.

4.4 Data Description

The dataset we utilized to train and test our model came from kaggle.com. The
dataset includes information about several films on IMDb, including film titles,
directors, genres, countries of origin, and the Facebook popularity of the top three
actors featured in the film. This information was provided in a variety of formats,
including strings, integers, and floating point data. In order to implement the
machine learning algorithms effectively and avoid underutilization of certain
aspects of the films provided in the dataset, the data was converted to numerical
values using the Scikit learn preprocessing library to scale the features.

20
Figure 4.3 Data Description

The dataset contains 28 variables for 5043 films, spanning across 100 years in
66 countries. There are 2399 unique director names, and thousands of
actors/actresses. “imdb_score” is the response variable while the other 27
variables are possible predictors. IMDb website is just a good choice to refer at
this time. Due to its popularity, IMDb website contains a great deal of
information about films and the comments from audiences. The scores which
IMDb gives are highly recognized by the public, representing the quality of
content as well as audience’s favor to some extent. Roughly speaking, half of
the variables is directly related to films themselves, such as title, year, duration,
etc. Another half is related to the people who involved in the production of the
films, e.g., director names, director face book popularity, film rating from
critics, etc.

21
4.5 Flow Chart

Figure 4.4 Flow Chart

22
CHAPTER-5
EXPERIMENTAL INVESTIGATIONS

23
CHAPTER-5
EXPERIMENTAL INVESTIGATIONS

Coming to analysis or investigations three supervised learning approaches are


selected for this problem. Films is taken that all these approaches are
fundamentally different from each other, so that we can cover as wide an umbrella
as possible in term of possible approaches. For each algorithm, we will try out
different values of a few hyper parameters to arrive at the best possible classifier.
This will be carried out with the help of grid search cross validation technique.
There are several Machine learning algorithms to be used depending on the data
you are going to process such images, sound, text, and numerical values. The
algorithms that you can choose according to the objective that you might have
may be classification algorithms and Regression algorithms.

1.Support Vector Machine

Support Vector Machine (SVM) is a supervised machine learning algorithm


which can be used for both classification or regression challenges. However, it is
mostly used in classification problems. Support Vectors are simply the
coordinates of individual observation. The goal of a support vector machine is
not only to draw hyperplanes and divide data points, but to draw the hyperplane
the separates data points with the largest margin, or with the most space between
the dividing line and any given data point.

24
Figure 5.1 Support Vector Machine

2.Random Forest Classification


Random Forest or Random decision forests are an ensemble method for
classification, regression and other tasks that operate by constructing a multitude
of decision trees at training time and outputting the class that is the mode of the
classes or mean/average prediction of the individual trees.

Figure 5.2 Random Forest classifier

25
3.KNN Classification algorithm or K-Nearest Neighbor algorithm

K-Nearest Neighbor is one of the simplest Machine Learning algorithms based


on Supervised Learning technique. K-NN algorithm assumes the similarity
between the new case/data and available cases and put the new case into the
category that is most similar to the available categories.

Figure 5.3 KNN Classification Algorithm

4.Linear Regression

Linear regression analysis is used to predict the value of a variable based on the
value of another variable. The variable you want to predict is called the dependent
variable. The variable you are using to predict the other variable's value is called
the independent variable. This form of analysis estimates the coefficients of the
linear equation, involving one or more independent variables that best predict the
value of the dependent variable. Linear regression fits a straight line or surface
that minimizes the discrepancies between predicted and actual output values.
There are simple linear regression calculators that use a “least squares” method
to discover the best-fit line for a set of paired data. You then estimate the value
of X (dependent variable) from Y (independent variable).

26
Figure 5.4 Linear Regression

27
CHAPTER 6
EXPERIMENTAL RESULTS

28
CHAPTER 6
EXPERIMENTAL RESULTS

The below figures show the results of the module implementation. These
screenshots show the User Interface through which the modules are being
developed.

6.1 Flask

Flask is a lightweight WSGI web application framework. It is designed to make


getting started quick and easy, with the ability to scale up to complex applications.
It began as a simple wrapper around Werkzeug and Jinja and has become one of
the most popular Python web application frameworks. Flask offers suggestions,
but doesn't enforce any dependencies or project layout. It is up to the developer
to choose the tools and libraries they want to use. There are many extensions
provided by the community that make adding new functionality easy.

Figure 6.1 Flask

29
6.2 Home Page

The process involves inputting several variables, which include:

● Budget: The amount of money invested in making the film can


significantly influence its financial success. A larger budget might lead to
higher expectations, and the film would need to perform well to cover the
costs and generate a profit.

● Genres: Certain genres may be more popular than others, leading to higher
box office revenues. For example, action, adventure, and superhero films
often generate higher revenues compared to dramas or independent films.

● Popularity: The popularity of the film, which could be influenced by


factors such as the cast, director, or marketing campaign, can impact its
financial success. A highly anticipated film is more likely to draw larger
audiences and generate higher revenues.

● Runtime: The length of the film can also affect its financial success.
Shorter films might be more appealing to audiences with limited time,
while longer films may attract dedicated fans or those looking for a more
immersive experience.

● Vote average: The average rating given to the film by audiences can
impact its success. Films with higher ratings are more likely to attract
viewers, while those with lower ratings might struggle to draw audiences.

30
● Vote count: The total number of votes a film receives can also be an
indicator of its popularity and potential financial success. A higher vote
count might suggest that the film has a larger audience and could generate
more revenue.

● Director: The director's previous successes and reputation can influence a


film's financial success. A well-known and respected director might attract
a larger audience, leading to higher revenues.

● Release date: The month and week of release can impact a film's financial
success. Releasing a film during a popular season or during a week with
less competition can lead to higher revenues.

These factors, along with others, can contribute to the overall financial success of
a film. However, it's important to note that the specific combination and weight
of these factors can vary for each film, and the outcome might not always be
predictable.

Figure 6.2 Home Page

31
6.3 Result Page

You don't need a large time to wait for the results with in a second the amount
will be predicted.

Figure 6.3 Result Page

6.4 ADVANTAGES AND DISADVANTAGES

ADVANTAGES:

● Efficiency in workflow: One of the first desires that probably


comes to mind is efficiency. When building your website, you
want to be able to reach as many people as you can. The system
predicts an approximate success rate of a film based on its
profitability by analyzing data.

● Reduce costs: You don't need a large time to wait for the results
with in a second the amount will be predicted.

● Using machine learning algorithms to predict film box office


gross in data sets. Various kinds of data sets, have to use this to

32
train classifier algorithms to predict film gross with good
accuracy.

Disadvantages:

● Any single error in data set can change the entire data.
● Correct accuracy must be needed while doing the project
using supervised machine learning algorithms.
● Python code should be correct without any error.

33
CHAPTER-7
CONCLUSION AND
FUTURE ENHANCEMENT

34
CHAPTER-7
7.1 CONCLUSION
The proposed system addresses the limitations of the existing system by
providing accurate predictions of Film Financial Forecasting using various
supervised learning algorithms. By comparing the performance of different
classification algorithms, the system identifies the best-performing technique,
ensuring the use of the most effective method for prediction. The utilization of
Linear Regression, which has been identified as the algorithm with the highest
accuracy, provides a reliable method for predicting film box office gross. The
user-friendly interface created using Flask enhances usability and accessibility,
allowing users to input data and view the predicted box office gross values easily.

7.2 FUTURE ENHANCEMENT


The proposed system can be further improved by incorporating additional
features, such as social media sentiment analysis, to provide a more
comprehensive prediction of film success. The system can also be expanded to
include real-time predictions, allowing users to access up-to-date predictions as
soon as they become available. Additionally, the system can be integrated with
existing film streaming platforms, such as Netflix or Amazon Prime, to provide
personalized recommendations based on predicted box office gross.

Furthermore, the system can be scaled up to analyze larger datasets, allowing for
more accurate predictions and improved performance. The system can also be
adapted to predict the success of other forms of media, such as music or books,
by using similar supervised learning algorithms and techniques.
Overall, the proposed system provides a reliable and accurate method for
predicting film box office gross, and has the potential for further development
and expansion in the future.

35
CHAPTER -8
REFERENCES

36
CHAPTER -8

REFERENCES

We referred some books and surfed the internet for the better outcome of the
project:

[1] Simonoff, J. S. and Sparrow, I. R. Predicting movie grosses: Winners and


losers, blockbusters and sleepers. In Chance, 2000.

[2] Joshi, M., Das, D., Gimpel, K., and Smith, N. A. Movie Reviews and
Revenues: An Experiment in Text Regression. In Proceedings of the North
American Chapter of the Association for Computational Linguistics Human
Language Technologies Conference, 2010.

[3] Sharda, R. and Delen, D. Predicting box-office success of motion pictures


with neural networks. In Expert Systems with Applications, 2006.

[4] “Global box office revenue 2016 | Statistic.” [Online]. Available:


https://www.statista.com/statistics/259987/global-box-office revenue/.
[Accessed: 03-Jun-2018].

[5] S. Gopinath, P. K. Chintagunta, and S. Venkataraman, “Blogs, Advertising,


and Local-Market Movie Box Office Performance,” Management Science, vol.
59, no. 12, pp. 2635–2654, 2013.

[6] M. C. A. Mestyán, T. Yasseri, and J. Kertész, “Early Prediction of Movie


Box Office Success Based on Wikipedia Activity Big Data,” PLoS ONE, vol. 8,
no. 8, 2013.

37
[7] J. S. Simonoff and I. R. Sparrow, “Predicting Movie Grosses: Winners and
Losers, Blockbusters and Sleepers,” Chance, vol. 13, no. 3, pp. 15–24, 2000.

[8] A. Chen, “Forecasting gross revenues at the movie box office,” Working
paper, University of Washington, Seattle, WA, June, 2002

[9] M. S. Sawhney and J. Eliashberg, “A Parsimonious Model for Forecasting


Gross Box-Office Revenues of Motion Pictures,” Marketing Science, vol. 15, no.
2, pp. 113–131, 1996.

38
CHAPTER-9
APPENDIX

39
CHAPTER-9
APPENDIX

9.1 COLAB NOTEBOOK


FORECASTING FILM FINANCIAL SUCCESS WITH MACHINE LEARNING.ipynb - Colab
(google.com)

9.2 FLASK CODE

from flask import Flask, request, jsonify, render_template


import pickle
import pandas as pd

app = Flask(__name__) #initialising the flask app


filepath="model_movies.pkl"
model=pickle.load(open(filepath,'rb'))#loading the saved model
scalar=pickle.load(open("scalar_movies.pkl","rb"))#loading the saved scalar file

@app.route('/')
def home():
return render_template('movieboxoffice.html')

@app.route('/y_predict',methods=['POST'])
def y_predict():
# For rendering results on HTML
input_feature=[float(x) for x in request.form.values()]
features_values=[np.array(input_feature)]

feature_name=['budget','genres','popularity','runtime','vote_average','vote_count'
,

40
'director','release_month','release_DOW']
x_df=pd.DataFrame(features_values,columns=feature_name)
x=scalar.transform(x_df)
# predictions using the loaded model file
prediction=model.predict(x)
print("Prediction is:",prediction)
return render_template("revenuepredict.html",prediction_text=prediction[0])
if __name__ == "__main__":
app.run(debug=False)

Figure 9.1 Flask Code

9.3 HTML FILES

i. movieboxoffice.html

<!DOCTYPE html>
<html lang="en">
<head>
<title>MOVIE BOX OFFICE GROSS PREDICTION</title>
<style>

41
body{
opacity : 0.75;
text-align: center;
font-family: Verdana, Tahoma, sans-serif;
font-size: larger;
background-image:
url('https://i.postimg.cc/6q9K4Hrm/filmpicture.jpg');
background-repeat: no-repeat;
background-attachment: fixed;
background-size: cover;
}
h1,p {
animation-duration: 3s;
animation-name: slidein;
}

@keyframes slidein {
from {
margin-left: 100%;
width: 300%;
}

to {
margin-left: 0%;
width: 100%;
}
}
form{
padding: 10px;

42
}
td{
padding: 6px;
}
input[type=number]{

padding: 8px 0px;


margin: 8px 0;
border: 2px solid #ccc;
border-radius: 6px;
}
#btn{
background-color: #52abf3;
width: 100%;
border: 10px;
border-radius: 7px;
color: white;
padding: 15px 32px;
font-size: 16px;
cursor: pointer;
}
#btn:hover{
background-color:#008bf9;
}
table{
border-radius: 5px;
background-color: #f2f2f2;
padding: 20px;
}

43
#prediction {
color: white ;
}
input[type=text],select,button{
width: 100%;
padding: 12px 20px;
margin: 8px 0;
box-sizing: border-box;
border: 2px solid red;
border-radius: 4px;
}

</style>
</head>
<body>
<!-- <div class="box"> -->
<h1 style="background-color:powderblue;">MOVIE BOX OFFICE
GROSS PREDICTION</h1>
<!-- </div> -->

<!-- <div class="box"> -->


<center>
<form action="{{ url_for('y_predict')}}" method="post">
<table>
<tr>

44
<td>
<label for="budget">Budget</label>
</td>
<td>:</td>
<td>
<input type="text" id="budget" name="budget"
placeholder="Budget in $ Million"
required="required"/>
</td>
</tr>
<tr>
<td>
<label for="genres">Genres</label>
</td>
<td>:</td>
<td>
<select id="genres" name="genres" >
<option>Select the genres</option>
<option value="6">Drama</option>
<option value="3">Comedy</option>
<option value="0">Action</option>
<option value="1">Adventure</option>
<option value="10">Horror</option>
<option value="4">Crime</option>
<option value="16">Thriller</option>
<option value="2">Animation</option>
<option value="8">Fantasy</option>
<option value="14">Science Fiction</option>
<option value="13">Romance</option>

45
<option value="7">Family</option>
<option value="12">Mystery</option>
<option value="5">Documentary</option>
<option value="18">Western</option>
<option value="17">War</option>
<option value="9">History</option>
<option value="15">TV Movie</option>
<option value="11">Music</option></select>
</td>
</tr>
<tr>
<td>
<label for="Enter popularity">Enter popularity</label>
</td>
<td>:</td>
<td>
<input type="text" id="Enter popularity" name="Enter
popularity" placeholder="Enter the popularity"
required="required" />
</td>
</tr>

<tr>
<td>
<label for="Enter runtime ">Enter runtime </label>
</td>
<td>:</td>
<td>
<input type="text" id="Enter runtime " name="Enter runtime "

46
placeholder="Enter runtime"
required="required"/>
</td>
</tr>

<tr>
<td>
<label for="Enter vote_average">Enter vote_average</label>
</td>
<td>:</td>
<td>
<input type="text" id="Enter vote_average" name="Enter
vote_average" placeholder="Enter vote_average"
required="required"/>
</td>
</tr>

<tr>
<td>
<label for="Enter vote_count">Enter vote_count </label>
</td>
<td>:</td>
<td>
<input type="text" id="Enter vote_count" name="Enter
vote_count" placeholder="Enter vote_count"
required="required"/>
</td>
</tr>

47
<tr>
<td>
<label for="director">Director </label>
</td>
<td>:</td>
<td>
<select id="director" name="director" >
<option>Select the director</option>
<option value="2108">Steven Spielberg</option>
<option value="2323">Woody Allen</option>
<option value="1431">Martin Scorsese</option>
<option value="377">Clint Eastwood</option>
<option value="1851">Ridley Scott</option>
<option value="1894">Robert Rodriguez</option>
<option value="2051">Spike Lee</option>
<option value="2107">Steven Soderbergh</option>
<option value="1810">Renny Harlin</option>
<option value="2169">Tim Burton</option>
<option value="1654">Oliver Stone</option>
<option value="1904">Robert Zemeckis</option>
<option value="1930">Ron Howard</option>
<option value="1034">Joel Schumacher</option>
<option value="156">Barry Levinson</option>
<option value="1480">Michael Bay</option>
<option value="2234">Tony Scott</option>
<option value="245">Brian De Palma</option>
<option value="667">Francis Ford Coppola</option>
<option value="1256">Kevin Smith</option>
<option value="1973">Sam Raimi</option>

48
<option value="2025">Shawn Levy</option>
<option value="1823">Richard Donner</option>
<optionvalue="320">Chris
Columbus</option></select><br>
</td>
</tr>

<tr>
<td>
<label for="Enter the month of release">Enter the month of
release</label>
</td>
<td>:</td>
<td>
<input type="text" id="Enter the month of release"
name="Enter the month of release" placeholder="Enter
the month of release " required="required" />
</td>
</tr>

<tr>
<td>
<label for="Enter the week of the month">Enter the week of the
month</label>
</td>
<td>:</td>
<td>
<input type="text" id="Enter the week of the month"
name="Enter the week of the month"

49
placeholder="Enter the week of the month " min="0" />
</td>
</tr>

<tr>
<td colspan="3">
<input id="btn" type="submit" value="Predict" >
</td>
</tr>
</table>
<p id="prediction"> {{prediction_text}}</p>
</form>
</center>
<!-- </div> -->
</body>
</html>

Figure 9.2 movie boxoffice output

50
ii. revenuepredict.html

<style>
.idiv{
border-radius:10px;

}
body
{
background-image:url('../static/projector_image/lights.jpg');
background-repeat: no-repeat;

background-position: center;
font-family:sans-serif;
background-size:cover;
}
input{
font-size:1.3em;
width:80%;
text-align:center;
}
input placeholder{
text-align:center;
}
button{
outline:0;
border:0;
background-color:darkred;

51
color:white;
width:100px;
height:40px;
}
button:hover{
background-color:brown;
border:solid 1px black;
}
h1{
color:white;
}
h2{
color:lightyellow;

}
h1 {
text-shadow: 2px 2px 5px blue;
}
h2{
color: lightyellow;
}
h2{
text-shadow: 2px 2px 5px orange;
}
</style>
<head>
<title > Movie Box Office Gross Revenue</title>
</head>
<body>

52
<div class='idiv'>
<br/>
<h1 align="right">Movie Box Office Gross Revenue : </h1>
<br/>
<h2 align="right">The Revenue predicted is $ {{prediction_text}}
million </h2>

<br/>
<br/>
<br/>
</div>

</body>
</html>

Figure 9.3 revenue predict output

53

You might also like