By Rishitha Padi 2020IMG-047: Flight Fare Prediction System
By Rishitha Padi 2020IMG-047: Flight Fare Prediction System
by
Rishitha Padi
2020IMG-047
in
IPG-MBA
-
ATAL BIHARI VAJPAYEE
I hereby certify that the work, which is being presented in the report, entitled Flight Fare Prediction
System,
in IPG-MBA and submitted to the institution is an authentic record of my/our own work carried
out during the period May-2021 to August-2021 under the supervision of Prof. Vinal Patel. I also
cited the reference about the text(s)/figure(s)/table(s) from where they have been taken.
Date
The final copy of this report has been examined by the signatories, and we find that both the
content and the form meet acceptable presentation standards of scholarly work in the above
mentioned discipline.
Candidate’s Declaration
I hereby certify that I have properly checked and verified all the items as prescribed in the
check-list and ensure that my thesis is in the proper format as specified in the guideline for thesis
preparation.
I declare that the work containing in this report is my own work. I understand that plagiarism is
defined as any one or combination of the following:
(1) To steal and pass off (the ideas or words of another) as one’s own
(4) To present as new and original idea or product derived from an existing source.
I understand that plagiarism involves an intentional act by the plagiarist of using someone else’s
work/ideas completely/partially and claiming authorship/originality of the work/ideas. Verbatim
copy as well as close resemblance to some else’s work constitute plagiarism.
I have given due credit to the original authors/sources for all the words, ideas, diagrams, graph-
ics, computer programmes, experiments, results, websites, that are not my original contribution.
I have used quotation marks to identify verbatim sentences and given credit to the original au-
thors/sources.
I affirm that no portion of my work is plagiarized, and the experiments and results reported in
the report/dissertation/thesis are not manipulated. In the event of a complaint of plagiarism and
the manipulation of the experiments and results, I shall be fully responsible and answerable. My
faculty supervisor(s) will not be responsible for the same.
Signature:
Date: 14-10-2022
Abstract
Travelling by flights has made lives easier. People are so used to flights. Flying has become
a essential component of modern life as more and more people choose it over other, quicker forms
Customers want to pay as little as possible, and airline firms want to make as much money as
they can. These prices can differ in hours sometimes, they differ based on number of stops also.The
fares of these flights varies based on various parameters including flight timing, flight duration and
destination. And factors like holiday or festive season. Hence, having a general understanding of
flight costs before the vacation will help people in saving money and time.
By applying machine learning algorithms to the gathered historical flight data, the suggested sys-
tem will generate a prediction model. This model will provide users an understanding of the pricing
strategy they follow, and gives a predicted value that they can use to make more optimal decisions
The advantages of the Flight Fare Prediction system’s design, operation, and performance are
Keywords: Feature selection, Flight price, Machine learning, Pricing Models, Prediction Model,
Random Forest.
Dedication
Being a student who stays far away from my place, travelling to my place for holidays has
become a problem. It takes me almost 24 hours to travel by train, so best option is flight but
To solve this, I thought it would be beneficial for many who wants to book tickets, to know
In the present scenario, with more things moving to the digital platform, a digital solution
Acknowledgments
I want to extend my heartful thanks to Dr. Vinal Patel, Dr. Santosh Singh Rathore,
Dr. Karm Veer Arya, Dr. Gaurav Kaushal for constantly guiding me through the project.
They helped me to develop an excellent practice of reading the recent literature before pursuing the
work. They provided invaluable advice, recommendations, wise judgement, constructive criticism,
and an eye for perfection that helped the current work grow and flourish. It’s only because of their
overwhelming interest and helpful attitude, the present work has attained the stage it has. I can
sincerely say that this project made me explore many areas of computer science, and along the
way, I found out about new topics. This practice increased my thirst for knowledge without any
doubt. Their expertise in the field of Machine Learning helped me to think better and innovative.
I am indebeted to all the professors for allowing us to develop an industry-grade project. They
gave their valuable time for evaluating and give much-needed insights about the project.
(Rishitha Padi)
vii
Contents
Chapter
ABBREVIATION 1
1 Introduction 2
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature review 5
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Methodology 8
3.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Mechanism/Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Bibliography 21
ix
LIST OF FIGURES
Figure
3.5 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.5 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1
ABBREVIATIONS
Introduction
This chapter presents an overview of the context as a part of the project developed in section
1.1. Section 1.2 introduces the objectives of the project. Next, section 1.3 presents the implemen-
tation workflow step by step. Finally, in section 1.4, the result of the research carried out is briefly
introduced.
1.1 Context
In the present-day scenario, airline corporations are trying to alter the cost of airline tickets
to increase their revenues. This predictive system helps them save millions of rupees for the users
by giving them an estimation of the tickets so that they won’t end up buying costly tickets. This
project aims to create an application that uses machine learning to estimate the flight fares of
various flights. It also suggests to the customer whether to ”buy” or ”wait”. If there are chances
for the flight price to decrease, then waiting is recommended. When the least possible price is
Airlines’ ticket prices fluctuate drastically and day by day for the same flight. Customers
find it extremely difficult to purchase airline tickets at the lowest price since the price fluctuates
often. The price of the cheapest ticket for a specific flight increases with time. This takes place
so that the revenue maximize and to make sure customers understands that last-minute purchases
are expensive.
3
1.2 Objectives
The objective is to develop an Machine Learning model and deploy it using flask that would
serve as a portal for predicting flight fares based on the users arrival time, departure time, source,
destination, number of stops and airline. The model is trained using data set with different condi-
tions. Then it is trained to predict using Random Forest Regression. In this project, our main goals
were to identify underlying patterns in travel costs in India using historical data and to recommend
• To analyse every single detail of every ticket and develop a Flight Fare Predictor using Ex-
• To deploy the model into a web application using Flask, HTML, CSS.
There is a slight increase in training and test accuracy after modifications. Test accuracy
is also marginally better. The best part of the results is that the training time gets drastically
Literature review
This chapter presents an overview about the literary background in section 2.1. In section
2.2, the key related research are discussed. Analysis of this related research will be done in section
2.3. Finally, in section 2.4, the overall conclusion from the literature review is discussed.
2.1 Background
• The main factor that should only matter for flight price need to be distance, although it is
still playing a significant role but it is not the only factor that dictate the pricing strategy.
• For airlines to modify strategy and resources for a particular route, they must be capable
of accurately predicting the trend in airfare at the level of a given market segment.
• We could assist travellers reduce expenses on their journeys if we could provide them with
the best time to purchase their flight tickets based on historical data and also show them
• Juhar Ahmed Abdella, Nazar Zaki, and Khaled Shuai: Automatic detection of airline ticket
price and demand: In this paper, there’s review customer-side and airline-side prediction
6
models. This review study demonstrates that models on both sides rely on a small num-
ber of features, such as historical information on ticket prices, the date that tickets were
purchased, and the departure date. Advanced machine learning methods combined with
external factors like social media data and search engine queries are not taken into account.
study’s goal is to examine the Russian aviation industry and contrast how prices behave
on domestic and international flights. An empirical data-driven model was created for the
purpose of predicting air prices for various trip directions using the information that was
gathered from two independent ticket price information aggregators (AviaSales and Sabre)
• Collecting the data: For a prediction model, to train the model we need a data set which
has the past records of the project. For this model, we need the prices calculated in the
past on using this we can have an idea of the prices were there. It is impossible to collect
data manually so it requires a python program to collect the data automatically at specific
time daily.
• Data cleaning and Preparation: Once we obtain the data, it may have missing values
and the format in which it is given may not be suit the model. So, we must clean and
prepare it in accordance with the model’s specifications. This is very crucial step in building
a model. For this we use a variety of statistical varieties by making use of built-in Python
packages.
• Building the model: Once the data is prepared, then the data is analysed where all the
hidden trends are revealed. Then followed by applying predictive models and classification
models on the training set. Models used includes Random Forest, Linear Regression,
7
Decision Trees, Extra trees Regressor and combinations of these models to improve
the accuracy.
we must test them on a testing set and determine whether or not each user request resulted
in savings or losses. Later after comparing loss of the predicted and actual value, the
• Deploying the model: Once the model is ready, a function to prepare the data and then
by using Flask a web application is created where frontend is designed using HTML, CSS
Flight prices vary so much that the fare of a flight ticket today may be different from the
price of the same ticket tomorrow. So, it has become very difficult to predict the flight fare.
To solve this issue, I used a data set of prices of flight tickets for different airlines and for
routes between different places, making use of this data this model is built so that it can predict
Methodology
This section includes a write up on tools and methods used while implementation. It lays
down a clear description of the process used during development and testing.
3.1 Tools
There are variety of tools used during development of project. Major tools used have been
listed below:
• Prediction system:
∗ Jupyter Notebook: Jupyter Notebook allows users to compile all aspects of a data
project in one place making it easier to show the entire process of a project to your
intended audience. Through the web-based application, users can create data visual-
izations and other components of a project to share with others via the platform.
• Web interface:
∗ HTML: The coding that organises a web page’s content is called HTML, (HyperText
3.2 Workflow
3.3 Mechanism/Algorithm
In this project, to build a predictive model initially machine learning models are applied and
then a web application is built where the input is taken and predicted value is given as output.
The collected data is saved in CSV file. The dataset is collected from Kaggle, it has a total
of almost ten thousand data points in it. Every data point contains : Date of Journey, Airline
they want to travel in, Source, Destination, Departure Time, Arrival Time, Route, Total number
Cleaning the data involves removing all the null values which are very less in our dataset, so
that had no impact on results. We found that the vast majority of the data was present in string
format during data pre-processing which is not suitable while applying machine learning model.
Each feature’s data is retrieved, including the day and month from the journey’s date in integer
format and the hours and minutes from the departure time. Categorical variables are transformed
into model-recognizable values by making use of One hot-encoding and label encoding approaches.
The data types were modified, and certain properties had to be divided to make them more useful.
One Hot Encoding: In the data set, there is categorical data which cannot be used in
model directly, so One Hot Encoding merely generates new features based on how many distinct
values are present in the categorised feature. Each one of the feature is now a feature in the data
Date of Journey is split into date, month and year. Every detail is now a feature. Similarly,
departure time and arrival time is divided into minutes and hours making a each one a feature.
Now, these changes are made within the data set. But categorical data is converted separately.
In dataset columns we have a feature named Duration, where we have hours and minutes
together. Here, we have to do preprocessing and convert that feature into two:
Duration hours and Duration minutes, where they store only the hours and minutes respectively.
Figure 3.1 shows the box plot graph between airlines and fare. That is the fare charged by
different airlines. Clearly, from the figure it is clear that Jet Airways charge more compared to
Similarly, from the box plot graph Source vs price. The source place’s impact on the price
of the airlines operating there is shown in this box plot. From the graph, only place which offers
flights at every cost range is Bangalore. Delhi, has the expensive flight tickets compared to others.
Similarly from the box plot graph Destination vs price. The destination place’s impact on
the price of the airlines operating there. From the graph, we can figure out that the only place
which offers flights at every cost range is Delhi. Cochin, has the expensive flight tickets compared
Total stops are in the form of strings which are converted into integers using python functions.
And additional info has 90 percentage data as no info so it does not matter and the route and number
of stops define the same thing so route feature is removed. Total stops vs price concludes that as
The graph between Date of journey and price varies from on route to other route. Here I
took example of Mumbai-Delhi route. Here, the time at which we book our tickets in a day also
matters, it matters if we book in the morning, afternoon or evening. Not only this, but the number
of days before which we book also matters. It does not compulsorily increase, sometimes if there
is a holiday period right now and is not there further then the prices may decrease.
13
Now, after analyzing the categorical data separately they need to be concatenated. Now, the
correlation between the features is given in the figure 3.4 where the negative numbers represent that
the features are inversely proportional, that is if one increases then the other decreases. However,
Extra Trees Regressor is a class that is utilized to increase the accuracy rate and also to
control over fitting. Further, on finding the most important feature it is the most common method
15
to use.From the figure 3.6, we got to know that Total-stops is the most important feature in our
model prediction.
After applying few regression models and verifying the models based on R-square, MAE,
Linear Regression
This is a Supervised learning technique. It is a model where we suppose that the input and
output are linearly related. For Flight price prediction data set, since there are many independent
variables on which the price may depend on so here it is multiple linear regression.
Decision Tree
This is a tree structure which is used in both regression and classification models. All in-
16
dependent variables are selected from dataset as decision nodes and further these decision trees is
used to make decisions. Each leaf node in a decision tree’s two nodes indicates the result, which
is represented by the characteristics of a dataset. The internal nodes and branches illustrate the
decision rules.
Random Forest is an machine learning algorithm where it selects some random samples from
the dataset and then the algorithm builds a decision tree for every sample and finally best scored
decision tree is considered as final one. Accuracy obtained from this model is 80.8 percentage which
This section discusses the experiment conducted and the corresponding results obtained.
On calculating the accuracy scores of test data for these models the results are:
After this, the data is split into test and train, where training data is used to train the data
with model and the test data is used to measure the accuracy. However, the models should be
RandomSearchCV
This is a technique where randomly hyperparameter combinations are compared to find the
best solution for the model.It requires more space and many combinations which consumes lot of
time.
It is the magnitude of the difference of the predicted value to the actual value of that obser-
vation.
It is the average of squared error which is used for loss function for least square regression.
Backend of this model is created using FLASK framework in which the API end points like
GET and POST are built to get the data and fetch the data and then display the output in the
frontend.
The frontend application is made using HTML, CSS, bootstrap where user can choose the
flight details they want. Then they are sent to ML model backend where the flight price is predicted
and then the output is sent to frontend and then it is shown.We have deployed the ML model
19
through Flask. It helps us in running the model on localhost. The server can be further deployed
4.1 Conclusion
The Random Forest Classifier gave good accuracy compared to other algorithms. So, com-
paratively this is the best ML model. Figure 4.3 represents a line graph where randomly few values
are taken from test data to compare the original and predicted prices of the flight fare of tickets. In
Figure 4.4, we can select the details to check the fare. Then on submitting the details the backend
receives the details from frontend and then the price is predicted and the final value is shown in
Figure 4.5.
Bibliography
[1] Juhar Ahmed Abdella, Nazar Zaki, and Khaled Shuaib. Automatic detection of airline ticket
price and demand: A review. In 2018 International Conference on Innovations in Information
Technology (IIT), volume 33, pages 169–174, 2018.
[2] Lantseva Anastasia, Nikishova Anna Mukhina Ksenia, Ivanov Sergey, Knyazkov, and Kon-
stantin. In ”Data-driven Modeling of Airlines Pricing”, volume 66, pages 267–276, 2015.
[3] W. Groves and M. Gini. In ”An agent for optimizing airline ticket purchasing”, pages 1341–1342,
2013. accessed : 2022-07-30.
[4] Supriya Rajankar, Neha sakhrakar, and Omprakash rajankar. “flight fare prediction using
machine learning algorithms”. In International journal of Engineering Research and Technology
(IJERT), June 2019. accessed : 2022-07-30.
[5] K. Tziridis, Th. Kalampokas, G. A. Papakostas, and K. I. Diamantaras. Airfare prices predic-
tion using machine learning techniques. In 2017 25th European Signal Processing Conference
(EUSIPCO), pages 1036–1039, 2017.