0% found this document useful (0 votes)

52 views30 pages

By Rishitha Padi 2020IMG-047: Flight Fare Prediction System

This document is a report submitted by Rishitha Padi in 2020 for their summer project at the ATAL BIHARI VAJPAYEE Indian Institute of Information Technology and Management in Gwalior, India. The report details the development of a flight fare prediction system using machine learning algorithms. It includes an abstract, introduction describing the context and objectives, a literature review on related work, a methodology section outlining the tools and workflow used, an experiments and results section, and a conclusion. The system aims to help users better understand flight pricing strategies and predict fares to help make more optimal travel booking decisions.

Uploaded by

KRISHNA PRASAD SAMUDRALA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views30 pages

By Rishitha Padi 2020IMG-047: Flight Fare Prediction System

Uploaded by

KRISHNA PRASAD SAMUDRALA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Flight Fare Prediction System

Rishitha Padi

2020IMG-047

A report submitted for Summer Project

IPG-MBA

-
ATAL BIHARI VAJPAYEE

INDIAN INSTITUTE OF INFORMATION TECHNOLOGY AND MANAGEMENT

GWALIOR - 474015, MADHYA PRADESH, INDIA

Report Certificate

I hereby certify that the work, which is being presented in the report, entitled Flight Fare Prediction
System,
in IPG-MBA and submitted to the institution is an authentic record of my/our own work carried
out during the period May-2021 to August-2021 under the supervision of Prof. Vinal Patel. I also
cited the reference about the text(s)/figure(s)/table(s) from where they have been taken.

Prof. Vinal Patel

Date

The final copy of this report has been examined by the signatories, and we find that both the
content and the form meet acceptable presentation standards of scholarly work in the above
mentioned discipline.
Candidate’s Declaration

I hereby certify that I have properly checked and verified all the items as prescribed in the
check-list and ensure that my thesis is in the proper format as specified in the guideline for thesis
preparation.
I declare that the work containing in this report is my own work. I understand that plagiarism is
defined as any one or combination of the following:

(1) To steal and pass off (the ideas or words of another) as one’s own

(2) To use (another’s production) without crediting the source

(3) To commit literary theft

(4) To present as new and original idea or product derived from an existing source.

I understand that plagiarism involves an intentional act by the plagiarist of using someone else’s
work/ideas completely/partially and claiming authorship/originality of the work/ideas. Verbatim
copy as well as close resemblance to some else’s work constitute plagiarism.
I have given due credit to the original authors/sources for all the words, ideas, diagrams, graph-
ics, computer programmes, experiments, results, websites, that are not my original contribution.
I have used quotation marks to identify verbatim sentences and given credit to the original au-
thors/sources.
I affirm that no portion of my work is plagiarized, and the experiments and results reported in
the report/dissertation/thesis are not manipulated. In the event of a complaint of plagiarism and
the manipulation of the experiments and results, I shall be fully responsible and answerable. My
faculty supervisor(s) will not be responsible for the same.

Signature:

Name: Rishitha Padi

Roll. No: 2020IMG-047

Date: 14-10-2022
Abstract

Travelling by flights has made lives easier. People are so used to flights. Flying has become

a essential component of modern life as more and more people choose it over other, quicker forms

of transportation.Even though purchasing a plane ticket is available, many people struggle to do it

under ideal circumstances.

Customers want to pay as little as possible, and airline firms want to make as much money as

they can. These prices can differ in hours sometimes, they differ based on number of stops also.The

fares of these flights varies based on various parameters including flight timing, flight duration and

destination. And factors like holiday or festive season. Hence, having a general understanding of

flight costs before the vacation will help people in saving money and time.

By applying machine learning algorithms to the gathered historical flight data, the suggested sys-

tem will generate a prediction model. This model will provide users an understanding of the pricing

strategy they follow, and gives a predicted value that they can use to make more optimal decisions

about their travel bookings.

The advantages of the Flight Fare Prediction system’s design, operation, and performance are

discussed in this study, along with any drawbacks.

Keywords: Feature selection, Flight price, Machine learning, Pricing Models, Prediction Model,

Random Forest.
Dedication

Being a student who stays far away from my place, travelling to my place for holidays has

become a problem. It takes me almost 24 hours to travel by train, so best option is flight but

booking a flight ticket has been a challenge for me.

To solve this, I thought it would be beneficial for many who wants to book tickets, to know

the ideal conditions to book their tickets. Although, initially

In the present scenario, with more things moving to the digital platform, a digital solution

is the most befitting to overcome such a challenge.

Acknowledgments

I want to extend my heartful thanks to Dr. Vinal Patel, Dr. Santosh Singh Rathore,

Dr. Karm Veer Arya, Dr. Gaurav Kaushal for constantly guiding me through the project.

They helped me to develop an excellent practice of reading the recent literature before pursuing the

work. They provided invaluable advice, recommendations, wise judgement, constructive criticism,

and an eye for perfection that helped the current work grow and flourish. It’s only because of their

overwhelming interest and helpful attitude, the present work has attained the stage it has. I can

sincerely say that this project made me explore many areas of computer science, and along the

way, I found out about new topics. This practice increased my thirst for knowledge without any

doubt. Their expertise in the field of Machine Learning helped me to think better and innovative.

I am indebeted to all the professors for allowing us to develop an industry-grade project. They

gave their valuable time for evaluating and give much-needed insights about the project.

Specialisation was one of the primary sources of my learning.

(Rishitha Padi)
vii

Contents

Chapter

ABBREVIATION 1

1 Introduction 2

1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Implementation workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Research Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Literature review 5

2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Key related research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Proposed hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Methodology 8

3.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3 Mechanism/Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3.2 Data Cleaning and Preparing . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

viii

3.3.3 Exploratory Data Analysis/ Data Visualization . . . . . . . . . . . . . . . . . 10

3.3.4 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Experiments and results 17

4.0.1 Model Training and Hypertuning . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.0.2 Deploying the ML model with Flask . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Bibliography 21
ix

LIST OF FIGURES

Figure

1.1 Work Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 Airline vs Price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Source vs Price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3 Destination vs Price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Departure Time Vs Fare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.5 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.6 Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1 Parameters before RandomSearchCV . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 Parameters after RandomSearchCV . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 Original Vs Predicted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.4 Entering Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.5 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1

ABBREVIATIONS

CSS Cascading Style Sheets

HTML Hypertext Markup Language
LR Linear Regression
MAE Mean Absolute Error
ML Machine Learning
MSE Mean Squared Error
RMSE Root Mean Squared Error
SVM Support Vector Machine
Chapter 1

Introduction

This chapter presents an overview of the context as a part of the project developed in section

1.1. Section 1.2 introduces the objectives of the project. Next, section 1.3 presents the implemen-

tation workflow step by step. Finally, in section 1.4, the result of the research carried out is briefly

introduced.

1.1 Context

In the present-day scenario, airline corporations are trying to alter the cost of airline tickets

to increase their revenues. This predictive system helps them save millions of rupees for the users

by giving them an estimation of the tickets so that they won’t end up buying costly tickets. This

project aims to create an application that uses machine learning to estimate the flight fares of

various flights. It also suggests to the customer whether to ”buy” or ”wait”. If there are chances

for the flight price to decrease, then waiting is recommended. When the least possible price is

obtained, it suggests buying.

Airlines’ ticket prices fluctuate drastically and day by day for the same flight. Customers

find it extremely difficult to purchase airline tickets at the lowest price since the price fluctuates

often. The price of the cheapest ticket for a specific flight increases with time. This takes place

so that the revenue maximize and to make sure customers understands that last-minute purchases

are expensive.
3

1.2 Objectives

The objective is to develop an Machine Learning model and deploy it using flask that would

serve as a portal for predicting flight fares based on the users arrival time, departure time, source,

destination, number of stops and airline. The model is trained using data set with different condi-

tions. Then it is trained to predict using Random Forest Regression. In this project, our main goals

were to identify underlying patterns in travel costs in India using historical data and to recommend

the most advantageous time to purchase a ticket.

The objectives are:

• To analyse every single detail of every ticket and develop a Flight Fare Predictor using Ex-

traTreesRegressor, Random Forest Regression, Hyper Parameter Tuning[RandomizedSearchCV]

• To deploy the model into a web application using Flask, HTML, CSS.

1.3 Implementation workflow

The workflow followed during the implementation is as follows:

Step 1: Data collection for Machine learning model

Step 2: Exploratory Data Analysis

Step 3: Required Pre-processing of the data.

Step 4: Feature Selection

Step 5: Applying ML algorithm

Step 6: Pickling Model in a File

Step 7: Build UI for web application using HTML and CSS

Step 10: Integrate Frontend and Backend

Step 11: Test, and document the final system.

Figure 1.1: Work Flow Diagram

1.4 Research Results

There is a slight increase in training and test accuracy after modifications. Test accuracy

is also marginally better. The best part of the results is that the training time gets drastically

reduced after modifications. Detailed results have been discussed further.

Chapter 2

Literature review

This chapter presents an overview about the literary background in section 2.1. In section

2.2, the key related research are discussed. Analysis of this related research will be done in section

2.3. Finally, in section 2.4, the overall conclusion from the literature review is discussed.

2.1 Background

• The main factor that should only matter for flight price need to be distance, although it is

still playing a significant role but it is not the only factor that dictate the pricing strategy.

• For airlines to modify strategy and resources for a particular route, they must be capable

of accurately predicting the trend in airfare at the level of a given market segment.

• We could assist travellers reduce expenses on their journeys if we could provide them with

the best time to purchase their flight tickets based on historical data and also show them

various trends in the airline sector.

2.2 Key related research

The references used for the project are:

• Juhar Ahmed Abdella, Nazar Zaki, and Khaled Shuai: Automatic detection of airline ticket

price and demand: In this paper, there’s review customer-side and airline-side prediction
6

models. This review study demonstrates that models on both sides rely on a small num-

ber of features, such as historical information on ticket prices, the date that tickets were

purchased, and the departure date. Advanced machine learning methods combined with

external factors like social media data and search engine queries are not taken into account.

• A study published in Lantseva Anastasia, Nikishova Anna Mukhina Ksenia, Ivanov

Sergey, Knyazkov, and Kon-stantin:Data-driven Modeling of Airlines Pricing: This

study’s goal is to examine the Russian aviation industry and contrast how prices behave

on domestic and international flights. An empirical data-driven model was created for the

purpose of predicting air prices for various trip directions using the information that was

gathered from two independent ticket price information aggregators (AviaSales and Sabre)

for the period of spring-summer 2015.

2.3 Proposed hypothesis

• Collecting the data: For a prediction model, to train the model we need a data set which

has the past records of the project. For this model, we need the prices calculated in the

past on using this we can have an idea of the prices were there. It is impossible to collect

data manually so it requires a python program to collect the data automatically at specific

time daily.

• Data cleaning and Preparation: Once we obtain the data, it may have missing values

and the format in which it is given may not be suit the model. So, we must clean and

prepare it in accordance with the model’s specifications. This is very crucial step in building

a model. For this we use a variety of statistical varieties by making use of built-in Python

packages.

• Building the model: Once the data is prepared, then the data is analysed where all the

hidden trends are revealed. Then followed by applying predictive models and classification

models on the training set. Models used includes Random Forest, Linear Regression,
7

Decision Trees, Extra trees Regressor and combinations of these models to improve

the accuracy.

• Combining models and calculating accuracy: After developing a variety of models,

we must test them on a testing set and determine whether or not each user request resulted

in savings or losses. Later after comparing loss of the predicted and actual value, the

accuracy is calculated for the model implemented.

• Deploying the model: Once the model is ready, a function to prepare the data and then

by using Flask a web application is created where frontend is designed using HTML, CSS

and bootstrap. And machine learning model as backend we can predict.

2.4 Problem formulation

Flight prices vary so much that the fare of a flight ticket today may be different from the

price of the same ticket tomorrow. So, it has become very difficult to predict the flight fare.

To solve this issue, I used a data set of prices of flight tickets for different airlines and for

routes between different places, making use of this data this model is built so that it can predict

the prices of various input features.

Chapter 3

Methodology

This section includes a write up on tools and methods used while implementation. It lays

down a clear description of the process used during development and testing.

3.1 Tools

There are variety of tools used during development of project. Major tools used have been

listed below:

• Prediction system:

∗ Jupyter Notebook: Jupyter Notebook allows users to compile all aspects of a data

project in one place making it easier to show the entire process of a project to your

intended audience. Through the web-based application, users can create data visual-

izations and other components of a project to share with others via the platform.

∗ Other libraries: Numpy, matplotlib, seaborn, Pandas.

• Web interface:

∗ HTML: The coding that organises a web page’s content is called HTML, (HyperText

Markup Language). Content may be organised using paragraphs, a list of bulleted

points, graphics, and data tables, among other options.

∗ CSS: Cascading Style Sheets, or CSS, is an abbreviation. A computer language called

CSS is used to layout and organise web pages.

∗ Flask:Flask is used for developing web applications using python.There is a built-in

development server and a fast debugger provided.

3.2 Workflow

The workflow followed during the implementation is as follows:

Step 1: Data collection for Machine learning model

Step 2: Exploratory Data Analysis

Step 3: Required Pre-processing of the data.

Step 4: Feature Selection

Step 5: Applying ML algorithm

Step 6: Pickling Model in a File

Step 7: Build UI for web application using HTML and CSS

Step 10: Integrate Frontend and Backend

Step 11: Test, and document the final system.

3.3 Mechanism/Algorithm

In this project, to build a predictive model initially machine learning models are applied and

then a web application is built where the input is taken and predicted value is given as output.

3.3.1 Data Collection

The collected data is saved in CSV file. The dataset is collected from Kaggle, it has a total

of almost ten thousand data points in it. Every data point contains : Date of Journey, Airline

they want to travel in, Source, Destination, Departure Time, Arrival Time, Route, Total number

of Stops, Additional information, Price.

3.3.2 Data Cleaning and Preparing

Cleaning the data involves removing all the null values which are very less in our dataset, so

that had no impact on results. We found that the vast majority of the data was present in string

format during data pre-processing which is not suitable while applying machine learning model.

Each feature’s data is retrieved, including the day and month from the journey’s date in integer

format and the hours and minutes from the departure time. Categorical variables are transformed

into model-recognizable values by making use of One hot-encoding and label encoding approaches.

The data types were modified, and certain properties had to be divided to make them more useful.

One Hot Encoding: In the data set, there is categorical data which cannot be used in

model directly, so One Hot Encoding merely generates new features based on how many distinct

values are present in the categorised feature. Each one of the feature is now a feature in the data

set.It is basically nothing but creating dummy variables.

3.3.3 Exploratory Data Analysis/ Data Visualization

Date of Journey is split into date, month and year. Every detail is now a feature. Similarly,

departure time and arrival time is divided into minutes and hours making a each one a feature.

Now, these changes are made within the data set. But categorical data is converted separately.

In dataset columns we have a feature named Duration, where we have hours and minutes

together. Here, we have to do preprocessing and convert that feature into two:

Duration hours and Duration minutes, where they store only the hours and minutes respectively.

Figure 3.1 shows the box plot graph between airlines and fare. That is the fare charged by

different airlines. Clearly, from the figure it is clear that Jet Airways charge more compared to

other Airlines. Trujet is the cheapest of all airways.

Figure 3.1: Airline vs Price

Similarly, from the box plot graph Source vs price. The source place’s impact on the price

of the airlines operating there is shown in this box plot. From the graph, only place which offers

flights at every cost range is Bangalore. Delhi, has the expensive flight tickets compared to others.

Figure 3.2: Source vs Price

Similarly from the box plot graph Destination vs price. The destination place’s impact on

the price of the airlines operating there. From the graph, we can figure out that the only place

which offers flights at every cost range is Delhi. Cochin, has the expensive flight tickets compared

to others. After Cochin, Bangalore is the next one.

Figure 3.3: Destination vs Price

Total stops are in the form of strings which are converted into integers using python functions.

And additional info has 90 percentage data as no info so it does not matter and the route and number

of stops define the same thing so route feature is removed. Total stops vs price concludes that as

the number of stops increase the price increases.

The graph between Date of journey and price varies from on route to other route. Here I

took example of Mumbai-Delhi route. Here, the time at which we book our tickets in a day also

matters, it matters if we book in the morning, afternoon or evening. Not only this, but the number

of days before which we book also matters. It does not compulsorily increase, sometimes if there

is a holiday period right now and is not there further then the prices may decrease.
13

Figure 3.4: Departure Time Vs Fare

Now, after analyzing the categorical data separately they need to be concatenated. Now, the

correlation between the features is given in the figure 3.4 where the negative numbers represent that

the features are inversely proportional, that is if one increases then the other decreases. However,

the positive numbers represent positive correlation.

Figure 3.5: Correlation

Extra Trees Regressor is a class that is utilized to increase the accuracy rate and also to

control over fitting. Further, on finding the most important feature it is the most common method
15

to use.From the figure 3.6, we got to know that Total-stops is the most important feature in our

model prediction.

Figure 3.6: Feature Importance

3.3.4 Regression Models

After applying few regression models and verifying the models based on R-square, MAE,

MSE and RMSE.

Linear Regression

This is a Supervised learning technique. It is a model where we suppose that the input and

output are linearly related. For Flight price prediction data set, since there are many independent

variables on which the price may depend on so here it is multiple linear regression.

Decision Tree

This is a tree structure which is used in both regression and classification models. All in-
16

dependent variables are selected from dataset as decision nodes and further these decision trees is

used to make decisions. Each leaf node in a decision tree’s two nodes indicates the result, which

is represented by the characteristics of a dataset. The internal nodes and branches illustrate the

decision rules.

Random Forest Regression

Random Forest is an machine learning algorithm where it selects some random samples from

the dataset and then the algorithm builds a decision tree for every sample and finally best scored

decision tree is considered as final one. Accuracy obtained from this model is 80.8 percentage which

is the best of all models.

Chapter 4

Experiments and results

This section discusses the experiment conducted and the corresponding results obtained.

Note: These results may vary from machine to machine.

On calculating the accuracy scores of test data for these models the results are:

Linear Regression: 0.602

Decision Tree Regression : 0.644

Random Forest Regression : 0.811

4.0.1 Model Training and Hypertuning

After this, the data is split into test and train, where training data is used to train the data

with model and the test data is used to measure the accuracy. However, the models should be

hypertuned using RandomSearchCV before being sent into the function.

RandomSearchCV

This is a technique where randomly hyperparameter combinations are compared to find the

best solution for the model.It requires more space and many combinations which consumes lot of

time.

MAE(Mean Absolute Error)

It is the magnitude of the difference of the predicted value to the actual value of that obser-

vation.

MSE(Mean Square Error)

It is the average of squared error which is used for loss function for least square regression.

RMSE(Root Mean Square Error

It is the square root of the MSE(Mean Square Error)

Figure 4.1: Parameters before RandomSearchCV

Figure 4.2: Parameters after RandomSearchCV

4.0.2 Deploying the ML model with Flask

Backend of this model is created using FLASK framework in which the API end points like

GET and POST are built to get the data and fetch the data and then display the output in the

frontend.

The frontend application is made using HTML, CSS, bootstrap where user can choose the

flight details they want. Then they are sent to ML model backend where the flight price is predicted

and then the output is sent to frontend and then it is shown.We have deployed the ML model
19

through Flask. It helps us in running the model on localhost. The server can be further deployed

to heroku so it can be accessed anytime instead of limiting it to local system.

4.1 Conclusion

Figure 4.3: Original Vs Predicted

Figure 4.4: Entering Details

Figure 4.5: Output

The Random Forest Classifier gave good accuracy compared to other algorithms. So, com-

paratively this is the best ML model. Figure 4.3 represents a line graph where randomly few values

are taken from test data to compare the original and predicted prices of the flight fare of tickets. In

Figure 4.4, we can select the details to check the fare. Then on submitting the details the backend

receives the details from frontend and then the price is predicted and the final value is shown in

Figure 4.5.
Bibliography

[1] Juhar Ahmed Abdella, Nazar Zaki, and Khaled Shuaib. Automatic detection of airline ticket
price and demand: A review. In 2018 International Conference on Innovations in Information
Technology (IIT), volume 33, pages 169–174, 2018.

[2] Lantseva Anastasia, Nikishova Anna Mukhina Ksenia, Ivanov Sergey, Knyazkov, and Kon-
stantin. In ”Data-driven Modeling of Airlines Pricing”, volume 66, pages 267–276, 2015.

[3] W. Groves and M. Gini. In ”An agent for optimizing airline ticket purchasing”, pages 1341–1342,
2013. accessed : 2022-07-30.

[4] Supriya Rajankar, Neha sakhrakar, and Omprakash rajankar. “flight fare prediction using
machine learning algorithms”. In International journal of Engineering Research and Technology
(IJERT), June 2019. accessed : 2022-07-30.

[5] K. Tziridis, Th. Kalampokas, G. A. Papakostas, and K. I. Diamantaras. Airfare prices predic-
tion using machine learning techniques. In 2017 25th European Signal Processing Conference
(EUSIPCO), pages 1036–1039, 2017.