Implementation of Flight Fare Prediction System Using Machine Learning
Implementation of Flight Fare Prediction System Using Machine Learning
https://doi.org/10.22214/ijraset.2022.43230
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
Abstract: The Flight ticket prices increase or decrease every now and then depending on various factors like timing of the flights,
destination, duration of flights. In the proposed system a predictive model will be created by applying machine learning
algorithms to the collected historical data of flights. Optimal timing for airline ticket purchasing from the consumer’s
perspective is challenging principally because buyers have insufficient information for reasoning about future price movements.
In this project we majorly targeted to uncover underlying trends of flight prices in India using historical data and also to suggest
the best time to buy a flight ticket. The project implements the validations or contradictions towards myths regarding the airline
industry, a comparison study among various models in predicting the optimal time to buy the flight ticket and the amount that
can be saved if done so. Remarkably, the trends of the prices are highly sensitive to the route, month of departure, day of
departure, time of departure, whether the day of departure is a holiday and airline carrier. Highly competitive routes like most
business routes (tier 1 to tier 1 cities like Mumbai-Delhi) had a non-decreasing trend where prices increased as days to departure
decreased, however other routes (tier 1 to tier 2 cities like Delhi - Guwahati) had a specific time frame where the prices are
minimum. Moreover, the data also uncovered two basic categories of airline carriers operating in India – the economical group
and the luxurious group, and in most cases, the minimum priced flight was a member of the economical group. The data also
validated the fact that, there are certain time-periods of the day where the prices are expected to be maximum. The scope of the
project can be extensively extended across the various routes to make significant savings on the purchase of flight prices across
the Indian Domestic Airline market.
Keywords: Flight ticket, Optimal timing, historical data, competitive routes, Indian Domestic Airline market.
I. INTRODUCTION
The flight ticket buying system is to purchase a ticket many days prior to flight take-off so as to stay away from the effect of the
most extreme charge. Mostly, aviation routes don’t agree this procedure. Plane organizations may diminish the cost at the time, they
need to build the market and at the time when the tickets are less accessible. They may maximize the costs. So, the cost may rely
upon different factors. To foresee the costs this venture uses AI to exhibit the ways of flight tickets after some time. All
organizations have the privilege and opportunity to change its ticket costs at any time. Explorer can set aside cash by booking a
ticket at the least costs. People who had travelled by flight frequently are aware of price fluctuations. The airlines use complex
policies of Revenue Management for execution of distinctive evaluating systems. The evaluating system as a result changes the
charge depending on time, season, and festive days to change the header or footer on successive pages. The ultimate aim of the
airways is to earn profit whereas the customer searches for the minimum rate. Customers usually try to buy the ticket well in
advance of departure date so as to avoid hike in airfare as date comes closer. But actually, this is not the fact. The customer may
wind up by giving more than they ought to for the same seat.
II. MOTIVATION
Motivation is to help people who tends to pay more for the flight fare ticket and for those who are naïve to this booking tickets
process. This will also help us to get more exposure to the machine learning techniques that will help us to excel and improve in the
existing skills.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3814
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
2) William Groves and Maria Gini "An agent for optimizing airline ticket purchasing" in proceedings of the 2013 international
conference on autonomous agents and multi-agent systems.
In case study [2] by William groves an agent is introduced which is able to optimize purchase timing on behalf of customers. Partial
least square regression technique is used to build a model. Initially they have used various techniques for feature selection such as
Feature Extraction, Lagged Feature Computation, Regression Model Construction and Optimal Model Selection. Their experiments
were designed to estimate real-world costs of using our prediction models. The lag scheme approach works well for many choices of
machine learning algorithms, but PLS regression was found to work best for this domain. The improved performance can be
attributed to a natural resistance to collinear and irrelevant variables.
3) J. Santos Dominguez-Menchero, Javier Rivera and Emilio Torres Manzanera "Optimal purchase timing in the airline market".
In this paper, the researchers have researched the general pattern in airline pricing behaviour and a methodology for analysing
different routes and/or carriers. Their purpose is to provide customers with the relevant information they need to decide the best time
to purchase a ticket, striking a balance between the desire to save money and any time restraints the buyer may have. Their study
shows how non-parametric isotonic regression techniques, as opposed to standard parametric techniques, are particularly useful.
Most importantly, we can determine the margin of time consumers may delay their purchase without significant price increase,
specify the economic loss for each day the purchase is delayed and detect when it is better to wait until the last day to make a
purchase.
4) Supriya Rajankar, Neha sakhrakar and Omprakash rajankar “Flight fare prediction using machine learning algorithms”
International journal of Engineering Research and Technology (IJERT) June 2019.
Journal by Supriya Rajankar a survey on flight fare prediction using machine learning algorithm uses small dataset consisting of
flights between Delhi and Bombay. Algorithms such as K-nearest neighbours (KNN), linear regression, support vector machine
(SVM) are applied to gain different outcomes and do research on them. For predicting the flight ticket prices, many algorithms were
implemented in machine learning. The algorithms are: Support Vector Machine (SVM), Linear regression, K-Nearest neighbours,
Decision tree, Multilayer Perceptron, Gradient Boosting and Random Forest Algorithm. Using python library scikit learn these
models have been implemented. The parameters like R-square, MAE and MSE are considered to verify the performance of these
models. The best model results were of Decision Tree algorithm.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3815
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
5) Tianyi wang, samira Pouyanfar, haiman Tian and Yudong Tao "A Framework for airline price prediction: A machine learning
approach"
In this paper, Tianyi wang, samira Pouyanfar, haiman Tian and Yudong Tao [5] proposed framework where two databases are
combined together with macroeconomic data and machine learning algorithms such as support vector machine, XGBoost are used to
model the average ticket price based on source and destination pairs. The framework achieves a high prediction accuracy 0.869 with
the adjusted R squared performance metrics. They had the result of lowest error rate of 0.92 with the XGBoost Algorithm.
6) T. Janssen "A linear quantile mixed regression model for prediction of airline ticket prices"
In this paper, they have predicted the best time to purchase the tickets. They have used various machine learning algorithms such as
linear regression, Decision Tree, Random Forest, K-Nearest Neighbour, Multilayer Perceptron (MLP), gradient boosting, support
vector machine (SVM). For predictors, they have used Naïve Bayes and Stacked Prediction Model. the research a desired model is
implemented using the Linear Quantile Blended Regression methodology for San Francisco–New York course where each day
airfares are given by online website. Two features such as number of days for departure and whether departure is on weekend or
weekday are considered to develop the model.
7) Wohlfarth, T.clemencon, S.Roueff “A Data mining approach to travel price forecasting” 10th international conference on
machine learning Honolulu 2011.
In the research paper [7] on Flight fare prediction system by Wohlfarth, T.clemencon, S.Roueff using the technique of yield
management in the air travel industry. They have used various data mining techniques. It is the goal of this paper to consider the
design of decision-making tools in the context of varying travel prices from the customer’s perspective. Terms used in the research
are machine techniques/ algorithms mentioned as Clustering.
8) Vinod Kimbhaune, Harshil Donga, Ashutosh Trivedi, Sonam Mahajan and Viraj Mahajan research paper on flight fare
prediction system.
In the research paper [7] on Flight fare prediction system by Vinod Kimbhaune, Harshil Donga, Ashutosh Trivedi, Sonam Mahajan
and Viraj Mahajan using the various machine learning algorithm approaches i.e., Random Forest, Decision tree and Linear
regression are applied on dataset. To determine ideal purchase time for flight ticket. There project aims to develop an application
which will predict the flight prices for various flights using machine learning model. The techniques they have used are mentioned
as Linear Regression, Decision Tree and random Forest. The performance metrics techniques used are MAE, MSE and RSME. The
outcome for their project was not fully accurate but by adding more real time data set will give more accurate results.
9) W. Groves and M. Gini, ―An agent for optimizing airline ticket purchasing, ǁ 12th International Conference on Autonomous
Agents and Multiagent Systems (AAMAS 2013), St. Paul, MN, May 06 - 10, 2013, pp. 1341-1342.
This is the extended version of the research paper [3] exploited Partial Least Square Regression (PLSR) for building up a model.
The information was gathered from major travel adventure booking sites from 22 February 2011 to 23 June 2011. Extra information
was additionally gathered and are utilized to check the correlations of the exhibitions of the last model. Janssen.
V. DIFFERENT APPROACHES
There are various approaches for implementing the project, below we got some approaches used by authors in the literature survey:
A. Linear Regression
Regression is a method of modelling a target value based on predictors that are independent. It is mostly based on the number of
independent variables and the relationship between independent and dependent variables.
Linear regression is a type of analysis where the number of independent variables is one and the relationship between the dependent
and independent variables vary linearly. The important concept to understand linear regressions are cost function and Gradient
decent
y(pred) = b0+b1 ∗ x
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3816
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
B. Gradient Boosting
It is an additive regression model by fitting simple function to current “pseudo” residuals sequentially by least-squares at each
iteration. It uses the Decision tree as a basic estimator in sci-kit implementation. Starting from 10 to 1000 with the interval of 10
boosting stages are used with maximum numbers. The loss function is an important parameter in the gradient boosting. It can be
calculated with options: least squares regression, least absolute deviation, and quantile regression.
C. K- Nearest Neighbours
In regression techniques, the output obtained is an average value of its k nearest neighbours. It is a non-parametric method like
SVM. Using some values, results are evaluated and the best performance value is obtained.
D. Multi-Layer Perceptron
It is the class of feedforward artificial neural networks. It includes the input layer, output layer and the number of the hidden layers.
The hidden layer gives the depth of the neural network. The setup includes 1 hidden layer, the number of neurons starts from 100 to
2000 with different intervals depending upon the required condition. To fire each neuron, it requires activation energy. The logistic
sigmoid function is used as an activation function.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3817
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
A. Random Forest
It is a supervised learning algorithm. The benefit of the random forest is, it very well may be utilized for both characterization and
relapse issue which structure most of current machine learning framework. Random forest forms numerous decision trees, what’s
more, adds them together to get an increasingly exact and stable expectation. Random Forest has nearly the equivalent parameters as
a decision tree or a stowing classifier model. It is very simple to discover the significance of each element on the expectation when
contrasted with others in this calculation. The regular component in these techniques is, for the kth tree, a random vector theta k is
produced, autonomous of the past random vector’s theta 1, theta k-1 however with the equivalent distribution, while a tree is
developed utilizing the preparation set and bringing about a classifier. x is an information vector. For a period, in stowing the
random vector is created as the includes in N boxes where N is the number of models in the preparation set of information. In
random split, choice includes various autonomous random whole numbers between 1 to K. The dimensionality and nature of theta
rely upon its utilization in the development of a tree. After countless trees are created, they select the most famous class. These
methodologies are called as random forests.
B. XGBoost
XGboost is the implementation of gradient boosted decision tree. In this algorithm, decision trees are created in sequential form.
Weights play an important role in XGBoost. Weights are assigned to all independent variables which are then fed into decision tree
which predicts results. The weight of tree is predicted wrong by tree is increased then these variables are then fed to second decision
tree. This individual classifiers/predictor then ensemble to give a strong and more precise model. It ca work on
regression,classification, prediction, ranking, user-defined prediction problems.
C. Performance Metrics
Performance metrics are statistical models which will be used to compare the accuracy of the machine learning models trained by
different algorithms. The sklearn. metrics module will be used to implement the functions to measure the errors from each model
using the regression metrics. Following metrics will be used to check the error measure of each model.
G. R2 (Coefficient of Determination)
It helps you to understand how well the independent variable adjusted with the variance in your model.
R2 = − ∑(ý-y̅ )2 ÷ ∑(y-y̅ )2
The value of R-square lies between 0 to 1. The closer its value to one, the better your model is when comparing with other model
values.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3818
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3819
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
IX. IMPLEMENTATION
We have followed following steps in our project to get to our ultimate goal of predicting flight fare:
1) Importing Necessary Libraries
Importing the python libraries such as pandas, matplotlib, seaborn, NumPy for reading and visualizing the dataset.
7) Feature Selection
In this process, we will find out the best feature which will contribute to our target variable.
X = “Independent Feature”
Y = “Dependent Feature” i.e., “Price” column.
We will separate all the independent features except price in the X variable and price in Y variable. For this, we will use loc & iloc
method.
Now, we have used “ExtraTreesRegressor” to find more important features from the data. Use the selection variable and do fitting
the X & Y features. After this we will print “feature_importance” and will get to know the important features.
We get to know that “Total_stops” is playing as the most important feature.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3820
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
As we can, the scores of Random Forest before hyperparameter tuning and XGBoost are same. Scores of Random Forest are slightly
affected after performing hyperparameter tuning.
As compared to the results of the reference paper [1], they have used various machine learning techniques in which they have got
the best results with the Bagging Regression Tree method with the 87.42 accuracy rate. As compared to the Random Forest model
of reference paper [1], below are the comparison:
As compared to the results of the reference paper [15], they have used various machine learning techniques in which they have got
the best results with the Trend Based Model method with the 81.8 accuracy rate. As compared to the Random Forest model of
reference paper[15], below are the comparison:
XI. CONCLUSIONS
Machine Learning algorithms are applied on the dataset to predict the dynamic fare of flights. This gives the predicted values of
flight fare to get a flight ticket at minimum cost. The values of R-squared obtained from the algorithm give the accuracy of the
model. In the future, if more data could be accessed such as the current availability of seats, the predicted results will be more
accurate. Finally, we conclude that this methodology is not preferred for performing this project. We can add more methods, more
data for more accurate results.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3821
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue V May 2022- Available at www.ijraset.com
REFERENCES
[1] K. Tziridis T. Kalampokas G. Papa Kostas and K. Diamantaras "Airfare price prediction using machine learning techniques" in European Signal Processing
Conference (EUSIPCO), DOI: 10.23919/EUSIPCO .2017.8081365L. Li Y. Chen and Z. Li” Yawning detection for monitoring driver fatigue based on two
cameras” Proc. 12th Int. IEEE Conf. Intel. Transp. Syst. pp. 1-6 Oct. 2009.
[2] William Groves and Maria Gini "An agent for optimizing airline ticket purchasing" in proceedings of the 2013 international conference on autonomous agents
and multi-agent systems.
[3] J. Santos Dominguez-Menchero, Javier Rivera and Emilio TorresManzanera "Optimal purchase timing in the airline market".
[4] Supriya Rajankar, Neha sakhrakar and Omprakash rajankar “Flight fare prediction using machine learning algorithms” International journal of Engineering
Research and Technology (IJERT) June 2019.
[5] Tianyi wang, samira Pouyanfar, haiman Tian and Yudong Tao "A Framework for airline price prediction: A machine learning approach"
[6] T. Janssen "A linear quantile mixed regression model for prediction of airline ticket prices"
[7] Wohlfarth, T. clemencon, S. Roueff “A Dat mining approach to travel price forecasting” 10th international conference on machine learning Honolulu 2011.
[8] Vinod Kimbhaune, Harshil Donga, Ashutosh Trivedi, Sonam Mahajan and Viraj Mahajan research paper on flight fare prediction system.
[9] W. Groves and M. Gini, ―An agent for optimizing airline ticket purchasing, ǁ 12th International Conference on Autonomous Agents and Multiagent Systems
(AAMAS 2013), St. Paul, MN, May 06 - 10, 2013, pp. 1341-1342.
[10] Viet Hoang Vu, Quang Tran Minh and Phu H. Phung, An Airfare Prediction Model for Developing Marketsǁ, IEEE paper 2018.
[11] Dominguez-Menchero, J. Santo, Reviera, ǁoptimal purchase timing in airline marketsǁ ,2014
[12] medium.com/analytics-vidhya/mae-mse-rmse-coefficient of determination-adjusted-r-squared-which-metric-is bettercd0326a5697e article on performance
metrics
[13] www.keboola.com/blog/random-forest-regression article on random forest
[14] https://towardsdatascience.com/machine-learning-basics-decisiontreeregression-1d73ea003fda article on decision tree regression.
[15] Achyut Joshi, Himanshu Sikaria, Tarun Devireddy, & Dr. Vivek Vijay. Predicting Flight Prices in India
[16] O. Etzioni, R. Tuchinda, C. A. Knoblock, and A. Yates. To buy or not to buy: mining airfare data to minimize ticket purchase price.
[17] Manolis Papadakis. Predicting Airfare Prices.
[18] Groves and Gini, 2011. A Regression Model for Predicting Optimal Purchase TimingFor Airline Tickets.
[19] Modeling of United States Airline Fares – Using the Official Airline Guide (OAG) and Airline Origin and Destination Survey (DB1B), Krishna Rama-Murthy,
2006.
[20] B. S. Everitt: The Cambridge Dictionary of Statistics, Cambridge University Press, Cambridge (3rd edition, 2006). ISBN 0-521-69027-7.
[21] Bishop: Pattern Recognition and Machine Learning, Springer, ISBN 0-387-31073-8.
[22] E. Bachis and C. A. Piga. Low-cost airlines and online price dispersion. International Journal of Industrial Organization, In Press, Corrected Proof, 2011.
[23] P. P. Belobaba. Airline yield management. an overview of seat inventory control. Transportation Science, 21(2):63, 1987.
[24] Y. Levin, J. McGill, and M. Nediak. Dynamic pricing in the presence of strategic consumers and oligopolistic competition. Management Science, 55(1):32–46,
2009
[25] B. Smith, J. Leimkuhler, R. Darrow, and Samuels, ―Yield managementat american airlines,ǁInterfaces, vol.22, pp. 8–31, 1992.
[26] T. Janssen, ―A linear quantile mixed regression model for prediction of airline ticket prices,ǁ Bachelor Thesis, Radboud University, 2014.
[27] S.B. Kotsiantis, ―Decision trees: a recent overview,ǁ Artificial Intelligence Review, vol. 39, no. 4, pp. 261-283, 2013.
[28] L. Breiman, ―Random forests, ǁ Machine Learning, vol. 45, pp. 5-32, 2001.
[29] S. Haykin, Neural Networks – A Comprehensive Foundation. Prentice Hall, 2nd Edition, 1999.
[30] H. Drucker, C.J.C. Burges, L. Kaufman, A. Smola and V. Vapnik, ǁSupport vector regression machines, ǁ Advances in neural information processing systems,
vol. 9, pp. 155-161, 1997.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3822