Updated Hard Copy Final Report
Updated Hard Copy Final Report
Bachelor of Technology
in
Computer Engineering – AI
1
Faculty of Technology
Marwadi University
Computer Engineering – AI department
2023-2024
CERTIFICATE
This is to certify that the project entitled Flight fare prediction using
Date: ____________________
2
Acknowledgments
We would like to extend our sincere gratitude to all those who have contributed to
the successful completion of this project. Your unwavering support, guidance, and
assistance have been invaluable throughout this endeavor. First and foremost, we
would like to express our deep appreciation to Marwadi University for providing
us with conducive research and learning environment. Your commitment to
academic excellence has been a constant source of inspiration.
Our heartfelt thanks go to Dr. Madhu Shukla, whose visionary leadership has not
only guided the department but also provided us with the opportunity to explore our
research interests. Your mentorship has played a pivotal role in our growth. We are
immensely grateful to our mentor, Prof. Abdul Kalam, whose expertise and
guidance have shaped our understanding of the subject matter. We also express our
heartfelt appreciation to Prof. Anjan Kumar Sahoo, our external guide, whose
expertise and guidance were instrumental in shaping our understanding of the
subject matter.
Lastly, we would extend our thanks to our friends for their constant support and
understanding. Your presence has provided us with strength. This project represents
a collective effort, and we are thankful to all who have played a role in its
completion.
3
Flight fare prediction using machine learning
Index [Sample]
Institute’s Vision and Mission............................................................................................ iv
Department’s Vision and Mission ..................................................................................... iv
PEO, POs and PSOs .......................................................................................................... iv
Abstract ................................................................................................................................ iv
Text ....................................................................................................................................... iv
1 Introduction…………………………………………………………………………2
1.1 Problem Summary………………………………………………………………2
1.2 Aims and Objective……………………………………………………………..3
1.3 Problem Specifications………………………………………………………….4
1.4 Literature Review……………………………………………………………….5
1.5 Plane Of The Work……………………………………………………………..6
1.6 Materials And Tools Required…………………………………………………6
1.7 Motivation……………………………………………………………………...8
2 Methodology………………………………………………………………………..10
2.1 Design Specification…………………………………………………………...13
2.2 Proposed Machine Learning Algorithm……………………………………….14
3 Implementation………………………………………………………………………17
4 Conclusion…………………………………………………………………………20
4.2 Discussion………………………………………………………………………20
5 References……………………………………………………………………… 22
I
Flight fare prediction using machine learning
Institute’s Vision
Our vision is to address challenges facing our society and planet through sterile education
that builds capacity of our students and empower them through their innovative thinking
practice and character building that will ultimately manifest to boost creativity and
responsibility utilizing the limited natural resources to meet the challenges of the 21st
century.
Institute’s Mission
II
Flight fare prediction using machine learning
Department’s Vision
To impart quality technical education through research, innovation and teamwork for
creating professionally superior and ethically strong manpower that meet the global
challenges of engineering industries and research organization in the area of Computer
Engineering.
Department’s Mission
• Dedicate itself to providing its students with the skills, knowledge and attitudes
that will allow its graduates to succeed as engineers, leaders, professionals and
entrepreneurs.
• Prepare its graduates for life-long learning to meet intellectual, ethical and career
challenges.
• Inspire graduates for competitive exam higher education as well as research and
development.
III
Flight fare prediction using machine learning
Our graduated students are expected to fulfill the following Program Educational
Objectives (PEOs):
2. Breadth: Will apply current industry accepted practices, new and emerging
technologies to analyse, design, implement and maintain state of art solutions.
IV
Flight fare prediction using machine learning
PO2: Problem analysis: Identify, formulate, review research literature, and analyze
complex engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
PO5: Modern tool usage: Create, select, and apply appropriate techniques, resources,
and modern engineering and IT tools including prediction and modeling to complex
engineering activities with an understanding of the limitations.
PO6: The engineer and society: Apply reasoning informed by the contextual
knowledge to assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
PO8: Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.
V
Flight fare prediction using machine learning
PO12: Life-long learning: Recognize the need for, and have the preparation and ability
to engage in independent and life-long learning in the broadest context of technological
change.
PSO1. Students shall demonstrate skills, the knowledge and competence in the analysis,
design and development of computer-based systems addressing industrial and social
issues.
PSO2. Students shall have competence to take challenges associated with future
technological issues associated with security, wearable devices, augmented reality,
Internet of Anything etc.
VI
Flight fare prediction using machine learning
Attainment
PO / PSO Level Justification
VII
Flight fare prediction using machine learning
Abstract
Passengers are attempting to grasp how these airline businesses make judgments
regarding flight ticket costs over time, since demand for air travel in India is growing
more popular with multiple flight tickets purchasing on the internet. There are a variety of
strategies that allow you to perform things at the right moment. Customers want the
cheapest ticket possible, but airlines want to maximize their profit by keeping their entire
income as high as feasible. To increase revenue, airlines use several computational
tactics, including as demand forecasting and pricing discrimination. This is for the
consumer who buys flight ticket by estimating the amount of the flight fare. The major
difficulty from the customer’s perspective, finding the perfect value or the ideal time to
purchase tickets is the most difficult component. The bulk of the techniques rely on
advanced computational intelligence, prediction models, and a branch of science called
Machine Learning (ML). This research emphasizes the factors and provides instructions
for developing a machine learning-based aircraft fare prediction model.
VIII
Flight fare prediction using machine learning
Table 3.1 The most frequently used metrics for evaluating machine 17
IX
Flight fare prediction using machine learning
X
Flight fare prediction using machine learning
XI
Flight fare prediction using machine learning
Chapter 1
1
Flight fare prediction using machine learning
1.Introduction
The prediction of flight fares has been a topic of interest in the field of transportation and
tourism for many years. Various methods have been proposed for predicting flight fares,
including statistical models, machine learning algorithms, and artificial intelligence techniques.
One common approach for flight fare prediction is the use of linear regression models. These
models use historical data on flight fares, such as the date of the flight, the destination, and the
carrier, to predict future fares. Researchers have found that linear regression models can
provide accurate predictions of flight fares, but they may not be able to capture the complex
relationships between different factors that influence fares.
Another popular approach for flight fare prediction is the use of machine learning algorithms,
such as decision trees, random forests, and neural networks. These algorithms have been found
to be effective in capturing the non-linear relationships between different factors that influence
fares. However, they may require a large amount of data to train effectively.
Artificial intelligence techniques such as deep learning models have also been used for flight
fare prediction. These models have been found to be effective in capturing the non-linear
relationships between different factors that influence fares. However, they are typically more
computationally expensive than traditional machine learning algorithms.
Several studies have been published on flight fare prediction, including those that have used
various datasets, such as airlines' pricing data, search engines' data, and travel agencies' data.
Some studies have also focused on specific industries such as low-cost carriers, full-service
carriers, or specific regions.
In conclusion, the literature on flight fare prediction has shown that there are a variety of
methods that can be used for prediction, including linear regression models, machine learning
algorithms, and artificial intelligence techniques. Each approach has its own strengths and
weaknesses, and the choice of method will depend on the specific needs of the problem and the
availability of data.
Currently, airlines use sophisticated tactics and procedures to allocate ticket pricing in a
dynamic manner. These techniques take into consideration several financial, marketing,
commercial, and societal elements that have a direct impact on the ultimate price of flight. Due
2
Flight fare prediction using machine learning
to the tremendous complexity of the pricing methods used by airlines, it is very difficult for a
passenger to acquire an airline ticket at the lowest price, since the price fluctuates constantly.
To solve this problem, we have been provided with prices of flight tickets for various airlines
between the months of March and June of 2019 and between various cities, using which we
aim to build a model which predicts the prices of the flights using various input features.
Anyone who has booked a flight ticket knows how unexpectedly the prices vary. Airlines use
using sophisticated quasi-academic tactics known as "revenue management" or "yield
management". The cheapest available ticket for a given date gets more or less expensive over
time. This usually happens as an attempt to maximize revenue based on -
2. Keeping the flight as full as they want it (raising prices on a flight which is filling up in
order to reduce sales and hold back inventory for those expensive last-minute
expensive purchases)
So, if we could inform the travellers with the optimal time to buy their flight tickets based on
the historic data and show them various trends in the airline industry, we could help them save
money on their travels. This would be a practical implementation of a data analysis, statistics,
and machine learning techniques to solve a daily problem faced by travellers.
The objectives of the project can broadly be laid down by the following questions –
1. Flight Trends
What is the best time to buy so that the consumer can save the most by taking the
least risk? So should a passenger wait to buy his ticket, or should he buy as early as
possible?
3. Verifying Myths
3
Flight fare prediction using machine learning
Does price increase as we get near to departure date? Is Indigo cheaper than Jet
Airways? Are morning flights expensive?
The scope of this project extends to the management of extensive airlines datasets,
encompassing historical flight fare records. The primary challenge is to effectively
process these vast datasets, extracting meaningful patterns and trends, while creating
algorithms that balance computational efficiency. The following problem specifications
emerged during the project, drawing insights from the works in recent research papers:
Some of the problems that were faced during the project were as follows:
Airlines generate vast amounts of data from various sources such as flight schedules,
passenger information, maintenance logs, etc. Handling this data requires robust
infrastructure and efficient algorithms to process, store, and retrieve information in a
timely manner.
Real-time predictive algorithms are essential for tasks like predicting flight delays,
4
Flight fare prediction using machine learning
Models deployed in the airline industry must be robust to changes in data patterns,
external factors (e.g., weather conditions), and operational dynamics. Regular monitoring
and updating of models are necessary to ensure they remain effective over time.
Additionally, models should be adaptable to new data sources and evolving business
requirements.
Each of these areas requires a combination of domain expertise, data engineering skills, and
advanced analytics techniques to address the specific challenges faced by the airline industry.
• Tiyani Wang [3] proposed to predict the cost on pricing basis at the level of marketing
strategies. The DB1B and T-100 datasets, as well as data about the economy. It depicts
a high-level overview of the proposed framework's primary components. In the data
preparation stage, all datasets are removed to exclude any inaccurate sample data,
changed, and merged based on the section of the market. The feature extraction module
extracts and generates handmade characteristics that are intended to characterize a
market segment.
•
5
Flight fare prediction using machine learning
quite difficult for a customer to get the best deal on an airline ticket because prices
fluctuate often.
• G.A.Papakostas[7] proposed several strategies have lately been presented that can give
the optimum moment for a consumer to purchase an airline ticket by projecting the
price of the
• flight. The bulk of these strategies rely on advanced prediction models developed in the
Machine Learning branch of computational intelligence research (ML).
• Janssen [8] designed a linear quantile hybrid regressor model that performs well for
predicting plane ticket prices several days before arrival.
• Ren, Yang and Yuan [9], studied for predicting aircraft ticket prices, LR (77.06%
acc.), NB (73.06% acc.), SR (76.84% acc.), and SVM (80.6% acc. for two bins)
models performed well.
our project is a smart approach to forecasting flight fares through machine learning. It
begins by analyzing vast amounts of historical flight data, considering factors such as
departure and arrival locations, dates, and ticket types. This data is then cleaned and
organized to focus on the most relevant aspects. Using this refined dataset, the system
trains machine learning models to understand the relationship between various
parameters and ticket prices. These models, such as regression algorithms, learn from
past data to make predictions about future flight fares. The accuracy of these predictions
is continually evaluated and refined to ensure reliability. Once the model is deemed
accurate, it can be deployed to provide real-time fare estimates, helping travelers plan
their trips more effectively. Continuous monitoring and adaptation ensure that the system
remains up-to-date with changing trends and variables in the airline industry, maintaining
its usefulness and accuracy over time.
6
Flight fare prediction using machine learning
• Historical Flight Data: Access to comprehensive datasets containing past
flight information, including fares, departure/arrival airports, dates, times,
and other relevant details.
7
Flight fare prediction using machine learning
1.7 Motivation
• The motivation for developing a flight fare prediction model lies in its potential to
benefit both consumers and businesses in the travel industry.
• It offers consumers cost-saving insights, convenience, and informed decision-
making while providing businesses with market insights, competitive advantage,
and improved revenue management capabilities.
• Flight fare prediction projects provide an excellent opportunity to apply and
experiment with various data science techniques, such as machine learning
algorithms, time series analysis, and feature engineering. It's a chance to delve
into predictive modelling and gain practical experience in a real-world scenario.
8
Flight fare prediction using machine learning
Chapter 2
9
Flight fare prediction using machine learning
2 Methodology
In order to carry out more research and development for this investigation, the research
methodology is shown below. The research methodology can be broken down into a total of
six distinct stages, each of which is explained in turn below (as shown in Fig 1).
The first step Application Understanding aims to explain that the Variability in the air fares
can be analysed effectively by using various Machine Learning techniques. The airfares
movement are being judged manually based on the human sentiments based on their
experiences, which lacks to consider the other factors which affect the variability in the
airfares and this manual process takes a lot of time to identify the right price for the flight for a
specific departure date for a customer. Recognizing that traditional techniques might, at times,
be inaccurate and realizing that using modern ways will result in findings that are more
accurate and produced more quickly. This research can enhance in automating the customer’s
experience to make the booking of the flight at the most optimal cost. The passengers will be
able to make flight booking at an optimal cost by getting the predictions and determine the
number of days in advance for a passenger to make the flight booking, by predicting the flight
10
Flight fare prediction using machine learning
ticket price pattern with respect to number of days left for the departure.
The second step (EDA) Data Gathering involves the gathering of dataset to predict flight price
prediction is being taken from the Kaggle1 which is a public and open-source repository. The
Dataset captures the flight level data over the period of 11th February 2022 to 31st March
2022, i.e., for 50 days. The dataset collected consists of the flight level data for the top 6 major
metro cities of India and it captures the data for all the major airline companies of India.
Totally, 300261 datapoints and 11 features have been taken into consideration in the dataset.
The features considered for the dataset consists of the following features, the name of the
airline company, City from which the flight takes off, Departure Time of the flight, Airline
Stops, Arrival Time of the flight, Destination City, Cabin Class, Duration of Flight, Days left
before departure and the target variable for this dataset is considered as the price of the flight.
The datasets that are taken into consideration considers the flight level data for 6 major cities
of India which is Delhi, Mumbai, Chennai, Hyderabad, Kolkata and Bangalore, and also the
flight level dataset captures the data from 6 major airline service providers which are Vistara,
Air India, Go First, Indigo, Air Asia and Spice Jet, among all of these airline companies
Vistara and Air India are known to be premium airline service providers and Go First, Indigo,
Air Asia and Spice Jet are marked as the low cost airline carriers.
Fig. 2(b): Source city in economic class Fig. 2(c): Distribution day left vs Price
11
Flight fare prediction using machine learning
Fig. 2(d) cities used business class Fig. 2(e) Distribution of business
class
The third step Data pre-processing and transformation involves to pre-process and transform
the flight level dataset which is a structured dataset that requires to be undergone through
various steps of the processing and transformation, the following steps were performed for the
pre-processing and transformation of dataset, Initially the dataset for flight_details.csv was
read into a data frame and then, the Unnamed column was dropped from the flight_details.csv
dataset. After that, the dataset was diagnosed for the null values and no null values were found
in any of the columns of the dataset. Then, the distribution of the numerical variables was seen
by box plots to check for the outlier values in the datasets, the log scaling was applied on the
features having outliers value and then the z-score method was applied to detect and remove
the remaining outliers. After that some of the steps were performed to transform the dataset so
that it can accurately fit into various machine learning models. Firstly, the dummy variables for
all the categorical data in the pre-processed dataset are generated using the one hot encoding
technique. The previous research has shown that the accuracy of a machine learning models
increases when one-hot encoding for categorical variables is being applied as it allows the
representation of categorical data to be more expressive. Then, the correlation matrix is plotted
for all the dummy variables and the original feature with the price of the airline to check the
impact of all the variables on the price of the airline. After this, split is made between the
training and testing sets in the ratio of 70:30 (i.e. train: 70% and test: 30%) to apply various
machine learning models on the pre-processed dataset and after that the target variable i.e. the
price of the airline is being separated with the features set. The Min max scaling
12
Flight fare prediction using machine learning
transformation was applied to the train and the test sets to have a better fit into the machine
learning models.
The fourth step Data modelling and conversion involves to predict the continuous target
variable i.e., the flight price, various ML models which are based on regression techniques are
applied on the processed dataset which includes basic and advanced ML models. Regression
analysis is a statistical method for connecting a dependent variable to one or more independent
(explanatory) variables. A regression model may demonstrate whether variations in the
dependent variable are related to variations in one or more explanatory variables. The
regression models that were used to predict the prices of the flight tickets were, multiple linear
regression, decision tree regressor, K-neighbours regressor, extra trees regressor, Boost
regressor and bagging regressor.
The fifth step Evaluation involves to evaluate various metrics for the machine learning
regression models implemented to give an accurate prediction of the flight price and compare
these metrics to get the best model to accurately predict the flight prices. These evaluation
metrics are RMSE (Root Mean Squared Error), MSE (Mean Squared Error), MAE (Mean
Absolute Error), Adjusted R-Square. Graphical analysis on the performance of all the models
were obtained and the model with lowest error terms was taken into consideration to fit the test
data and get the flight price prediction on the testing data. The detailed description and
formulas for the evaluation metrics used to measure the prediction of airline ticket prices can
be found here.
In order to successfully predict the prices of the airfares it becomes important to comprehend
the description of the design approach that I will be following to predict the cost of the air
fares. The airfare prediction starts by collecting the Airlines dataset from Kaggle which is a
public open-source repository and then followed by the pre- processing various features of the
data and after getting the pre-processed data, the data is split into train and testing sets and then
various regression-based models are applied on the training data, followed by the testing of the
model on the test data and finally a predicted output of flight price is generated for the test data
and the model is evaluated. The below diagram (as shown in fig 2) explains about the same.
13
Flight fare prediction using machine learning
After the train and test sets of the data are obtained, multiple machine learning models have
been applied on the training dataset to predict the flight price. The models implemented to
predict the flight price are listed below:
The Logistic Regression Classifier is utilized to predict the probability of a data point belonging
to a particular category. It shares similarities with Linear regression in thatit assumes data can
be characterized by a linear function. However, instead of linear modeling, logistic regression
employs the sigmoid function to model the data [6].
One of the most popular and useful methods for supervised learning is the decision tree. Both
14
Flight fare prediction using machine learning
Regression and Classification problems may be solved with it, while Classification is more
often utilized in real-world settings. It was the way the tree asked the correct questions at the
right node in Decision Trees for Classification to provide precise and effective classifications.
Entropy and Information Gain are the two metrics used in Classification Trees to accomplish
this. However, because we are making predictions about continuous variables, we are unable to
compute the entropy and follow the same procedure, for the continuous target variable the
mean square error (MSE) is a measurement that indicates how much our projections stray from
the initial goal.
Random Forest, developed by Leo Breiman [7], is a method of ensemble learning that
incorporates numerous decision trees. Unlike a single decision tree, it combines predictions
from diverse trees trained on randomly chosen subsets of the training data, resulting in
improved performance [8]. Similar to decision trees, Random Forest calculates the Gini
Index for each parameter and utilizes the regression error formula [7] to locate the closest
node, which improves prediction accuracy.
15
Flight fare prediction using machine learning
Chapter 3
16
Flight fare prediction using machine learning
3 Implementation
A. Metrices
In the era of machine learning, metrics assist as benchmarks for assessing the potency of an
ML model. Represented as Table 3, these metrics gauge the accuracy, efficacy, and efficiency
of a model, enabling comparisons between different models. The selection of a specific
metric hinges on the task at hand and the application's demands. Within machine learning,
various metrics are employed: Accuracy in classification: This metric assesses theproportion of
accurate success rate made by the model, commonly used as the main assessment measure for
classification tasks.
Precision and recall: These metrics assess the degree to which a model can reliably discover
positive cases, such as the identification of cancer in medical diagnostics. Recall shows the
percentage of true positive cases that are accurately detected, whereas precision represents the
percentage of accurate positive forecasts.
F1 score: An indicator of balance between recall and precision, the F1 score is a metric that
takes both metrics and combines them into a single number. Commonly, it is employed to
evaluate a model's overall efficacy in a cohesive fashion.
Predicted values
True False
True Positive (TP) False Negative
(FN) TP + TN
Accuracy =
Type 1 Error TP + TN + FP + FN
True
False
TP + TN
Accuracy =
TP + TN + FP + FN
Pr ecision
TP
2 x Pr ecision x Recall
TP + TN F1 =
Pr ecision + Recall
Table 3.1: The most frequently used metrics for evaluating machine
17
Flight fare prediction using machine learning
learning
B. Model results
Algorithms Accuracy
18
Flight fare prediction using machine learning
Chapter 4
19
Flight fare prediction using machine learning
4 Conclusion
• The Decision Trees regressor model was the fourth best performing model with the
adjusted R-Square value of 0.9738, MAE as 1247.90, RMSE as 3307.63 and MSE 1.34
x 107, this suggests that the decision tree regressor have better MAE metrics than the
Extra tree regressor, which means that the average price predicted price variation for
decision tree regressor is lower than the Extra tree regressor. This also means that the
predicted prices vary by the factor of 0.9738 from the original prices, so it can also be
inferred here that the predicted and the actual values are very close and the average
deviation for the predicted prices are deviated by the value of 1247.90 INR.
• The Linear regression model had a lowest accuracy in predicting the flight prices as the
adjusted R-Square was 0.9069, MAE was 4571.18, RMSE as 6919.96 and the MSE as
4.78 x 107 , which means that the predicted prices vary by the factor of 0.9069 from the
original prices, this means that the R-squared is significantly lower than the top 6
models that have been presented above so it can also be inferred that the predicted and
the actual values of prices are significantly less close than the top 6 models and the
average deviation for the predicted prices are deviated by the value of 4571.18 INR,
which is significantly higher than the others models.
• Since the XG boost model had the best evaluation metrics in predicting the flight price,
so the model was then fitted on the test data to get the flight price predictions on the
test data. The predictions of flight prices on the test data and plotted against the actual
values of the flight prices. The test data was seen to fit well on the XG Boost model
and the predicted prices were almost matching the actual price values of the flights.
4.2 Discussion
This project was carried out step by stage, and it was difficult to identify a good dataset and to
extract it, clean it, and convert it. After pre-processing the dataset, one hot encoding for the
categorical variable, and scaling the data using the min max scaling transformation, the
project's goal was achieved. Understanding the purpose of the variables in the dataset and
identifying the relevant factors were the main challenges in the pre-processing of the data. The
Exploratory data analysis was performed after generating the pre-processed dataset which
resulted in generating better insights than the previous work that has been done in predicting
the airfares by (Liu et al., 2017) and gathering more information about the features that affect
the variability of the airfares such as time of flight, the weekday of the departure, duration of
the flight.
20
Flight fare prediction using machine learning
The number of days left before departure, timings of the day at which the flights are departed
and arrived, the weekday on the flight is scheduled and the duration of the flight plays a vital
role for determining the price of the flight price.
The case studies using a variety of sophisticated regression machine learning models have
outperformed the earlier models in the literature study. With an adjusted R-squared value of
0.9845, this research produced the best prediction metrics to predict the price of the flight
using the XG Boost model. This is significantly better than the research conducted by (Wang
et al. 2019), which produced the best adjusted R-squared value of 0.858 using the Random
Forest model. By accurately forecasting a customer's ideal flight cost, this study will have a
significant positive impact on the airline industry's passengers and improve the possibility that
they will make more purchases of airline ticket at the most optimal cost.
• More routes can be added and the same analysis can be expanded to major airports and
travel routes in India.
• The analysis can be done by increasing the data points and increasing the historical data
used.
• That will train the model better giving better accuracies and more savings.
• More rules can be added in the Rule based learning based on our understanding of the
industry,
• also incorporating the offer periods given by the airlines.
• Developing a more user-friendly interface for various routes giving more flexibility to
the users.
21
Flight fare prediction using machine learning
References
22
Flight fare prediction using machine learning
1
Flight fare prediction using machine learning