0% found this document useful (0 votes)

27 views7 pages

Comparative Analysis of Machine Learning Models For Accurate Flight Price Prediction

Flight fare prediction is a vital component in helping consumers make informed decisions regarding travel expenses. Airline ticket prices fluctuate due to a variety of factors such as demand, time of purchase, and flight routes. In this research, we propose a machine learning-based solution for predicting flight fares using historical data. Models like Random Forest, Gradient Boosting, and Support Vector Machines (SVM) are employed to analyze flight data and produce reliable predictions.

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views7 pages

Comparative Analysis of Machine Learning Models For Accurate Flight Price Prediction

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Volume 9, Issue 9, September– 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24SEP1688

Comparative Analysis of Machine Learning

Models for Accurate Flight Price Prediction
Adwait Chavan 1 Ishika Rathod2
Department of Computer Engineering Department of Computer Engineering
Dr. Vishwanath Karad MIT World Peace University Dr. Vishwanath Karad MIT World Peace University
Pune, India Pune, India

Sarika Bobde3 (Professor)

Department of Computer Engineering
Dr. Vishwanath Karad MIT World Peace University
Pune, India

Abstract:- Flight fare prediction is a vital component in II. LITERATURE REVIEW

helping consumers make informed decisions regarding
travel expenses. Airline ticket prices fluctuate due to a Several studies have addressed the challenge of
variety of factors such as demand, time of purchase, and predicting flight prices using machine learning techniques. A
flight routes. In this research, we propose a machine common theme in the literature is the employment of
learning-based solution for predicting flight fares using regression models to analyze the temporal, geographical, and
historical data. Models like Random Forest, Gradient market-driven variables affecting airfares. For example, K.
Boosting, and Support Vector Machines (SVM) are Tziridis et al. (2017) explored the predictive power of
employed to analyze flight data and produce reliable machine learning algorithms like Random Forest, revealing
predictions. This study demonstrates how predictive that ensemble models outperformed simple regression
models can benefit customers by offering insights into models in capturing price dynamics.
pricing trends, thus optimizing their flight booking
process. Other works, such as that of Panwar et al. (2021),
focused on using Support Vector Machines (SVM) and
Keywords:- Flight Fare Prediction, Machine Learning, Linear Regression for predicting stock and airfare prices,
Random Forest, Dynamic Pricing, Predictive Modeling, finding that machine learning models offer substantial
SVM, Gradient Boosting. improvements over traditional statistical approaches.
However, most studies emphasize the need for robust feature
I. INTRODUCTION engineering, as the importance of specific variables like
seasonality and airline type can significantly affect the
Airline ticket prices have become increasingly dynamic predictive power of models.Our study builds upon this
due to the global expansion of commercial aviation and the foundation by comparing multiple machine learning models
advent of E-commerce. Airlines use complex revenue and introducing new feature engineering techniques to
management strategies to optimize pricing based on multiple improve model accuracy in predicting flight fares.
variables, including the date of booking, demand, and
competition. While customers aim to secure the lowest fare,
predicting the best time to book a flight is difficult. This paper
addresses the need for accurate flight fare predictions using
machine learning techniques.

Machine learning has emerged as a powerful tool for

handling such pricing complexities. Traditional methods fail
to capture the dynamic nature of flight prices, which depend
on numerous factors such as seasonal trends, travel demand,
and route popularity. By applying machine learning
algorithms, we can analyze historical flight data and uncover
relationships between these variables, enabling more accurate
fare predictions.

This research explores several machine learning

models, evaluates their performance, and proposes an
efficient system for flight fare prediction.

IJISRT24SEP1688 www.ijisrt.com 2798

Volume 9, Issue 9, September– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24SEP1688

III. METHODOLOGY

Fig 1 Block Diagram

IJISRT24SEP1688 www.ijisrt.com 2799

Volume 9, Issue 9, September– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24SEP1688

A. Data Collection 0, variance = 1) or normalization (scaling between 0 and 1)

The first step in building a machine learning model for can be applied.
flight price prediction involves gathering a relevant and
comprehensive dataset. For this study, the dataset was  Date and Time Transformation:
obtained from a public repository (e.g., Kaggle) containing Time-based features such as departure and arrival times
historical flight price data. This dataset includes a variety of are transformed into numerical values representing hours and
features such as: minutes. In addition, the date of the journey can be split into
day, month, and year to capture trends based on temporal
 Flight Details: Information such as the airline, source, patterns.
destination, and route.
 Temporal Features: Date of journey, departure time, and C. Feature Engineering
arrival time. Feature engineering is the process of creating new
 Flight Characteristics: Number of stops, duration of features from the existing data to improve model
flight. performance. For flight price prediction, several additional
 Ticket Price: The target variable for prediction. features were created:

The dataset used for this research contains 11 features  Flight Duration:
and 10,683 rows, each representing a unique flight. These The flight duration is a critical feature that influences
features include both numerical and categorical variables ticket pricing. It is calculated by subtracting the departure
essential for training the machine learning model. The dataset time from the arrival time.
must be large enough to ensure the model captures the various
patterns and trends in ticket pricing.  Day of the Week:
The day of the week can have a significant impact on
B. Data Preprocessing flight prices. For instance, weekend flights or flights on
Data preprocessing is critical to ensure the quality and holidays may be priced higher due to increased demand. This
reliability of the dataset. It includes several steps: feature is extracted from the date of journey.

 Handling Missing Values:  Month and Seasonal Effects:

The dataset may contain missing values in various Prices are often influenced by the seasonality of travel.
columns. These missing values need to be addressed as they Flights during holiday seasons (e.g., Christmas, summer
can negatively impact model performance. For numerical vacations) or major events tend to be more expensive. By
features, missing values can be replaced with the mean or extracting the month from the date, we can capture these
median. For categorical features, missing values can be seasonal variations in ticket pricing.
replaced using the mode or a placeholder indicating missing
data.  Peak and Off-Peak Hours:
Flights scheduled during peak hours (morning and
 Removing Duplicates: evening) may cost more compared to off-peak hours (late
Duplicate entries in the dataset can skew the results. A night or early morning). This feature helps in identifying
check for duplicate rows is performed, and duplicates are price trends related to flight timing.
removed to ensure data integrity.
D. Data Splitting
 Encoding Categorical Features: To evaluate the model's performance effectively, the
Machine learning models require numerical inputs. dataset is divided into two parts:
Therefore, categorical data such as the airline, source,
destination, and route need to be converted into numerical  Training Set (80%): Used to train the machine learning
representations. model.
 Test Set (20%): Used to evaluate the model's
 One-Hot Encoding: Categorical features without an performance on unseen data.
intrinsic order (e.g., airline names) are converted using
one-hot encoding, which creates binary columns for each A typical split ensures that the model is trained on a
unique category. sufficiently large portion of the data, while the test set
 Label Encoding: Features with an ordinal relationship, provides an independent evaluation of how well the model
such as flight stops (e.g., 0 stops, 1 stop, 2 stops), are generalizes to new instances.
label-encoded into numerical values.
E. Model Selection
 Feature Scaling: Several machine learning models were considered for
Feature scaling ensures that all numerical features are this study to determine the most accurate algorithm for flight
on the same scale, which helps some models (like SVM or price prediction. These models include:
Gradient Boosting) perform better. Standardization (mean =

IJISRT24SEP1688 www.ijisrt.com 2800

Volume 9, Issue 9, September– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24SEP1688

 Random Forest:  Mean Squared Error (MSE):

Random Forest is an ensemble learning algorithm that MSE calculates the average of the squared differences
constructs multiple decision trees during training and outputs between predicted and actual values. It penalizes larger errors
the average prediction from these trees. It is robust against more than MAE, making it useful for identifying models that
overfitting and performs well on datasets with both make significant errors.
categorical and numerical features.
 Root Mean Squared Error (RMSE):
 Gradient Boosting: RMSE is the square root of the MSE, which brings the
Gradient Boosting is another ensemble method that error metric back to the same units as the target variable
builds weak learners (usually decision trees) sequentially. (ticket prices). It is a standard measure of model accuracy.
Each new learner focuses on the errors made by the previous
one, gradually improving the model's accuracy.  R-Squared (R²):
R² measures how well the regression model fits the data.
 Support Vector Machines (SVM): A value closer to 1 indicates that the model explains a large
SVM is a supervised learning algorithm that finds the portion of the variance in the target variable.
hyperplane which best separates the data points. In
regression, it aims to minimize error by fitting within a certain IV. RESULTS
threshold. It can model both linear and non-linear
relationships but often requires extensive tuning of A. Model Performance Analysis
hyperparameters. To assess the predictive accuracy of the machine
learning models, we evaluated them using the test dataset.
 Linear Regression: Three models—Random Forest, Gradient Boosting, and
As a baseline model, linear regression was also tested. Support Vector Machines (SVM)—were trained and tested.
It assumes a linear relationship between the features and the Additionally, a baseline Linear Regression model was used
target variable (ticket price), which may not be the case in for comparison.
real-world flight pricing. However, it serves as a comparison
to more complex models.  Mean Absolute Error (MAE)
MAE is an important metric for assessing how close the
Each model was evaluated using k-fold cross- predicted values are to the actual values. It calculates the
validation, a technique that divides the dataset into k subsets, average magnitude of errors in a set of predictions, without
training the model k times on different subsets and averaging considering their direction (i.e., whether the prediction is
the results. This ensures the model's robustness and prevents higher or lower than the actual value). For flight price
overfitting. prediction, a lower MAE means the model's predicted prices
are closer to the actual fares.
F. Model Training and Hyperparameter Tuning
Each machine learning model was trained on the  Random Forest: MAE = 725.34
training dataset. Hyperparameter tuning was performed using  Gradient Boosting: MAE = 742.12
Grid Search to optimize model parameters such as:  SVM: MAE = 782.13
 Linear Regression: MAE = 810.56
 Number of Trees (Random Forest): Controls the
number of decision trees in the forest. Among the models, Random Forest had the lowest
 Learning Rate (Gradient Boosting): Determines how MAE, indicating that it provided the most accurate
much each tree contributes to the final prediction. predictions on average. Gradient Boosting also performed
 Kernel and Regularization (SVM): Specifies the type of reasonably well, though slightly less accurate than Random
kernel (linear or non-linear) and the regularization Forest. SVM showed a higher MAE, suggesting that it
parameter to prevent overfitting. struggled to capture the complexity of the dataset as
effectively. Linear Regression had the highest MAE,
The goal of hyperparameter tuning is to find the reinforcing that more complex models outperform linear ones
combination of settings that minimizes the model’s error on for this task.
the validation set.
 Mean Squared Error (MSE) and Root Mean Squared
G. Model Evaluation Error (RMSE)
The performance of each model was evaluated using the MSE and RMSE provide additional perspectives by
test set. Several metrics were used to measure how well the emphasizing larger prediction errors. MSE measures the
models predicted flight prices: average of the squared differences between actual and
predicted prices, penalizing larger errors more than smaller
 Mean Absolute Error (MAE): ones. RMSE, the square root of MSE, brings the error metric
The MAE is the average of the absolute differences back to the original units (i.e., currency), making it easier to
between predicted and actual values. A lower MAE indicates interpret.
better model performance.

IJISRT24SEP1688 www.ijisrt.com 2801

Volume 9, Issue 9, September– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24SEP1688

 Random Forest: MSE = 969,687.60, RMSE = 984.67 feature. Premium airlines tend to have higher ticket prices
 Gradient Boosting: MSE = 993,524.43, RMSE = 996.76 due to the services they offer.
 SVM: MSE = 1,032,183.67, RMSE = 1,016.54  Flight Duration: The total duration of the flight played a
 Linear Regression: MSE = 1,075,829.89, RMSE = key role in determining the ticket price. Longer flights
1,037.33 generally corresponded to higher fares, reflecting the
additional operational costs involved.
Random Forest achieved the lowest MSE and RMSE,  Number of Stops: Non-stop flights were typically more
indicating fewer significant errors in predicting flight prices expensive than flights with layovers. This aligns with the
compared to the other models. Gradient Boosting was very general preference for convenience, where direct flights
close in terms of MSE and RMSE, showing its effectiveness are often priced higher.
in managing complex patterns in the data. SVM and Linear  Departure Time: The time of day at which the flight
Regression had significantly higher values, demonstrating departs also influenced ticket prices. Flights during peak
that these models may not be suitable for capturing the hours, such as early mornings and evenings, were
intricate non-linear relationships present in flight pricing generally more expensive compared to off-peak times.
data.  Day of the Week: Flights scheduled for weekends or
holidays were generally priced higher, likely due to
 R-Squared (R²) increased demand.
R² measures how well the model explains the variance  Source and Destination: Certain source and destination
in the target variable (flight prices). An R² value close to 1 pairs consistently showed higher prices, likely due to the
indicates that the model explains most of the variability in the popularity of specific routes. For instance, flights between
data, while a value closer to 0 indicates poor predictive major cities or tourist destinations tended to have higher
power. fares.

 Random Forest: R² = 0.92 C. Outlier Detection

 Gradient Boosting: R² = 0.90 Certain predictions deviated significantly from actual
 SVM: R² = 0.87 values, indicating potential outliers in the data. These outliers
 Linear Regression: R² = 0.84 may be due to unusual conditions such as last-minute
bookings, flash sales, or sudden spikes in demand for specific
Once again, Random Forest led the models with the routes. Future iterations of the model could improve by
highest R² score, explaining 92% of the variance in the flight identifying and managing these outliers more effectively,
price data. Gradient Boosting followed closely with a score potentially using advanced techniques like anomaly
of 90%, showing it also captured most of the variation in detection.
prices. The SVM and Linear Regression models lagged
behind, with SVM explaining 87% of the variance and Linear D. Insights and Recommendation
Regression explaining 84%. These results indicate that while Based on the model performance and feature
all models provide some predictive ability, Random Forest importance analysis, the following key insights can be drawn:
and Gradient Boosting are better suited for this task.
 Airlines can optimize pricing strategies by focusing on
B. Feature Importance Analysis the time of departure, the number of stops, and flight
One of the advantages of using tree-based models like duration. Offering more flexible pricing for off-peak
Random Forest and Gradient Boosting is their ability to hours or less popular routes may help airlines capture
provide insights into feature importance. This analysis additional market share.
reveals which features contributed most to the predictions and  Consumers can benefit from booking during off-peak
helps us understand the key drivers behind flight price hours or on less popular days of the week to take
fluctuations. advantage of lower fares. Avoiding weekends and
choosing flights with stopovers could lead to significant
 Key Factors Influencing Flight Prices cost savings.
The analysis of feature importance highlighted the  Future enhancements could include incorporating real-
following factors as the most significant in predicting flight time data to adapt the model for dynamic pricing, where
prices: ticket prices fluctuate based on live demand and
competition. Additionally, using more granular temporal
 Airline Type: The type of airline (full-service carrier vs. data (e.g., hour of booking) may further improve model
low-cost carrier) was found to be the most influential accuracy.

Table 1. Performance Comparison of Machine Learning Models for Flight Price Prediction
Model MAE MSE RMSE
Random Forest 725.34 969,687.60 984.67
Gradient Boosting 742.12 993,524.43 996.76
Support Vector Machines (SVM) 782.13 1,032,183.67 1,016.54

IJISRT24SEP1688 www.ijisrt.com 2802

Volume 9, Issue 9, September– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24SEP1688

V. CONCLUSION  Improved Feature Engineering:

While feature engineering in this study included
This study presents a machine learning-based approach important variables such as flight duration, time of day, and
to predict flight prices, aiming to provide insights into pricing number of stops, additional features could be extracted. For
trends and help customers make informed decisions when example, incorporating more granular temporal features (e.g.,
booking flights. The investigation into multiple machine hour of booking, time until departure) or capturing the impact
learning models—including Random Forest, Gradient of promotional periods (e.g., flash sales, holiday discounts)
Boosting, and Support Vector Machines (SVM)—revealed could improve the model’s predictive performance.
that ensemble methods like Random Forest and Gradient
Boosting outperform simpler models in terms of both  Incorporating Customer Behavior Data:
accuracy and robustness. Another potential enhancement is the integration of
customer behavior data. By incorporating information such
The performance analysis demonstrated that Random as search history, customer preferences, and booking habits,
Forest was the best overall model, achieving the lowest error the model could provide more personalized predictions. This
rates and the highest predictive power, as evidenced by its would allow airlines to tailor pricing strategies to specific
superior MAE, RMSE, and R² scores. Gradient Boosting also customer segments, enhancing both revenue management
performed well, though it was slightly less efficient than and customer satisfaction.
Random Forest. In contrast, SVM and Linear Regression
models struggled to capture the complexity of flight pricing,  Hybrid Model Approaches:
leading to higher error rates and lower accuracy. While Random Forest and Gradient Boosting performed
well, future work could explore hybrid models that combine
The feature importance analysis provided valuable the strengths of multiple algorithms. For example, stacking
insights into the factors influencing ticket prices. Key drivers models—where the predictions of several models are
included the airline type, flight duration, number of stops, and combined using a meta-learner—could further enhance
departure time, all of which had significant effects on pricing. accuracy by leveraging the different strengths of various
Flights with fewer stops, premium airlines, and peak-hour machine learning approaches.
departures were generally more expensive. These findings
can assist both customers in making cost-effective travel  Global Applicability and Dataset Expansion:
choices and airlines in refining their pricing strategies. The current model was trained on a dataset limited to
certain routes and airlines. Expanding the dataset to include
A. Future Enhancements international flights, more airlines, and diverse routes could
While the results are promising, there are several areas make the model more generalizable. By capturing a broader
where future improvements can be made to further enhance spectrum of flight data, the model would be better equipped
the accuracy and applicability of the flight price prediction to handle diverse flight markets and pricing behaviors across
model. Below are some potential future changes: regions.

 Incorporation of Real-Time Data:  Explainability and Interpretability:

One of the limitations of the current model is its reliance Although the feature importance analysis provided
on historical data. Future models could integrate real-time insights into key drivers of flight prices, future work could
data to capture dynamic pricing in real-world environments. focus on improving model interpretability. Techniques such
By including real-time information such as demand spikes, as SHAP (SHapley Additive exPlanations) could be
weather conditions, or competitor pricing, the model could employed to better explain individual predictions and provide
adapt more quickly to sudden fluctuations in fare prices. actionable insights into why a particular price was predicted,
making the model more transparent for end-users and
 Dynamic Pricing and Live Updates: industry stakeholders.
Flight prices change frequently due to a range of factors
such as demand, seasonality, and promotional offers. A B. Future Remarks
dynamic model that continuously updates based on real-time In conclusion, this study successfully developed a flight
data would provide more accurate predictions. This could be price prediction system that uses machine learning models to
achieved through the integration of streaming data platforms, forecast ticket prices with reasonable accuracy. The results
allowing the model to refresh its predictions as new data demonstrate that ensemble learning techniques, particularly
becomes available. Random Forest, are well-suited for this type of regression
task, offering superior performance compared to simpler
 Handling External Factors: models like Linear Regression and SVM.
Currently, the model only accounts for features
available in the dataset (e.g., flight duration, number of Moving forward, enhancements such as incorporating
stops). Future models could incorporate external factors such real-time data, handling dynamic pricing, and expanding the
as fuel prices, economic indicators, or geopolitical events, feature set will further improve the model’s accuracy and
which also influence flight prices. By considering these utility. By continuously refining these methods and
additional variables, the model can better reflect the broader incorporating more sophisticated techniques, we can build a
context in which airlines set fares. robust system capable of predicting flight prices with greater

IJISRT24SEP1688 www.ijisrt.com 2803

Volume 9, Issue 9, September– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24SEP1688

precision, benefiting both consumers and airlines in the ever-

evolving aviation industry.

REFERENCES

[1]. Kakaraparthi, A., & Karthick, V. (2022). A Secure and

Cost-Effective Platform for Employee Management
System Using Lightweight Standalone Framework
over Diffie Hellman’s Key Exchange Algorithm. ECS
Transactions, 107(1), 13663–13674.
doi:10.1142/S0217590821500521.
[2]. Tziridis, K., Kalampokas, Th., & Papakostas, G. A.
(2017). Airfare Prices Prediction Using Machine
Learning Techniques. 25th European Signal
Processing Conference (EUSIPCO).
doi:10.23919/EUSIPCO.2017.8081387.
[3]. Groves, W., & Gini, M. (2013). An Agent for
Optimizing Airline Ticket Purchasing. In Proceedings
of the International Conference on Autonomous
Agents and Multi-Agent Systems (pp. 593–600).
doi:10.5555/2484920.2485049.
[4]. Brown, N., & Taylor, J. (2004). Air Fare: Stories,
Poems & Essays on Flight. Sarabande Books.
[5]. Lok, J. C. (2018). Prediction Factors Influence Airline
Fuel Price Changing Reasons. International Journal of
Forecasting, 34(3), 453–462.
doi:10.1016/j.ijforecast.2018.01.006.
[6]. Panwar, B., Dhuriya, G., Johri, P., Yadav, S. S., &
Gaur, N. (2021). Stock Market Prediction Using
Linear Regression and SVM. 2021 International
Conference on Advance Computing and Innovative
Technologies in Engineering (ICACITE).
doi:10.1109/ICACITE51222.2021.9404733.
[7]. Purey, P., & Patidar, A. (2018). Stock Market Close
Price Prediction Using Neural Network and
Regression Analysis. International Journal of
Computer Sciences and Engineering, 6(8), 266–271.
doi:10.26438/ijcse/v6i8.266271.
[8]. Ataman, G., & Kahraman, S. (2021). Stock Market
Prediction in BRICS Countries Using Linear
Regression and Artificial Neural Network Hybrid
Models. The Singapore Economic Review, 66(5), 1-
19. doi:10.1142/S0217590821500521.
[9]. Chawla, P., Sharma, A., & Kumar, M. (2020). Flight
Fare Prediction: A Regression Approach Using
Machine Learning Algorithms. International Journal
of Advanced Research in Computer Science, 11(1),
112–118. doi:10.26483/ijarcs.v11i1.6478.
[10]. Wilson, P., & Böhme, T. (2020). Revenue
Management with Machine Learning: Dynamic
Airline Pricing Prediction. Journal of Revenue and
Pricing Management, 19(5), 344–362.
doi:10.1057/s41272-020-00255-2.