Comparative Analysis of Machine Learning Models For Accurate Flight Price Prediction
Comparative Analysis of Machine Learning Models For Accurate Flight Price Prediction
III. METHODOLOGY
The dataset used for this research contains 11 features Flight Duration:
and 10,683 rows, each representing a unique flight. These The flight duration is a critical feature that influences
features include both numerical and categorical variables ticket pricing. It is calculated by subtracting the departure
essential for training the machine learning model. The dataset time from the arrival time.
must be large enough to ensure the model captures the various
patterns and trends in ticket pricing. Day of the Week:
The day of the week can have a significant impact on
B. Data Preprocessing flight prices. For instance, weekend flights or flights on
Data preprocessing is critical to ensure the quality and holidays may be priced higher due to increased demand. This
reliability of the dataset. It includes several steps: feature is extracted from the date of journey.
Random Forest: MSE = 969,687.60, RMSE = 984.67 feature. Premium airlines tend to have higher ticket prices
Gradient Boosting: MSE = 993,524.43, RMSE = 996.76 due to the services they offer.
SVM: MSE = 1,032,183.67, RMSE = 1,016.54 Flight Duration: The total duration of the flight played a
Linear Regression: MSE = 1,075,829.89, RMSE = key role in determining the ticket price. Longer flights
1,037.33 generally corresponded to higher fares, reflecting the
additional operational costs involved.
Random Forest achieved the lowest MSE and RMSE, Number of Stops: Non-stop flights were typically more
indicating fewer significant errors in predicting flight prices expensive than flights with layovers. This aligns with the
compared to the other models. Gradient Boosting was very general preference for convenience, where direct flights
close in terms of MSE and RMSE, showing its effectiveness are often priced higher.
in managing complex patterns in the data. SVM and Linear Departure Time: The time of day at which the flight
Regression had significantly higher values, demonstrating departs also influenced ticket prices. Flights during peak
that these models may not be suitable for capturing the hours, such as early mornings and evenings, were
intricate non-linear relationships present in flight pricing generally more expensive compared to off-peak times.
data. Day of the Week: Flights scheduled for weekends or
holidays were generally priced higher, likely due to
R-Squared (R²) increased demand.
R² measures how well the model explains the variance Source and Destination: Certain source and destination
in the target variable (flight prices). An R² value close to 1 pairs consistently showed higher prices, likely due to the
indicates that the model explains most of the variability in the popularity of specific routes. For instance, flights between
data, while a value closer to 0 indicates poor predictive major cities or tourist destinations tended to have higher
power. fares.
Table 1. Performance Comparison of Machine Learning Models for Flight Price Prediction
Model MAE MSE RMSE
Random Forest 725.34 969,687.60 984.67
Gradient Boosting 742.12 993,524.43 996.76
Support Vector Machines (SVM) 782.13 1,032,183.67 1,016.54
REFERENCES