Report
Report
The growth of the aviation sector has made flight delays A significant amount of research work has been done
more common across the world. They cause inconvenience in the field of air-traffic control and commercial aviation.
to the travellers and incur monetary losses to the airlines. Many researchers have made attempts to solve this problem
We analysed the various factors responsible for flight delays and have presented different machine learning approaches.
and applied machine learning models such as Random For-
est, XGBoost, Logistic Regression, Decision Tree and Naive The researchers at School of Computer, Wuhan Voca-
Bayes to predict whether a given flight would be delayed or tional College of Software and Engineering, developed
not. The XGBoost Classifier performed exceptionally well, a multiple linear regression model and compared its
giving an accuracy of 0.88 and an AUC of 0.93. GitHub: performance with other models such as Naive-Bayes and
github.com/rhythm-patel/Predicting-Flight-Delays C4.5. The delay value predicted was classified into two
classes with the flights having a delay of more than 30
minutes being classified as ’Delayed’. The model achieved
1. Introduction 79% accuracy, and it was observed that weather was not
a significant feature for classification, except in extreme
There has been a remarkable expansion in commercial
conditions. [2]
aviation in the past decade, with more people preferring air
travel for a fast and comfortable journey. However, flight
delays have become quite common across the world with The researchers at Nova Information Management
the growth of the aviation sector. Besides inconvenience School, Portugal used various data-mining techniques
to the travellers, flight delays have a negative impact on along with a Knowledge Discovery Database approach
the economy. The airline companies incur substantial and then trained Decision Tree, Logistic Regression and
monetary losses and observe a fall in their reputation if Multi-Layer Perceptron models. The SMOTE technique
their flights are delayed often. The unforeseen delays also gave better results than the Under-sampling technique. The
have a cascading effect on various other sectors. According features like distance and month were found to insignifi-
to a report by the Joint Economic Committee of United cant. The Multi-Layer Perceptron gave an accuracy of 85%
States Congress, the total cost of flight delays to the US and emerged as the best model for prediction. [3]
economy was over $40 billion with $19 billion to the
airlines, $12 billion to the passengers and around $10 The researchers at State University of New York at Bing-
billion to other industries. The delayed flights also pose hamton, New York and Defense Sciences Institute, Turkish
certain environmental concerns. Delayed flights consumed Military Academy, Ankara, introduced DMP-ANN model
an additional 740 million units of jet fuel and released for prediction of defects by applying it to the system of air
over 7 million metric tonnes of additional Carbon Diox- traffic control. Traditional ANN had a hard time handling
ide [1]. Thus, the prediction of flight delays is a crucial task. nominal variables and had to convert to 1-of-N encoding,
which reduced the performance. Results showed that this
This project aims at analysing factors responsible for new ANN outperformed traditional ANN in terms of error
flight delays and designing a machine learning model to (RMSE) as well as time required for training the data. How-
predict them. We classify a flight as ’Delayed’ if the ar- ever, since the number of layers increases with connections,
rival delay of the flight is more than 3 minutes. Prediction complexity becomes a limitation. The model was used to
of delays can help customer choose best flight for the jour- predict flight delays at JFK airport and gave remarkable re-
ney and help airlines to identify flaws in their organisation. sults. [4]
1
3. Dataset It was found that the mean delays of flight dropped with
an increase in route distance. However, flights travelling
We used the 2015 Flight Delays and Cancellations distances over 3000Km showed a spike in delays.
Dataset, collected and published by U.S. DOT’s Bureau of
Transportation Statistics, for training and testing our pre-
diction models. The data from the three files were merged
into a single dataset. The merged dataset consisted of over
5 million samples and had information about airlines, air-
ports, flight schedules, flight routes, delays and cancellation
reasons. Most of the columns had an excellent filling factor,
except those related to cancellation reasons.
2
4. Methodology Random Forest Classifier
To validate the performance of our models, we per- Random Forest algorithm is an ensemble learning algo-
formed a train-val-test split of 70:10:20. We trained rithm which uses Bagging. The Random Forest Classifier
Random Forest Classifier, XGBoost Classifier, Naive creates a number of Decision Trees which are trained on
Bayes Classifier, Decision Tree and Logistic Regression, bootstrap samples of training set by randomly choosing a
from the Sklearn library, on the training set. Grid Search set of attributes for each split. Each Decision Tree in the
CV was used to find optimal hyper-parameters for the forest independently predicts the class of a test sample. The
models. Appropriate graphs and metrics were generated for votes from all the trees in the forest are aggregated to decide
the analysis and performance of the different models were the class of the test sample.
compared.
5. Results
Gaussian Naive Bayes Classifier Gaussian Naive Bayes Classifier
Gaussian Naive Bayes Classifier is a supervised proba- Gaussian Naive Bayes Classifier was used and an accu-
bilistic machine learning model for continuous data, based racy 0.820 was obtained. This was a considerably good re-
on the Bayes Theorem. It assumes independence among sult to start with.
the input features and calculates posterior probabilities of Logistic Regression
different classes to make predictions. The Gaussian condi-
tional probability is as given in (1). Logistic Regression with default parameters was also
tried on the given data and an accuracy of 0.864 was
(xi − µy )2
1 achieved.
P (xi |y) = q exp − (1)
2πσy2 2σy2 Decision Tree
Logistic Regression Decision Tree with default parameters was tried on the
given data and an accuracy of 0.806 was achieved.
Logistic Regression uses the sigmoid function to trans-
form the output into a probability value which can be used XGBoost Classifier
to classify the output. The hypothesis function for logistic The XGBoost Classifier gave a base accuracy of 0.881.
regression is given in (2). It was further tuned by testing different values for Hyper-
1 parameters class weight, learning rate, max depth and
h(θ) = (2) n estimators using Grid-Search CV.
1 + e−θT x
Decision Tree
A decision tree has a structure similar to flowchart in
which each internal node is a test on a feature and each leaf
node is a class label. Information gain is used to select the
feature to split on at each step while building the tree. The
information gain can be calculated using entropy loss (3) or
gini impurity index (4). Decision tree is easy to understand
and can handle both numerical and categorical data.
IG = 1 − Σp2i (4) The best accuracy obtained after tuning was 0.884.
3
Classification Report Predicted class 0 . Predicted class 1
LR GNB DT XGB RF LR GNB DT XGB RF
True Class 0 54439 51967 48808 54878 54719 3372 5844 9003 2933 3092
True Class 1 8913 10430 8360 8231 8338 23734 22217 24287 24416 24259
Precision 0.86 0.83 0.85 0.87 0.86 0.88 0.79 0.72 0.89 0.88
Recall 0.94 0.89 0.84 0.95 0.94 0.73 0.68 0.74 0.75 0.74
F1-score 0.90 0.85 0.84 0.90 0.89 0.79 0.73 0.73 0.81 0.80
• criterion: entropy
• n estimators: 400
5.1. Analysis
4
6. Conclusion [3] Roberto Henriques and Inês Feiteira. Predictive modelling:
flight delays and associated factors, hartsfield–jackson atlanta
We can conclude that Random Forest Classifier and XG- international airport. Procedia computer science, 138:638–
Boost Classifier performed significantly well in predicting 645, 2018.
flight delays. XGBoost Classifier had a slight edge over [4] Sina Khanmohammadi, Salih Tutun, and Yunus Kucuk. A
Random Forest Classifier and fitted the problem statement new multilevel input layer artificial neural network for pre-
better with an accuracy score of 0.88 and AUC score of dicting flight delays at jfk airport. Procedia Computer Sci-
0.93. ence, 95:237–244, 2016.
We hope that our work would contribute to the soci-
ety and be utilised by the concerned parties. For future
prospects, we aim to design an application for the same.
6.2. Learning
In this project, we studied the various factors responsible
for flight delays and gained some interesting insights by
conducting Exploratory Data Analysis on the dataset. We
were successful in pre-processing the dataset and applying
machine learning models like Random Forest Classifier,
Decision Tree, XGBoost Classifier, Gaussian Naive Bayes
Classifier and Logistic Regression to our problem. The
project allowed us to go beyond the scope of regular
teachings of the classes and learn more about various
machine learning algorithms. We developed analytical
skills to study the data and come up with appropriate
machine learning algorithms which would fit well on the
data.
References
[1] Joint Economic Commitee Majority Staff. Your flight has
been delayed again. Technical report, Tech. rep, 2008.
[2] Yi Ding. Predicting flight delay based on multiple linear re-
gression. In IOP Conference Series: Earth and Environmental
Science, volume 81, pages 1–7, 2017.