0% found this document useful (0 votes)
35 views5 pages

Report

This document summarizes research on predicting flight delays using machine learning models. Previous studies developed regression and classification models using factors like distance, weather, and month. The current study analyzes delay factors, applies models like Random Forest, XGBoost, and Naive Bayes to a dataset of over 5 million flights, and finds XGBoost achieves 88% accuracy in predicting delays over 3 minutes.

Uploaded by

Madhu S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views5 pages

Report

This document summarizes research on predicting flight delays using machine learning models. Previous studies developed regression and classification models using factors like distance, weather, and month. The current study analyzes delay factors, applies models like Random Forest, XGBoost, and Naive Bayes to a dataset of over 5 million flights, and finds XGBoost achieves 88% accuracy in predicting delays over 3 minutes.

Uploaded by

Madhu S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Predicting Flight Delays

Anuneet Anand Mohnish Agrawal Rhythm Patel


2018022 2018053 2018083
[email protected] [email protected] [email protected]

Abstract 2. Literature Review

The growth of the aviation sector has made flight delays A significant amount of research work has been done
more common across the world. They cause inconvenience in the field of air-traffic control and commercial aviation.
to the travellers and incur monetary losses to the airlines. Many researchers have made attempts to solve this problem
We analysed the various factors responsible for flight delays and have presented different machine learning approaches.
and applied machine learning models such as Random For-
est, XGBoost, Logistic Regression, Decision Tree and Naive The researchers at School of Computer, Wuhan Voca-
Bayes to predict whether a given flight would be delayed or tional College of Software and Engineering, developed
not. The XGBoost Classifier performed exceptionally well, a multiple linear regression model and compared its
giving an accuracy of 0.88 and an AUC of 0.93. GitHub: performance with other models such as Naive-Bayes and
github.com/rhythm-patel/Predicting-Flight-Delays C4.5. The delay value predicted was classified into two
classes with the flights having a delay of more than 30
minutes being classified as ’Delayed’. The model achieved
1. Introduction 79% accuracy, and it was observed that weather was not
a significant feature for classification, except in extreme
There has been a remarkable expansion in commercial
conditions. [2]
aviation in the past decade, with more people preferring air
travel for a fast and comfortable journey. However, flight
delays have become quite common across the world with The researchers at Nova Information Management
the growth of the aviation sector. Besides inconvenience School, Portugal used various data-mining techniques
to the travellers, flight delays have a negative impact on along with a Knowledge Discovery Database approach
the economy. The airline companies incur substantial and then trained Decision Tree, Logistic Regression and
monetary losses and observe a fall in their reputation if Multi-Layer Perceptron models. The SMOTE technique
their flights are delayed often. The unforeseen delays also gave better results than the Under-sampling technique. The
have a cascading effect on various other sectors. According features like distance and month were found to insignifi-
to a report by the Joint Economic Committee of United cant. The Multi-Layer Perceptron gave an accuracy of 85%
States Congress, the total cost of flight delays to the US and emerged as the best model for prediction. [3]
economy was over $40 billion with $19 billion to the
airlines, $12 billion to the passengers and around $10 The researchers at State University of New York at Bing-
billion to other industries. The delayed flights also pose hamton, New York and Defense Sciences Institute, Turkish
certain environmental concerns. Delayed flights consumed Military Academy, Ankara, introduced DMP-ANN model
an additional 740 million units of jet fuel and released for prediction of defects by applying it to the system of air
over 7 million metric tonnes of additional Carbon Diox- traffic control. Traditional ANN had a hard time handling
ide [1]. Thus, the prediction of flight delays is a crucial task. nominal variables and had to convert to 1-of-N encoding,
which reduced the performance. Results showed that this
This project aims at analysing factors responsible for new ANN outperformed traditional ANN in terms of error
flight delays and designing a machine learning model to (RMSE) as well as time required for training the data. How-
predict them. We classify a flight as ’Delayed’ if the ar- ever, since the number of layers increases with connections,
rival delay of the flight is more than 3 minutes. Prediction complexity becomes a limitation. The model was used to
of delays can help customer choose best flight for the jour- predict flight delays at JFK airport and gave remarkable re-
ney and help airlines to identify flaws in their organisation. sults. [4]

1
3. Dataset It was found that the mean delays of flight dropped with
an increase in route distance. However, flights travelling
We used the 2015 Flight Delays and Cancellations distances over 3000Km showed a spike in delays.
Dataset, collected and published by U.S. DOT’s Bureau of
Transportation Statistics, for training and testing our pre-
diction models. The data from the three files were merged
into a single dataset. The merged dataset consisted of over
5 million samples and had information about airlines, air-
ports, flight schedules, flight routes, delays and cancellation
reasons. Most of the columns had an excellent filling factor,
except those related to cancellation reasons.

3.1. Cleaning and Feature Extraction


We selected January month’s data for our use, keeping
Figure 2. Distance Vs. Delay
the computational complexity in mind. Since we were con-
cerned with only flight delay prediction, we dropped the
We also conducted a statistical study of different numer-
columns related to cancellation. Some other irrelevant and
ical features in the dataset and derived important insights
redundant columns like Tail Number, Airport Name, Air-
about their distributions. It was found that almost 68% of
line Name and Wheels On were dropped. The Date-Time
the flights had no delay or a delay of less than 3 minutes.
format was corrected, and rows with NaN values were re-
moved. We also removed outliers which had an arrival delay
of more than 500 minutes. A dataset of dimensions 456685
rows × 11 columns was obtained with the following fea-
tures.

• Categorical Features: Airline, Origin, Destination

• Numerical Features: Distance, Taxi Out, Departure


Delay, Day of Week, Arrival Delay

• Date/Time Features: Date, Scheduled Departure, Figure 3. Distribution Of Arrival Delay


Scheduled Arrival

3.2. Exploratory Data Analysis 3.3. Pre-Processing


We had to pre-process the data before training and eval-
We identified the busiest airports and air traffic shares of
uating different models on it. The Categorical features had
different airlines and visualised the mean delays in airlines
to be converted into numerical forms so that they can be
on different days of the week. It was observed that most
interpreted by the machine learning models.
flights were delayed on Sunday and Monday. Alaska Air-
lines Inc. and Delta Air Lines Inc. were the best performing • One-Hot Encoding was used to handle the Airline fea-
airlines whereas Frontier Airlines Inc. and American Eagle ture [14 possible values].
Airlines Inc. were often delayed.
• Label Encoding was used to handle Origin and Desti-
nation features [300+ possible values].
• Date and Time were expressed using sine and cosine
values of their individual attributes to incorporate their
cyclic nature.
• Keeping 3 minutes as a threshold, the samples with an
arrival delay less than 3 minutes were assigned class 0
whereas the remaining samples were assigned class 1.
The features were scaled using Standard Scaler. We
obtained the numpy arrays X and Y with dimensions
Figure 1. Heat Map of Mean Airline Delays Vs. Days (456855, 32) and (456855, ) respectively.

2
4. Methodology Random Forest Classifier
To validate the performance of our models, we per- Random Forest algorithm is an ensemble learning algo-
formed a train-val-test split of 70:10:20. We trained rithm which uses Bagging. The Random Forest Classifier
Random Forest Classifier, XGBoost Classifier, Naive creates a number of Decision Trees which are trained on
Bayes Classifier, Decision Tree and Logistic Regression, bootstrap samples of training set by randomly choosing a
from the Sklearn library, on the training set. Grid Search set of attributes for each split. Each Decision Tree in the
CV was used to find optimal hyper-parameters for the forest independently predicts the class of a test sample. The
models. Appropriate graphs and metrics were generated for votes from all the trees in the forest are aggregated to decide
the analysis and performance of the different models were the class of the test sample.
compared.
5. Results
Gaussian Naive Bayes Classifier Gaussian Naive Bayes Classifier
Gaussian Naive Bayes Classifier is a supervised proba- Gaussian Naive Bayes Classifier was used and an accu-
bilistic machine learning model for continuous data, based racy 0.820 was obtained. This was a considerably good re-
on the Bayes Theorem. It assumes independence among sult to start with.
the input features and calculates posterior probabilities of Logistic Regression
different classes to make predictions. The Gaussian condi-
tional probability is as given in (1). Logistic Regression with default parameters was also
tried on the given data and an accuracy of 0.864 was
(xi − µy )2
 
1 achieved.
P (xi |y) = q exp − (1)
2πσy2 2σy2 Decision Tree
Logistic Regression Decision Tree with default parameters was tried on the
given data and an accuracy of 0.806 was achieved.
Logistic Regression uses the sigmoid function to trans-
form the output into a probability value which can be used XGBoost Classifier
to classify the output. The hypothesis function for logistic The XGBoost Classifier gave a base accuracy of 0.881.
regression is given in (2). It was further tuned by testing different values for Hyper-
1 parameters class weight, learning rate, max depth and
h(θ) = (2) n estimators using Grid-Search CV.
1 + e−θT x
Decision Tree
A decision tree has a structure similar to flowchart in
which each internal node is a test on a feature and each leaf
node is a class label. Information gain is used to select the
feature to split on at each step while building the tree. The
information gain can be calculated using entropy loss (3) or
gini impurity index (4). Decision tree is easy to understand
and can handle both numerical and categorical data.

IH = −Σpi log2 (pi ) (3) Figure 4. Tuning XGB Classifier

IG = 1 − Σp2i (4) The best accuracy obtained after tuning was 0.884.

XGBoost Classifier Optimal Hyper-Parameters


XGBoost algorithm is an ensemble learning algorithm • class weight: None
which uses Boosting. XGBoost Classifier creates a num-
• learning rate: 0.1
ber of Decision Trees like Random Forest Classifier. How-
ever, unlike Random Forest, the trees are trained sequen- • max depth: 7
tially such that each new model corrects the errors of the
previous one. • n estimators: 300

3
Classification Report Predicted class 0 . Predicted class 1
LR GNB DT XGB RF LR GNB DT XGB RF
True Class 0 54439 51967 48808 54878 54719 3372 5844 9003 2933 3092
True Class 1 8913 10430 8360 8231 8338 23734 22217 24287 24416 24259
Precision 0.86 0.83 0.85 0.87 0.86 0.88 0.79 0.72 0.89 0.88
Recall 0.94 0.89 0.84 0.95 0.94 0.73 0.68 0.74 0.75 0.74
F1-score 0.90 0.85 0.84 0.90 0.89 0.79 0.73 0.73 0.81 0.80

Random Forest Classifier


From the Receiver Operating Curves curves given in
The Random Forest Classifier gave a base accuracy of
Figure 6, we note that Area Under Curve score was highest
0.873. It was further tuned by testing different values for
for XGBoost Classifier and Random Forest Classifier and
Hyper-parameters class weight, criterion and n estimators
lowest for Decision Tree.
using Grid-Search CV.

The classification report table shows the combined


Confusion Matrix for the different classifiers along with
Precision, Recall and F1-score. It helps us to analyse
the different models on various parameters like precision,
recall and F1-score and find the optimal model for our
problem statement.

Random Forest Classifier and XGBoost Classifier


reported the best values across all judging parameters.
Figure 5. Tuning Random Forest Classifier
For our problem statement, we need a model with a high
The best accuracy obtained after tuning was 0.875. value of Recall so that we can inform about flight delays
and the concerned professionals and people can adjust ac-
Optimal Hyper-Parameters cordingly. Thus, XGBoost with a recall value of 0.75 seems
to be the best choice among the five classifiers.
• class weight: None

• criterion: entropy

• n estimators: 400

5.1. Analysis

Figure 7. Feature Importance

We calculated feature importance from our XGBoost


model and observed some interesting insights. Most im-
portant features for predicting flight delays were found to
Figure 6. Receiver Operating Curves for Random Forest Classi- be Departure Delay and Taxi-Out. We observe that distance
fier, Decision Tree, XGBoost Classifier, Naive Bayes Classifier from origin to destination did not contribute much to the
and Logistic Regression delay.

4
6. Conclusion [3] Roberto Henriques and Inês Feiteira. Predictive modelling:
flight delays and associated factors, hartsfield–jackson atlanta
We can conclude that Random Forest Classifier and XG- international airport. Procedia computer science, 138:638–
Boost Classifier performed significantly well in predicting 645, 2018.
flight delays. XGBoost Classifier had a slight edge over [4] Sina Khanmohammadi, Salih Tutun, and Yunus Kucuk. A
Random Forest Classifier and fitted the problem statement new multilevel input layer artificial neural network for pre-
better with an accuracy score of 0.88 and AUC score of dicting flight delays at jfk airport. Procedia Computer Sci-
0.93. ence, 95:237–244, 2016.
We hope that our work would contribute to the soci-
ety and be utilised by the concerned parties. For future
prospects, we aim to design an application for the same.

6.1. Member Contribution


• Anuneet: Literature survey & data collection, Train-
ing different models, Collecting and merging datasets,
Statistical analysis, Exploratory Data Analysis, Train-
ing different types of models, Code Formatting, Report
writing.

• Mohnish: Literature survey & data collection, Train-


ing different types of models to find the best one for
our project, Plotting graphs for data analysis, Optimiz-
ing the model through hyper-parameter tuning, Report
writing.

• Rhythm: Literature survey & data collection, Prepro-


cessing the data, Testing & evaluating different mod-
els, Training different types of models, Fine-tuning
them, Gaining insights with visualizations, Report
writing.

6.2. Learning
In this project, we studied the various factors responsible
for flight delays and gained some interesting insights by
conducting Exploratory Data Analysis on the dataset. We
were successful in pre-processing the dataset and applying
machine learning models like Random Forest Classifier,
Decision Tree, XGBoost Classifier, Gaussian Naive Bayes
Classifier and Logistic Regression to our problem. The
project allowed us to go beyond the scope of regular
teachings of the classes and learn more about various
machine learning algorithms. We developed analytical
skills to study the data and come up with appropriate
machine learning algorithms which would fit well on the
data.

References
[1] Joint Economic Commitee Majority Staff. Your flight has
been delayed again. Technical report, Tech. rep, 2008.
[2] Yi Ding. Predicting flight delay based on multiple linear re-
gression. In IOP Conference Series: Earth and Environmental
Science, volume 81, pages 1–7, 2017.

You might also like