0% found this document useful (0 votes)

35 views5 pages

Report

This document summarizes research on predicting flight delays using machine learning models. Previous studies developed regression and classification models using factors like distance, weather, and month. The current study analyzes delay factors, applies models like Random Forest, XGBoost, and Naive Bayes to a dataset of over 5 million flights, and finds XGBoost achieves 88% accuracy in predicting delays over 3 minutes.

Uploaded by

Madhu S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views5 pages

Report

Uploaded by

Madhu S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Predicting Flight Delays

Anuneet Anand Mohnish Agrawal Rhythm Patel

2018022 2018053 2018083
[email protected] [email protected] [email protected]

Abstract 2. Literature Review

The growth of the aviation sector has made flight delays A significant amount of research work has been done
more common across the world. They cause inconvenience in the field of air-traffic control and commercial aviation.
to the travellers and incur monetary losses to the airlines. Many researchers have made attempts to solve this problem
We analysed the various factors responsible for flight delays and have presented different machine learning approaches.
and applied machine learning models such as Random For-
est, XGBoost, Logistic Regression, Decision Tree and Naive The researchers at School of Computer, Wuhan Voca-
Bayes to predict whether a given flight would be delayed or tional College of Software and Engineering, developed
not. The XGBoost Classifier performed exceptionally well, a multiple linear regression model and compared its
giving an accuracy of 0.88 and an AUC of 0.93. GitHub: performance with other models such as Naive-Bayes and
github.com/rhythm-patel/Predicting-Flight-Delays C4.5. The delay value predicted was classified into two
classes with the flights having a delay of more than 30
minutes being classified as ’Delayed’. The model achieved
1. Introduction 79% accuracy, and it was observed that weather was not
a significant feature for classification, except in extreme
There has been a remarkable expansion in commercial
conditions. [2]
aviation in the past decade, with more people preferring air
travel for a fast and comfortable journey. However, flight
delays have become quite common across the world with The researchers at Nova Information Management
the growth of the aviation sector. Besides inconvenience School, Portugal used various data-mining techniques
to the travellers, flight delays have a negative impact on along with a Knowledge Discovery Database approach
the economy. The airline companies incur substantial and then trained Decision Tree, Logistic Regression and
monetary losses and observe a fall in their reputation if Multi-Layer Perceptron models. The SMOTE technique
their flights are delayed often. The unforeseen delays also gave better results than the Under-sampling technique. The
have a cascading effect on various other sectors. According features like distance and month were found to insignifi-
to a report by the Joint Economic Committee of United cant. The Multi-Layer Perceptron gave an accuracy of 85%
States Congress, the total cost of flight delays to the US and emerged as the best model for prediction. [3]
economy was over $40 billion with $19 billion to the
airlines, $12 billion to the passengers and around $10 The researchers at State University of New York at Bing-
billion to other industries. The delayed flights also pose hamton, New York and Defense Sciences Institute, Turkish
certain environmental concerns. Delayed flights consumed Military Academy, Ankara, introduced DMP-ANN model
an additional 740 million units of jet fuel and released for prediction of defects by applying it to the system of air
over 7 million metric tonnes of additional Carbon Diox- traffic control. Traditional ANN had a hard time handling
ide [1]. Thus, the prediction of flight delays is a crucial task. nominal variables and had to convert to 1-of-N encoding,
which reduced the performance. Results showed that this
This project aims at analysing factors responsible for new ANN outperformed traditional ANN in terms of error
flight delays and designing a machine learning model to (RMSE) as well as time required for training the data. How-
predict them. We classify a flight as ’Delayed’ if the ar- ever, since the number of layers increases with connections,
rival delay of the flight is more than 3 minutes. Prediction complexity becomes a limitation. The model was used to
of delays can help customer choose best flight for the jour- predict flight delays at JFK airport and gave remarkable re-
ney and help airlines to identify flaws in their organisation. sults. [4]

1
3. Dataset It was found that the mean delays of flight dropped with
an increase in route distance. However, flights travelling
We used the 2015 Flight Delays and Cancellations distances over 3000Km showed a spike in delays.
Dataset, collected and published by U.S. DOT’s Bureau of
Transportation Statistics, for training and testing our pre-
diction models. The data from the three files were merged
into a single dataset. The merged dataset consisted of over
5 million samples and had information about airlines, air-
ports, flight schedules, flight routes, delays and cancellation
reasons. Most of the columns had an excellent filling factor,
except those related to cancellation reasons.

3.1. Cleaning and Feature Extraction

We selected January month’s data for our use, keeping
Figure 2. Distance Vs. Delay
the computational complexity in mind. Since we were con-
cerned with only flight delay prediction, we dropped the
We also conducted a statistical study of different numer-
columns related to cancellation. Some other irrelevant and
ical features in the dataset and derived important insights
redundant columns like Tail Number, Airport Name, Air-
about their distributions. It was found that almost 68% of
line Name and Wheels On were dropped. The Date-Time
the flights had no delay or a delay of less than 3 minutes.
format was corrected, and rows with NaN values were re-
moved. We also removed outliers which had an arrival delay
of more than 500 minutes. A dataset of dimensions 456685
rows × 11 columns was obtained with the following fea-
tures.

• Categorical Features: Airline, Origin, Destination

• Numerical Features: Distance, Taxi Out, Departure

Delay, Day of Week, Arrival Delay

• Date/Time Features: Date, Scheduled Departure, Figure 3. Distribution Of Arrival Delay

Scheduled Arrival

3.2. Exploratory Data Analysis 3.3. Pre-Processing

We had to pre-process the data before training and eval-
We identified the busiest airports and air traffic shares of
uating different models on it. The Categorical features had
different airlines and visualised the mean delays in airlines
to be converted into numerical forms so that they can be
on different days of the week. It was observed that most
interpreted by the machine learning models.
flights were delayed on Sunday and Monday. Alaska Air-
lines Inc. and Delta Air Lines Inc. were the best performing • One-Hot Encoding was used to handle the Airline fea-
airlines whereas Frontier Airlines Inc. and American Eagle ture [14 possible values].
Airlines Inc. were often delayed.
• Label Encoding was used to handle Origin and Desti-
nation features [300+ possible values].
• Date and Time were expressed using sine and cosine
values of their individual attributes to incorporate their
cyclic nature.
• Keeping 3 minutes as a threshold, the samples with an
arrival delay less than 3 minutes were assigned class 0
whereas the remaining samples were assigned class 1.
The features were scaled using Standard Scaler. We
obtained the numpy arrays X and Y with dimensions
Figure 1. Heat Map of Mean Airline Delays Vs. Days (456855, 32) and (456855, ) respectively.

2
4. Methodology Random Forest Classifier
To validate the performance of our models, we per- Random Forest algorithm is an ensemble learning algo-
formed a train-val-test split of 70:10:20. We trained rithm which uses Bagging. The Random Forest Classifier
Random Forest Classifier, XGBoost Classifier, Naive creates a number of Decision Trees which are trained on
Bayes Classifier, Decision Tree and Logistic Regression, bootstrap samples of training set by randomly choosing a
from the Sklearn library, on the training set. Grid Search set of attributes for each split. Each Decision Tree in the
CV was used to find optimal hyper-parameters for the forest independently predicts the class of a test sample. The
models. Appropriate graphs and metrics were generated for votes from all the trees in the forest are aggregated to decide
the analysis and performance of the different models were the class of the test sample.
compared.
5. Results
Gaussian Naive Bayes Classifier Gaussian Naive Bayes Classifier
Gaussian Naive Bayes Classifier is a supervised proba- Gaussian Naive Bayes Classifier was used and an accu-
bilistic machine learning model for continuous data, based racy 0.820 was obtained. This was a considerably good re-
on the Bayes Theorem. It assumes independence among sult to start with.
the input features and calculates posterior probabilities of Logistic Regression
different classes to make predictions. The Gaussian condi-
tional probability is as given in (1). Logistic Regression with default parameters was also
tried on the given data and an accuracy of 0.864 was
(xi − µy )2

1 achieved.
P (xi |y) = q exp − (1)
2πσy2 2σy2 Decision Tree
Logistic Regression Decision Tree with default parameters was tried on the
given data and an accuracy of 0.806 was achieved.
Logistic Regression uses the sigmoid function to trans-
form the output into a probability value which can be used XGBoost Classifier
to classify the output. The hypothesis function for logistic The XGBoost Classifier gave a base accuracy of 0.881.
regression is given in (2). It was further tuned by testing different values for Hyper-
1 parameters class weight, learning rate, max depth and
h(θ) = (2) n estimators using Grid-Search CV.
1 + e−θT x
Decision Tree
A decision tree has a structure similar to flowchart in
which each internal node is a test on a feature and each leaf
node is a class label. Information gain is used to select the
feature to split on at each step while building the tree. The
information gain can be calculated using entropy loss (3) or
gini impurity index (4). Decision tree is easy to understand
and can handle both numerical and categorical data.

IH = −Σpi log2 (pi ) (3) Figure 4. Tuning XGB Classifier

IG = 1 − Σp2i (4) The best accuracy obtained after tuning was 0.884.

XGBoost Classifier Optimal Hyper-Parameters

XGBoost algorithm is an ensemble learning algorithm • class weight: None
which uses Boosting. XGBoost Classifier creates a num-
• learning rate: 0.1
ber of Decision Trees like Random Forest Classifier. How-
ever, unlike Random Forest, the trees are trained sequen- • max depth: 7
tially such that each new model corrects the errors of the
previous one. • n estimators: 300

3
Classification Report Predicted class 0 . Predicted class 1
LR GNB DT XGB RF LR GNB DT XGB RF
True Class 0 54439 51967 48808 54878 54719 3372 5844 9003 2933 3092
True Class 1 8913 10430 8360 8231 8338 23734 22217 24287 24416 24259
Precision 0.86 0.83 0.85 0.87 0.86 0.88 0.79 0.72 0.89 0.88
Recall 0.94 0.89 0.84 0.95 0.94 0.73 0.68 0.74 0.75 0.74
F1-score 0.90 0.85 0.84 0.90 0.89 0.79 0.73 0.73 0.81 0.80

Random Forest Classifier

From the Receiver Operating Curves curves given in
The Random Forest Classifier gave a base accuracy of
Figure 6, we note that Area Under Curve score was highest
0.873. It was further tuned by testing different values for
for XGBoost Classifier and Random Forest Classifier and
Hyper-parameters class weight, criterion and n estimators
lowest for Decision Tree.
using Grid-Search CV.

The classification report table shows the combined

Confusion Matrix for the different classifiers along with
Precision, Recall and F1-score. It helps us to analyse
the different models on various parameters like precision,
recall and F1-score and find the optimal model for our
problem statement.

Random Forest Classifier and XGBoost Classifier

reported the best values across all judging parameters.
Figure 5. Tuning Random Forest Classifier
For our problem statement, we need a model with a high
The best accuracy obtained after tuning was 0.875. value of Recall so that we can inform about flight delays
and the concerned professionals and people can adjust ac-
Optimal Hyper-Parameters cordingly. Thus, XGBoost with a recall value of 0.75 seems
to be the best choice among the five classifiers.
• class weight: None

• criterion: entropy

• n estimators: 400

5.1. Analysis

Figure 7. Feature Importance

We calculated feature importance from our XGBoost

model and observed some interesting insights. Most im-
portant features for predicting flight delays were found to
Figure 6. Receiver Operating Curves for Random Forest Classi- be Departure Delay and Taxi-Out. We observe that distance
fier, Decision Tree, XGBoost Classifier, Naive Bayes Classifier from origin to destination did not contribute much to the
and Logistic Regression delay.

4
6. Conclusion [3] Roberto Henriques and Inês Feiteira. Predictive modelling:
flight delays and associated factors, hartsfield–jackson atlanta
We can conclude that Random Forest Classifier and XG- international airport. Procedia computer science, 138:638–
Boost Classifier performed significantly well in predicting 645, 2018.
flight delays. XGBoost Classifier had a slight edge over [4] Sina Khanmohammadi, Salih Tutun, and Yunus Kucuk. A
Random Forest Classifier and fitted the problem statement new multilevel input layer artificial neural network for pre-
better with an accuracy score of 0.88 and AUC score of dicting flight delays at jfk airport. Procedia Computer Sci-
0.93. ence, 95:237–244, 2016.
We hope that our work would contribute to the soci-
ety and be utilised by the concerned parties. For future
prospects, we aim to design an application for the same.

6.1. Member Contribution

• Anuneet: Literature survey & data collection, Train-
ing different models, Collecting and merging datasets,
Statistical analysis, Exploratory Data Analysis, Train-
ing different types of models, Code Formatting, Report
writing.

• Mohnish: Literature survey & data collection, Train-

ing different types of models to find the best one for
our project, Plotting graphs for data analysis, Optimiz-
ing the model through hyper-parameter tuning, Report
writing.

• Rhythm: Literature survey & data collection, Prepro-

cessing the data, Testing & evaluating different mod-
els, Training different types of models, Fine-tuning
them, Gaining insights with visualizations, Report
writing.

6.2. Learning
In this project, we studied the various factors responsible
for flight delays and gained some interesting insights by
conducting Exploratory Data Analysis on the dataset. We
were successful in pre-processing the dataset and applying
machine learning models like Random Forest Classifier,
Decision Tree, XGBoost Classifier, Gaussian Naive Bayes
Classifier and Logistic Regression to our problem. The
project allowed us to go beyond the scope of regular
teachings of the classes and learn more about various
machine learning algorithms. We developed analytical
skills to study the data and come up with appropriate
machine learning algorithms which would fit well on the
data.

References
[1] Joint Economic Commitee Majority Staff. Your flight has
been delayed again. Technical report, Tech. rep, 2008.
[2] Yi Ding. Predicting flight delay based on multiple linear re-
gression. In IOP Conference Series: Earth and Environmental
Science, volume 81, pages 1–7, 2017.

Airline Delay Model
No ratings yet
Airline Delay Model
11 pages
Flight Delay Prediction System Paper - 802 - 826 - 828
No ratings yet
Flight Delay Prediction System Paper - 802 - 826 - 828
7 pages
An Approach To Predict Operational Performance of Airline Schedules Using Aircraft Assignment Key Performance Indicators
No ratings yet
An Approach To Predict Operational Performance of Airline Schedules Using Aircraft Assignment Key Performance Indicators
90 pages
Netaji Subhash Engineering College
No ratings yet
Netaji Subhash Engineering College
24 pages
A Review On Flight Delay Prediction
No ratings yet
A Review On Flight Delay Prediction
21 pages
American Airlines Flight Arrival Delay Analysis
No ratings yet
American Airlines Flight Arrival Delay Analysis
11 pages
Flight DElay Report
No ratings yet
Flight DElay Report
49 pages
A Machine Learning Model For Flight Delay Prediction: Certificate
No ratings yet
A Machine Learning Model For Flight Delay Prediction: Certificate
17 pages
REPORT On Time Flights Performance
No ratings yet
REPORT On Time Flights Performance
9 pages
Flight Delay Prediction
No ratings yet
Flight Delay Prediction
17 pages
MenonMovva PredictingFlightDelays Report
No ratings yet
MenonMovva PredictingFlightDelays Report
6 pages
Flight DElay Report
No ratings yet
Flight DElay Report
49 pages
Predicting Flight Delays With Error Calculation Using Machine Learned Classifiers
No ratings yet
Predicting Flight Delays With Error Calculation Using Machine Learned Classifiers
6 pages
Aerospace 08 00152 v3
No ratings yet
Aerospace 08 00152 v3
20 pages
Flight Delay Detection in BIG Data Analysis
No ratings yet
Flight Delay Detection in BIG Data Analysis
11 pages
Seminar PPT - Lipika-1
No ratings yet
Seminar PPT - Lipika-1
21 pages
Airline Delay Prediction
No ratings yet
Airline Delay Prediction
6 pages
Flight Delay Prediction Team3
No ratings yet
Flight Delay Prediction Team3
8 pages
SNU Assignment 1
No ratings yet
SNU Assignment 1
3 pages
(IJCST-V10I5P36) :mrs R Jhansi Rani, T Govardhan Reddy
No ratings yet
(IJCST-V10I5P36) :mrs R Jhansi Rani, T Govardhan Reddy
5 pages
Example On Flight Delay Data
No ratings yet
Example On Flight Delay Data
10 pages
NN Presentation
No ratings yet
NN Presentation
10 pages
Belcastro 2016
No ratings yet
Belcastro 2016
20 pages
Predicting Flight Delays
No ratings yet
Predicting Flight Delays
7 pages
Project Synopsis - Prediction of Flight Delay Analysis
No ratings yet
Project Synopsis - Prediction of Flight Delay Analysis
5 pages
KrishnaBathula 1
No ratings yet
KrishnaBathula 1
6 pages
Software Project1
No ratings yet
Software Project1
76 pages
Big Data Journalpaper
No ratings yet
Big Data Journalpaper
41 pages
5th International Conference On Electronics and Sustainable Communication Systems (ICESC 2024)
No ratings yet
5th International Conference On Electronics and Sustainable Communication Systems (ICESC 2024)
15 pages
FLIGHT DELAY Prediction 4th
No ratings yet
FLIGHT DELAY Prediction 4th
18 pages
Icaart 2023 94 CR-4
No ratings yet
Icaart 2023 94 CR-4
11 pages
SCI - Volume 26 - Issue 5 - Pages 2689-2702
No ratings yet
SCI - Volume 26 - Issue 5 - Pages 2689-2702
14 pages
Major Project Final
No ratings yet
Major Project Final
21 pages
Flightdelay
No ratings yet
Flightdelay
53 pages
Flight Delay Prediction Based On Machine Learning Full
No ratings yet
Flight Delay Prediction Based On Machine Learning Full
9 pages
Base Paper (Flight Delay Prediction)
No ratings yet
Base Paper (Flight Delay Prediction)
6 pages
Fin Irjmets1676179194
No ratings yet
Fin Irjmets1676179194
6 pages
Analysis of Factors in Flight Delay: Yiyang Xu, Luyao Liu, Xichen Gao and Fanyu Frank Zeng
No ratings yet
Analysis of Factors in Flight Delay: Yiyang Xu, Luyao Liu, Xichen Gao and Fanyu Frank Zeng
7 pages
On The Relevance of Data Science For Fli
No ratings yet
On The Relevance of Data Science For Fli
17 pages
IJRTI2305086
No ratings yet
IJRTI2305086
6 pages
Flight Delay Report
No ratings yet
Flight Delay Report
29 pages
Literature Survey Big Data
No ratings yet
Literature Survey Big Data
15 pages
Machine Learning Approach For Flight Departure Delay Prediction and Analysis
No ratings yet
Machine Learning Approach For Flight Departure Delay Prediction and Analysis
15 pages
S1366554518311979
No ratings yet
S1366554518311979
1 page
Flight Delay Prediction Based On Aviation Big Data: ISSN PRINT 2319 1775 Online 2320 7876
No ratings yet
Flight Delay Prediction Based On Aviation Big Data: ISSN PRINT 2319 1775 Online 2320 7876
5 pages
A Data Mining Approach To Flight Arrival Delay Pre
No ratings yet
A Data Mining Approach To Flight Arrival Delay Pre
6 pages
Predicting-Flight-Delays-AI ML
No ratings yet
Predicting-Flight-Delays-AI ML
7 pages
A Hybrid Machine Learning Based Model For Predicting Flight Delay Through Aviation Big Data
No ratings yet
A Hybrid Machine Learning Based Model For Predicting Flight Delay Through Aviation Big Data
16 pages
Flight Delay Prediction - Tomer & Ofek
No ratings yet
Flight Delay Prediction - Tomer & Ofek
29 pages
Bda Kav
No ratings yet
Bda Kav
9 pages
Slide BigData English English
No ratings yet
Slide BigData English English
26 pages
Delay Prediction
No ratings yet
Delay Prediction
37 pages
Model
No ratings yet
Model
20 pages
Duplichecker Plagiarism Report
No ratings yet
Duplichecker Plagiarism Report
3 pages
Assignment1 Code and Conclude DSA Nikhil Mishra
No ratings yet
Assignment1 Code and Conclude DSA Nikhil Mishra
36 pages
Project 1
No ratings yet
Project 1
9 pages
Project 1.1
No ratings yet
Project 1.1
3 pages

Report

Uploaded by

Report

Uploaded by

Predicting Flight Delays

Anuneet Anand Mohnish Agrawal Rhythm Patel

Abstract 2. Literature Review

3.1. Cleaning and Feature Extraction

• Categorical Features: Airline, Origin, Destination

• Numerical Features: Distance, Taxi Out, Departure

• Date/Time Features: Date, Scheduled Departure, Figure 3. Distribution Of Arrival Delay

3.2. Exploratory Data Analysis 3.3. Pre-Processing

IH = −Σpi log2 (pi ) (3) Figure 4. Tuning XGB Classifier

XGBoost Classifier Optimal Hyper-Parameters

Random Forest Classifier

The classification report table shows the combined

Random Forest Classifier and XGBoost Classifier

Figure 7. Feature Importance

We calculated feature importance from our XGBoost

6.1. Member Contribution

• Mohnish: Literature survey & data collection, Train-

• Rhythm: Literature survey & data collection, Prepro-

You might also like