Flightdelay
Flightdelay
by
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
JEPPIAAR NAGAR, RAJIV GANDHI SALAI, CHENNAI - 600119
MARCH - 2022
i
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with “A” grade by NAAC
Jeppiaar Nagar, Rajiv Gandhi Salai, Chennai – 600 119
www.sathyabama.ac.in
BONAFIDE CERTIFICATE
This is to certify that this Project Report is the bonafide work of BALAMURUGAN.R(REG NO:
38120019) BARANIDARAN.GT (REG NO: 38120021)who have done the Project work as a team
who carried out the project entitled “PREDICTING FLIGHT DELAYS WITH ERROR
CALCULATION USING MACHINE LEARNED CLASSIFIERS” under my supervision from
November 2021 to April 2022.
Internal Guide
V. MARIA ANU M.E., Ph.D.,
ii
DECLARATION
iii
ACKNOWLEDGEMENT
iv
ABSTRACT
Flight delay is a major problem in the aviation sector. During the last two decades, the
growth of the aviation sector has caused air traffic congestion, which has caused flight delays.
Flight delays result not only in the loss of fortune also negatively impact the environment. Flight
delays also cause significant losses for airlines operating commercial flights. Therefore, they do
everything possible in the prevention or avoidance of delays and cancellations of flights by
taking some measures. In Tree Regression this paper, using machine learning models such as
Logistic Regression, Decision Bayesian, Ridge, Random Forest Regression and Gradient
Boosting Regression we predict whether the arrival of a particular flight will be delayed or not.
v
TABLE OF CONTENTS
CHAPTER NO TITLE PAGE NO
1 INTRODUCTION 1
1.1 Machine Learning 1
3 DEVELOPMENT PROCESS 12
3.1 Requirement Analysis 12
5 TESTING 22
5.1 Type OF Tests 23
5.1.1 Unit Testing 23
vi
5.1.2 Integration Testing 23
5.1.3 Function Test 23
5.1.4 System Test 23
A. Sample code 28
B. PublicationReport 36
IEEE copy rightform 38
IEEEAcceptanceform 42
Reference 43
vii
C. Sample code 28
C.Publication Report 36
IEEE copy right form 38
IEEE Acceptance form 42
Reference 43
viii
LIST OF FIGURES
S.NO NAME PAGE NO
1.1 Block Diagram 2
ix
LIST OF ABBREVIATION
ML Machine Learning
x
CHAPTER 1
INTRODUCTION
Flight delay is studied vigorously in various research in recent years. The growing demand
for air travel has led to an increase in flight delays. According to the Federal Aviation
Administration (FAA), the aviation industry loses more than $3 billion in a year due to flight
delays and, as per BTS, in 2016 there were 860,646 arrival delays. The reasons for the delay of
commercial scheduled flights are air traffic congestion, passengers increasing per year,
maintenance and safety problems, adverse weather conditions, the late arrival of plane to be used
for next flight. In the United States, the FAA believes that a flight is delayed when the scheduled
and actual arrival times differs by more than 15 minutes. Since it becomes a serious problem in
the United States, analysis and prediction of flight delays are being studied to reduce large costs.
1
these algorithms, machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below block diagram explains
the working of Machine Learning algorithm:
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning
Supervised learning is a type of machine learning method in which we provide sample
labeled data to the machine learning system in order to train it, and on that basis, it predicts the
output.
The system creates a model using labeled data to understand the datasets and learn about
each data, once the training and processing are done then we test the model by providing a
sample data to check whether it is predicting the exact output or not.
2
The goal of supervised learning is to map input data with the output data. The supervised
learning is based on supervision, and it is the same as when a student learns things in the
supervision of the teacher. The example of supervised learning is spam filtering.
Clustering
Association
Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems. In Logistic regression, instead of fitting a regression line,
3
we fit an "S" shaped logistic function, which predicts two maximum values (0 or 1). The curve
from the logistic function indicates the likelihood of something such as whether the cells are
cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets. Logistic
Regression can be used to classify the observations using different types of data and can easily
determine the most effective variables used for the classification. The below image is showing
the logistic function
4
In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends
to 1, and a value below the threshold values tends to 0.
Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It
is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset. It is a
graphical representation for getting all the possible solutions to a problem/decision based
on given conditions. It is called a decision tree because, similar to a tree, it starts with the
root node, which expands on further branches and constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
5
There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.
6
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It is
based on the concept of ensemble learning, which is a process of combining multiple classifiers
to solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it predicts the final
output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
Below are some points that explain why we should use the Random Forest algorithm:
7
It can also maintain accuracy when a large proportion of data is missing.
Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
8
[2] Title: Flight delay prediction for commercial air transport: A deep learning approach
Authors: SobhanAsian - 2019
Description:
This study analyzes high-dimensional data from Beijing International Airport and
presents a practical flight delay prediction model. Following a multifactor approach, a novel
deep belief network method is employed to mine the inner patterns of flight delays. Support
vector regression is embedded in the developed model to perform a supervised fine-tuning within
the presented predictive architecture. The proposed method has proven to be highly capable of
handling the challenges of large datasets and capturing the key factors influencing delays. This
ultimately enables connected airports to collectively alleviate delay propagation within their
network through collaborative efforts (e.g., delay prediction synchronization).
9
improving airport transportation efficiency, rationally scheduling flights and improving
passenger comfort. In this paper, the Cat-boost model is utilized on the U.S Domestic airline on-
time performance data from U.S. Transportation Administration, combined with the
characteristics of the model to determine the influencing factors, and to predict the arrival delays
of flights within the United States. The accuracy;precision and some other criterion of the model
are given to evaluate the performance on the data. A better effect is obtained: the accuracy reach
80.44% in this case. Finally, the specific delay time is predicted, we found that the support vector
machine has the best prediction result for the flight delay time, the average prediction error is
9.733 min, which has a certain reference value for flight operation and airport scheduling.
[5] Title: A statistical approach to predict flight delay using
gradient boosted decision tree
Authors: Suvojit Manna , Sanket Biswas , Riyanka Kundu , Somnath Rakshit
Description:
Supervised machine learning algorithms have been used extensively in different domains
of machine learning like pattern recognition, data mining and machine translation. Similarly,
there has been several attempts to apply the various supervised or unsupervised machine learning
algorithms to the analysis of air traffic data. However, no attempts have been made to apply
Gradient Boosted Decision Tree, one of the famous machine learning tools to analyse those air
traffic data. This paper investigates the effectiveness of this successful paradigm in the air traffic
delay prediction tasks. By combining this regression model based on the machine learning
paradigm, an accurate and sturdy prediction model has been built which enables an elaborated
analysis of the patterns in air traffic delays. Gradient Boosted Decision Tree has shown a great
accuracy in modeling sequential data. With the help of this model, day-to-day sequences of the
departure and arrival flight delays of an individual airport can be predicted efficiently. In this
paper, the model has been implemented on the Passenger Flight on-time Performance data taken
from U.S. Department of Transportation to predict the arrival and departure delays in flights. It
shows better accuracy as compared to other methods.
10
CHAPTER 2
PROBLEM STATEMENT
2.1 EXISTING SYSTEM
The Existing system proposed that, The expected growth in air travel demand and the
positive correlation with the economic factors highlight the significant contribution of the
aviation community to the U.S. economy. On‐time operations play a key role in airline
performance and passenger satisfaction. Thus, an accurate investigation of the variables that
cause delays is of major importance. The application of machine learning techniques in data
mining has seen explosive growth in recent years and has garnered interest from a broadening
variety of research domains including aviation. This study employed a support vector machine
(SVM) model to explore the non-linear relationship between flight delay outcomes. These
findings provide insight for better understanding of the causes of departure delays and the
impacts of various explanatory factors on flight delay patterns.
The primary contribution of Existing study is to investigate the possibility of using SVM
models for analysis of the causes of flight delay and investigation of flight delay patterns. The
maximum precision achieved was 79.7% with gradient booster as a classifier with a limited data
set
2.1.1 DISADVANTAGES
There is import of training and testing dataset is very small to predict the flight delays.
Less accuracy prediction when testing the new dataset.
Taken huge time to predict the Flight error.
11
CHAPTER 3
DEVELOPMENT PROCESS
3.1.1 PYTHON:
Python is a dynamic, high level, free open source and interpreted programming language.
It supports object-oriented programming as well as procedural oriented programming. In
Python, we don’t need to declare the type of variable because it is a dynamically typed
language.
For example, x=10 .Here, x can be anything such as String, int, etc.
Python is an interpreted, object-oriented programming language similar to PERL, that has gained
popularity because of its clear syntax and readability. Python is said to be relatively easy to learn
and portable, meaning its statements can be interpreted in a number of operating systems,
including UNIX-based systems, Mac OS, MS-DOS, OS/2, and various versions of Microsoft
Windows 98. Python was created by Guido van Rossum, a former resident of the Netherlands,
whose favourite comedy group at the time was Monty Python's Flying Circus. The source code is
freely available and open for modification and reuse. Python has a significant number of users.
Features in Python
There are many features in Python, some of which are discussed below
Easy to code
Free and Open Source
Object-Oriented Language
GUI Programming Support
12
High-Level Language
Extensible feature
Python is Portable language
Python is Integrated language
Interpreted Language
Op e r a t i n g S y s t e m Windows 7 or later
Simulation Tool Anaconda (Jupyter notebook)
Do c u m e n t a t i o n Ms – Office
HARDWARE REQUIREMENTS:
13
3.3 SYSTEM DESIGN
3.3.2 ADVANTAGES
The system collects huge number of dataset to train to the model and predict the flight
delay error calculation.
Speed and accuracy score is high.
Prediction rate is high.
15
the date and time and flight labelling along with airline airborne time are also provided. The data
set consists of 25 columns and 59986 rows. Fig. 1 shows some of the fields of the original
dataset. There were many lines with missing and null values. The data must be pre-processed for
later use
The methodology here uses the supervised learning technique to gather the advantages of
having the schedule and real arrival time. Initially, some specific monitoring algorithms with a
light computation cost were considered candidates and therefore the best candidate was perfected
for the final model.
We develop a system that predicts for a delay in flight departure based on certain
parameters. We train our model for forecasting using various attributes of a particular flight, such
as arrival performances, flight summaries, origin/destination, etc
16
Fig. 2. Snapshot of Dataset
Module 2: Pre-Processing
Once the data is extracted from the twitter source as the datasets, this information has to
be passed to the classifier. The classifier cleans the dataset by removing redundant data like stop
words, emoticons in order to make sure that non textual content is identified and removed before
the analysis.
Text pre-processing is an essential a part of any NLP method and the significance of the
NLP pre-processing are
To minimize indexing (or knowledge) records dimension of the textual content records
1. Stop words bills 20-30% of total phrase counts in a special textual content record
2. Stemming may just diminish indexing size as much as forty- 50%
17
To make stronger the efficiency and effectiveness of the IR method
Tokenization:
Tokenization is the process of breaking a circulate of textual content into phrases, phrases,
symbols, or different significant factors called tokens .The aim of the tokenization is the
exploration of the phrases in a sentence. The list of tokens turns into input for further processing
akin to parsing or textual content mining. Tokenization is valuable both in linguistics (where it's
a form of textual content segmentation), and in laptop science, the place it forms a part of lexical
analysis. Textual knowledge is simplest a block of characters at the starting.
All strategies in know-how retrieval require the words of the data set. For that reason, the
requirement for a parser is a tokenization of records. This might be sound trivial because the text
is already saved in computing device-readable codecs. However, some problems are nonetheless
left, like the removing of punctuation marks. Different characters like brackets, hyphens, and so
on require processing as well.
Stop phrases are very more often than not used fashioned phrases like ‘and’, ‘are’, ‘this’
etc. They don't seem to be useful in classification of records. So they must be removed. However,
the development of such stop phrases record is problematic and inconsistent between textual
sources. This process also reduces the text knowledge and improves the approach performance.
Each textual content report offers with these phrases which are not vital for text mining
applications.
Stemming usually refers to a crude heuristic process that chops off the ends of words in
the hope of accomplishing this goal accurately more often than not, and quite often involves the
removal of derivational affixes.
Lemmatization often refers to doing matters competently with the usage of a vocabulary
and morphological analysis of phrases, in most cases aiming to eliminate inflectional endings
only and to come back the base or dictionary type of a word, which is often called the lemma.
19
Fig 4. Flight Delay error
Day Departure
Delay Airline
Flight Number
Destination Airport
Origin Airport
20
Day of Week
Taxi out
Module 4: Evaluation
After pre-processing and feature extraction of our dataset, 60% of the dataset was selected for
training and 40% of the dataset was selected for testing. For error calculation, we are using
scikit-learn metrices. Results are divided between two sections, Departure Delay(A) and Arrival
Delay(B).
A. Departure Delay
our results for departure delay which compares different Machine Learning models, i.e. Logistic
Regression, Decision Tree Regressor, Bayesian Ridge, Random Forest Regressor and Gradient
Boosting Regressor, based on various evaluation metrics. Further, we compare each model
concerning one evaluation metric at a time.
B. Arrival Delay
our results for arrival delay which compares different Machine Learning models, i.e. Logistic
Regression, Decision Tree Regressor, Bayesian Ridge, Random Forest Regressor and Gradient
Boosting Regressor, based on various evaluation metrics. Further, we compare each model
concerning one evaluation metric at a time.
CHAPTER 4
SYSTEM STUDY
21
burden to the company. For feasibility analysis, some understanding of the major requirements for
the system is essential. Three key considerations involved in the feasibility analysis are
i. Economical Feasibility
ii. Technical Feasibility
iii. Social Feasibility
4.1.1. Economic Feasibility
This study is carried out to check the economic impact that the system will have on the
organization. The amount of fund that the company can pour into the research and development of
the system is limited. The expenditures must be justified. Thus, the developed system as well
within the budget and this was achieved because most of the technologies used are freely available.
Only the customized products had to be purchased.
4.1.2. Technical Feasibility
This study is carried out to check the technical feasibility, that is, the technical requirements
of the system. Any system developed must not have a high demand on the available technical
resources. This will lead to high demands on the available technical resources. This will lead to
high demands being placed on the client. The developed system must have a modest requirement,
as only minimal or null changes are required for implementing this system.
4.1.3. Social Feasibility
The aspect of study is to check the level of acceptance of the system by the user. This
includes the process of training the user to use the system efficiently. The user must not feel
threatened by the system, instead must accept it as a necessity. The level of acceptance by the users
solely depends on the methods that are employed to educate the user about the system and to make
him familiar with it. His level of confidence must be raised so that he is also able to make some
constructive criticism, which is welcomed, as he is the final user of the system.
CHAPTER 5
TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the functionality
of components, sub – assemblies, assemblies and/or a finished product It is the process of
exercising software with the intent of ensuring that the
Software system meets its requirements and user expectations and does not fail in an unacceptable
manner. There are various types of test. Each test type addresses a specific testing requirement.
5.1. TYPES OF TESTS
5.1.1. UNIT TESTING
22
Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. All decision branches and
internal code flow should be validated. It is the testing of individual software units of the
application .it is done after the completion of an individual unit before integration. This is a
structural testing, that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application, and/or system
configuration. Unit tests ensure that each unique path of a business process performs accurately to
the documented specifications and contains clearly defined inputs and expected results.
5.1.2. INTEGRATION TESTING
Integration tests are designed to test integrated software components to determine if they
actually run as one program. Testing is event driven and is more concerned with the basic outcome
of screens or fields. Integration tests demonstrate that although the components were individually
satisfaction, as shown by successfully unit testing, the combination of components is correct and
consistent. Integration testing is specifically aimed at exposing the problems that arise from the
combination of components.
5.1.3. FUNCTIONAL TEST
Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user manuals.
24
RESULT :
25
Fig 6. The week's day
26
Fig.7. Graph of accuracy algorithms
iii. A training algorithm for learning classification and regression rules from
data is referred to as logistic regression.
27
CHAPTER 6
Machine learning algorithms were applied progressively and successively to predict flight
arrival & delay. We built five models out of this. We saw for each evaluation metric considered the
values of the models and compared them. We found out that: -
In Departure Delay, Random Forest Regressor was observed as the best model with Mean
Squared Error 2261.8 and Mean Absolute Error 24.1, which are the minimum value found in these
respective metrics. In Arrival Delay, Random Forest Regressor was the best model observed with
Mean Squared Error 3019.3 and Mean Absolute Error 30.8, which are the minimum value found in
these respective metrics.
In the rest of the metrics, the value of the error of Random Forest Regressor although is not
minimum but still gives a low value comparatively. In maximum metrics, we found out that
Random Forest Regressor gives us the best value and thus should be the model selected.
The future scope of this paper can include the application of more advanced, modern and
innovative pre-processing techniques, automated hybrid learning and sampling algorithms, and
deep learning models adjusted to achieve better performance. To evolve a predictive model,
additional variables can be introduced. e.g., a model where meteorological statistics are utilized in
developing error-free models for flight delays. In this paper we used data from the US only,
therefore in future, the model can be trained with data from other countries as well. With the use of
models that are complex and hybrid of many other models provided with appropriate processing
power and with the use of larger detailed datasets, more accurate predictive models can be
developed. Additionally, the model can be configured for other airports to predict their flight
delays as well and for that data from these airports would be required to incorporate into this
research.
6.2.APPENDIX:
6.2.1 SAMPLE CODE
#import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
28
#pip install numpy
from sklearn.model_selection import train_test_split
from sklearn import metrics
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SimpleRNN, SpatialDropout1D
flight = pd.read_csv("Tweets.csv")
flight
flight.shape
flight.head()
flight.tail()
flight.describe()
flight.info()
flight.isnull().sum()
flight = flight[flight['airline_sentiment_confidence'] > 0.6]
flight
flight = flight[['text', 'airline_sentiment']]
flight.head()
def clean_train_data(x):
text = x
text = text.lower()
text = re.sub('\[.*?\]', '', text) # remove square brackets
text = re.sub(r'[^\w\s]','',text) # remove punctuation
text = re.sub('\w*\d\w*', '', text) # remove words containing numbers
text = re.sub('\n', '', text)
return text
flight['text'] = flight.text.apply(lambda x : clean_train_data(x))
flight.head()
data = flight.copy()
data
flight = flight[flight['airline_sentiment'] != 'neutral']
flight.head()
print("POsitive:",len(flight[flight['airline_sentiment'] == 'positive']))
29
print("\nNegative",len(flight[ flight['airline_sentiment'] == 'negative']))
X = token.texts_to_sequences(flight['text'].values)
X = pad_sequences(X)
X.shape
embed_dim = 128
lstm_out = 196
model = Sequential()
model.add(Embedding(max_features, embed_dim, input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
Y = pd.get_dummies(flight['airline_sentiment']).values
Y.shape
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.33, random_state=42)
X_train
X_test
y_train
y_test
batch_size = 25
history = model.fit(X_train, y_train, epochs=10, batch_size=batch_size, verbose=2)
# score = model.predict(X_test)
score, acc = model.evaluate(X_test, y_test, batch_size=batch_size, verbose=2)
print('score', score)
30
print('accuracy', acc)
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import emoji
#!pip install emoji
!pip install catboost
from sklearn.model_selection import train_test_split, GridSearchCV
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import confusion_matrix
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
data = pd.read_csv('2020_feb_flight_delay.csv')
data
data.shape
data.head()
data.tail()
data.info()
data.describe()
data.columns
data.notnull().sum()
#drop unnamed column
data = data.drop(['Unnamed: 9'],axis=1)
data
# distribution of our target
data['DEP_DEL15'].value_counts()
# Split the data into positive and negative
positive_rows = data.DEP_DEL15 == 1.0
data_pos = data.loc[positive_rows]
data_neg = data.loc[~positive_rows]
positive_rows.shape
31
data_neg.shape
data_pos.shape
data.shape
# Merge the balanced data
data = pd.concat([data_pos, data_neg.sample(n = len(data_pos))], axis = 0)
data
# Shuffle the order of data
data = data.sample(n = len(data)).reset_index(drop = True)
data
data.isna().sum()
data = data.dropna(axis=0)
data
data.info()
plt.figure(figsize=(15,5))
sns.distplot(data['DISTANCE'], hist=False, color="b", kde_kws={"shade": True})
plt.xlabel("Distance")
plt.ylabel("Frequency")
plt.title("Distribution of distance")
plt.show()
plt.figure(figsize=(15,5))
sns.distplot(data['DEP_TIME'], hist=False, color="b", kde_kws={"shade": True})
plt.xlabel("Distance")
plt.ylabel("Frequency")
plt.title("Distribution of distance")
plt.show()
plt.figure(figsize=(15,5))
sns.distplot(data['DEP_DEL15'], hist=False, color="b", kde_kws={"shade": True})
plt.xlabel("Distance")
plt.ylabel("Frequency")
plt.title("Distribution of distance")
plt.show()
plt.figure(figsize=(15,5))
sns.distplot(data['DAY_OF_WEEK'], hist=False, color="b", kde_kws={"shade": True})
plt.xlabel("Distance")
plt.ylabel("Frequency")
32
plt.title("Distribution of distance")
plt.show()
print(f"Average distance for delay {data[data['DEP_DEL15'] ==
1]['DISTANCE'].values.mean()} miles")
print(f"Average distance for no delay {data[data['DEP_DEL15'] ==
0]['DISTANCE'].values.mean()} miles")
#Count of carriers in the dataset
plt.figure(figsize=(15,10))
sns.countplot(x=data['OP_UNIQUE_CARRIER'], data=data)
plt.xlabel("Carriers")
plt.ylabel("Count")
plt.title("Count of unique carrier")
plt.show()
plt.figure(figsize=(15,10))
sns.countplot(x=data['DAY_OF_WEEK'], data=data)
plt.xlabel("Day of Week")
plt.ylabel("Count")
plt.title("Count of Day of Week")
plt.show()
data = data.rename(columns={'DEP_DEL15':'TARGET'})
data
def label_encoding(categories):
"""
To perform mapping of categorical features
"""
categories = list(set(list(categories.values)))
mapping = {}
for idx in range(len(categories)):
mapping[categories[idx]] = idx
return mapping
data['OP_UNIQUE_CARRIER'] =
data['OP_UNIQUE_CARRIER'].map(label_encoding(data['OP_UNIQUE_CARRIER']))
data.head()
data['ORIGIN'] = data['ORIGIN'].map(label_encoding(data['ORIGIN']))
data.head()
33
data['DEST'] = data['DEST'].map(label_encoding(data['DEST']))
data.head()
data['TARGET'].value_counts()
X = data[['DAY_OF_MONTH', 'DAY_OF_WEEK', 'OP_UNIQUE_CARRIER', 'ORIGIN',
'DEST', 'DEP_TIME', 'DISTANCE']].values
y = data[['TARGET']].values
X.shape
y.shape
# Splitting Train-set and Test-set
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=41)
34
for model in models:
preds_val = model.predict(X_val)
accuracy = get_accuracy(y_val, preds_val)
acc.append(accuracy)
model_name = ['Naive Bayes','Logistic Regression', 'Random Forest']
accuracy = dict(zip(model_name, acc))
print(accuracy)
plt.figure(figsize=(15,5))
ax = sns.barplot(x = list(accuracy.keys()), y = list(accuracy.values()))
for p, value in zip(ax.patches, list(accuracy.values())):
_x = p.get_x() + p.get_width() / 2
_y = p.get_y() + p.get_height() + 0.008
ax.text(_x, _y, round(value, 3), ha="center")
plt.xlabel("Models")
plt.ylabel("Accuracy")
plt.title("Model vs. Accuracy")
plt.show()
test_preds = rf.predict(X_test)
get_accuracy(y_test, test_preds)
35
Publication report :
36
37
IEEE COPYRIGHT AND CONSENT FORM
To ensure uniformity of treatment among all contributors, other forms may not be substituted for
this form, nor may any wording of the form be changed. This form is intended for original material
submitted to the IEEE and must accompany any such material in order to be published by the IEEE.
Please read the form carefully and keep a copy for your files.
Error Calculation for Prediction of Flight Delays using Machine Learned Classifiers
Baranidaran GT, Balamurugan R
2022 6th International Conference on Trends in Electronics and Informatics (ICOEI)
COPYRIGHT TRANSFER
The undersigned hereby assigns to The Institute of Electrical and Electronics Engineers,
Incorporated (the "IEEE") all rights under copyright that may exist in and to: (a) the Work,
including any revised or expanded derivative works submitted to the IEEE by the undersigned
based on the Work; and (b) any associated written or multimedia components or other
enhancements accompanying the Work.
GENERAL TERMS
1. The undersigned represents that he/she has the power and authority to make and execute
this form.
2. The undersigned agrees to indemnify and hold harmless the IEEE from any damage or
expense that may arise in the event of a breach of any of the warranties set forth above.
3. The undersigned agrees that publication with IEEE is subject to the policies and
procedures of the IEEE PSPB Operations Manual.
4. In the event the above work is not accepted and published by the IEEE or is withdrawn by
the author(s) before acceptance by the IEEE, the foregoing copyright transfer shall be null
and void. In this case, IEEE will retain a copy of the manuscript for internal
administrative/record-keeping purposes.
5. For jointly authored Works, all joint authors should sign, or one of the authors should sign
as authorized agent for the others.
6. The author hereby warrants that the Work and Presentation (collectively, the "Materials")
are original and that he/she is the author of the Materials. To the extent the Materials
incorporate text passages, figures, data or other material from the works of others, the
author has obtained any necessary permissions. Where necessary, the author has obtained
38
all third party permissions and consents to grant the license above and has provided copies
of such permissions and consents to IEEE
You have indicated that you DO wish to have video/audio recordings made of your
conference presentation under terms and conditions set forth in "Consent and
Release."
1. ln the event the author makes a presentation based upon the Work at a conference hosted
or sponsored in whole or in part by the IEEE, the author, in consideration for his/her
participation in the conference, hereby grants the IEEE the unlimited, worldwide,
irrevocable permission to use, distribute, publish, license, exhibit, record, digitize,
broadcast, reproduce and archive, in any format or medium, whether now known or
hereafter developed: (a) his/her presentation and comments at the conference; (b) any
written materials or multimedia files used in connection with his/her presentation; and (c)
any recorded interviews of him/her (collectively, the "Presentation"). The permission
granted includes the transcription and reproduction of the Presentation for inclusion in
products sold or distributed by IEEE and live or recorded broadcast of the Presentation
during or after the conference.
2. In connection with the permission granted in Section 1, the author hereby grants IEEE the
unlimited, worldwide, irrevocable right to use his/her name, picture, likeness, voice and
biographical information as part of the advertisement, distribution and sale of products
incorporating the Work or Presentation, and releases IEEE from any claim based on right
of privacy or publicity.
BY TYPING IN YOUR FULL NAME BELOW AND CLICKING THE SUBMIT BUTTON,
YOU CERTIFY THAT SUCH ACTION
CONSTITUTES YOUR ELECTRONIC SIGNATURE TO THIS FORM IN ACCORDANCE
WITH UNITED STATES LAW, WHICH
AUTHORIZES ELECTRONIC SIGNATURE BY AUTHENTICATED REQUEST FROM A
USER OVER THE INTERNET AS A VALID SUBSTITUTE FOR A WRITTEN SIGNATURE.
Baranidaran GT 08-03-2022
Signature
Date (dd-mm-yyyy)
39
Information for Authors
AUTHOR RESPONSIBILITIES
The IEEE distributes its technical publications throughout the world and wants to ensure that the
material submitted to its publications is properly available to the readership of those publications.
Authors must ensure that their Work meets the requirements as stated in section 8.2.1 of the IEEE
PSPB Operations Manual, including provisions covering originality, authorship, author
responsibilities and author misconduct. More information on IEEE’s publishing policies may be
found at
http://www.ieee.org/publications_standards/publications/rights/authorrightsresponsibilities.html
Authors are advised especially of IEEE PSPB Operations Manual section 8.2.1.B12: "It is the
responsibility of the authors, not the IEEE, to determine whether disclosure of their material
requires the prior consent of other parties and, if so, to obtain it." Authors are also advised of IEEE
PSPB Operations Manual section 8.1.1B: "Statements and opinions given in work published by the
IEEE are the expression of the authors."
- Personal Servers. Authors and/or their employers shall have the right to post the accepted
version of IEEE-copyrighted articles on their own personal servers or the servers of their
institutions or employers without permission from IEEE, provided that the posted version
includes a prominently displayed IEEE copyright notice and, when published, a full citation to
the original IEEE publication, including a link to the article abstract in IEEE Xplore. Authors
shall not post the final, published versions of their papers.
- Classroom or Internal Training Use. An author is expressly permitted to post any portion
of the accepted version of his/her own IEEE-copyrighted articles on the author's personal web
site or the servers of the author's institution or company in connection with the author's
40
teaching, training, or work responsibilities, provided that the appropriate copyright, credit, and
reuse notices appear prominently with the posted material. Examples of permitted uses are
lecture materials, course packs, ereserves, conference presentations, or in-house training
courses.
- Electronic Preprints. Before submitting an article to an IEEE publication, authors
frequently post their manuscripts to their own web site, their employer's site, or to another
server that invites constructive comment from colleagues. Upon submission of an article to
IEEE, an author is required to transfer copyright in the article to IEEE, and the author must
update any previously posted version of the article with a prominently displayed IEEE
copyright notice. Upon publication of an article by the IEEE, the author must replace any
previously posted electronic versions of the article with either (1) the full citation to the IEEE
work with a Digital Object Identifier (DOI) or link to the article abstract in IEEE Xplore, or (2)
the accepted version only (not the IEEE-published version), including the IEEE copyright
notice and full citation, with a link to the final, published article in IEEE Xplore.
Questions about the submission of the form or manuscript must be sent to the
publication's editor.
Please direct all questions about IEEE copyright policy to:
IEEE Intellectual Property Rights Office, [email protected], +1-732-562-3966
41
42
REFERENCES
1. Chakrabarty, Navoneel, Tuhin Kundu, Sudipta Dandapat, Apurba Sarkar, and Dipak Kumar
Kole. "Flight arrival delay prediction using gradient boosting classifier." In Emerging
Technologies in Data Mining and Information Security, pp. 651-659. Springer, Singapore,
2019.
2. Chakrabarty, Navoneel. "A data mining approach to flight arrival delay prediction for
american airlines." In 2019 9th Annual Information Technology, Electromechanical
Engineering and Microelectronics Conference (IEMECON), pp. 102-107. IEEE, 2019.
3. Kim, Y.J., Choi, S., Briceno, S. and Mavris, D., 2016, September. A deep learning
approach to flight delay prediction. In 2016 IEEE/AIAA 35th Digital Avionics Systems
Conference (DASC) (pp. 1-6). IEEE.
4. Sternberg, A., Soares, J., Carvalho, D. and Ogasawara, E., 2017. A review on flight delay
prediction. arXiv preprint arXiv:1703.06118.
5. Ding, Y., 2017, August. Predicting flight delay based on multiple linear regression. In IOP
conference series: Earth and environmental science (Vol. 81, No. 1, p. 012198). IOP
Publishing.
6. Manna, Suvojit, Sanket Biswas, Riyanka Kundu, Somnath Rakshit, Priti Gupta, and Subhas
Barman. "A statistical approach to predict flight delay using gradient boosted decision
tree." In 2017 International Conference on Computational Intelligence in Data Science
(ICCIDS), pp. 1-5. IEEE, 2017.
7. Dou, Xiaotong. "Flight arrival delay prediction and analysis using ensemble learning." In
2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control
Conference (ITNEC), vol. 1, pp. 836-840. IEEE, 2020.
8. Chen, Jun, and Meng Li. "Chained predictions of flight delay using machine learning." In
AIAA Scitech 2019 forum, p. 1661. 2019.
9. Rodríguez-Sanz, Álvaro, Fernando Gómez Comendador, Rosa Arnaldo Valdés, Javier
Pérez-Castán, Rocío Barragán Montes, and Sergio Cámara Serrano. "Assessment of airport
arrival congestion and delay: Prediction and reliability." Transportation Research Part C:
Emerging Technologies 98 (2019): 255-283.
10. Kuhn, Nathalie, and Navaneeth Jamadagni. "Application of machine learning algorithms to
predict flight arrival delays." CS229 (2017).
43