Mca Final Year Project
Mca Final Year Project
Mca Final Year Project
A PROJECT REPORT
ON
PREDICTION OF FLIGHT DELAY ANALYSIS
Submitted in partial fullfilment of the requirements for the award of degree of
MASTER OF COMPUTER APPLICATIONS
SUBMITTED BY:
K.TRIVIKRAM ( 18MCA043L)
UNDER THE GUIDANCE OF
Ms.R. JAYAMMA, MCA, M.TECH
Assistant Professor, Dept. of M.sc(CS)
CERTIFICATE
This is to certify that the project work entitled “PREDICTION OF FLIGHT DELAY ANALYSIS”
is a bonafide work carried out by K.TRIVIKRAM(18MCA043) in partial fulfilment for the award of
the degree in MASTER OF COMPUTER APPLICATIONS of KRISHNA UNIVERSITY,
MACHILIPATNAM during the academic year 2019-2021. All corrections / suggestions indicated for
internal assessment have been incorporated in the report. The project work has been approved as it
satisfies the academic requirements in respect of project work prescribed for the above degree.
External Examiner
ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion of any task would be incomplete without
mentioning the people who made it possible and whose constant guidance and encouragement crown
all the efforts with success. This acknowledgement transcends the reality of formality when we would
like to express deep gratitude and respect to all those people behind the screen who guided, inspired
and helped me for the completion of the work. I wish to place on my record my deep sense gratitude to
my project guide, Ms.R.JAYAMMA, Assistant Professor, Department of M.Sc(CS) for her
constant motivation and valuable help throughout the project work.
My sincere thanks to Mrs. SHAMIM, Head of the Department of M.Sc(CS) for her guidance
regarding the project. I also extended my thanks to Dr.P.BHARATHI DEVI, Head of the
Department of MCA for her valuable help throughout the project. I also extend my thanks to
Dr.MAZHARUNNISA BEGUM DIRECTOR for P.G. CENTRE, I extend gratitude to SRI.
S.VENKATESH, DIRECTOR for P.G. COURSES for his valuable suggestions.
K.TRIVIKRAM
(Regd.NO:18MCA043)
DECLARATION
I hereby declare the project work entitled “PREDICTION OF FLIGHT DELAY ANALYSIS”
submitted to K.B.N P.G COLLEGE affiliated to KRISHNA UNIVERSITY, has been done under the
guidance of Ms.R. JAYAMMA, Assistant Professor, Department of M.Sc(CS) during the period of
study in that it has found formed the basis for the award of the degree/diploma or other similar title to
any candidate of University.
.
Signature of Student
Name: K.Trivikram
Regd.No:18MCA043
College name: KBN PG COLLEGE
DATE:
PLACE: VIJAYAWADA
ABSTRACT
The prediction of flight delays is heavily investigated in the last few decades. Flight delays hurt
airlines, airports, and passengers. The development of accurate prediction models for flight delays
became cumbersome due to the complexity of air transportation system, the number of methods for
prediction, and the deluge of flight data. The flight delay analysis is based on scheduled arrival,
departure and actual time. In this context, this paper presents a thorough literature review of
approaches used to build flight delay prediction models. We propose a taxonomy and summarize the
initiatives used to address the flight delay prediction problem, according to scope, data, and
computational methods, giving particular attention to an increased usage of machine learning methods.
Besides, then we will check the accuracy metrics for flight delay prediction.
INDEX
3. 20 – 21
REVIEW OF LITERATURE
4. 4.1 DESGIN 22 – 39
4.2 UML DIAGRAM
4.3 IMPLEMENTATION
5. 40 – 43
SAMPLE CODE
6. 44 – 55
SCREENSHOTS
SYSTEM TESTING
RESULT ANALYSIS
8. 64 – 66
10. REFERENCES 69 – 70
1.INTRODUCTION
1.INTRODUCTION
The prediction of flight delays is heavily investigated in the last few decades. Flight delays hurt
airlines, airports, and passengers. The development of accurate prediction models for flight delays
became cumbersome due to the complexity of air transportation system, the number of methods for
prediction, and the deluge of flight data. The flight delay analysis is based on scheduled arrival,
departure and actual time. In this context, this paper presents a thorough literature review of
approaches used to build flight delay prediction models. We propose a taxonomy and summarize the
initiatives used to address the flight delay prediction problem, according to scope, data, and
computational methods, giving particular attention to an increased usage of machine learning methods.
Besides, then we will check the accuracy metrics for flight delay prediction.
An aircraft is said to be delayed when it departs and/or arrives later than its actual planned time. There
are several causes of an aircraft being delayed such as weather changes, problems in maintenance,
previous delays being propagated down the line, traffic congestion and many more. These delays are a
huge challenge for the aviation industry as well as their customers and passengers. In the USA alone,
these delays result in loss of about 22 billion US dollars every year. This is because aviation companies
are forced to pay the government authorities when they keep aircraft on hold for more than a certain
stipulated time. Airplane delays also cause a lot of problems for the travelling passengers. A delay of
an aircraft can be problematic for the travelling passengers as it prevents them from fulfilling their
commitments and attending preplanned events. This can result in the passenger losing a lot of money
as well as make him or her frustrated and angry.
Several models have already been proposed to correctly forecast delays in flights. We utilize a
machine learning technique called Logistic regression to predict delays in aircrafts. This technique
takes various independent parameters and trains a model to classify whether an aircraft is going to be
delayed or not. We implemented the algorithm on the Microsoft Azure Learning Studio platform.
We also utilised a weather dataset and joined it with the airport dataset at the respective locations to
determine the effect of weather conditions on flight delays as well as make the prediction more
accurate for real world scenarios. We train the model using 70 percent of the dataset and then test it
with the remaining 30 percent of the data. The model was able to successfully predict the correct
outcome in more than 80 percent of the scenarios.
2.SYSTEM REQUIREMENTS
2.SYSTEM REQUIREMENTS
Libraries:
Matplotlib
Numpy
Pandas
Regex
Requests
Scikit-learn
Scipy
Sklearn.
Language: Python
ANACONDA
Anaconda is a complete, open source data science package with a community of over 6 million users. It
is easy to download and install, and it is supported on Linux, MacOs, and Windows.
The distribution comes with more than 1,000 data packages as well as the Conda package and virtual
environment manager, so it elminates the need to learn to install each library independently.
As Anaconda’s website says, “The Python and R conda packages in the Anaconda Repository are
curated and compiled in our secure environment so you get optimized binaries that ‘just work’ on your
system”.
Jupyter Notebook
Spyder
PyCharm
VSCode
Glueviz
Orange 3 App
RStudio
Jupyter Lab
Qt Console: It is the PyQt GUI that supports inline figures, proper multiline editing with
syntax highlighting, graphical calltips and more.
Spyder: Spyder is a scientific Python Development Environment. It is a powerful Python
IDE with advanced editing, interactive testing, debugging and introspection features.
VS Code: It is a streamlined code editor with support for development operations like
debugging, task running and version control.
Glueviz: This is used for multidimensional data visualization across files. It explores
relationships within and among related datasets.
Orange 3: It is a component-based data mining framework. This can be used for data
visualization and data analysis. The workflows in Orange 3 are very interactive and provide
a large toolbox.
Rstudio: It is a set of integrated tools designed to help you be more productive with R. It
includes R essentials and notebooks.
The Jupyter Notebook is an open source web application that you can use to create and share
documents that contain live code, equations, visualizations, and text. Jupyter Notebook is maintained
by the people at Project Jupyter.
Jupyter Notebooks are a spin-off project from the IPython project, which used to have an IPython
Notebook project itself. The name, Jupyter, comes from the core supported programming languages
that it supports: Julia, Python, and R. Jupyter ships with the IPython kernel, which allows you to write
your programs in Python, but there are currently over 100 other kernels that you can also use.
The Jupyter Notebook is not included with Python, so if you want to try it out, you will need to install
Jupyter.
There are many distributions of the Python language. This article will focus on just two of them for the
purposes of installing Jupyter Notebook. The most popular is CPython, which is the reference version
of Python that you can get from their website. It is also assumed that you are using Python.
PyCharm: It is the most popular IDE for Python, and includes great features such as
excellent code completion and inspection with advanced debugger and support for web
programming and various frameworks. PyCharm is created by Czech company, Jet brains
which focusses on creating integrated development environment for various web
development languages like JavaScript and PHP. PyCharm offers some of the best features
to its users and developers in the following aspects
Code completion and inspection.
Advanced debugging.
Support for web programming and frameworks such as Django and Flask.
Features of PyCharm
Besides, a developer will find PyCharm comfortable to work with because of the features mentioned
below −
SQLAlchemy as Debugger: You can set a breakpoint, pause in the debugger and
can see the SQL representation of the user expression for SQL Language code.
Git Visualization in Editor: When coding in Python, queries are normal for a
developer. You can check the last commit easily in PyCharm as it has the blue sections that
can define the difference between the last commit and the current one.
Code Coverage in Editor: You can run .py files outside PyCharm Editor as well
marking it as code coverage details elsewhere in the project tree, in the summary section
etc.
Package Management: All the installed packages are displayed with proper visual
representation. This includes list of installed packages and the ability to search and add new
packages.
Local History: It is always keeping track of the changes in a way that complements
like Git. Local history in PyCharm gives complete details of what is needed to rollback and
what is to be added.
Refactoring : It is the process of renaming one or more files at a time and PyCharm
includes various shortcuts for a smooth refactoring process.
LIBRARIES
Matplotlib:
Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter
notebook, web application servers, and four graphical user interface toolkits.
Matplotlib tries to make easy things easy and hard things possible.
You can generate plots, histograms, power spectra, bar charts, error charts, scatterplots, etc., with
just a few lines of code.
For simple plotting the pyplot module provides a MATLAB-like interface, particularly when
combined with IPython.
For the power user, you have full control of line styles, font properties, axes properties, etc, via an
object oriented interface or via a set of functions familiar to MATLAB users.
Numpy:
NumPy is the fundamental package for scientific computing with Python. It contains among other
things:
a powerful N-dimensional array object
sophisticated (broadcasting) functions
tools for integrating C/C++ and Fortran code
useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi- dimensional
container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly
and speedily integrate with a wide variety of databases.
NumPy is licensed under the BSD license, enabling reuse with few restrictions.
Pandas:
History of development
In 2008, pandas development began at AQR Capital Management. By the end of 2009 it had been open
sourced, and is actively supported today by a community of like-minded individuals around the world
who contribute their valuable time and energy to help make open source pandas possible.
Since 2015, pandas is a NumFOCUS sponsored project. This will help ensure the success of
development of pandas as a world-class open-source project.
Timeline
Library Highlights
A fast and efficient DataFrame object for data manipulation with integrated indexing.
Tools for reading and writing data between in-memory data structures and different formats:
CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format; Intelligent
data alignment and integrated handling of missing data: gain automatic label-based
alignment in computations and easily manipulate messy data into an orderly form.
Flexible reshaping and pivoting of data sets.
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets.
Columns can be inserted and deleted from data structures for size mutability.
Aggregating or transforming data with a powerful group by engine allowing split-
apply-combine operations on data sets.
High performance merging and joining of data sets.
Hierarchical axis indexing provides an intuitive way of working with high- dimensional data
in a lower-dimensional data structure.
Time series-functionality: date range generation and frequency conversion, moving window
statistics, date shifting and lagging. Even create domain-specific time offsets and join time
series without losing data.
Highly optimized for performance, with critical code paths written in Cython or C.
Python with pandas is in use in a wide variety of academic and
commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising,
Web Analytics, and more.
Mission
Pandas aims to be the fundamental high-level building block for doing practical, real world data
analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible
open source data analysis / manipulation tool available in any language.
Vision
Accessible to everyone
Flexible
Powerful
Easy to use
Fast
Values
Is in the core of pandas to be respectful and welcoming with everybody, users, contributors and the
broader community. Regardless of level of experience, gender, gender identity and expression, sexual
orientation, disability, personal appearance, body size, race, ethnicity, age, religion, or nationality.
Regex:
A regular expression, regex or regexp (sometimes called a rational expression) is a sequence of
characters that define a search pattern.
Usually such patterns are used by string searching algorithms for "find" or "find and replace"
operations on strings, or for input validation.
It is a technique developed in theoretical computer science and formal language theory.
Regular expressions are used in search engines, search and replace dialogs of word processors
and text editors, in text processing utilities such as sed and AWK and in lexical
analysis.
Many programming languages provide regex capabilities either built-in or via libraries.
Requests:
Requests is a Python HTTP library, released under the Apache2 License.
The goal of the project is to make HTTP requests simpler and more human-friendly.
The current version is 2.22.0
The requests library is the de facto standard for making HTTP requests in Python.
It abstracts the complexities of making requests behind a beautiful, simple API so that you can
focus on interacting with services and consuming data in your application.
Scikit-learn:
cikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning
library for the Python programming language.
It features various classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to
interoperate with the Python numerical and scientific libraries NumPy
and SciPy.
Scikit-learn is largely written in Python, and uses numpy extensively for high- performance linear
algebra and array operations.
Furthermore, some core algorithms are written in Cython to improve performance.
Support vector machines are implemented by a Cython wrapper around LIBSVM; logistic
regression and linear support vector machines by a similar wrapper around LIBLINEAR.
In such cases, extending these methods with Python may not be possible.
Scikit-learn integrates well with many other Python libraries, such as matplotlib and plotly for
plotting, numpy for array vectorization, pandas dataframes, scipy, and many more.
Scikit-learn is one of the most popular machine learning libraries on GitHub.
SciPy:
SciPy is a free and open-source Python library used for scientific computing and technical
computing.
SciPy contains modules for optimization, linear algebra, integration, interpolation, special
functions, FFT, signal and image processing, ODE solvers and other tasks common in science
and engineering.
SciPy builds on the NumPy array object and is part of the NumPy stack which includes tools like
Matplotlib, pandas and SymPy, and an expanding set of scientific computing libraries.
This NumPy stack has similar users to other applications such as MATLAB, GNU Octave, and
Scilab.
The NumPy stack is also sometimes referred to as the SciPy stack.
SciPy is also a family of conferences for users and developers of these tools: SciPy (in the United
States), EuroSciPy (in Europe) and SciPy.in (in India).
Enthought originated the SciPy conference in the United States and continues to sponsor many of
the international conferences as well as host the SciPy website.
The SciPy library is currently distributed under the BSD license, and its development is
PYTHON:
Python is a general purpose, dynamic, high level and interpreted programming language. It
supports Object Oriented programming approach to develop applications. It is simple and easy to
learn and provides lots of high-level data structures.
Python is easy to learn yet powerful and versatile scripting language which makes it attractive for
Application Development.
Python's syntax and dynamic typing with its interpreted nature, makes it an ideal language for
scripting and rapid application development.
Python supports multiple programming pattern, including object oriented, imperative and
functional or procedural programming styles.
Python is not intended to work on special area such as web programming. That is why it is known
as multipurpose because it can be used with web, enterprise, 3D CAD etc.
We don't need to use data types to declare variable because it is dynamically typed so we can
write a=10 to assign an integer value in an integer variable.
Python makes the development and debugging fast because there is no compilation step included
in python development and edit-test-debug cycle is very fast.
Python features:
Python provides lots of features that are listed below.
Easy to Learn and Use: Python is easy to learn and use. It is developer-friendly and high level
programming language.
Expressive Language: Python language is more expressive means that it is more understandable
and readable.
Interpreted Language: Python is an interpreted language i.e. interpreter executes the code line by
line at a time. This makes debugging easy and thus suitable for beginners.
Cross-platform Language: Python can run equally on different platforms such as Windows,
Linux, Unix and Macintosh etc. So, we can say that Python is a portable language.
Free and Open Source: Python language is freely available at official web address. The
source-code is also available. Therefore it is open source.
Object-Oriented Language: Python supports object oriented language and concepts of classes and
objects come into existence.
Extensible: It implies that other languages such as C/C++ can be used to compile the code and
thus it can be used further in our python code.
Large Standard Library: Python has a large and broad library and provides rich set of module and
functions for rapid application development.
GUI Programming Support: Graphical user interfaces can be developed using Python.
Integrated: It can be easily integrated with languages like C, C++, JAVA etc.
Python applications:
Python is known for its general purpose nature that makes it applicable in almost each domain of
software development. Python as a whole can be used in any sphere of development.
Here, we are specifying applications areas where python can be applied.
Web Applications:
We can use Python to develop web applications. It provides libraries to handle internet protocols such
as HTML and XML, JSON, Email processing, request, beautifulSoup, Feedparser etc. It also provides
Frameworks such as Django, Pyramid, Flask etc to design and develop web based applications. Some
important developments are: PythonWikiEngines, Pocoo, PythonBlogSoftware etc.
Software Development:
Python is helpful for software development process. It works as a support language and can be used for
build control and management, testing etc.
Business Applications:
Python is used to build business applications like ERP and e-commerce systems. Tryton is a high level
application platform.
3D CAD Applications:
To create CAD application Fandango is a real application which provides full features of CAD.
Enterprise Applications:
Python can be used to create applications which can be used within an Enterprise or an Organization.
Some real time applications are: OpenErp, Tryton, Picalo etc.
Operational Feasibility
Operational Feasibility deals with the study of prospects of the system to be developed. This system
operationally eliminates all the tensions of the Admin and helps him in effectively tracking the project
progress. This kind of automation will surely reduce the time and energy, which previously consumed
in manual work. Based on the study, the system is proved to be operationally feasible.
Economic Feasibility
Economic Feasibility or Cost-benefit is an assessment of the economic justification for a computer
based project. As hardware was installed from the beginning & for lots of purposes thus the cost on
project of hardware is low. Since the system is a network based, any number of employees connected
to the LAN within that organization can use this tool from at any time. The Virtual Private Network is
to be developed using the existing resources of the organization. So the project is economically
feasible.
Technical Feasibility
According to Roger S. Pressman, Technical Feasibility is the assessment of the technical resources of
the organization. The organization needs IBM compatible machines with a graphical web browser
connected to the Internet and Intranet. The system is developed for platform Independent environment.
Java Server Pages, JavaScript, HTML, SQL server and WebLogic Server are used to develop the
system. The technical feasibility has been carried out. The system is technically feasible for
development and can be developed with the existing facility.
3.REVIEW OF LITERATURE
Flight delays hurt airlines, airports, and passengers. Their prediction is crucial during the
decision-making process for all players of commercial aviation. Moreover, the development of
accurate prediction models for flight delays became cumbersome due to the complexity of air
transportation system, the number of methods for prediction, and the deluge of flight data. In this
context, this paper presents a thorough literature review of approaches used to build flight delay
prediction models from the Data Science perspective. We propose a taxonomy and summarize the
initiatives used to address the flight delay prediction problem, according to scope, data, and
computational methods, giving particular attention to an increased usage of machine learning methods.
Besides, we also present a timeline of significant works that depicts relationships between flight delay
prediction problems and research trends to address them.
The expected growth in air travel demand and the positive correlation with the economic factors
highlight the significant contribution of the aviation community to the U.S. economy. On‐time
operations play a key role in airline performance and passenger satisfaction. Thus, an accurate
investigation of the variables that cause delays is of major importance. The application of machine
learning techniques in data mining has seen explosive growth in recent years and has garnered interest
from a broadening variety of research domains including aviation. This study employed a support
vector machine (SVM) model to explore the non-linear relationship between flight delay outcomes.
Individual flight data were gathered from 20 days in 2018 to investigate causes and patterns of air
traffic delay at three major New York City airports. Considering the black box characteristic of the
SVM, a sensitivity analysis was performed to assess the relationship between dependent and
explanatory variables. The impacts of various explanatory variables are examined in relation to delay,
weather information, airport ground operation, demand-capacity, and flow management
characteristics. The variable impact analysis reveals that factors such as pushback delay, taxi-out
delay, ground delay program, and demand-capacity imbalance with the probabilities of 0.506, 0.478,
0.339, and 0.338, respectively, are significantly associated with flight departure delay. These findings
provide insight for better understanding of the causes of departure delays and the impacts of various
explanatory factors on flight delay patterns.
Systems design is the process of defining elements of a system like modules, architecture, components
and their interfaces and data for a system based on the specified requirements. It is the process of
defining, developing and designing systems which satisfies the specific needs and requirements of a
business or organization.
This system is conducted for the purpose of single platform web application to multiple users. The
existent system which increases the chances for errors and it also causes much more stress to the
people which are engrossed in the work.
UML is a method for describing the system architecture in detail using the blue print. UML represents
a collection of best engineering practice that has proven successful in the modeling of large and
complex systems. The UML is very important parts of developing object oriented software and the
software development process. The UML uses mostly graphical notations to express the design of
software projects. Using the helps UML helps project teams communicate explore potential designs
and validate the architectural design of the software.
UML offers a set of standardized diagram types with which complex data, processes and systems can
easily be arranged in a clear, intuitive manner.
One major aspect of UML is the ability to use diagrams as a part of project documentation. These can
be utilised in various ways in the most diverse kinds of documents; for example, Use Case Diagrams
used in describing functional requirements can be specified in the requirements definition. Classes or
component diagrams can be used as software architecture in a design document. As a matter of
principle, UML diagrams can be used in practically any technical documentation (e.g. test plans) while
also serving as part of the user handbook.
Use case diagram represents the functionality of the system. Use case focus on the behavior of the
system from external point of view. Actors are external entities that interact with the system.
USECASE DIAGRAM
Use cases:
A use case describes a sequence of actions that provide something of measurable value to an actor and
is drawn as a horizontal ellipse.
Actors:
An actor is a person, organization, or external system that plays a role in one or more interactions with
the system.
Include:
In one form of interaction, a given use case may include another. "Include is a Directed Relationship
between two use cases, implying that the behaviour of the included use case is inserted into the
behaviour of the including use case.
The first use case often depends on the outcome of the included use case. This is useful for extracting
truly common behaviours from multiple use cases into a single description. The notation is a dashed
arrow from the including to the included use case, with the label "«include»". There are no
parameters or return values. To specify the location in a flow of events in which the base use case
includes the behaviour of another, you simply write include followed by the name of use case you want
to include, as in the following flow for track order.
Extend:
In another form of interaction, a given use case (the extension) may extend another. This relationship
indicates that the behaviour of the extension use case may be inserted in the extended use case under
some conditions. The notation is a dashed arrow from the extension to the extended use case, with the
label "«extend»". Modellers use the «extend» relationship to indicate use cases that are "optional" to
the base use case.
Generalization:
In the third form of relationship among use cases, a generalization/specialization relationship exists. A
given use case may have common behaviours, requirements, constraints, and assumptions with a more
general use case. In this case, describe them once, and deal with it in the same way, specialized cases.
The notation is a solid line ending in a hollow triangle drawn from the specialized to the more general
use case (following the standard generalization notation.
Associations:
Associations between actors and use cases are indicated in use case diagrams by solid lines. An
association exists whenever an actor is involved with an interaction described by a use case.
Associations are modelled as lines connecting use cases and actors to one another, with an optional
arrowhead on one end of the line. The arrowhead is often used to indicating the direction of the initial
invocation of the relationship or to indicate the primary actor within the use case.
2. Class Diagram
Class-based Modeling, or more commonly class-orientation, refers to the style of object-oriented
programming in which inheritance is achieved by defining classes of objects; as opposed to the objects
themselves (compare Prototype-based programming).
The most popular and developed model of OOP is a class-based model, as opposed to an object-based
model. In this model, objects are entities that combine state (i.e., data), behavior (i.e., procedures, or
methods) and identity (unique existence among all other objects). The structure and behavior of an
object are defined by a class, which is a definition, or blueprint, of all objects of a specific type. An
object must be explicitly created based on a class and an object thus created is considered to be an
instance of that class. An object is similar to a structure, with the addition of method pointers, member
access control, and an implicit data member which locates instances of the class (i.e. actual objects of
that class) in the class hierarchy (essential for runtime features).
Class Diagram
3. Sequence Diagram:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram that shows
how processes operate with one another and in what order. It is a construct of a Message Sequence
Chart.
Sequence diagrams are sometimes called event diagrams, event scenarios, and timing diagrams.A
sequence diagram shows, as parallel vertical lines (lifelines), different processes or objects that live
simultaneously, and, as horizontal arrows, the messages exchanged between them, in the order in
which they occur. This allows the specification of simple runtime scenarios in a graphical manner. If
the lifeline is that of an object, it demonstrates a role. Note that leaving the instance name blank can
represent anonymous and unnamed instances. In order to display interaction, messages are used. These
are horizontal arrows with the message name written above them. Solid arrows with full heads are
synchronous calls, solid arrows with stick heads are asynchronous calls and dashed arrows with stick
heads are return messages. This definition is true as of UML 2, considerably different from UML 1.x.
Activation boxes, or method-call boxes, are opaque rectangles drawn on top of lifelines to represent
that processes are being performed in response to the message (Execution Specifications in UML).
Objects calling methods on themselves use messages and add new activation boxes on top of any
others to indicate a further level of processing. When an object is destroyed (removed from memory),
an X is drawn on top of the lifeline, and the dashed line ceases to be drawn below it (this is not the case
in the first example though). It should be the result of a message, either from the object itself, or
another.
A message sent from outside the diagram can be represented by a message originating from a filled-in
circle (found message in UML) or from a border of sequence diagram (gate in UML).
Sequence Diagram
4. Collaboration Diagram:
A Sequence diagram is dynamic, and, more importantly, is time ordered. A Collaboration diagram is
very similar to a Sequence diagram in the purpose it achieves; in other words, it shows the dynamic
interaction of the objects in a system. A distinguishing feature of a Collaboration diagram is that it
shows the objects and their association with other objects in the system apart from how they interact
with each other. The association between objects is not represented in a Sequence diagram.
A Collaboration diagram is easily represented by modeling objects in a system and representing the
associations between the objects as links. The interaction between the objects is denoted by arrows. To
identify the sequence of invocation of these objects, a number is placed next to each of these arrows.
Collaboration diagram
Activity Diagram:
Activity diagrams are graphical representations of workflows of stepwise activities and actions with
support for choice, iteration and concurrency. In the Unified Modeling Language, activity diagrams
can be used to describe the business and operational step-by-step workflows of components in a
system. An activity diagram shows the overall flow of control.Activity diagrams are constructed from
a limited repertoire of shapes, connected with arrows.
Arrows run from the start towards the end and represent the order in which activities happen. However,
the join and split symbols in activity diagrams only resolve this for simple cases; the meaning of the
model is not clear when they are arbitrarily combined with the decisions or loops.
Activity diagram
a large black dot with a circle around it. Historical states are denoted as circles with the letter H
inside.
4.3 IMPLEMENTATION
System Architecture
Introduction
A delay of an aircraft can be problematic for the travelling passengers as it prevents them from
fulfilling their commitments and attending preplanned events. This can result in the passenger losing a
lot of money as well as make him or her frustrated and angry. Several models have already been
proposed to correctly forecast delays in flights. We utilize a machine learning technique called Lasso
regression to predict delays in aircrafts. This technique takes various independent parameters and
trains a model to classify whether an aircraft is going to be delayed or not. We implemented the
algorithm on the Microsoft Azure Learning Studio platform. We also utilised a weather dataset and
joined it with the airport dataset at the respective locations to determine the effect of weather
conditions on flight delays as well as make the prediction more accurate for real world scenarios. We
train the model using 70 percent of the dataset and then test it with the remaining 30 percent of the data.
The model was able to successfully predict the correct outcome in more than 80 percent of the
scenarios.
Dataset Description:
The sample data has been collected from department of transportation which consists of all the
records of flight details and weather data.
Dataset: 2015 flight delays and cancellations from kaggle.
The dataset consists of 23,123entries and 31 columns.
The dataset contains data on on-time, delayed, canceled and diverted flights, flight details,
arrival, departure and scheduled times of flights.
Features
YEAR: Year of the Flight Trip
MONTH: Month of the Flight Trip
DAY: Day of the Flight Trip
DAY_OF_WEEK: Day of week of the Flight Trip
AIRLINE: Airline Identifier
FLIGHT_NUMBER: Flight Identifier
TAIL_NUMBER: Aircraft Identifier
ORIGIN_AIRPORT: Starting Airport
DESTINATION_AIRPORT: Destination Airport
SCHEDULED_DEPARTURE: Planned Departure Time
DEPARTURE_TIME: WHEEL_OFF - TAXI_OUT
DEPARTURE_DELAY: Total Delay on Departure
TAXI_OUT: The time duration elapsed between departure from the origin airport gate and wheels off
WHEELS_OFF: The time point that the aircraft's wheels leave the ground
SCHEDULED_TIME: Planned time amount needed for the flight trip
ELAPSED_TIME: AIR_TIME+TAXI_IN+TAXI_OUT
AIR_TIME: The time duration between wheels_off and wheels_on time
DISTANCE: Distance between two airports
WHEELS_ON: The time point that the aircraft's wheels touch on the ground
TAXI_IN: The time duration elapsed between wheels-on and gate arrival at the destination airport
SCHEDULED_ARRIVAL: Planned arrival time
ARRIVAL_TIME: WHEELS_ON+TAXI_IN
ARRIVAL_DELAY: ARRIVAL_TIME-SCHEDULED_ARRIVAL
DIVERTED: Aircraft landed on airport that out of schedule
CANCELLED: Flight Cancelled (1 = cancelled)
CANCELLATION_REASON: Reason for Cancellation of flight: A - Airline/Carrier; B - Weather; C -
National Air System; D - Security
AIR_SYSTEM_DELAY: Delay caused by air system
SECURITY_DELAY: Delay caused by security
AIRLINE_DELAY: Delay caused by the airline
LATE_AIRCRAFT_DELAY: Delay caused by aircraft
WEATHER_DELAY: Delay caused by weather.
Project Modules
Pre processing:
Data pre processing is a technique that is used to convert raw data into a clean dataset. The data is
gathered from different sources is in raw format which is not feasible for the analysis. Pre-processing
for this approach takes 4 simple yet effective steps.
Feature Scaling:
The final step of data pre- processing is feature scaling. But what is it? It is a method used to
standardize the range of independent variables or features of data. But why is it necessary? A lot of
machine learning models are based on Euclidean distance. If, for example, the values in one column
(x) is much higher than the value in another column (y), (x2-x1) squared will give a far greater value
than (y2-y1) squared. So clearly, one square distinction dominates over the other square distinction. In
the machine learning equations, the square difference with the lower value in comparison to the far
greater value will almost be treated as if it does not exist. We do not want that to happen. That is why
it’s necessary to transform all our variables into the same scale.
Label Encoding
In machine learning, we usually deal with datasets which contains multiple labels in one or more than
one columns. These labels can be in the form of words or numbers. To make the data understandable or
in human readable form, the training data is often labeled in words.
Label encoding refers to converting the labels into numeric form so as to convert it into the
machine-readable form. Machine learning algorithms can then decide in a better way on how those
labels must be operated. It is an important pre-processing step for the structured dataset in supervised
learning.
Feature Selection
Feature selection is also called variable selection or attribute selection.
It is the automatic selection of attributes in your data (such as columns in tabular data) that are most
relevant to the predictive modeling problem you are working on.
feature selection… is the process of selecting a subset of relevant features for use in model
construction
Feature selection is different from dimensionality reduction. Both methods seek to reduce the number
of attributes in the dataset, but a dimensionality reduction method do so by creating new combinations
of attributes, where as feature selection methods include and exclude attributes present in the data
without changing them.
Correlation matrix:
A correlation matrix is a table showing correlation coefficients between sets of variables. Each random
variable (X i) in the table is correlated with each of the other values in the table (X j). This allows you
to see which pairs have the highest correlation.
Applying Algorithms
The dataset is split as train and test data and then train the model with regression algorithms such as
Support Vector Regression and LASSO regression.
Validation of Model
Model validation is the process of checking whether the user input is suitable for model binding and if
not it should provide useful error messages to the user. The first part is to ensure that only valid entries
are made. This should filter inputs which don’t make any sense.
Algorithms
Another hyperparameter is provided called “lambda” that controls the weighting of thesum of both
penalties to the loss function. A default value of 1.0 is used to use the fully weighted penalty; a value of
0 excludes the penalty. Very small values of lambada, such as 1e-3 or smaller, are common.
LASSO Regression:
Lasso regression is a type of linear Regression that uses shrinkage. Shrinkage is where data values are
shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models
(i.e. models with fewer parameters). This particular type of regression is well-suited for models
showing high levels of muticollinearity or when you want to automate certain parts of model selection,
like variable selection/parameter elimination.
The acronym “LASSO” stands for Least Absolute Shrinkage and Selection Operator.
A tuning parameter, λ controls the strength of the L1 penalty. λ is basically the of amount shrinkage:
When λ = 0, no parameters are eliminated. The estimate is equal to the one found with linear
regression.
As λ increases, more and more coefficients are set to zero and eliminated (theoretically, when λ
= ∞, all coefficients are eliminated).
As λ increases, bias increases.
As λ decreases, variance increases.
Lasso regression is one of the regularization methods that creates parsimonious models in the
presence of large number of features, where large means either of the below two things:
Large enough to enhance the tendency of the model to over-fit. Minimum ten variables can
cause overfitting.
Large enough to cause computational challenges. This situation can arise in case of millions or
billions of features.
Lasso regression performs L1 regularization that is it adds the penalty equivalent to the
absolute value of the magnitude of the coefficients. Here the minimization objective is as f
By using large coefficient, we are putting a huge emphasis on the particular feature that it canbe a good
predictor of the outcome. And when it is too large, the algorithm starts modeling intricate relations to
calculate the output & ends up overfitting to the particular data. Lasso regression adds a factor of the
sum of the absolute value of the coefficients the optimization objective.
5. SAMPLE CODE
5.SAMPLE CODE
SOURCE CODE:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
Load the dataset
df = pd.read_csv("flightsdelay.csv")
print("Data read Sucessfully")
df
DATA PRE-PROCESSING
wd = df['WeatherDelay'].astype('float').mean(axis = 0)
df['WeatherDelay'].replace(np.nan,wd,inplace = True)
nas = df['NASDelay'].astype('float').mean(axis = 0)
df['NASDelay'].replace(np.nan,nas,inplace = True)
sd = df['SecurityDelay'].astype('float').mean(axis = 0)
df['SecurityDelay'].replace(np.nan,sd,inplace = True)
lad = df['LateAircraftDelay'].astype('float').mean(axis = 0)
df['LateAircraftDelay'].replace(np.nan,lad,inplace = True)
df.isnull().sum()
LABEL ENCODING
LINEAR REGRESSION
Importing Model
from sklearn.linear_model import LinearRegression
model_lr = LinearRegression()
model_lr.fit(x_train,y_train)
from sklearn.metrics import r2_score
r2_score(y_test, model_lr.predict(x_test))
LASSO REGRESSION
Importing Model
from sklearn.linear_model import Lasso
model_lso = Lasso()
model_lso.fit(x_train,y_train)
from sklearn.metrics import r2_score
r2_score(y_test, model_lso.predict(x_test))
Here we got 65% of Accuracy in Lasso Regressor
RIDGE REGRESSION
Importing Model
from sklearn.linear_model import Ridge
model_rdg = Ridge()
model_rdg.fit(x_train,y_train)
from sklearn.metrics import r2_score
r2_score(y_test, model_lso.predict(x_test))
Here we got 65% of Accuracy in Ridge Regressor
Importing Model
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.1,random_state = 10)
from sklearn.linear_model import ElasticNet
model_en = ElasticNet(alpha = 1.0)
model_en.fit(x_train,y_train)
from sklearn.metrics import r2_score
r2_score(y_test, model_en.predict(x_test))
Here we got 70% of Accuracy in Elastic net regressor.
6. SCREENSHOTS
6. SCREENSHOTS
Raw Dataset
Instance of Dataset
7.SYSTEM TESTING
7.SYSTEM TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality of
components, subassemblies, assemblies and/or a finished product It is the process of exercising
software with the intent of ensuring that the Software system meets its requirements and user
expectations and does not fail in an unacceptable manner. There are various types of test. Each test
type addresses a specific testing requirement.
Unit testing
Unit testing involves the design of test cases that validate that the internal program logic is functioning
properly, and that program inputs produce valid outputs. All decision branches and internal code flow
should be validated. It is the testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing, that relies on
knowledge of its construction and is invasive. Unit tests perform basic tests at component level and test
a specific business process, application, and/or system configuration. Unit tests ensure that each
unique path of a business process performs accurately to the documented specifications and contains
clearly defined inputs and expected results.
Integration testing
Integration tests are designed to test integrated software components to determine if they actually run
as one program. Testing is event driven and is more concerned with the basic outcome of screens or
fields. Integration tests demonstrate that although the components were individually satisfaction, as
shown by successfully unit testing, the combination of components is correct and consistent.
Integration testing is specifically aimed at exposing the problems that arise from the combination of
components.
Functional test
Functional tests provide systematic demonstrations that functions tested are available as specified by
the business and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items:
Valid Input : identified classes of valid input must be accepted. Invalid Input : identified classes
of invalid input must be rejected. Functions : identified functions must be exercised.
System Test
System testing ensures that the entire integrated software system meets requirements. It tests a
configuration to ensure known and predictable results. An example of system testing is the
configuration oriented system integration test. System testing is based on process descriptions and
flows, emphasizing pre-driven process links and integration points.
Unit Testing
Unit testing is usually conducted as part of a combined code and unit test phase of the software
lifecycle, although it is not uncommon for coding and unit testing to be conducted as two distinct
phases.
Test objectives
Features to be tested
Integration Testing
Software integration testing is the incremental integration testing of two or more integrated software
components on a single platform to produce failures caused by interface defects. The task of the
integration test is to check that components or software applications,
e.g. components in a software system or – one step up – software applications at the company level –
interact without error.
Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant participation by the
end user. It also ensures that the system meets the functional requirements.
Test Results
All the test cases mentioned above passed successfully. No defects encountered.
Validation Testing.
Unit Testing
Unit testing focuses verification effort on the smallest unit of Software design that is the module. Unit
testing exercises specific paths in a module’s control structure to ensure complete coverage and
maximum error detection. This test focuses on each module individually, ensuring that it functions
properly as a unit. Hence, the naming is Unit Testing.
During this testing, each module is tested individually and the module interfaces are verified for the
consistency with design specification. All the important processing path are tested for the expected
results. All error handling paths are also tested.
Integration Testing
Integration testing addresses the issues associated with the dual problems of verification and program
construction. After the software has been integrated a set of high order tests are conducted. The main
objective in this testing process is to take unit tested modules and builds a program structure that has
been dictated by design.
Bottom-up Integration
This method begins the construction and testing with the modules at the lowest level in the program
structure. Since the modules are integrated from the bottom up, processing required for modules
subordinate to a given level is always available and the need for stubs is eliminated. The bottom up
integration strategy may be implemented with the following steps:
The low-level modules are combined into clusters into clusters that perform a specific Software
sub-function.
A driver (the control program) for testing is written to coordinate test case input and output.
The cluster is tested.
Drivers are removed and clusters are combined moving upward in the program structure
The bottom up approaches tests each module individually and then each module is module is
integrated with a main module and tested for functionality.
Output Testing
After performing the validation testing, the next step is output testing of the proposed system, since no
system could be useful if it does not produce the required output in the specified format. Asking the
users about the format required by them tests the outputs generated or displayed by the system under
consideration. Hence the output format is considered in 2 ways – one is on screen and another in
printed format.
Validation Checking
Validation checks are performed on the following fields.
Text Field:
The text field can contain only the number of characters lesser than or equal to its size. The text fields
are alphanumeric in some tables and alphabetic in other tables. Incorrect entry always flashes and error
message.
Numeric Field:
The numeric field can contain only numbers from 0 to 9. An entry of any character flashes an error
messages. The individual modules are checked for accuracy and what it has to perform. Each module
is subjected to test run along with sample data. The individually tested modules are integrated into a
single system. Testing involves executing the real data information is used in the program the
existence of any program defect is inferred from the output. The testing should be planned so that all
the requirements are individually tested.
A successful test is one that gives out the defects for the inappropriate data and produces and output
revealing the errors in the system.
USER TRAINING
Whenever a new system is developed, user training is required to educate them about the working of
the system so that it can be put to efficient use by those for whom the system has been primarily
designed. For this purpose the normal working of the project was demonstrated to the prospective
users. Its working is easily understandable and since the expected users are people who have good
knowledge of computers, the use of this system is very easy.
MAINTAINENCE
This covers a wide range of activities including correcting code and design errors. To reduce the need
for maintenance in the long run, we have more accurately defined the user’s requirements during the
process of system development. Depending on the requirements, this system has been developed to
satisfy the needs to the largest possible extent. With development in technology, it may be possible to
add many more features based on the requirements in future. The coding and designing is simple and
easy to understand which will make maintenance easier.
TESTING STRATEGY
A strategy for system testing integrates system test cases and design techniques into a well-planned
series of steps that results in the successful construction of software. The testing strategy must
co-operate test planning, test case design, test execution, and the resultant data collection and
evaluation. A strategy for software testing must accommodate low-level tests that are necessary to
verify that a small source code segment has been correctly implemented as well as high level tests that
validate major system functions against user requirements.
Software testing is a critical element of software quality assurance and represents the ultimate review
of specification design and coding. Testing represents an interesting anomaly for the software. Thus, a
series of testing are performed for the proposed system before the system is ready for user acceptance
testing.
8.RESULT ANALYSIS
8.RESULT ANALYSIS
9.CONCLUSION
Overall, our models are only of limited utility since none were capable of correctly predicting flight
delays with both precision and recall greater than 50%. This seemingly low performance is likely due
to the many causes of flight delays being outside the scope of our data. It is unclear if it is even possible
to predict whether or not a flight will be delayed so far in advance, as we have set up the problem,
because so many of the causes of delays (e.g. mechanical issues and weather) cannot be known in
advance. Despite this, we were successful in creating models that outperform baseline models, and
perform at least about as well as prior work, even when we often use less information, and generalize
to more airports.
Although imperfect, this model still makes potentially useful predictions about which flights are more
or less likely to be delayed.
FUTURE SCOPE
To improve our model it is essential to understand what features are important to the model.This
can be done for logistic regression. This can help us to inspire new feature ideas in both high bias as
well as high variance cases, find out the top features and data leakage which can occur in case the
column affecting the output label is included. This more beneficial to coming feature. Regarding this
any problems arise to update our model and most of problems solved by this model.
10.REFERENCES
10.REFERENCES
[1]Yufeng Tu, Michael Ball, Wolfgang Jank. Estimating Flight Departure Delay Distributions-A
Statistical Approach with Long-term Trend and Short-term Pattern. 2006
[3]Mueller, Eric R., and Gano B. Chatterji. "Analysis of aircraft arrival and departure delay
characteristics." AIAA aircraft technology, integration and operations (ATIO) conference. 2002.
[4] Beatty, Roger, et al. "Preliminary evaluation of flight delay propagation through an airline
schedule." Air Traffic Control Quarterly 7.4 (1999): 259-270.
[7] Shawn Allan, J.A Beeslev, Jim Evans, and SteveGaddy. 2001. Analysis of delay causality at
Newark international airport.
[9]Kimyj, Choi S, Briceno S, et al. A deep learning approach to flight delay prediction[C]. 35th Digital
Avionics Systems Conference, Sacramento, USA, 2016: 1–6.
[10] Lecun y, Bengio y, and Hinton G E. Deep learning[J]. Nature, 2015, 521(7553): 436–
444.doi: 10.1038/nature14539.
[11] Huang Gao, Liu Zhuang, and Weinber k q. Densely connected convolutional networks[C]. 30th
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, USA, 2017:
2261–2269.
[13] Nair V and Hinton G E. Rectified linear units improve restricted boltzmann machines[C]. 27th
International Conference on Machine Learning, Haifa, Israel, 2010: 807–814.
[15]Duan Kaibo, Keerthi ss, Chu Wei, et al. Multi-category classification by soft-max combination of
binary classifiers[C]. 4th International Workshop on Multiple Classifier Systems, Guildford, United
Kingdom, 2003: 125–134.