0% found this document useful (0 votes)
104 views25 pages

Documentation & Report For Flyzy Flight Cancellation Project

Uploaded by

Tankiso Mofokeng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views25 pages

Documentation & Report For Flyzy Flight Cancellation Project

Uploaded by

Tankiso Mofokeng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Tankiso Mofokeng

Flight cancellation
Flight cancellation prediction model for Flyzy.com

Tankiso Mofokeng
25/05/2024
1

About Me ............................................................................................................................. 3

About The Company ............................................................................................................ 4

Problem statement .............................................................................................................. 5

Solution ................................................................................................................................ 6

Data cleaning and Data preparation ................................................................................ 9

Visual representation of the outlier analysis before removing the outliers: ................ 9

Visual representation of the outlier analysis after removing the outliers: .................10

Exploratory Data Analysis (EDA) ....................................................................................10

Correlation matrix .......................................................................................................12

Display histograms of numerical columns.................................................................13

Count plot of cancellation status ...............................................................................14

Relationship between Previous_Flight_Delay_Minutes and Flight_Cancelled ...........15

Preprocessing and Model Building ................................................................................15

Helper functions .........................................................................................................16

Model training and results ..........................................................................................18

Build Other Classification Models .................................................................................20

Confusion matrix for Random Forest .........................................................................21

Decision Tree ..............................................................................................................22

Implemented strategies, plans and research methods ....................................................23

Data cleaning..................................................................................................................23
2

Exploratory Data Analysis (EDA) ....................................................................................23

Preprocessing and Model Building ................................................................................23

Build Other Classification Models .................................................................................23

Concluding statement/ Recommendations ......................................................................24


3

About Me

Tankiso Mofokeng(25 years old)

Recent IT Graduate (Systems Development) | Aspiring Data Scientist

Highly motivated recent graduate with a Bachelor's degree in Information Technology (Systems
Development) seeking a challenging entry-level position in data science or a related field. I am
eager to leverage strong analytical skills and a passion for problem-solving to contribute to
innovative projects.

Skills

• Systems Development

• Data Science Fundamentals

• Fast Learner

• Open Communication

• Honesty and Integrity


4

About The Company


Flyzy is one-of-a-kind travel-tech platform that aims to transform modern-day travel into a
completely hassle-free experience by bringing innovation and power of technology right to your
fingertips, quite literally! It connects retailers, service providers, and other stakeholders
facilitating air, road, and train travel with the passengers via its unique hi-tech AI-enabled
platform.

Apart from creating safer, simpler, and more personalized experiences for the passengers, Flyzy
also ensures commercial benefits for the service providers and stakeholders involved. Flyzy's
services will soon be made available for other popular modes of travel like rail and road.
5

Problem statement
Flyzy is a company focused on providing a smooth and hassle-free air travel experience. They
offer personalized in-flight and airport recommendations, and they also provide real-time flight
tracking, mobile check-in, and more. Flyzy aims to redefine the future of air travel with a more
personalized and connected experience from the beginning of the trip to the end.

Flight cancellation is a significant issue in the aviation industry. It not only disrupts the
customers' plans but also impacts the airline's reputation and profitability. Therefore, predicting
flight cancellations can help airlines take preventive measures and minimize disruptions.
6

Solution

The business objective of this project is to develop a reliable predictive model for flight
cancellations to support Flyzy's mission of providing a smooth and hassle-free air travel
experience. By accurately predicting flight cancellations, Flyzy can take proactive measures to
minimize disruptions for their customers and improve overall operational efficiency for partner
airlines.

This predictive model will enable Flyzy to:

• Enhance Customer Satisfaction:


• Optimize Operational Efficiency
• Improve Business Reputation
• Increase Profitability

Dataset
• Flight_ID (Numeric): This is a unique identifier for each flight. The values are randomly
generated and have no particular meaning. This column will not be used in the analysis
and modeling, as it does not contain any useful information.
7

• Airline (Categorical): This column represents the airline that operates the flight. It is a
categorical variable with five different airlines. Each airline has a different reputation and
operational efficiency, which might affect the likelihood of flight cancellations.

• Flight_Distance (Numeric): This column represents the distance of the flight in


kilometers. The values have been scaled to have a mean of 0 and a standard deviation of
1. Longer flights might have a higher chance of cancellation due to more complex
logistics and higher chances of
• encountering issues.

• Origin_Airport and Destination_Airport (Categorical): These columns represent the


airport from where the flight departs and the airport to which the flight arrives,
respectively. They are categorical variables with five different airports each. Some
airports might have a higher chance of flight cancellations due to factors like weather
conditions, operational efficiency, and local regulations.

• Scheduled_Departure_Time (Numeric): This column represents the time when the flight
is scheduled to depart. The time is represented as the number of minutes past midnight.
The departure time might affect the likelihood of flight cancellations. For example,
flights scheduled to depart late at night might have a higher chance of cancellation due to
less operational staff or
• unfavorable weather conditions.

• Day_of_Week (Numeric): This column represents the day of the week when the flight is
scheduled. The days are represented as numbers from 1 (Monday) to 7 (Sunday). The day
of the week might affect the likelihood of flight cancellations. For example, flights
scheduled on
• weekends might have a higher chance of cancellation due to higher passenger load.
8

• Month (Numeric): This column represents the month when the flight is scheduled. The
months are represented as numbers from 1 (January) to 12 (December). The month might
affect the likelihood of flight cancellations. For example, flights scheduled in winter
months might have a higher chance of cancellation due to bad weather conditions.

• Airplane_Type (Categorical): This column represents the model or type of the airplane. It
is a categorical variable with five different airplane types. Some airplane types might
have a higher chance of flight cancellations due to factors like age, maintenance issues,
and fuel efficiency.

• Weather_Score (Numeric): This column represents a score of the severity of the weather
conditions, with higher scores indicating worse weather. Bad weather is one of the most
common reasons for flight cancellations.

• Previous_Flight_Delay_Minutes (Numeric): This column represents the delay of the


previous flight by the same airplane in minutes. If the previous flight was delayed, it
might affect the current flight's schedule and lead to its cancellation.

• Airline_Rating (Numeric): This column represents a score of the quality or reputation of


the airline, with higher scores indicating better airlines.Better-rated airlines might have
lower chances of flight cancellations due to higher operational efficiency and customer
service.

• Passenger_Load (Numeric): This column represents the load or capacity utilization of the
flight, represented as a percentage. Flights with a higher passenger load might have a
lower chance of cancellation, as airlines want to avoid inconveniencing a large number of
passengers.

• Flight_Cancelled (Binary): This is the target variable, represented as a binary variable


where '1' means the flight was cancelled and '0' means it was not cancelled. The flight
cancellations are based on various factors and some randomness.
9

Data cleaning and Data preparation

During data pre-processing, a thorough examination for missing values and duplicates was
conducted. While neither were identified, outliers were present. To address these outliers, a Z-
score approach was implemented to define the upper and lower bounds of the data. This analysis
resulted in data trimming, reducing the initial dataset of 3,000 records to 2,949 records.
Consequently, 51 outliers were successfully removed.

Visual representation of the outlier analysis before removing the outliers:


10

Visual representation of the outlier analysis after removing the outliers:

Exploratory Data Analysis (EDA)

Following an initial exploration of the data, a descriptive analysis was conducted to determine
measures of central tendency: mean, median, and mode. The results indicated that the mean is
11

greater than the median, which is in turn greater than the mode. This observation suggests a
possible positive skew in the distribution of Previous_Flight_Delay_Minutes. To confirm this
initial assessment and gain a deeper understanding of the data's shape, a visual representation of
the distribution was generated.

Following the assessment of skewness, a comprehensive correlation analysis was undertaken to


explore the interrelationships between the variables. This analysis involved the generation of
histograms for all numerical columns. Histograms are graphical representations that effectively
depict the distribution of continuous data, allowing for the identification of potential outliers and
skewness. Additionally, count plots were created for categorical columns, focusing on the top 50
unique values. Count plots provide insights into the frequency of distinct categories within the
data. To delve deeper into specific relationships, a dedicated count plot was generated to
visualize the distribution of 'Cancellation Status'. Finally, to investigate the potential association
between 'Previous_Flight_Delay_Minutes' and 'Flight_Cancelled', a bar graph was plotted.
12

Correlation matrix
13

Display histograms of numerical columns


14

Count plot of cancellation status


15

Relationship between Previous_Flight_Delay_Minutes and Flight_Cancelled

Preprocessing and Model Building

To facilitate model comprehension, categorical variables were strategically encoded using


custom helper functions. This process transforms categorical data into a numerical representation
suitable for machine learning algorithms. Additionally, feature engineering techniques were
employed to identify and remove irrelevant features. In this case, columns such as 'day of the
week' and 'month' were determined to have minimal predictive power and were subsequently
excluded from the analysis. Following data preparation, the dataset was strategically split into
training and testing sets, denoted as X and y respectively. The training set serves as the
foundation for model training, while the testing set allows for unbiased evaluation of the model's
performance. Leveraging the prepared data, a logistic regression model was constructed. This
model achieved a noteworthy testing accuracy of 83.05%, indicating its potential for effective
flight cancellation prediction.
16

Helper functions
17
18

Model training and results


19
20

Build Other Classification Models

To explore alternative modeling approaches and potentially improve predictive performance,


additional models were constructed beyond the initial logistic regression. Specifically, a decision
tree and a random forest model were implemented. Both decision trees and random forests are
well-suited for classification tasks such as flight cancellation prediction. Following model
construction, a rigorous evaluation process was undertaken using the reserved test data. This
evaluation ensured an unbiased assessment of each model's generalizability. The evaluation
metrics employed were appropriate for the classification task, likely including metrics such as
accuracy, precision, recall, and F1-score. The random forest model achieved a remarkable testing
accuracy of 0.983, surpassing the performance of the logistic regression model. The decision tree
model also yielded a commendable accuracy of 0.964, demonstrating its effectiveness in this
prediction task.
21

Confusion matrix for Random Forest


22

Decision Tree
23

Implemented strategies, plans and research methods


Data cleaning

Z-score for dealing with outlier: A z-score measures exactly how many standard deviations
above or below the mean a data point is.

Exploratory Data Analysis (EDA)


● Descriptive Statistics: Used descriptive statistics of the dataset to determine the distribution of
the data.
● Distribution of data: Plot histograms or bar charts to see the distribution of data in each
column.
● Relationship between features: Plot scatter plots, pair plots, or correlation matrices to see the
relationship between different features.
● Relationship between features and target variable: Investigated how different features relate to
the target variable.

Preprocessing and Model Building


● Encoding categorical variables
● Feature Scaling
● Model Building
● Model Evaluation

Build Other Classification Models


● Building other models: Built other classification models such as Decision Tree, Random Forest
● Model Evaluation: Evaluated each model using appropriate metrics and the test data.
● Model Comparison: Compared the performance of all the models.
24

Concluding statement/ Recommendations


A comprehensive evaluation process was conducted to determine the most effective model for
flight cancellation prediction. This process involved training and evaluating several models,
including linear regression, decision trees, and random forests. Each model's performance was
assessed using appropriate metrics on the reserved test data, ensuring an unbiased evaluation of
generalizability.

While logistic regression achieved a respectable testing accuracy of 83.05%, further exploration
revealed superior performance from other models. Notably, the random forest model achieved a
significantly higher testing accuracy of 0.983, indicating its exceptional ability to predict flight
cancellations. The decision tree model also demonstrated strong performance with an accuracy
of 0.964.

These findings suggest that logistic regression, while a viable approach, may not be the optimal
choice for this specific prediction task. More sophisticated models, such as random forests, offer
a greater potential for accurate flight cancellation prediction."

This phrasing emphasizes the evaluation process and highlights the superior performance of the
Random Forest model while acknowledging the contribution of logistic regression.

You might also like