Documentation & Report For Flyzy Flight Cancellation Project
Documentation & Report For Flyzy Flight Cancellation Project
Flight cancellation
Flight cancellation prediction model for Flyzy.com
Tankiso Mofokeng
25/05/2024
1
About Me ............................................................................................................................. 3
Solution ................................................................................................................................ 6
Visual representation of the outlier analysis before removing the outliers: ................ 9
Visual representation of the outlier analysis after removing the outliers: .................10
Data cleaning..................................................................................................................23
2
About Me
Highly motivated recent graduate with a Bachelor's degree in Information Technology (Systems
Development) seeking a challenging entry-level position in data science or a related field. I am
eager to leverage strong analytical skills and a passion for problem-solving to contribute to
innovative projects.
Skills
• Systems Development
• Fast Learner
• Open Communication
Apart from creating safer, simpler, and more personalized experiences for the passengers, Flyzy
also ensures commercial benefits for the service providers and stakeholders involved. Flyzy's
services will soon be made available for other popular modes of travel like rail and road.
5
Problem statement
Flyzy is a company focused on providing a smooth and hassle-free air travel experience. They
offer personalized in-flight and airport recommendations, and they also provide real-time flight
tracking, mobile check-in, and more. Flyzy aims to redefine the future of air travel with a more
personalized and connected experience from the beginning of the trip to the end.
Flight cancellation is a significant issue in the aviation industry. It not only disrupts the
customers' plans but also impacts the airline's reputation and profitability. Therefore, predicting
flight cancellations can help airlines take preventive measures and minimize disruptions.
6
Solution
The business objective of this project is to develop a reliable predictive model for flight
cancellations to support Flyzy's mission of providing a smooth and hassle-free air travel
experience. By accurately predicting flight cancellations, Flyzy can take proactive measures to
minimize disruptions for their customers and improve overall operational efficiency for partner
airlines.
Dataset
• Flight_ID (Numeric): This is a unique identifier for each flight. The values are randomly
generated and have no particular meaning. This column will not be used in the analysis
and modeling, as it does not contain any useful information.
7
• Airline (Categorical): This column represents the airline that operates the flight. It is a
categorical variable with five different airlines. Each airline has a different reputation and
operational efficiency, which might affect the likelihood of flight cancellations.
• Scheduled_Departure_Time (Numeric): This column represents the time when the flight
is scheduled to depart. The time is represented as the number of minutes past midnight.
The departure time might affect the likelihood of flight cancellations. For example,
flights scheduled to depart late at night might have a higher chance of cancellation due to
less operational staff or
• unfavorable weather conditions.
• Day_of_Week (Numeric): This column represents the day of the week when the flight is
scheduled. The days are represented as numbers from 1 (Monday) to 7 (Sunday). The day
of the week might affect the likelihood of flight cancellations. For example, flights
scheduled on
• weekends might have a higher chance of cancellation due to higher passenger load.
8
• Month (Numeric): This column represents the month when the flight is scheduled. The
months are represented as numbers from 1 (January) to 12 (December). The month might
affect the likelihood of flight cancellations. For example, flights scheduled in winter
months might have a higher chance of cancellation due to bad weather conditions.
• Airplane_Type (Categorical): This column represents the model or type of the airplane. It
is a categorical variable with five different airplane types. Some airplane types might
have a higher chance of flight cancellations due to factors like age, maintenance issues,
and fuel efficiency.
• Weather_Score (Numeric): This column represents a score of the severity of the weather
conditions, with higher scores indicating worse weather. Bad weather is one of the most
common reasons for flight cancellations.
• Passenger_Load (Numeric): This column represents the load or capacity utilization of the
flight, represented as a percentage. Flights with a higher passenger load might have a
lower chance of cancellation, as airlines want to avoid inconveniencing a large number of
passengers.
During data pre-processing, a thorough examination for missing values and duplicates was
conducted. While neither were identified, outliers were present. To address these outliers, a Z-
score approach was implemented to define the upper and lower bounds of the data. This analysis
resulted in data trimming, reducing the initial dataset of 3,000 records to 2,949 records.
Consequently, 51 outliers were successfully removed.
Following an initial exploration of the data, a descriptive analysis was conducted to determine
measures of central tendency: mean, median, and mode. The results indicated that the mean is
11
greater than the median, which is in turn greater than the mode. This observation suggests a
possible positive skew in the distribution of Previous_Flight_Delay_Minutes. To confirm this
initial assessment and gain a deeper understanding of the data's shape, a visual representation of
the distribution was generated.
Correlation matrix
13
Helper functions
17
18
Decision Tree
23
Z-score for dealing with outlier: A z-score measures exactly how many standard deviations
above or below the mean a data point is.
While logistic regression achieved a respectable testing accuracy of 83.05%, further exploration
revealed superior performance from other models. Notably, the random forest model achieved a
significantly higher testing accuracy of 0.983, indicating its exceptional ability to predict flight
cancellations. The decision tree model also demonstrated strong performance with an accuracy
of 0.964.
These findings suggest that logistic regression, while a viable approach, may not be the optimal
choice for this specific prediction task. More sophisticated models, such as random forests, offer
a greater potential for accurate flight cancellation prediction."
This phrasing emphasizes the evaluation process and highlights the superior performance of the
Random Forest model while acknowledging the contribution of logistic regression.