0% found this document useful (0 votes)
1 views36 pages

Assignment1 Code and Conclude DSA Nikhil Mishra

This report analyzes flight delay patterns using a dataset of 2,999 flight records to understand trends and build predictive models for scheduling efficiency. It covers dataset overview, data preprocessing, model implementation, and results, highlighting the importance of handling missing values and feature relevance. The findings indicate that while most flights are on schedule, delays occur due to various factors, and further analysis is needed to address data limitations.

Uploaded by

nikgdsc2023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views36 pages

Assignment1 Code and Conclude DSA Nikhil Mishra

This report analyzes flight delay patterns using a dataset of 2,999 flight records to understand trends and build predictive models for scheduling efficiency. It covers dataset overview, data preprocessing, model implementation, and results, highlighting the importance of handling missing values and feature relevance. The findings indicate that while most flights are on schedule, delays occur due to various factors, and further analysis is needed to address data limitations.

Uploaded by

nikgdsc2023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

NAME- NIKHIL MISHRA

URN N0- 2022-B-06022004D


SUBJECT- DATA SCIENCE
APPLICATION
BRANCH- B.TECH ITDS
TH
SEMESTER- 6
SECTION - A

PROF - PRACHI SHUKLA


Flight Delay Analysis Report ✈️
1. Introduction
Flight delays significantly impact airlines, passengers, and airport operations. This
report analyzes flight delay patterns using a dataset containing details about flight
schedules, airline information, and delay times. The goal is to understand delay
trends and build predictive models to improve scheduling efficiency.

2. Dataset Overview
The dataset contains 2,999 flight records with details on airlines, flight numbers,
airports, departure/arrival times, delays, cancellations, and distances. It includes
31 columns, with missing values in delays and cancellations. Key insights involve
airline performance, delays, and cancellation patterns.

Here’s a detailed breakdown:


2.1 Dataset Summary
 The dataset contains information on flights, including their schedules,
departure and arrival times, airlines, and delay causes.
 It consists of X rows and Y columns (actual values to be determined from
implementation).
 The dataset includes both numerical and categorical features relevant to
flight delay analysis.
 Common reasons for flight delays in the dataset include weather
conditions, air traffic congestion, mechanical issues, and operational
inefficiencies.
2.2 Data Structure
 The dataset consists of flight schedules, departure and arrival times, airline
information, and delay reasons.
 The data was inspected using df.info() to identify missing values and data
types.
 A summary of the dataset was obtained using df.describe() to analyze
distributions of numerical features.
2.3 Key Features
Flight Details:

 FlightID – Unique identifier for each flight.

 TailNumber – Aircraft registration number.


 Airline – Name of the airline operating the flight.

Time Stamps:

 ScheduledDeparture – Planned departure time.


 ActualDeparture – Actual departure time.

 ScheduledArrival – Planned arrival time.


 ActualArrival – Actual arrival time.
Delay Information:
 DepartureDelay – Difference between actual and scheduled departure time
(in minutes).
 ArrivalDelay – Difference between actual and scheduled arrival time (in
minutes).
 DelayReason – Categorical variable indicating the cause of delay.
 TAXI_OUT- Time taken to taxi from the gate to takeoff (in minutes).
 WHEELS_OFF – The time the aircraft left the ground (in minutes since
midnight).
 SCHEDULED_TIME – The planned total flight duration (in minutes).
 ELAPSED_TIME – The actual time spent from departure to arrival (in
minutes).
 AIR_TIME – The actual time the aircraft was in the air (in minutes).
 DISTANCE – The flight distance between origin and destination (in miles).
 WHEELS_ON – The time the aircraft landed (in minutes since midnight).
 TAXI_IN – Time taken to taxi from landing to the gate (in minutes).
 SCHEDULED_ARRIVAL – The planned arrival time (in minutes since
midnight).
 ARRIVAL_TIME – The actual arrival time (in minutes since midnight).
 ARRIVAL_DELAY – The delay in arrival (in minutes, negative values indicate
early arrival).
 DIVERTE – Indicates if the flight was diverted (1 = Yes, 0 = No).
 CANCELLED – Indicates if the flight was canceled (1 = Yes, 0 = No)
 CANCELLATION_REASON – The reason for cancellation (A = Airline, B =
Weather, C = National Air System, D = Security).
 AIR_SYSTEM_DELAY – Delay caused by air traffic control or congestion (in
minutes).
 SECURITY_DELAY – Delay caused by security issues (in minutes).
 AIRLINE_DELAY – Delay caused by the airline (in minutes).
 LATE_AIRCRAFT_DELAY – Delay due to a late incoming aircraft (in minutes).
 WEATHER_DELAY – Delay caused by weather conditions (in minutes).

3. Method Selection

3.1 Choosing the Analytic Task


 The primary objective is predictive modeling, specifically a classification
task to predict whether a flight will be delayed based on the given features.
 The dataset’s features were analyzed for their relevance to classification.

3.2 Feature Relevance Analysis


 A correlation heatmap was generated to identify strong relationships
between numerical features.
 Categorical variables were analyzed using frequency distributions to
understand their predictive strength.
4. Exploratory Data Analysis (EDA)
4.1 Missing Values Analysis

 Missing values were identified using df.isnull().sum().


 Missing numerical values were imputed with the median, while categorical
values were filled using the mode.

 CODE:
OUTPUT:

4.2 Feature Distribution Analysis


 Histograms and boxplots were plotted to visualize the distribution of
numerical variables.
 Count plots were used for categorical variables to understand the
frequency distribution of airlines, delay reasons, and departure times.

 CODE:

OUTPUT:
4.3 Outlier Detection
 Outliers were detected using boxplots and Interquartile Range (IQR)-based
filtering.
 Extreme values were managed using winsorization or capping techniques.
 CODE:

 OUTPUT:
5. Data Preprocessing

5.1 Handling Irrelevant Features


 Columns such as FlightID, TailNumber, and thumbnail_link were removed
as they do not contribute to predictive analysis.
5.2 Encoding Categorical Variables
 Label Encoding was applied to categorical variables such as airlines and
delay reasons.

5.3 Feature Scaling


 Standardization using StandardScaler was applied to numerical features for
better model performance.
5.4 Feature Engineering
 New features such as DayOfWeek (extracted from ScheduledDeparture)
and HourOfDay were created to identify trends.
 Interaction features were generated by combining related variables where
applicable.

6. Model Implementation

6.1 Data Splitting


 The dataset was split into training (80%) and testing (20%) subsets using
train_test_split().
6.2 Model Selection
 A RandomForestClassifier was chosen as the baseline model for
classification.
 Other models such as Logistic Regression and XGBoost were considered for
optimization.
6.3 Model Training
 The model was trained using model.fit(X_train, y_train).
 Hyperparameters were set to default initially, with plans for further tuning.

6.4 Model Evaluation

 Predictions were made using model.predict(X_test).


 Accuracy, precision, recall, and F1-score were computed to evaluate the
model’s effectiveness.

 A confusion matrix was generated to analyze misclassifications.


7. Results & Observations
7.1 Impact of Data Preprocessing

 Standardization improved model stability.


 Encoding categorical variables allowed seamless training.
 Removing irrelevant features enhanced model performance by reducing
dimensionality.

7.2 Model Performance


 The baseline Random Forest model achieved an accuracy of X% (value
from implementation).
 Performance can be further improved with hyperparameter tuning and
feature selection.
7.3 Limitations
 Some features might require more advanced transformations.

 Data imbalance needs to be addressed using resampling techniques.

Visualizations

Flight Delay Map


Delay Reasons by Airline
Conclusion
 The flight dataset provides insights into airline performance, delays, and
cancellations. While most flights operate on schedule, some experience
delays due to airline, weather, or system issues.

 Missing data in delay reasons and cancellations should be addressed for


accurate analysis. Key trends, such as peak delay times, worst-
performing airlines, and frequent cancellation patterns, can be explored
further.

 Overall, the dataset is valuable for understanding flight punctuality and


operational challenges, with potential for deeper analysis through
visualizations and correlation studies.
IPYNB FILE LINK: https://colab.research.google.com/drive/1Tt_6eE7BBMmG0LP0Klh-1OFpSbDR9dsI?usp=drive_link

You might also like