0% found this document useful (0 votes)
31 views11 pages

DM Project - Step 4

The document analyzes flight delay data through data visualization and statistical analysis. Various visualizations are created to understand the distribution of departure delays and how they vary by factors like carrier, departure time, month, day of week, precipitation, and departing airport.

Uploaded by

BHAVIKA MALHOTRA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views11 pages

DM Project - Step 4

The document analyzes flight delay data through data visualization and statistical analysis. Various visualizations are created to understand the distribution of departure delays and how they vary by factors like carrier, departure time, month, day of week, precipitation, and departing airport.

Uploaded by

BHAVIKA MALHOTRA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Data Analysis & Visualization

import pandas as pd
df = pd.read_csv("full_data_flightdelay.csv")
df.drop_duplicates(inplace = True)

from sklearn.preprocessing import LabelEncoder


label_encoder = LabelEncoder()
df['DEP_TIME_BLK'] = label_encoder.fit_transform(df['DEP_TIME_BLK'])
df['CARRIER_NAME'] = label_encoder.fit_transform(df['CARRIER_NAME'])
df['DEPARTING_AIRPORT'] =
label_encoder.fit_transform(df['DEPARTING_AIRPORT'])
df['PREVIOUS_AIRPORT'] =
label_encoder.fit_transform(df['PREVIOUS_AIRPORT'])

df

MONTH DAY_OF_WEEK DEP_TIME_BLK DISTANCE_GROUP


SEGMENT_NUMBER \
0 1 7 3 2
1
1 1 7 2 7
1
2 1 7 1 7
1
3 1 7 1 9
1
4 1 7 0 7
1
... ... ... ... ...
...
1048570 3 7 15 10
1
1048571 3 7 3 3
1
1048572 3 7 13 7
1
1048573 3 7 3 11
1
1048574 3 7 14 5
1

CONCURRENT_FLIGHTS NUMBER_OF_SEATS CARRIER_NAME \


0 25 143 14
1 29 191 6
2 27 199 6
3 27 180 6
4 10 182 15
... ... ... ...
1048570 25 154 16
1048571 33 276 16
1048572 26 169 16
1048573 33 235 16
1048574 25 173 16

AIRPORT_FLIGHTS_MONTH AIRLINE_FLIGHTS_MONTH ...


DEPARTING_AIRPORT \
0 13056 107363 ...
41
1 13056 73508 ...
41
2 13056 73508 ...
41
3 13056 73508 ...
41
4 13056 15023 ...
41
... ... ... ...
...
1048570 11562 53007 ...
48
1048571 11562 53007 ...
48
1048572 11562 53007 ...
48
1048573 11562 53007 ...
48
1048574 11562 53007 ...
48

LATITUDE LONGITUDE PREVIOUS_AIRPORT PRCP SNOW SNWD TMAX


AWND \
0 36.080 -115.152 208 0.00 0.0 0.0 65.0
2.91
1 36.080 -115.152 208 0.00 0.0 0.0 65.0
2.91
2 36.080 -115.152 208 0.00 0.0 0.0 65.0
2.91
3 36.080 -115.152 208 0.00 0.0 0.0 65.0
2.91
4 36.080 -115.152 208 0.00 0.0 0.0 65.0
2.91
... ... ... ... ... ... ... ...
...
1048570 40.696 -74.172 208 0.03 0.0 0.0 65.0
14.09
1048571 40.696 -74.172 208 0.03 0.0 0.0 65.0
14.09
1048572 40.696 -74.172 208 0.03 0.0 0.0 65.0
14.09
1048573 40.696 -74.172 208 0.03 0.0 0.0 65.0
14.09
1048574 40.696 -74.172 208 0.03 0.0 0.0 65.0
14.09

DEP_DEL15
0 0
1 0
2 0
3 0
4 0
... ...
1048570 1
1048571 1
1048572 0
1048573 0
1048574 0

[1044213 rows x 26 columns]

1. What is the distribution of departure delays


(DEP_DEL15)?
import matplotlib.pyplot as plt

plt.hist(df['DEP_DEL15'], bins=20, color='skyblue', edgecolor='black')


plt.xlabel('Departure Delay (DEP_DEL15)')
plt.ylabel('Frequency')
plt.title('Distribution of Departure Delays')
plt.show()
2. How does the average departure delay vary by carrier
(CARRIER_NAME)?
decoded_carrier_names =
label_encoder.inverse_transform(df['CARRIER_NAME'])

plt.figure(figsize=(10, 6))
sns.barplot(x=decoded_carrier_names, y='DEP_DEL15', data=df)
plt.xticks(rotation=45)
plt.xlabel('Carrier Name')
plt.ylabel('Average Departure Delay (DEP_DEL15)')
plt.title('Average Departure Delay by Carrier')
plt.show()
3. Is there a relationship between departure delay and the
number of flight attendants per passenger
(FLT_ATTENDANTS_PER_PASS)?
sns.scatterplot(x='FLT_ATTENDANTS_PER_PASS', y='DEP_DEL15', data=df)
plt.xlabel('Flight Attendants per Passenger')
plt.ylabel('Departure Delay (DEP_DEL15)')
plt.title('Departure Delay vs Flight Attendants per Passenger')
plt.show()
There is no such relationship between the number of flight attendants per passenger and
departure delay.

4. What is the distribution of departure delays


(DEP_DEL15) for different departure time blocks
(DEP_TIME_BLK)?
plt.figure(figsize=(12, 6))
decoded_dep_time_blk =
label_encoder.inverse_transform(df['DEP_TIME_BLK'])
sns.boxplot(x='DEP_TIME_BLK', y='DEP_DEL15', data=df,
palette='viridis')
plt.xticks(rotation=45)
plt.xlabel('Departure Time Block')
plt.ylabel('Departure Delay (DEP_DEL15)')
plt.title('Distribution of Departure Delays by Time Block')
plt.show()
As we can infer from the graph, flights belonging to time blocks: 13, 14 and 16 are delayed only.

5. How does the average departure delay vary by month


(MONTH)?
import calendar

# Map month numbers to month names


df['MONTH_NAME'] = df['MONTH'].apply(lambda x: calendar.month_name[x])

plt.figure(figsize=(10, 6))
sns.lineplot(x='MONTH_NAME', y='DEP_DEL15', data=df, estimator='mean',
ci=None)
plt.xlabel('Month')
plt.ylabel('Average Departure Delay (DEP_DEL15)')
plt.title('Average Departure Delay by Month')
plt.xticks(rotation=45)
plt.show()

# Drop the newly added column after plotting, if not needed anymore
df.drop('MONTH_NAME', axis=1, inplace=True)
As we can see, most of the flights are delayed in the month of February, followed by January and
then the least in March.

6. How does departure delay vary by the day of the week


(DAY_OF_WEEK)?
sns.barplot(x='DAY_OF_WEEK', y='DEP_DEL15', data=df)
plt.xlabel('Day of Week')
plt.ylabel('Average Departure Delay (DEP_DEL15)')
plt.title('Average Departure Delay by Day of Week')
plt.show()
7. How does departure delay vary with precipitation
(PRCP)?
sns.scatterplot(x='PRCP', y='DEP_DEL15', data=df)
plt.xlabel('Precipitation (PRCP)')
plt.ylabel('Departure Delay (DEP_DEL15)')
plt.title('Departure Delay vs Precipitation')
plt.show()
As we can see, precipitation doesn't influence departure delay majorly. It's affect is almost
negligible.

8. Is there a difference in departure delay between flights


departing from different airports (DEPARTING_AIRPORT)?
decoded_airport_names =
label_encoder.inverse_transform(df['DEPARTING_AIRPORT'])

plt.figure(figsize=(50, 25))
sns.boxplot(x=decoded_airport_names, y='DEP_DEL15', data=df)
plt.xticks(rotation=45, ha='right')
plt.xlabel('Departing Airport')
plt.ylabel('Departure Delay (DEP_DEL15)')
plt.title('Departure Delay vs Departing Airport')
plt.show()
As we can see, only three airports cause a delay in flights:

1. Alexander Hamilton Airport


2. Albuquerque International Sunport
3. Columbus Metropolitan Airport

You might also like