DM Project - Step 4
DM Project - Step 4
import pandas as pd
df = pd.read_csv("full_data_flightdelay.csv")
df.drop_duplicates(inplace = True)
df
DEP_DEL15
0 0
1 0
2 0
3 0
4 0
... ...
1048570 1
1048571 1
1048572 0
1048573 0
1048574 0
plt.figure(figsize=(10, 6))
sns.barplot(x=decoded_carrier_names, y='DEP_DEL15', data=df)
plt.xticks(rotation=45)
plt.xlabel('Carrier Name')
plt.ylabel('Average Departure Delay (DEP_DEL15)')
plt.title('Average Departure Delay by Carrier')
plt.show()
3. Is there a relationship between departure delay and the
number of flight attendants per passenger
(FLT_ATTENDANTS_PER_PASS)?
sns.scatterplot(x='FLT_ATTENDANTS_PER_PASS', y='DEP_DEL15', data=df)
plt.xlabel('Flight Attendants per Passenger')
plt.ylabel('Departure Delay (DEP_DEL15)')
plt.title('Departure Delay vs Flight Attendants per Passenger')
plt.show()
There is no such relationship between the number of flight attendants per passenger and
departure delay.
plt.figure(figsize=(10, 6))
sns.lineplot(x='MONTH_NAME', y='DEP_DEL15', data=df, estimator='mean',
ci=None)
plt.xlabel('Month')
plt.ylabel('Average Departure Delay (DEP_DEL15)')
plt.title('Average Departure Delay by Month')
plt.xticks(rotation=45)
plt.show()
# Drop the newly added column after plotting, if not needed anymore
df.drop('MONTH_NAME', axis=1, inplace=True)
As we can see, most of the flights are delayed in the month of February, followed by January and
then the least in March.
plt.figure(figsize=(50, 25))
sns.boxplot(x=decoded_airport_names, y='DEP_DEL15', data=df)
plt.xticks(rotation=45, ha='right')
plt.xlabel('Departing Airport')
plt.ylabel('Departure Delay (DEP_DEL15)')
plt.title('Departure Delay vs Departing Airport')
plt.show()
As we can see, only three airports cause a delay in flights: