Flight Fare Prediction Using ML Algorithms
Flight Fare Prediction Using ML Algorithms
September 8, 2023
1
1.2 3.LOAD DATA
[129]: #import data
data=pd.read_excel("Flight_Fare.xlsx")
[130]: data.head(4)
2
1.4 5.BASIC CHECKS
[131]: # to see the first five data
data.head()
Additional_Info Price
10678 No info 4107
10679 No info 4145
10680 No info 7229
10681 No info 12648
10682 No info 11753
3
[134]: # to see the size of the data
data.size
[134]: 117513
4
[138]: Price
count 10683.000000
mean 9087.064121
std 4611.359167
min 1759.000000
25% 5277.000000
50% 8372.000000
75% 12373.000000
max 79512.000000
[139]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Airline 10683 non-null object
1 Date_of_Journey 10683 non-null object
2 Source 10683 non-null object
3 Destination 10683 non-null object
4 Route 10682 non-null object
5 Dep_Time 10683 non-null object
6 Arrival_Time 10683 non-null object
7 Duration 10683 non-null object
8 Total_Stops 10682 non-null object
9 Additional_Info 10683 non-null object
10 Price 10683 non-null int64
dtypes: int64(1), object(10)
memory usage: 918.2+ KB
5
Text(8, 0, 'Vistara Premium economy'),
Text(9, 0, 'Jet Airways Business'),
Text(10, 0, 'Multiple carriers Premium economy'),
Text(11, 0, 'Trujet')])
1.5.2 Insights
• Jet Airways is the costliest among all the flights
• Jet Airways has the highest share followed by Indigo
[141]: plt.figure(figsize=(10,6))
sns.countplot(x="Source",data=data)
plt.xticks(rotation=90)
6
Text(2, 0, 'Delhi'),
Text(3, 0, 'Chennai'),
Text(4, 0, 'Mumbai')])
1.5.3 Insights
• Delhi has highest take off or originating point for all the flights followed by Kolkata and
Banglore respectively.
[142]: plt.figure(figsize=(10,6))
sns.countplot(x="Destination",data=data)
plt.xticks(rotation=90)
7
1.5.4 Insights
• Cochin has the highest landing or arrival of the flights from different places followed by
Banglore
[143]: plt.figure(figsize=(10,6))
sns.countplot(x="Total_Stops",data=data)
plt.xticks(rotation=90)
8
1.5.5 Insights
• Most flights have single stop in between taking off and landing at the destination followed by
non-stop.
[144]: plt.figure(figsize=(10,6))
sns.countplot(x="Additional_Info",data=data)
plt.xticks(rotation=90)
9
1.5.6 Insights
• Most of the flights do not have any extra information
• There are few flights with extra information of “in-flight meal not included”
[145]: # sweetviz is used for univariate
!pip install sweetviz
10
c:\programdata\anaconda3\lib\site-packages (from sweetviz) (4.64.1)
Requirement already satisfied: jinja2>=2.11.1 in
c:\programdata\anaconda3\lib\site-packages (from sweetviz) (3.1.2)
Requirement already satisfied: scipy>=1.3.2 in
c:\programdata\anaconda3\lib\site-packages (from sweetviz) (1.10.0)
Requirement already satisfied: numpy>=1.16.0 in
c:\programdata\anaconda3\lib\site-packages (from sweetviz) (1.23.5)
Requirement already satisfied: MarkupSafe>=2.0 in
c:\programdata\anaconda3\lib\site-packages (from jinja2>=2.11.1->sweetviz)
(2.1.1)
Requirement already satisfied: pyparsing>=2.3.1 in
c:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.1.3->sweetviz)
(3.0.9)
Requirement already satisfied: pillow>=6.2.0 in
c:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.1.3->sweetviz)
(9.4.0)
Requirement already satisfied: kiwisolver>=1.0.1 in
c:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.1.3->sweetviz)
(1.4.4)
Requirement already satisfied: fonttools>=4.22.0 in
c:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.1.3->sweetviz)
(4.25.0)
Requirement already satisfied: cycler>=0.10 in
c:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.1.3->sweetviz)
(0.11.0)
Requirement already satisfied: packaging>=20.0 in
c:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.1.3->sweetviz)
(22.0)
Requirement already satisfied: contourpy>=1.0.1 in
c:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.1.3->sweetviz)
(1.0.5)
Requirement already satisfied: python-dateutil>=2.7 in
c:\programdata\anaconda3\lib\site-packages (from matplotlib>=3.1.3->sweetviz)
(2.8.2)
Requirement already satisfied: pytz>=2020.1 in
c:\programdata\anaconda3\lib\site-packages (from
pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3->sweetviz) (2022.7)
Requirement already satisfied: colorama in c:\programdata\anaconda3\lib\site-
packages (from tqdm>=4.43.0->sweetviz) (0.4.6)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-
packages (from python-dateutil>=2.7->matplotlib>=3.1.3->sweetviz) (1.16.0)
| ␣
↪ | [ 0%] 00:00 ->…
11
Report my_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY
not pop up, regardless, the report IS saved in your notebook/colab files.
1.5.7 Insights
• The majority of prices are within the 20,000 range, but there are some outliers.
• The most frequent airline is Jet Airways. However, Jet Airways Business has a much higher
average price than the other lines.
• The most flights depart from Delhi, and the average price is the highest.
• Cochin is the most heavily trafficked destination. New Delhi, on the other hand, has the
highest average price.
• A little more than half of the flights make single stop between the origin and destination,
around one-third is direct flight.
1.5.8 BYVARIATE
[147]: plt.figure(figsize=(12,6))
sns.barplot(x="Airline",y="Price",data=data)
plt.xticks(rotation=90)
12
1.5.9 Insights
• Jet Airways Business has the highest price when compared to others.
[148]: Airline 0
Date_of_Journey 0
Source 0
Destination 0
Route 1
Dep_Time 0
Arrival_Time 0
Duration 0
Total_Stops 1
Additional_Info 0
Price 0
dtype: int64
13
1.6.2 Insights
• There are only two null values
• 1 in Route
• 1 in Total_Stops
[149]: # We drop the null value
data.dropna(inplace=True)
[151]: data.isnull().sum()
[151]: Airline 0
Date_of_Journey 0
Source 0
Destination 0
Route 0
Dep_Time 0
Arrival_Time 0
Duration 0
Total_Stops 0
Additional_Info 0
Price 0
dtype: int64
1.7.2 Date
[152]: data["journey_Date"]= pd.to_datetime(data['Date_of_Journey'], format= "%d/%m/
↪%Y").dt.day
14
1.7.3 Month
[153]: data["journey_Month"]= pd.to_datetime(data['Date_of_Journey'], format= "%d/%m/
↪%Y").dt.month
[154]: data.head(3)
journey_Date journey_Month
0 24 3
1 1 5
2 9 6
• Since we have extracted Date of Journey column into Date & Month, Now we can drop it as
Original Date of Journey column is of no use.
[155]: # droping date of journey column as we have allready extracted data and month
data.drop(['Date_of_Journey'],axis=1,inplace=True)
1.7.4 Hours
[156]: # Extracting Hours
data['Dep_hour']=pd.to_datetime(data['Dep_Time']).dt.hour #pd.to_datetime
1.7.5 Minutes
[157]: #Extracting minutes
data['Dep_min']=pd.to_datetime(data['Dep_Time']).dt.minute
[159]: data.head(5)
15
[159]: Airline Source Destination Route Arrival_Time \
0 IndiGo Banglore New Delhi BLR → DEL 01:10 22 Mar
1 Air India Kolkata Banglore CCU → IXR → BBI → BLR 13:15
2 Jet Airways Delhi Cochin DEL → LKO → BOM → COK 04:25 10 Jun
3 IndiGo Kolkata Banglore CCU → NAG → BLR 23:30
4 IndiGo Banglore New Delhi BLR → NAG → DEL 21:35
Dep_hour Dep_min
0 22 20
1 5 50
2 9 25
3 18 5
4 16 50
#Extracting minutes
data['Arrival_min']=pd.to_datetime(data['Arrival_Time']).dt.minute
[161]: data.head(3)
16
Dep_min Arrival_hour Arrival_min
0 20 1 10
1 50 13 15
2 25 4 25
duration = list(data["Duration"])
for i in range(len(duration)):
if len(duration[i].split()) !=2:
if "h" in duration[i]:
duration[i] = duration[i].strip() + " 0m" # Adds 0 minute
else:
duration[i] = "0h " + duration[i] # Adds 0 hour
duration_hours = []
duration_mins = []
for i in range(len(duration)):
duration_hours.append(int(duration[i].split(sep = "h")[0])) # Extract hours␣
↪from duration
duration_mins.append(int(duration[i].split(sep = "m")[0].split()[-1]))
• Adding “duration_hours” and “duration_mins” list to data frame and dropping the column
“duration” from it.
[163]: data["Duration_hours"] = duration_hours
data["Duration_mins"] = duration_mins
[164]: data.head(4)
17
2 No info 13882 9 6 9 25
3 No info 6218 12 5 18 5
Additional_Info
0 No info
1 No info
2 No info
3 No info
4 No info
plt.show()
18
1.8.1 Insights
• From the graph above we can understand that JetAirways has the highest price and rest are
quite in the same range
[167]: data2=data.copy()
[168]: #OneHotEncoding
df1=pd.get_dummies(data2["Airline"],drop_first=True)
data2=pd.concat([data2,df1],axis=1).drop(["Airline"],axis=1)
[169]: data2.head(3)
[3 rows x 25 columns]
[170]: #OneHotEncoding
df2=pd.get_dummies(data2["Source"],drop_first=True)
data2=pd.concat([data2,df2],axis=1).drop(["Source"],axis=1)
[171]: data2.head(3)
19
1 Banglore CCU → IXR → BBI → BLR 2 stops No info 7662
2 Cochin DEL → LKO → BOM → COK 2 stops No info 13882
[3 rows x 28 columns]
[172]: #OneHotEncoding
df3=pd.get_dummies(data2["Destination"],drop_first=True)
data2=pd.concat([data2,df3],axis=1).drop(["Destination"],axis=1)
[173]: data2.head(4)
20
2 0 0 0
3 0 0 0
[4 rows x 32 columns]
[175]: data2.head(5)
[5 rows x 30 columns]
[176]: plt.figure(figsize=(8,5))
sns.countplot(data=data,x="Total_Stops")
21
[177]: data2['Total_Stops'].value_counts()
[178]: # Based on the observation from above countplot and value counts we can manualy␣
↪encode the total_stop column
df=data2
df.head()
22
Arrival_hour Arrival_min Duration_hours Duration_mins … \
0 1 10 2 50 …
1 13 15 7 25 …
2 4 25 19 0 …
3 23 30 5 25 …
4 21 35 4 45 …
[5 rows x 30 columns]
[179]: x=df.drop("Price",axis=1)
x.head()
23
Hyderabad Kolkata New Delhi
0 0 0 1
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 1
[5 rows x 29 columns]
1.9 Scaling
[180]: from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
x=scaler.fit_transform(x)
print(x)
[[0. 0.88461538 0. … 0. 0. 1. ]
[0.5 0. 0.66666667 … 0. 0. 0. ]
[0.5 0.30769231 1. … 0. 0. 0. ]
…
[0. 1. 0.33333333 … 0. 0. 0. ]
[0. 0. 0. … 0. 0. 1. ]
[0.5 0.30769231 0.66666667 … 0. 0. 0. ]]
24
5 11 25 2 25
6 10 25 15 30
7 5 5 21 5
8 10 25 25 30
9 19 15 7 50
[182]: data2.corr()
Duration_mins
Total_Stops -0.490055
Price -0.542767
journey_Date 0.487520
journey_Month -0.332520
Dep_hour 0.528181
Dep_min 0.304109
Arrival_hour 0.362001
Arrival_min 0.080203
Duration_hours -0.585949
Duration_mins 1.000000
plt.figure(figsize=(20,15))
25
sns.heatmap(data2.corr(),annot = True, cmap = "RdYlGn")
plt.tick_params(labelsize=11)
1.10.1 Insights
• we have to drop the column if the independent columns are highly related but we dont have
any.
• We see that there are few cells which shows high correlation but thats between independent
and dependent columns
26
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.
↪25,random_state=42)
[186]: print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
(8011, 29)
(8011,)
(2671, 29)
(2671,)
[189]: mse=mean_squared_error(y_test,y_pred)
print(mse)
mae=mean_absolute_error(y_test,y_pred)
print(mae)
7835152.949901841
1949.458356115105
44.15267099638599
[191]: lr_score=r2_score(y_test,y_pred)
lr_score
[191]: 0.6198931301596473
[192]: 0.6180333675296419
27
1.13 KNN
[193]: # for the model creation we have to separate the independent and dependent
x=df.drop("Price",axis=1)
y=df["Price"]
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.
↪25,random_state=42)
[196]: (y_test!=y_pred).sum()
[196]: 2639
[197]: len(y_test)
[197]: 2671
[198]: (y_test!=y_pred).sum()/len(y_test)
[198]: 0.9880194683639086
[200]: ERROR_RATE
[200]: [0.7865967802321228,
0.9183826282291276,
0.9591913141145638,
0.9790340696368401,
0.9880194683639086,
0.9962560838637214,
0.9970048670909771,
28
0.9992512167727443,
0.9996256083863722,
1.0,
1.0,
0.9996256083863722]
29
[204]: mse=mean_squared_error(y_test,y_pred)
mse
[204]: 8908715.681299139
[205]: mae=mean_absolute_error(y_test,y_pred)
mae
[205]: 1845.3384500187196
[206]: knn_score=r2_score(y_test,y_pred)
knn_score
[206]: 0.5678113683844929
[207]: adj_r2=1-(1-knn_score)*(2671-1)/(2671-13-1)
adj_r2
[207]: 0.5656967834349251
x_train,x_test,y_train,y_test=train_test_split(x,y, test_size=0.
↪25,random_state=42)
[212]: mse=mean_squared_error(y_test,y_pred)
mse
[212]: 5930858.2890817
[213]: mae=mean_absolute_error(y_test,y_pred)
mae
30
[213]: 1350.9729190066143
[214]: dt_score=r2_score(y_test,y_pred)
dt_score
[214]: 0.7122762000762477
[215]: adj_r2=1-(1-dt_score)*(2671-1)/(2671-13-1)
adj_r2
[215]: 0.7108684434337904
[218]: x.shape
[219]: y.shape
[219]: (10682,)
[222]: 1159.2350776537098
31
[223]: 4118971.092235592
[224]: 2029.524843956238
[233]: r2=r2_score(y_test,y_pred)
r2
[233]: 0.8001763055752238
[234]: adj_r2=1-(1-r2)*(2671-1)/(2671-13-1)
adj_r2
[234]: 0.7991986209581662
[235]: plt.figure(figsize=(6,4))
sns.distplot(y_test-y_pred)
plt.show()
32
plt.ylabel("y_pred")
plt.figure(figsize=(6,4))
plt.show()
33
[240]: rf_random=RandomizedSearchCV(estimator=random_forest,param_distributions=random_grid,scoring='
[241]: rf_random.fit(x_train,y_train)
[242]: rf_random.best_params_
[245]: MAE=mean_absolute_error(y_test,y_pred)
MAE
[245]: 1132.3857159768027
[246]: MSE=mean_squared_error(y_test,y_pred)
MSE
34
[246]: 3832499.572301327
[247]: 1957.6770858089255
[248]: random_forest.score(x_train,y_train)
[248]: 0.9109941911759742
[249]: random_forest.score(x_test,y_test)
[249]: 0.8140739018872378
[250]: # R2 score
rf_score=metrics.r2_score(y_test,y_pred)
rf_score
[250]: 0.8140739018872378
[251]: prediction=rf_random.predict(x_test)
[252]: plt.figure(figsize=(6,4))
sns.distplot(y_test-prediction)
plt.show()
35
1.16.1 Insight:
• We see the normal distribution in the curve
[253]: plt.scatter(y_test, y_pred, alpha = 0.9)
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.figure(figsize=(6,4))
plt.show()
1.16.2 Insight:
• We can see the observation in linearly scattered
36
1.17 GRADIENT BOOSTING
[254]: # for the model creation we have to separate the independent and dependent
x=df.drop("Price",axis=1)
y=df["Price"]
[258]: mse=mean_squared_error(y_test,y_hat)
mse
[258]: 4306336.397248144
[259]: mae=mean_absolute_error(y_test,y_hat)
mae
[259]: 1488.1156304342967
[260]: gb_score=r2_score(y_test,y_hat)
gb_score
[260]: 0.7910866502665936
[261]: adj_r2=1-(1-gb_score)*(2671-1)/(2671-13-1)
adj_r2
[261]: 0.7900644923642473
37
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'subsample': [0.8, 0.9, 1.0]
}
[274]: rsc=RandomizedSearchCV(estimator=gb_regressor,param_distributions=param_grid,scoring='neg_mean
[279]: rsc.best_params_
[293]: mse=mean_squared_error(y_test,y_hat)
mse
38
[293]: 3052127.0810980895
[294]: mae=mean_absolute_error(y_test,y_hat)
mae
[294]: 1165.4653521509392
[295]: gbst_score=r2_score(y_test,y_hat)
gbst_score
[295]: 0.8519321219931385
1.19 10.RESULT
1.19.1 Comparison of the Best Models Evaluated by Cross Validation
• LinearRegressor - CV: 0.61
• KNeighborsRegressor - CV: 0.56
• DecisionTreeRegressor - CV: 0.70
• RandomForestRegressor - CV: 0.81
• GradientBoostingRegressor - CV: 0.85
[298]: scores = [lr_score,knn_score,dt_score,rf_score,gbst_score]
algorithms = ["Linear Regression","KNN","Decision Tree","Random␣
↪Forest","Gradient Boosting"]
for i in range(len(algorithms)):
print("The R2 score achieved using "+algorithms[i]+" is: "+str(scores[i])+"␣
↪%")
[306]: plt.figure(figsize=(10,6))
plt.xlabel("Algorithms")
plt.ylabel("R2 score")
ax=sns.barplot(x=algorithms,y=scores)
for label in ax.containers:
ax.bar_label(label)
plt.tight_layout()
plt.tick_params(labelsize=14)
39
1.20 Conclusion
• The best model is Gradient Boosting with a r2_score of 0.85.
• The second best model followed by Gradient Boosting is Random Forest with a r2_score of
0.81.
• Some of the best features which has high impact on price are Total_Stops, Duration,Airline
and Route.
40