Flight-Price-Prediction - Flight - Price - Ipynb at Master Mandal-21 - Flight-Price-Prediction
Flight-Price-Prediction - Flight - Price - Ipynb at Master Mandal-21 - Flight-Price-Prediction
Flight-Price-Prediction - Flight - Price - Ipynb at Master Mandal-21 - Flight-Price-Prediction
master
Flight-Price-Prediction / flight_price.ipynb
Mandal-21
Add files via upload
1
contributor
In [1]:
import numpy as np
import pandas as pd
sns.set()
Importing dataset
1. Since data is in form of excel file we have to
use pandas read_excel to load the data
2. After loading it is important to check the
complete information of data as it can
indication many of the hidden infomation
such as null values in a column or a row
3. Check whether any null values are there or
not. if it is present then following can be
done,
A. Imputing data using Imputation method
in sklearn
B. Filling NaN values with mean, median
and mode using fillna() method
4. Describe data --> which can give statistical
analysis
In [2]:
train_data = pd.read_excel(r"E:\MachineLearni
In [3]:
pd.set_option('display.max_columns', None)
In [4]:
train_data.head()
BL
0 IndiGo 24/03/2019 Banglore New Delhi
Air →
1 1/05/2019 Kolkata Banglore
India →
→
DE
Jet
2 9/06/2019 Delhi Cochin
Airways B
BL
4 IndiGo 01/03/2019 Banglore New Delhi
→
In [5]:
train_data.info()
In [6]:
train_data["Duration"].value_counts()
1h 30 386
1h 30m 386
2h 45m 337
2h 55m 337
2h 35m 329
...
42h 5m 1
28h 30m 1
36h 25m 1
40h 20m 1
30h 25m 1
In [7]:
train_data.dropna(inplace = True)
In [8]:
train_data.isnull().sum()
Out[8]: Airline 0
Date_of_Journey 0
Source 0
Destination 0
Route 0
Dep_Time 0
Arrival_Time 0
Duration 0
Total_Stops 0
Additional_Info 0
Price 0
dtype: int64
EDA
From description we can see that
Date_of_Journey is a object data type,
In [9]:
train_data["Journey_day"] = pd.to_datetime(tr
In [10]:
train_data["Journey_month"] = pd.to_datetime(
In [11]:
train_data.head()
BL
0 IndiGo 24/03/2019 Banglore New Delhi
Air →
1 1/05/2019 Kolkata Banglore
India →
→
DE
Jet
2 9/06/2019 Delhi Cochin
Airways B
BL
4 IndiGo 01/03/2019 Banglore New Delhi
→
In [12]:
# Since we have converted Date_of_Journey col
train_data.drop(["Date_of_Journey"], axis = 1
In [13]:
# Departure time is when a plane leaves the g
# Similar to Date_of_Journey we can extract v
# Extracting Hours
train_data["Dep_hour"] = pd.to_datetime(train
# Extracting Minutes
train_data["Dep_min"] = pd.to_datetime(train_
In [14]:
train_data.head()
BLR →
0 IndiGo Banglore New Delhi 01:10 22 Mar
DEL
CCU
Air → IXR
1 Kolkata Banglore 13:15
India → BBI
→ BLR
DEL →
LKO
Jet →
2 Delhi Cochin 04:25 10 Jun
Airways BOM
→
COK
CCU
→
3 IndiGo Kolkata Banglore 23:30
NAG
→ BLR
BLR →
4 IndiGo Banglore New Delhi NAG 21:35
→ DEL
In [15]:
# Arrival time is when the plane pulls up to
# Similar to Date_of_Journey we can extract v
# Extracting Hours
train_data["Arrival_hour"] = pd.to_datetime(t
# Extracting Minutes
train_data["Arrival_min"] = pd.to_datetime(tr
In [16]:
train data head()
train_data.head()
BLR →
0 IndiGo Banglore New Delhi 2h 50m
DEL
CCU
Air → IXR
1 Kolkata Banglore 7h 25m
India → BBI
→ BLR
DEL →
LKO
Jet →
2 Delhi Cochin 19h
Airways BOM
→
COK
CCU
→
3 IndiGo Kolkata Banglore 5h 25m
NAG
→ BLR
BLR →
4 IndiGo Banglore New Delhi NAG 4h 45m
→ DEL
In [17]:
# Time taken by plane to reach destination is
# It is the differnce betwwen Departure Time
for i in range(len(duration)):
if len(duration[i].split()) != 2: # Ch
if "h" in duration[i]:
duration[i] = duration[i].strip()
else:
duration_hours = []
duration_mins = []
for i in range(len(duration)):
duration_hours.append(int(duration[i].spl
duration_mins.append(int(duration[i].spli
In [18]:
# Adding duration_hours and duration_mins lis
train_data["Duration_hours"] = duration_hours
train_data["Duration_mins"] = duration_mins
In [19]:
train_data.drop(["Duration"], axis = 1, inpla
In [20]:
train_data.head()
BLR →
0 IndiGo Banglore New Delhi non-stop
DEL
CCU
Air → IXR
1 Kolkata Banglore 2 stops
India → BBI
→ BLR
DEL →
LKO
Jet →
2 Delhi Cochin 2 stops
Airways BOM
→
COK
CCU
→
3 IndiGo Kolkata Banglore 1 stop
NAG
→ BLR
BLR →
4 IndiGo Banglore New Delhi NAG 1 stop
→ DEL
In [21]:
train_data["Airline"].value_counts()
IndiGo 2053
SpiceJet 818
Vistara 479
GoAir 194
Trujet 1
In [22]:
# From graph we can see that Jet Airways Busi
# Apart from the first Airline almost all are
# Airline vs Price
In [23]:
# As Airline is Nominal Categorical data we w
Airline = train_data[["Airline"]]
Airline.head()
Out[23]:
Airline_Air Airline_Jet
Airline_GoAir Airline_IndiGo
India Airways
0 0 0 1 0
1 1 0 0 0
2 0 0 0 1
3 0 0 1 0
4 0 0 1 0
In [24]:
train_data["Source"].value_counts()
Kolkata 2871
Banglore 2197
Mumbai 697
Chennai 381
In [25]:
# Source vs Price
In [26]:
# As Source is Nominal Categorical data we wi
Source = train_data[["Source"]]
Source.head()
0 0 0 0
1 0 0 1
2 0 1 0
3 0 0 1
4 0 0 0
In [27]:
train_data["Destination"].value_counts()
Banglore 2871
Delhi 1265
Hyderabad 697
Kolkata 381
In [28]:
# As Destination is Nominal Categorical data
Destination = train_data[["Destination"]]
Destination.head()
Out[28]:
Destination_Cochin Destination_Delhi Destination_H
0 0 0
1 0 0
2 1 0
3 0 0
4 0 0
In [29]:
train_data["Route"]
Out[29]: 0 BLR →
DEL
3 CCU → NAG →
BLR
4 BLR → NAG →
DEL
...
In [30]:
# Additional_Info contains almost 80% no_info
# Route and Total_Stops are related to each o
train_data.drop(["Route", "Additional_Info"],
In [31]:
train_data["Total_Stops"].value_counts()
non-stop 3491
2 stops 1520
3 stops 45
4 stops 1
In [32]:
# As this is case of Ordinal Categorical type
# Here Values are assigned with corresponding
In [33]:
train_data.head()
Air
1 Kolkata Banglore 2 7662
India
Jet
2 Delhi Cochin 2 13882
Airways
In [34]:
# Concatenate dataframe --> train_data + Airl
In [35]:
data_train.head()
Out[35]:
Airline Source Destination Total_Stops Price
Air
1 Kolkata Banglore 2 7662
India
Jet
2 Delhi Cochin 2 13882
Airways
In [36]:
data_train.drop(["Airline", "Source", "Destin
In [37]:
data_train.head()
Out[37]:
Total_Stops Price Journey_day Journey_month D
0 0 3897 24 3
1 2 7662 1 5
2 2 13882 9 6
3 1 6218 12 5
4 1 13302 1 3
In [38]:
data_train.shape
In [40]:
test_data.head()
D
Jet
0 6/06/2019 Delhi Cochin
Airways
D
Jet
2 21/05/2019 Delhi Cochin
Airways
D
Multiple
3 21/05/2019 Delhi Cochin
carriers
B
4 Air Asia 24/06/2019 Banglore Delhi
In [41]:
# Preprocessing
print("-"*75)
print(test_data.info())
print()
print()
print("-"*75)
test_data.dropna(inplace = True)
print(test_data.isnull().sum())
# EDA
# Date_of_Journey
test_data["Journey_day"] = pd.to_datetime(tes
test_data["Journey_month"] = pd.to_datetime(t
test_data.drop(["Date_of_Journey"], axis = 1,
# Dep_Time
test_data["Dep_hour"] = pd.to_datetime(test_d
test_data["Dep_min"] = pd.to_datetime(test_da
test_data.drop(["Dep_Time"], axis = 1, inplac
# Arrival_Time
test_data["Arrival_hour"] = pd.to_datetime(te
test_data["Arrival_min"] = pd.to_datetime(tes
test_data.drop(["Arrival_Time"], axis = 1, in
# Duration
duration = list(test_data["Duration"])
for i in range(len(duration)):
if len(duration[i].split()) != 2: # Ch
if "h" in duration[i]:
duration[i] = duration[i].strip()
else:
duration_hours = []
duration_mins = []
for i in range(len(duration)):
duration_hours.append(int(duration[i].spl
duration_mins.append(int(duration[i].spli
test_data["Duration_hours"] = duration_hours
test_data["Duration_mins"] = duration_mins
# Categorical data
print("Airline")
print("-"*75)
print(test_data["Airline"].value_counts())
Airline = pd.get_dummies(test_data["Airline"]
print()
print("Source")
print("-"*75)
print(test_data["Source"].value_counts())
Source = pd.get_dummies(test_data["Source"],
print()
print("Destination")
print("-"*75)
print(test_data["Destination"].value_counts()
Destination = pd.get_dummies(test_data["Desti
# Replacing Total_Stops
print()
print()
---------------------------------------------
------------------------------
dtypes: object(10)
None
Null values :
---------------------------------------------
------------------------------
Airline 0
Date_of_Journey 0
Source 0
Destination 0
Route 0
Dep_Time 0
Arrival_Time 0
Duration 0
Total_Stops 0
Additional_Info 0
dtype: int64
Airline
---------------------------------------------
------------------------------
IndiGo 511
SpiceJet 208
Vistara 129
Air Asia 86
GoAir 46
Source
---------------------------------------------
------------------------------
Delhi 1145
Kolkata 710
Banglore 555
Mumbai 186
Chennai 75
Destination
---------------------------------------------
------------------------------
Cochin 1145
Banglore 710
Delhi 317
Hyderabad 186
Kolkata 75
In [42]:
data_test.head()
Out[42]:
Total_Stops Journey_day Journey_month Dep_hou
0 1 6 6 1
1 1 12 5
2 1 21 5 1
3 1 21 5
4 0 24 6 2
Feature Selection
Finding out the best feature which will contribute
and have good relation with target variable.
Following are some of the feature selection
methods,
1. heatmap
2. feature_importance_
3. SelectKBest
In [43]:
data_train.shape
In [44]:
data_train.columns
'Airline_Multiple carriers',
'Airline_Trujet', 'Airline_Vistara',
'Airline_Vistara Premium economy',
'Source_Chennai', 'Source_Delhi', 'Sou
rce_Kolkata', 'Source_Mumbai',
'Destination_Cochin', 'Destination_Del
hi', 'Destination_Hyderabad',
'Destination_Kolkata', 'Destination_Ne
w Delhi'],
dtype='object')
In [45]:
X = data_train.loc[:, ['Total_Stops', 'Journe
'Dep_min', 'Arrival_hour', 'Arrival_mi
'Duration_mins', 'Airline_Air India',
'Airline_Jet Airways', 'Airline_Jet Ai
'Airline_Multiple carriers',
Out[45]:
Total_Stops Journey_day Journey_month Dep_hou
0 0 24 3 2
1 2 1 5
2 2 9 6
3 1 12 5 1
4 1 1 3 1
In [46]:
y = data_train.iloc[:, 1]
y.head()
Out[46]: 0 3897
1 7662
2 13882
3 6218
4 13302
In [47]:
# Finds correlation between Independent and d
plt.figure(figsize = (18,18))
plt.show()
In [48]:
# Important feature using ExtraTreesRegressor
max_depth=None, max_featu
res='auto', max_leaf_nodes=None,
max_samples=None, min_imp
urity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2, min_
weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=
None, oob_score=False,
random_state=None, verbos
e=0, warm_start=False)
In [49]:
print(selection.feature_importances_)
2.50235698e-02]
In [50]:
#plot graph of feature importances for better
plt.figure(figsize = (12,8))
feat_importances = pd.Series(selection.featur
feat_importances.nlargest(20).plot(kind='barh
plt.show()
In [51]:
from sklearn.model_selection import train_tes
X_train, X_test, y_train, y_test = train_test
In [52]:
from sklearn.ensemble import RandomForestRegr
reg_rf = RandomForestRegressor()
reg_rf.fit(X_train, y_train)
max_depth=None, max_fea
tures='auto', max_leaf_nodes=None,
max_samples=None, min_i
mpurity_decrease=0.0,
min_impurity_split=Non
e, min_samples_leaf=1,
min_samples_split=2, mi
n_weight_fraction_leaf=0.0,
n_estimators=100, n_job
s=None, oob_score=False,
random_state=None, verb
ose=0, warm_start=False)
In [53]:
y_pred = reg_rf.predict(X_test)
In [54]:
reg_rf.score(X_train, y_train)
Out[54]: 0.9539164511170628
In [55]:
reg_rf.score(X_test, y_test)
Out[55]: 0.798383043987616
In [56]:
sns.distplot(y_test-y_pred)
plt.show()
In [57]:
plt.scatter(y_test, y_pred, alpha = 0.5)
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.show()
In [58]:
from sklearn import metrics
In [59]:
print('MAE:', metrics.mean_absolute_error(y_t
print('MSE:', metrics.mean_squared_error(y_te
print('RMSE:', np.sqrt(metrics.mean_squared_e
MAE: 1172.5455945373583
MSE: 4347276.1614450775
RMSE: 2085.0122688955757
In [60]:
# RMSE/(max(DV)-min(DV))
2090.5509/(max(y)-min(y))
Out[60]: 0.026887077025966846
In [61]:
metrics.r2_score(y_test, y_pred)
Out[61]: 0.7983830439876158
In [ ]:
Hyperparameter Tuning
Choose following method for
hyperparameter tuning
1. RandomizedSearchCV --> Fast
2. GridSearchCV
Assign hyperparameters in form of
dictionery
Fit the model
Check best paramters and best score
In [62]:
from sklearn.model_selection import Randomize
In [63]:
#Randomized Search CV
In [64]:
# Create the random grid
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_sampl
'min_samples_leaf': min_sample
In [65]:
# Random search of parameters, using 5 fold c
# search across 100 different combinations
rf_random = RandomizedSearchCV(estimator = re
In [66]:
rf_random.fit(X_train,y_train)