Flight-Price-Prediction - Flight - Price - Ipynb at Master Mandal-21 - Flight-Price-Prediction

Mandal-21 / Flight-Price-Prediction Public
Code Issues 10 Pull requests 5 Actions Projects Security Ins

master
Flight-Price-Prediction / flight_price.ipynb
Mandal-21
Add files via upload

1
contributor
4316 lines (4316 sloc)

436 KB
Flight Price Prediction
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
Importing dataset
1. Since data is in form of excel file we have to
use pandas read_excel to load the data
2. After loading it is important to check the
complete information of data as it can
indication many of the hidden infomation
such as null values in a column or a row
3. Check whether any null values are there or
not. if it is present then following can be
done,
A. Imputing data using Imputation method
in sklearn
B. Filling NaN values with mean, median
and mode using fillna() method
4. Describe data --> which can give statistical
analysis
In [2]:
train_data = pd.read_excel(r"E:\MachineLearni
In [3]:
pd.set_option('display.max_columns', None)
In [4]:
train_data.head()
Out[4]: Airline Date_of_Journey Source Destination Ro
BL
0 IndiGo 24/03/2019 Banglore New Delhi
Air →
1 1/05/2019 Kolkata Banglore
India →
→
DE
Jet
2 9/06/2019 Delhi Cochin
Airways B
3 IndiGo 12/05/2019 Kolkata Banglore

→
BL
→
In [5]:
train_data.info()
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Airline 10683 non-null object
1 Date_of_Journey 10683 non-null object
2 Source 10683 non-null object
3 Destination 10683 non-null object
4 Route 10682 non-null object
5 Dep_Time 10683 non-null object
6 Arrival_Time 10683 non-null object
7 Duration 10683 non-null object
8 Total_Stops 10682 non-null object
9 Additional_Info 10683 non-null object
10 Price 10683 non-null int64
dtypes: int64(1), object(10)
memory usage: 918.2+ KB
In [6]:
train_data["Duration"].value_counts()
Out[6]: 2h 50m 550
1h 30 386
1h 30m 386
2h 45m 337
2h 55m 337
2h 35m 329
...
42h 5m 1
28h 30m 1
36h 25m 1
40h 20m 1
30h 25m 1
Name: Duration, Length: 368, dtype: int64
In [7]:
train_data.dropna(inplace = True)
In [8]:
train_data.isnull().sum()
Out[8]: Airline 0
Date_of_Journey 0
Source 0
Destination 0
Route 0
Dep_Time 0
Arrival_Time 0
Duration 0
Total_Stops 0
Additional_Info 0
Price 0
dtype: int64
EDA
From description we can see that
Date_of_Journey is a object data type,
Therefore, we have to convert this datatype into

timestamp so as to use this column properly for
prediction
For this we require pandas to_datetime to

convert object data type to datetime dtype.
.dt.day method will extract only day of that

date
.dt.month method will extract only month of

that date
In [9]:
train_data["Journey_day"] = pd.to_datetime(tr
In [10]:
train_data["Journey_month"] = pd.to_datetime(
In [11]:
train_data.head()
Out[11]: Airline Date_of_Journey Source Destination Ro
BL
Air →
1 1/05/2019 Kolkata Banglore
India →
→
DE
Jet
Airways B

→
BL
→
In [12]:
# Since we have converted Date_of_Journey col

train_data.drop(["Date_of_Journey"], axis = 1
In [13]:
# Departure time is when a plane leaves the g
# Similar to Date_of_Journey we can extract v

# Extracting Hours
train_data["Dep_hour"] = pd.to_datetime(train

# Extracting Minutes
train_data["Dep_min"] = pd.to_datetime(train_

# Now we can drop Dep_Time as it is of no use

train_data.drop(["Dep_Time"], axis = 1, inpla
In [14]:
train_data.head()
Out[14]: Airline Source Destination Route Arrival_Time
BLR →
0 IndiGo Banglore New Delhi 01:10 22 Mar
DEL
CCU
Air → IXR
1 Kolkata Banglore 13:15
India → BBI
→ BLR
DEL →
LKO
Jet →
2 Delhi Cochin 04:25 10 Jun
Airways BOM
→
COK
CCU
→
3 IndiGo Kolkata Banglore 23:30
NAG
→ BLR
BLR →
4 IndiGo Banglore New Delhi NAG 21:35
→ DEL
In [15]:
# Arrival time is when the plane pulls up to
# Similar to Date_of_Journey we can extract v

# Extracting Hours
train_data["Arrival_hour"] = pd.to_datetime(t

# Extracting Minutes
train_data["Arrival_min"] = pd.to_datetime(tr

# Now we can drop Arrival_Time as it is of no

train_data.drop(["Arrival_Time"], axis = 1, i
In [16]:
train data head()
train_data.head()
Out[16]: Airline Source Destination Route Duration To
BLR →
0 IndiGo Banglore New Delhi 2h 50m
DEL
CCU
Air → IXR
1 Kolkata Banglore 7h 25m
India → BBI
→ BLR
DEL →
LKO
Jet →
2 Delhi Cochin 19h
Airways BOM
→
COK
CCU
→
3 IndiGo Kolkata Banglore 5h 25m
NAG
→ BLR
BLR →
4 IndiGo Banglore New Delhi NAG 4h 45m
→ DEL
In [17]:
# Time taken by plane to reach destination is
# It is the differnce betwwen Departure Time

# Assigning and converting Duration column in

duration = list(train_data["Duration"])
for i in range(len(duration)):
if len(duration[i].split()) != 2: # Ch
if "h" in duration[i]:
duration[i] = duration[i].strip()
else:
duration[i] = "0h " + duration[i]

duration_hours = []
duration_mins = []
duration_hours.append(int(duration[i].spl
duration_mins.append(int(duration[i].spli
In [18]:
# Adding duration_hours and duration_mins lis

train_data["Duration_hours"] = duration_hours
train_data["Duration_mins"] = duration_mins
In [19]:
train_data.drop(["Duration"], axis = 1, inpla
In [20]:
train_data.head()
Out[20]: Airline Source Destination Route Total_Stops
BLR →
0 IndiGo Banglore New Delhi non-stop
DEL
CCU
Air → IXR
1 Kolkata Banglore 2 stops
India → BBI
→ BLR
DEL →
LKO
Jet →
2 Delhi Cochin 2 stops
Airways BOM
→
COK
CCU
→
3 IndiGo Kolkata Banglore 1 stop
NAG
→ BLR
BLR →
4 IndiGo Banglore New Delhi NAG 1 stop
→ DEL
Handling Categorical Data

One can find many ways to handle categorical
data. Some of them categorical data are,
1. Nominal data --> data are not in any order

--> OneHotEncoder is used in this case
2. Ordinal data --> data are in order -->
LabelEncoder is used in this case
In [21]:
train_data["Airline"].value_counts()
Out[21]: Jet Airways 3849
IndiGo 2053
Air India 1751
Multiple carriers 1196
SpiceJet 818
Vistara 479
Air Asia 319
GoAir 194
Multiple carriers Premium economy 13
Jet Airways Business 6
Vistara Premium economy 3
Trujet 1
Name: Airline, dtype: int64
In [22]:
# From graph we can see that Jet Airways Busi
# Apart from the first Airline almost all are

# Airline vs Price
sns.catplot(y = "Price", x = "Airline", data

plt.show()
In [23]:
# As Airline is Nominal Categorical data we w

Airline = train_data[["Airline"]]
Airline = pd.get_dummies(Airline, drop_first=

Airline.head()
Out[23]:
Airline_Air Airline_Jet
Airline_GoAir Airline_IndiGo
India Airways
0 0 0 1 0
1 1 0 0 0
2 0 0 0 1
3 0 0 1 0
4 0 0 1 0
In [24]:
train_data["Source"].value_counts()
Out[24]: Delhi 4536
Kolkata 2871
Banglore 2197
Mumbai 697
Chennai 381
Name: Source, dtype: int64
In [25]:
# Source vs Price
sns.catplot(y = "Price", x = "Source", data =

plt.show()
In [26]:
# As Source is Nominal Categorical data we wi

Source = train_data[["Source"]]
Source = pd.get_dummies(Source, drop_first= T

Source.head()
Out[26]: Source_Chennai Source_Delhi Source_Kolkata Sour
0 0 0 0
1 0 0 1
2 0 1 0
3 0 0 1
4 0 0 0
In [27]:
train_data["Destination"].value_counts()
Out[27]: Cochin 4536
Banglore 2871
Delhi 1265
New Delhi 932
Hyderabad 697
Kolkata 381
Name: Destination, dtype: int64
In [28]:
# As Destination is Nominal Categorical data

Destination = train_data[["Destination"]]
Destination = pd.get_dummies(Destination, dro

Destination.head()
Out[28]:
Destination_Cochin Destination_Delhi Destination_H
0 0 0
1 0 0
2 1 0
3 0 0
4 0 0
In [29]:
train_data["Route"]
Out[29]: 0 BLR →
DEL
1 CCU → IXR → BBI →

BLR
2 DEL → LKO → BOM →

COK
3 CCU → NAG →
BLR
4 BLR → NAG →
DEL
...
10678 CCU → BLR
10679 CCU → BLR
10680 BLR → DEL
10681 BLR → DEL
10682 DEL → GOI → BOM → COK
Name: Route, Length: 10682, dtype: object
In [30]:
# Additional_Info contains almost 80% no_info
# Route and Total_Stops are related to each o

train_data.drop(["Route", "Additional_Info"],
In [31]:
train_data["Total_Stops"].value_counts()
Out[31]: 1 stop 5625
non-stop 3491
2 stops 1520
3 stops 45
4 stops 1
Name: Total_Stops, dtype: int64
In [32]:
# As this is case of Ordinal Categorical type
# Here Values are assigned with corresponding

train_data.replace({"non-stop": 0, "1 stop":
In [33]:
train_data.head()
Out[33]: Airline Source Destination Total_Stops Price
0 IndiGo Banglore New Delhi 0 3897
Air
1 Kolkata Banglore 2 7662
India
Jet
2 Delhi Cochin 2 13882
Airways
3 IndiGo Kolkata Banglore 1 6218
In [34]:
# Concatenate dataframe --> train_data + Airl

data_train = pd.concat([train_data, Airline,
In [35]:
data_train.head()
Out[35]:
Airline Source Destination Total_Stops Price
Air
1 Kolkata Banglore 2 7662
India
Jet
2 Delhi Cochin 2 13882
Airways
3 IndiGo Kolkata Banglore 1 6218
In [36]:
data_train.drop(["Airline", "Source", "Destin
In [37]:
data_train.head()
Out[37]:
Total_Stops Price Journey_day Journey_month D
0 0 3897 24 3
1 2 7662 1 5
2 2 13882 9 6
3 1 6218 12 5
4 1 13302 1 3
In [38]:
data_train.shape
Out[38]: (10682, 30)

Test set
In [39]:
test_data = pd.read_excel(r"E:\MachineLearnin
In [40]:
test_data.head()
Out[40]: Airline Date_of_Journey Source Destination R
D
Jet
Airways

→
D
Jet
Airways
D
Multiple
carriers
B
4 Air Asia 24/06/2019 Banglore Delhi
In [41]:
# Preprocessing
print("Test data Info")
print("-"*75)
print(test_data.info())
print()
print()
print("Null values :")
print("-"*75)
test_data.dropna(inplace = True)
print(test_data.isnull().sum())
# EDA
# Date_of_Journey
test_data["Journey_day"] = pd.to_datetime(tes
test_data["Journey_month"] = pd.to_datetime(t
test_data.drop(["Date_of_Journey"], axis = 1,
# Dep_Time
test_data["Dep_hour"] = pd.to_datetime(test_d
test_data["Dep_min"] = pd.to_datetime(test_da
test_data.drop(["Dep_Time"], axis = 1, inplac
# Arrival_Time
test_data["Arrival_hour"] = pd.to_datetime(te
test_data["Arrival_min"] = pd.to_datetime(tes
test_data.drop(["Arrival_Time"], axis = 1, in
# Duration
duration = list(test_data["Duration"])
if len(duration[i].split()) != 2: # Ch
if "h" in duration[i]:
duration[i] = duration[i].strip()
else:
duration[i] = "0h " + duration[i]
duration_hours = []
duration_mins = []
duration_hours.append(int(duration[i].spl
duration_mins.append(int(duration[i].spli
# Adding Duration column to test set
test_data["Duration_hours"] = duration_hours
test_data["Duration_mins"] = duration_mins
test_data.drop(["Duration"], axis = 1, inplac
# Categorical data
print("Airline")
print("-"*75)
print(test_data["Airline"].value_counts())
Airline = pd.get_dummies(test_data["Airline"]
print()
print("Source")
print("-"*75)
print(test_data["Source"].value_counts())
Source = pd.get_dummies(test_data["Source"],
print()
print("Destination")
print("-"*75)
print(test_data["Destination"].value_counts()
Destination = pd.get_dummies(test_data["Desti
# Additional_Info contains almost 80% no_info

# Route and Total_Stops are related to each o
test_data.drop(["Route", "Additional_Info"],
# Replacing Total_Stops
test_data.replace({"non-stop": 0, "1 stop": 1

# Concatenate dataframe --> test_data + Airli

data_test = pd.concat([test_data, Airline, So
data_test.drop(["Airline", "Source", "Destina
print()
print()
print("Shape of test data : ", data_test.shap
Test data Info
---------------------------------------------
------------------------------
RangeIndex: 2671 entries, 0 to 2670
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Airline 2671 non-null object
1 Date_of_Journey 2671 non-null object
2 Source 2671 non-null object
3 Destination 2671 non-null object
4 Route 2671 non-null object
5 Dep_Time 2671 non-null object
6 Arrival_Time 2671 non-null object
7 Duration 2671 non-null object
8 Total_Stops 2671 non-null object
9 Additional_Info 2671 non-null object
dtypes: object(10)
memory usage: 208.8+ KB
None
Null values :
---------------------------------------------
------------------------------
Airline 0
Date_of_Journey 0
Source 0
Destination 0
Route 0
Dep_Time 0
Arrival_Time 0
Duration 0
Total_Stops 0
Additional_Info 0
dtype: int64
Airline
---------------------------------------------
------------------------------
Jet Airways 897
IndiGo 511
Air India 440
Multiple carriers 347
SpiceJet 208
Vistara 129
Air Asia 86
GoAir 46
Multiple carriers Premium economy 3
Jet Airways Business 2

Name: Airline, dtype: int64
Source
---------------------------------------------
------------------------------
Delhi 1145
Kolkata 710
Banglore 555
Mumbai 186
Chennai 75
Name: Source, dtype: int64
Destination
---------------------------------------------
------------------------------
Cochin 1145
Banglore 710
Delhi 317
New Delhi 238
Hyderabad 186
Kolkata 75
Name: Destination, dtype: int64
Shape of test data : (2671, 28)
In [42]:
data_test.head()
Out[42]:
Total_Stops Journey_day Journey_month Dep_hou
0 1 6 6 1
1 1 12 5
2 1 21 5 1
3 1 21 5
4 0 24 6 2
Feature Selection
Finding out the best feature which will contribute
and have good relation with target variable.
Following are some of the feature selection
methods,
1. heatmap
2. feature_importance_
3. SelectKBest
In [43]:
data_train.shape
Out[43]: (10682, 30)
In [44]:
data_train.columns
Out[44]: Index(['Total_Stops', 'Price', 'Journey_day',

'Journey_month', 'Dep_hour',
'Dep_min', 'Arrival_hour', 'Arrival_mi

n', 'Duration_hours',
'Duration_mins', 'Airline_Air India',

'Airline_GoAir', 'Airline_IndiGo',
'Airline_Jet Airways', 'Airline_Jet Ai
rways Business',
'Airline_Multiple carriers',
'Airline_Multiple carriers Premium eco

nomy', 'Airline_SpiceJet',
'Airline_Trujet', 'Airline_Vistara',
'Airline_Vistara Premium economy',
'Source_Chennai', 'Source_Delhi', 'Sou
rce_Kolkata', 'Source_Mumbai',
'Destination_Cochin', 'Destination_Del
hi', 'Destination_Hyderabad',
'Destination_Kolkata', 'Destination_Ne
w Delhi'],
dtype='object')
In [45]:
X = data_train.loc[:, ['Total_Stops', 'Journe
'Dep_min', 'Arrival_hour', 'Arrival_mi
'Duration_mins', 'Airline_Air India',
'Airline_Jet Airways', 'Airline_Jet Ai
'Airline_Multiple carriers',
'Airline_Multiple carriers Premium eco

'Airline_Trujet', 'Airline_Vistara', '
'Source_Chennai', 'Source_Delhi', 'Sou
'Destination_Cochin', 'Destination_Del
'Destination_Kolkata', 'Destination_Ne
X.head()
Out[45]:
Total_Stops Journey_day Journey_month Dep_hou
0 0 24 3 2
1 2 1 5
2 2 9 6
3 1 12 5 1
4 1 1 3 1
In [46]:
y = data_train.iloc[:, 1]
y.head()
Out[46]: 0 3897
1 7662
2 13882
3 6218
4 13302
Name: Price, dtype: int64
In [47]:
# Finds correlation between Independent and d

plt.figure(figsize = (18,18))
sns.heatmap(train_data.corr(), annot = True,

plt.show()
In [48]:
# Important feature using ExtraTreesRegressor

from sklearn.ensemble import ExtraTreesRegres

l ti E t T R ()
selection = ExtraTreesRegressor()
selection.fit(X, y)
Out[48]: ExtraTreesRegressor(bootstrap=False, ccp_alph

a=0.0, criterion='mse',
max_depth=None, max_featu
res='auto', max_leaf_nodes=None,
max_samples=None, min_imp
urity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2, min_
weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=
None, oob_score=False,
random_state=None, verbos
e=0, warm_start=False)
In [49]:
print(selection.feature_importances_)
[1.95434120e-01 1.44021586e-01 5.38555703e-02

2.40539410e-02
2.17430940e-02 2.73470116e-02 1.96095029e-02

1.27500039e-01
1.74337150e-02 1.06741541e-02 1.87128973e-03

1.65718507e-02
1.51364596e-01 6.81637760e-02 1.89902192e-02

8.47326514e-04
3.10615675e-03 1.08644888e-04 5.29446496e-03

9.35369428e-05
6.27777871e-04 1.23137157e-02 3.24867396e-03

8.91085147e-03
1.41174133e-02 1.95895945e-02 7.68279226e-03

4.01014841e-04
2.50235698e-02]
In [50]:
#plot graph of feature importances for better

plt.figure(figsize = (12,8))
feat_importances = pd.Series(selection.featur
feat_importances.nlargest(20).plot(kind='barh
plt.show()
Fitting model using

Random Forest
1. Split dataset into train and test set in order
to prediction w.r.t X_test
2. If needed do scaling of data
Scaling is not done in Random forest
3. Import model
4. Fit the data
5. Predict w.r.t X_test
6. In regression check RSME Score
7. Plot graph
In [51]:
from sklearn.model_selection import train_tes
X_train, X_test, y_train, y_test = train_test
In [52]:
from sklearn.ensemble import RandomForestRegr
reg_rf = RandomForestRegressor()
reg_rf.fit(X_train, y_train)
Out[52]: RandomForestRegressor(bootstrap=True, ccp_alp

ha=0.0, criterion='mse',
max_depth=None, max_fea
tures='auto', max_leaf_nodes=None,
max_samples=None, min_i
mpurity_decrease=0.0,
min_impurity_split=Non
e, min_samples_leaf=1,
min_samples_split=2, mi
n_weight_fraction_leaf=0.0,
n_estimators=100, n_job
s=None, oob_score=False,
random_state=None, verb
ose=0, warm_start=False)
In [53]:
y_pred = reg_rf.predict(X_test)
In [54]:
reg_rf.score(X_train, y_train)
Out[54]: 0.9539164511170628
In [55]:
reg_rf.score(X_test, y_test)
Out[55]: 0.798383043987616
In [56]:
sns.distplot(y_test-y_pred)
plt.show()
In [57]:
plt.scatter(y_test, y_pred, alpha = 0.5)
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.show()
In [58]:
from sklearn import metrics
In [59]:
print('MAE:', metrics.mean_absolute_error(y_t
print('MSE:', metrics.mean_squared_error(y_te
print('RMSE:', np.sqrt(metrics.mean_squared_e
MAE: 1172.5455945373583
MSE: 4347276.1614450775
RMSE: 2085.0122688955757
In [60]:
# RMSE/(max(DV)-min(DV))
2090.5509/(max(y)-min(y))
Out[60]: 0.026887077025966846
In [61]:
metrics.r2_score(y_test, y_pred)
Out[61]: 0.7983830439876158
In [ ]:

Hyperparameter Tuning
Choose following method for
hyperparameter tuning
1. RandomizedSearchCV --> Fast
2. GridSearchCV
Assign hyperparameters in form of
dictionery
Fit the model
Check best paramters and best score
In [62]:
from sklearn.model_selection import Randomize
In [63]:
#Randomized Search CV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(s

# Number of features to consider at every spl
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 3

# Minimum number of samples required to split
min_samples_split = [2, 5, 10, 15, 100]
# Minimum number of samples required at each

min_samples_leaf = [1, 2, 5, 10]
In [64]:
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_sampl
'min_samples_leaf': min_sample
In [65]:
# Random search of parameters, using 5 fold c
# search across 100 different combinations
rf_random = RandomizedSearchCV(estimator = re
In [66]:
rf_random.fit(X_train,y_train)
Fitting 5 folds for each of 10 candidates, to

talling 50 fits
[CV] n_estimators=900, min_samples_split=5, m

in_samples_leaf=5, max_features=sqrt, max_dep
th=10
[Parallel(n_jobs=1)]: Using backend Sequentia

lBackend with 1 concurrent workers.
[CV] n_estimators=900, min_samples_split=5,

min_samples_leaf=5, max_features=sqrt, max_de
pth=10, total= 13.0s

th=10
[Parallel(n_jobs=1)]: Done 1 out of 1 | e

lapsed: 12.9s remaining: 0.0s

th=10


th=10


th=10


pth=15
[CV] n_estimators=1100, min_samples_split=1

0, min_samples_leaf=2, max_features=sqrt, max
_depth=15, total= 18.6s

pth=15


pth=15


pth=15


pth=15

[CV] n estimators=300, min samples split=100,

[CV] n_estimators 300, min_samples_split 100,
min_samples_leaf=5, max_features=auto, max_de
pth=15

0, min_samples_leaf=5, max_features=auto, max

pth=15


pth=15


pth=15


pth=15


in_samples_leaf=5, max_features=auto, max_dep
th=15


th=15


th=15


th=15


th=15


in_samples_leaf=10, max_features=auto, max_de
pth=20

min_samples_leaf=10, max_features=auto, max_d
epth=20, total= 25.0s

pth=20


pth=20


pth=20


pth=20


pth=25


pth=25


pth=25


pth=25


pth=25


min_samples_leaf=10, max_features=sqrt, max_d
epth=5

5, min_samples_leaf=10, max_features=sqrt, ma
x_depth=5, total= 9.4s

epth=5
[CV] n estimators=1100 min samples split=1


epth=5


epth=5


epth=5


pth=15

pth=15, total= 4.7s

pth=15

pth=15, total= 4.6s

pth=15

pth=15, total= 4.1s

pth=15

pth=15, total= 4.3s
[CV] n estimators 300 min samples split 15

Flight-Price-Prediction - Flight - Price - Ipynb at Master Mandal-21 - Flight-Price-Prediction

Uploaded by

Copyright:

Available Formats

Flight-Price-Prediction - Flight - Price - Ipynb at Master Mandal-21 - Flight-Price-Prediction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Flight-Price-Prediction - Flight - Price - Ipynb at Master Mandal-21 - Flight-Price-Prediction

Uploaded by

Copyright:

Available Formats

Mandal-21 / Flight-Price-Prediction Public

Code Issues 10 Pull requests 5 Actions Projects Security Ins

4316 lines (4316 sloc)

import matplotlib.pyplot as plt

import seaborn as sns

Out[4]: Airline Date_of_Journey Source Destination Ro

3 IndiGo 12/05/2019 Kolkata Banglore

RangeIndex: 10683 entries, 0 to 10682

Data columns (total 11 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Airline 10683 non-null object

1 Date_of_Journey 10683 non-null object

2 Source 10683 non-null object

3 Destination 10683 non-null object

4 Route 10682 non-null object

5 Dep_Time 10683 non-null object

6 Arrival_Time 10683 non-null object

7 Duration 10683 non-null object

8 Total_Stops 10682 non-null object

9 Additional_Info 10683 non-null object

10 Price 10683 non-null int64

dtypes: int64(1), object(10)

memory usage: 918.2+ KB

Out[6]: 2h 50m 550

Name: Duration, Length: 368, dtype: int64

Therefore, we have to convert this datatype into

For this we require pandas to_datetime to

.dt.day method will extract only day of that

.dt.month method will extract only month of

Out[11]: Airline Date_of_Journey Source Destination Ro

3 IndiGo 12/05/2019 Kolkata Banglore

# Now we can drop Dep_Time as it is of no use

Out[14]: Airline Source Destination Route Arrival_Time

# Now we can drop Arrival_Time as it is of no

Out[16]: Airline Source Destination Route Duration To

# Assigning and converting Duration column in

duration[i] = "0h " + duration[i]

Out[20]: Airline Source Destination Route Total_Stops

Handling Categorical Data

1. Nominal data --> data are not in any order

Out[21]: Jet Airways 3849

Air India 1751

Multiple carriers 1196

Air Asia 319

Multiple carriers Premium economy 13

Jet Airways Business 6

Vistara Premium economy 3

Name: Airline, dtype: int64

sns.catplot(y = "Price", x = "Airline", data

Airline = pd.get_dummies(Airline, drop_first=

Out[24]: Delhi 4536

Name: Source, dtype: int64

sns.catplot(y = "Price", x = "Source", data =

Source = pd.get_dummies(Source, drop_first= T

Out[26]: Source_Chennai Source_Delhi Source_Kolkata Sour

Out[27]: Cochin 4536

New Delhi 932