Flight Price Prediction Capstone Project Submission 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

Flight Price Prediction

Jasvinder Singh Bindra

28/03/2020

Objective:
Flight ticket prices can be something hard to guess, today we might see a price, check out
the price of the same flight tomorrow, it will be a different story. We might have often
heard travelers saying that flight ticket prices are so unpredictable. Here you will be
provided with prices of flight tickets for various airlines between the months of March
and June of 2019 and between various cities.

FEATURES:
1. Airline: The name of the airline.
2. Date_of_Journey: The date of the journey
3. Source: The source from which the service begins.
4. Destination: The destination where the service ends.
5. Route: The route taken by the flight to reach the destination.
6. Dep_Time: The time when the journey starts from the source.
7. Arrival_Time: Time of arrival at the destination.
8. Duration: Total duration of the flight.
9. Total_Stops: Total stops between the source and destination.
10. Additional_Info: Additional information about the flight
11. Price: The price of the ticket

#Loading the Dataset.


library(readxl)
library(readr)
data_train <- read_excel("C:/Users/jasvi/Desktop/Capstone Project/FlightPrice
_train.xlsx")
View(data_train)
train=data_train
View(train)
summary(train)

## Airline Date_of_Journey Source Destination


## Length:10683 Length:10683 Length:10683 Length:10683
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Route Dep_Time Arrival_Time Duration
## Length:10683 Length:10683 Length:10683 Length:10683
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Total_Stops Additional_Info Price
## Length:10683 Length:10683 Min. : 1759
## Class :character Class :character 1st Qu.: 5277
## Mode :character Mode :character Median : 8372
## Mean : 9087
## 3rd Qu.:12373
## Max. :79512

str(train)

## Classes 'tbl_df', 'tbl' and 'data.frame': 10683 obs. of 11 variables:


## $ Airline : chr "IndiGo" "Air India" "Jet Airways" "IndiGo" ...
## $ Date_of_Journey: chr "24/03/2019" "1/05/2019" "9/06/2019" "12/05/2019"
...
## $ Source : chr "Banglore" "Kolkata" "Delhi" "Kolkata" ...
## $ Destination : chr "New Delhi" "Banglore" "Cochin" "Banglore" ...
## $ Route : chr "BLR <U+2192> DEL" "CCU <U+2192> IXR <U+2192> BBI
<U+2192> BLR" "DEL <U+2192> LKO <U+2192> BOM <U+2192> COK" "CCU <U+2192> NAG
<U+2192> BLR" ...
## $ Dep_Time : chr "22:20" "05:50" "09:25" "18:05" ...
## $ Arrival_Time : chr "01:10 22 Mar" "13:15" "04:25 10 Jun" "23:30" ...
## $ Duration : chr "2h 50m" "7h 25m" "19h" "5h 25m" ...
## $ Total_Stops : chr "non-stop" "2 stops" "2 stops" "1 stop" ...
## $ Additional_Info: chr "No info" "No info" "No info" "No info" ...
## $ Price : num 3897 7662 13882 6218 13302 ...

#Installing the Packages


library(stringr)
library(stringi)
library(MASS)
library(DMwR)
library(plyr)

library(ggplot2)
library(purrr)

library(tidyr)
library(corrgram)

library(caret)

library(lubridate)

library(tidyverse)

library(rpart)
library(fastDummies)

colnames(train)

## [1] "Airline" "Date_of_Journey" "Source" "Destination"


## [5] "Route" "Dep_Time" "Arrival_Time" "Duration"
## [9] "Total_Stops" "Additional_Info" "Price"

#Checking the Class of the Data


class(train)

## [1] "tbl_df" "tbl" "data.frame"

#Checking the Missing Values


sum(is.na(train))

## [1] 2

mv=data.frame(apply(train, 2, function(x){sum(is.na(x))}))
mv

## apply.train..2..function.x...
## Airline 0
## Date_of_Journey 0
## Source 0
## Destination 0
## Route 1
## Dep_Time 0
## Arrival_Time 0
## Duration 0
## Total_Stops 1
## Additional_Info 0
## Price 0

#Missing Value Treatment


train=na.omit(train)
dim(train)

## [1] 10682 11

sum(is.na(train))

## [1] 0

#Understanding & Preparing variables:


#Variable 1:Airline:
unique(train$Airline)

## [1] "IndiGo" "Air India"


## [3] "Jet Airways" "SpiceJet"
## [5] "Multiple carriers" "GoAir"
## [7] "Vistara" "Air Asia"
## [9] "Vistara Premium economy" "Jet Airways Business"
## [11] "Multiple carriers Premium economy" "Trujet"

#Plotting frequecy count of each airline:


ggplot(train,aes(x=train$Airline,fill=train$Airline))+
geom_bar(position="dodge")+labs(title = "Counts of each Airline")+
geom_text(aes(label=..count..),stat='count',position=position_dodge(0.9),vj
ust=-0.2)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))

#Plotting mean price of each airline:


ggplot(train, aes(x=train$Airline, y=train$Price)) + stat_summary(fun="mean",
geom="bar")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+labs(title = "Mean
price of each airline")
#Boxplot of each airline vs price:
ggplot(aes(y = train$Price, x = train$Airline, fill = train$Price), data = tr
ain) + geom_boxplot()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+labs(title = "Boxp
lot of each level in airline with price")
#From the above chart it’s evident that levels such as “Jet Airways Business”, “Multiple
carriers Premium economy” ,“Trujet”,“Vistara Premium economy” have very less no.of
observations.So removing these levels.
#So creating Subset:
train=subset(train, train$Airline != "Vistara Premium economy" & train$Airlin
e != "Jet Airways Business" & train$Airline != "Multiple carriers Premium ec
onomy" & train$Airline != "Trujet")
unique(train$Airline)

## [1] "IndiGo" "Air India" "Jet Airways"


## [4] "SpiceJet" "Multiple carriers" "GoAir"
## [7] "Vistara" "Air Asia"

#Thus, removed the less proportionate levels from Airline variable.


#Variable 2: Date_of_Journey:
unique(train$Date_of_Journey)

## [1] "24/03/2019" "1/05/2019" "9/06/2019" "12/05/2019" "01/03/2019"


## [6] "24/06/2019" "12/03/2019" "27/05/2019" "1/06/2019" "18/04/2019"
## [11] "9/05/2019" "24/04/2019" "3/03/2019" "15/04/2019" "12/06/2019"
## [16] "6/03/2019" "21/03/2019" "3/04/2019" "6/05/2019" "15/05/2019"
## [21] "18/06/2019" "15/06/2019" "6/04/2019" "18/05/2019" "27/06/2019"
## [26] "21/05/2019" "06/03/2019" "3/06/2019" "15/03/2019" "3/05/2019"
## [31] "9/03/2019" "6/06/2019" "24/05/2019" "09/03/2019" "1/04/2019"
## [36] "21/04/2019" "21/06/2019" "27/03/2019" "18/03/2019" "12/04/2019"
## [41] "9/04/2019" "1/03/2019" "03/03/2019" "27/04/2019"

#Replacing same dates of different formats (1/03/2019 and 01/03/2019) to same format:
train$Date_of_Journey <- str_replace_all(train$Date_of_Journey, "01/03/2019",
"1/03/2019")
train$Date_of_Journey <- str_replace_all(train$Date_of_Journey, "03/03/2019",
"3/03/2019")
train$Date_of_Journey <- str_replace_all(train$Date_of_Journey, "06/03/2019",
"6/03/2019")
train$Date_of_Journey <- str_replace_all(train$Date_of_Journey, "09/03/2019",
"9/03/2019")
unique(train$Date_of_Journey)

## [1] "24/03/2019" "1/05/2019" "9/06/2019" "12/05/2019" "1/03/2019"


## [6] "24/06/2019" "12/03/2019" "27/05/2019" "1/06/2019" "18/04/2019"
## [11] "9/05/2019" "24/04/2019" "3/03/2019" "15/04/2019" "12/06/2019"
## [16] "6/03/2019" "21/03/2019" "3/04/2019" "6/05/2019" "15/05/2019"
## [21] "18/06/2019" "15/06/2019" "6/04/2019" "18/05/2019" "27/06/2019"
## [26] "21/05/2019" "3/06/2019" "15/03/2019" "3/05/2019" "9/03/2019"
## [31] "6/06/2019" "24/05/2019" "1/04/2019" "21/04/2019" "21/06/2019"
## [36] "27/03/2019" "18/03/2019" "12/04/2019" "9/04/2019" "27/04/2019"

#Changing / to - :
str(train$Date_of_Journey)

## chr [1:10659] "24/03/2019" "1/05/2019" "9/06/2019" "12/05/2019" ...

train$Date_of_Journey=str_replace_all(train$Date_of_Journey, "[/]", "-")

unique(train$Date_of_Journey)

## [1] "24-03-2019" "1-05-2019" "9-06-2019" "12-05-2019" "1-03-2019"


## [6] "24-06-2019" "12-03-2019" "27-05-2019" "1-06-2019" "18-04-2019"
## [11] "9-05-2019" "24-04-2019" "3-03-2019" "15-04-2019" "12-06-2019"
## [16] "6-03-2019" "21-03-2019" "3-04-2019" "6-05-2019" "15-05-2019"
## [21] "18-06-2019" "15-06-2019" "6-04-2019" "18-05-2019" "27-06-2019"
## [26] "21-05-2019" "3-06-2019" "15-03-2019" "3-05-2019" "9-03-2019"
## [31] "6-06-2019" "24-05-2019" "1-04-2019" "21-04-2019" "21-06-2019"
## [36] "27-03-2019" "18-03-2019" "12-04-2019" "9-04-2019" "27-04-2019"

#Variable 3: Source:
unique(train$Source)

## [1] "Banglore" "Kolkata" "Delhi" "Chennai" "Mumbai"

#Variable 4: Destination:
unique(train$Destination)

## [1] "New Delhi" "Banglore" "Cochin" "Kolkata" "Delhi" "Hyderabad


"

#Replacing New Delhi to Delhi:


train$Destination=str_replace_all(train$Destination, "New Delhi", "Delhi")

unique(train$Destination)

## [1] "Delhi" "Banglore" "Cochin" "Kolkata" "Hyderabad"

#Variable 5: Route:
unique(train$Route)

## [1] "BLR <U+2192> DEL" "CCU <U+2192> IXR <U+2192> BBI <
U+2192> BLR"
___ Continued

#It’s not important variable. Since, Total_Stops variable explains the same.
#Variable 6: Dep_Time:
unique(train$Dep_Time)
## [1] "22:20" "05:50" "09:25" "18:05" "16:50" "09:00" "18:55" "08:00" "08:
55"
Continued**

#Combining Date_of_Journey & Dep_Time to new variable:


train$departure=paste(train$Date_of_Journey, train$Dep_Time, sep=' ')
#Tranforming the departure to date time format:
train$departure=as.POSIXlt(train$departure, format = "%d-%m-%Y %H:%M")
#Sorting the dataset based on departure:
train=train[ order(train$departure , decreasing = FALSE ),]
class(train$departure)

## [1] "POSIXlt" "POSIXt"

str(train)

## Classes 'tbl_df', 'tbl' and 'data.frame': 10659 obs. of 12 variables:


## $ Airline : chr "Multiple carriers" "Multiple carriers" "Air Indi
a" "Air India" ...
## $ Date_of_Journey: chr "1-03-2019" "1-03-2019" "1-03-2019" "1-03-2019" .
..
## $ Source : chr "Delhi" "Delhi" "Banglore" "Banglore" ...
## $ Destination : chr "Cochin" "Cochin" "Delhi" "Delhi" ...
## $ Route : chr "DEL <U+2192> BOM <U+2192> COK" "DEL <U+2192> BOM
<U+2192> COK" "BLR <U+2192> AMD <U+2192> DEL" "BLR <U+2192> AMD <U+2192> DEL"
...
## $ Dep_Time : chr "00:20" "00:20" "00:30" "00:30" ...
## $ Arrival_Time : chr "13:20" "15:30" "20:30" "23:55" ...
## $ Duration : chr "13h" "15h 10m" "20h" "23h 25m" ...
## $ Total_Stops : chr "1 stop" "1 stop" "1 stop" "1 stop" ...
## $ Additional_Info: chr "No info" "No info" "1 Long layover" "1 Long layo
ver" ...
## $ Price : num 29528 23170 14752 12599 16000 ...
## $ departure : POSIXlt, format: "2019-03-01 00:20:00" "2019-03-01 00:
20:00" ...
## - attr(*, "na.action")= 'omit' Named int 9040
## ..- attr(*, "names")= chr "9040"

#created a new variable named departure by uniting Date_of_Journey & Dep_Tme. #The
departure variable has been changed to datetime format. #We can extract day , month,
hour seperately in to new columns
#Variable 7: Arrival_Time:
unique(train$Arrival_Time)

## [1] "13:20" "15:30" "20:30" "23:55"


Continued**

#Dep_Time & Arrival_Time will be explained by duration variable. So, we can leave this
variable
#Variable 8 :Duration:
unique(train$Duration)

## [1] "13h" "15h 10m" "20h" "23h 25m" "1h 30m" "27h 40m" "16h 25m
"
Continued

str(train$Duration)

## chr [1:10659] "13h" "15h 10m" "20h" "23h 25m" "1h 30m" "27h 40m" "16h 25m
" ...

#Duration is in categorical format. It has to be changed to numeric. #So, trying to remove


the h and m from the variable.
train$dur=str_replace_all(train$Duration, "h ", ".")
train$dur=str_replace_all(train$dur, "m", "")
train$dur=str_replace_all(train$dur, "h", ".00")
class(train$dur)

## [1] "character"

train$dur1=hm(train$dur)

str(train$dur1)

## Formal class 'Period' [package "lubridate"] with 6 slots


## ..@ .Data : num [1:10659] 0 0 0 0 0 0 0 0 0 0 ...
## ..@ year : num [1:10659] 0 0 0 0 0 0 0 0 0 0 ...
## ..@ month : num [1:10659] 0 0 0 0 0 0 0 0 0 0 ...
## ..@ day : num [1:10659] 0 0 0 0 0 0 0 0 0 0 ...
## ..@ hour : num [1:10659] 13 15 20 23 1 27 16 2 4 10 ...
## ..@ minute: num [1:10659] 0 10 0 25 30 40 25 20 45 25 ...

sum(is.na(train$dur1))

## [1] 1

class(train$dur1)

## [1] "Period"
## attr(,"package")
## [1] "lubridate"

summary(train$dur1)

## Min. 1st Qu.


## "1H 15M 0S" "2H 50M 0S"
## Median Mean
## "8H 40M 0S" "10H 43M 26.9431413023049S"
## 3rd Qu. Max.
## "15H 30M 0S" "1d 23H 40M 0S"
## NA's
## "1"

#train$dur2=as.numeric(train$dur1) #in seconds


train$duration=round(as.duration(train$dur1)/dhours(1)) #in hours (important)
#Duration --> dur --> dur1 --> duration

#Variable 9: Total_Stops:
unique(train$Total_Stops)

## [1] "1 stop" "non-stop" "2 stops" "3 stops" "4 stops"

#Variable 10: Additional Info:


unique(train$Additional_Info)

## [1] "No info" "1 Long layover"


## [3] "No Info" "Change airports"
## [5] "Business class" "2 Long layover"
## [7] "1 Short layover" "Red-eye flight"
## [9] "In-flight meal not included" "No check-in baggage included"

train$Additional_Info=str_replace_all(train$Additional_Info, "No Info", "No


info")
unique(train$Additional_Info)

## [1] "No info" "1 Long layover"


## [3] "Change airports" "Business class"
## [5] "2 Long layover" "1 Short layover"
## [7] "Red-eye flight" "In-flight meal not included"
## [9] "No check-in baggage included"

#Variable 11: Price:


unique(train$Additional_Info)

## [1] "No info" "1 Long layover"


## [3] "Change airports" "Business class"
## [5] "2 Long layover" "1 Short layover"
## [7] "Red-eye flight" "In-flight meal not included"
## [9] "No check-in baggage included"

train$Additional_Info=str_replace_all(train$Additional_Info, "No Info", "No


info")
unique(train$Additional_Info)

## [1] "No info" "1 Long layover"


## [3] "Change airports" "Business class"
## [5] "2 Long layover" "1 Short layover"
## [7] "Red-eye flight" "In-flight meal not included"
## [9] "No check-in baggage included"
#Dealing with departure variable: #Initially the “Date_of_Journey” & “Dep_Time” to form a
new variable named “departure”. I changed this type to date-time format. #Now taking only
hour values form it. #Extracting the hour:
train$dep_hour <- format(train$departure, "%H")

#Creating morning, day, evening, night, midnight timestamp using dep_hour variable:
str(train$dep_hour)

## chr [1:10659] "00" "00" "00" "00" "02" "04" "04" "05" "05" "05" "05" "05"
...

train$dep_hour=as.numeric(train$dep_hour)
train$dep_time_slot = ifelse(train$dep_hour < 5, "Pre_Morning", ifelse(train$
dep_hour < 10,"Morning",ifelse(train$dep_hour < 17,"Day_Time",ifelse(train$de
p_hour < 22,"Evening","Late_Night"))))
train$dep_time_slot=as.factor(train$dep_time_slot)
summary(train$dep_time_slot)

## Day_Time Evening Late_Night Morning Pre_Morning


## 3022 2846 548 3778 465

#Thus created the flight dep time slot by using the flight departure hour. Now, the flights
are acheduled as “Day_Time Evening Late_Night Morning Pre_Morning”.
#Boxplot of each dep_time_slot vs price:
ggplot(aes(y = train$Price, x = train$dep_time_slot, fill = train$Price), dat
a = train) + geom_boxplot()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+labs(title = "Boxp
lot of each level in dep_time_slot with price")

#Plotting mean price of each dep_time_slot:


ggplot(train, aes(x=train$dep_time_slot, y=train$Price)) + stat_summary(fun="
mean", geom="bar")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+labs(title = "Mean
price of each airline")

#(Means are more


or less same (good))
#Creating dep_day (date+month):
train$dep_day <- format(train$departure, "%d%b")

summary(train)

## Airline Date_of_Journey Source Destination


## Length:10659 Length:10659 Length:10659 Length:10659
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Route Dep_Time Arrival_Time Duration
## Length:10659 Length:10659 Length:10659 Length:10659
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Total_Stops Additional_Info Price
## Length:10659 Length:10659 Min. : 1759
## Class :character Class :character 1st Qu.: 5266
## Mode :character Mode :character Median : 8372
## Mean : 9057
## 3rd Qu.:12373
## Max. :54826
##
## departure dur
## Min. :2019-03-01 00:20:00 Length:10659
## 1st Qu.:2019-03-27 19:15:00 Class :character
## Median :2019-05-15 09:50:00 Mode :character
## Mean :2019-05-05 11:30:24
## 3rd Qu.:2019-06-06 07:00:00
## Max. :2019-06-27 23:55:00
##
## dur1 duration dep_hour
## Min. :1H 15M 0S Min. : 1.00 Min. : 0.0
## 1st Qu.:2H 50M 0S 1st Qu.: 3.00 1st Qu.: 8.0
## Median :8H 40M 0S Median : 9.00 Median :11.0
## Mean :10H 43M 26.9431413023049S Mean :10.74 Mean :12.5
## 3rd Qu.:15H 30M 0S 3rd Qu.:16.00 3rd Qu.:18.0
## Max. :1d 23H 40M 0S Max. :48.00 Max. :23.0
## NA's :1 NA's :1
## dep_time_slot dep_day
## Day_Time :3022 Length:10659
## Evening :2846 Class :character
## Late_Night : 548 Mode :character
## Morning :3778
## Pre_Morning: 465
##
##

#dep->full form is departure* #Related variables: #duration –> Duration, dur,


dur1,Arrival_Time #dep_time_slot –> Dep_Time, departure, dep_hour #departure –>
Date_of_Journey, Dep_Time #dep_day –> departure, Date_of_Journey #Here, Dep_Time can
be explained by derived variable named dep_time_slot. #Also, dep_day explains the date
and month of departure.
#Creating a new dataframe:
train1=train

#Dropping some variable:


colnames(train1)

## [1] "Airline" "Date_of_Journey" "Source" "Destination"


## [5] "Route" "Dep_Time" "Arrival_Time" "Duration"
## [9] "Total_Stops" "Additional_Info" "Price" "departure"
## [13] "dur" "dur1" "duration" "dep_hour"
## [17] "dep_time_slot" "dep_day"

#Since “Total_Stops” and “Route” variables are denoting same thing. I’m removing “Route”
variable from the data:
train1$Route=NULL

#Removing old variables:


#Removing old variables:
train1$Date_of_Journey=NULL
train1$Dep_Time=NULL
train1$Arrival_Time=NULL
train1$Duration=NULL
train1$dur=NULL
train1$dep_hour=NULL

#Summary of Train1
summary(train1)

## Airline Source Destination Total_Stops


## Length:10659 Length:10659 Length:10659 Length:10659
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Additional_Info Price departure
## Length:10659 Min. : 1759 Min. :2019-03-01 00:20:00
## Class :character 1st Qu.: 5266 1st Qu.:2019-03-27 19:15:00
## Mode :character Median : 8372 Median :2019-05-15 09:50:00
## Mean : 9057 Mean :2019-05-05 11:30:24
## 3rd Qu.:12373 3rd Qu.:2019-06-06 07:00:00
## Max. :54826 Max. :2019-06-27 23:55:00
##
## dur1 duration dep_time_slot
## Min. :1H 15M 0S Min. : 1.00 Day_Time :3022
## 1st Qu.:2H 50M 0S 1st Qu.: 3.00 Evening :2846
## Median :8H 40M 0S Median : 9.00 Late_Night : 548
## Mean :10H 43M 26.9431413023049S Mean :10.74 Morning :3778
## 3rd Qu.:15H 30M 0S 3rd Qu.:16.00 Pre_Morning: 465
## Max. :1d 23H 40M 0S Max. :48.00
## NA's :1 NA's :1
## dep_day
## Length:10659
## Class :character
## Mode :character
##
##
##
##

str(train1)

## Classes 'tbl_df', 'tbl' and 'data.frame': 10659 obs. of 11 variables:


## $ Airline : chr "Multiple carriers" "Multiple carriers" "Air Indi
a" "Air India" ...
## $ Source : chr "Delhi" "Delhi" "Banglore" "Banglore" ...
## $ Destination : chr "Cochin" "Cochin" "Delhi" "Delhi" ...
## $ Total_Stops : chr "1 stop" "1 stop" "1 stop" "1 stop" ...
## $ Additional_Info: chr "No info" "No info" "1 Long layover" "1 Long layo
ver" ...
## $ Price : num 29528 23170 14752 12599 16000 ...
## $ departure : POSIXlt, format: "2019-03-01 00:20:00" "2019-03-01 00:
20:00" ...
## $ dur1 :Formal class 'Period' [package "lubridate"] with 6 slot
s
## .. ..@ .Data : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ year : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ month : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ day : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ hour : num 13 15 20 23 1 27 16 2 4 10 ...
## .. ..@ minute: num 0 10 0 25 30 40 25 20 45 25 ...
## $ duration : num 13 15 20 23 2 28 16 2 5 10 ...
## $ dep_time_slot : Factor w/ 5 levels "Day_Time","Evening",..: 5 5 5 5 5
5 5 4 4 4 ...
## $ dep_day : chr "01Mar" "01Mar" "01Mar" "01Mar" ...
## - attr(*, "na.action")= 'omit' Named int 9040
## ..- attr(*, "names")= chr "9040"

#Data Visualization:
#Plotting frequecy count of each airline:
ggplot(train1,aes(x=train1$Airline,fill=train1$Airline))+
geom_bar(position="dodge")+labs(title = "Counts of each Airline
")+
geom_text(aes(label=..count..),stat='count',position=position_d
odge(0.9),vjust=-0.2)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#Plotting mean price of each airline:
ggplot(train1, aes(x=train1$Airline, y=train1$Price)) + stat_summary(fun="mea
n", geom="bar")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+labs(t
itle = "Mean price of each airline")
#Plotting mean price of each Source:
ggplot(train1, aes(x=train1$Source, y=train1$Price)) + stat_summary(fun="mean
", geom="bar")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+labs(t
itle = "Mean price of each Source")

#Plotting Source vs destination to count no.of flights:


ggplot(train1, aes(train1$Source, ..count..)) + geom_bar(aes(fill = train1$De
stination), position = "dodge")
## Warning: Use of `train1$Destination` is discouraged. Use `Destination` ins
tead.

## Warning: Use of `train1$Source` is discouraged. Use `Source` instead.

ggplot(train1,aes(x=train1$Source,fill=train1$Destination))+
geom_bar(position="dodge")+labs(title = "Source vs Destination"
)+
geom_text(aes(label=..count..),stat='count',position=position_d
odge(0.9),vjust=-0.2)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#It’s clear that #Banglore flights go only to Delhi #Chennai flights go only to Kolkata #Delhi
flights go only to Cochin #Kolkata flights go only to Banglore #Mumbai flights go only to
Hyderabad.
#Plotting frequecy count of each airline vs source:
ggplot(train1,aes(x=train1$Airline,fill=train1$Source))+
geom_bar(position="dodge")+labs(title = "Counts of each Airline
from Source")+
geom_text(aes(label=..count..),stat='count',position=position_d
odge(0.9),vjust=-0.2)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))

#Plotting mean price Vs Total_Stops:


ggplot(train1, aes(x=train1$Total_Stops, y=train1$Price)) + stat_summary(fun=
"mean", geom="bar")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+labs(t
itle = "Mean price vs total stops")
#Plotting mean price Vs duration:
ggplot(train1, aes(x=train1$duration, y=train1$Price)) + stat_summary(fun="me
an", geom="bar")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+labs(t
itle = "Mean price vs duration")

#Plotting duration vs price: #important plot


ggplot(train1, aes(x = train1$duration, y = train1$Price)) + geom_point()+ ge
om_smooth(method = "lm")
#Plotting mean price of each dep_time_slot:
ggplot(train1, aes(x=train1$dep_time_slot, y=train1$Price)) + stat_summary(fu
n="mean", geom="bar")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+labs(title = "Mean
price vs dep_time_slot")

#(Means are more


or less same (good))
#Plotting frequency count of each airline vs dep_time_slot:
ggplot(train1,aes(x=train1$Airline,fill=train1$dep_time_slot))+
geom_bar(position="dodge")+labs(title = "Counts of each Airline vs dep_time
_slot")+
geom_text(aes(label=..count..),stat='count',position=position_dodge(0.9),vj
ust=-0.2)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))

#Plotting frequency count of each airline vs dep_time_slot:


ggplot(train1,aes(x=train1$Airline,fill=train1$dep_time_slot))+
geom_bar(position="dodge")+labs(title = "Counts of each Airline vs dep_time
_slot")+
geom_text(aes(label=..count..),stat='count',position=position_dodge(0.9),vj
ust=-0.2)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
#Plotting frequency count of each Additional Info:
ggplot(train,aes(x=train1$Additional_Info,fill=train1$Additional_Info))+
geom_bar(position="dodge")+labs(title = "Counts of each Additional Info")+
geom_text(aes(label=..count..),stat='count',position=position_dodge(0.9),vj
ust=-0.2)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))

plot(train1$departure,train1$Price)

#Plotting mean price of each dep_day:


ggplot(train1, aes(x=train1$dep_day, y=train1$Price)) + stat_summary(fun="mea
n", geom="bar")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+labs(title = "Mean
price of each dep_day")
#With the understanding from the above visualizations, Dropping Some variables:
train1$Additional_Info=NULL
train1$dur1=NULL
train1$departure=NULL
train1$Route=NULL

#Creating Dummy variables for categorical data:


library(fastDummies)
colnames(train1)

## [1] "Airline" "Source" "Destination" "Total_Stops"


## [5] "Price" "duration" "dep_time_slot" "dep_day"

train2=dummy_cols(train1, select_columns = c("Airline","Source","Destination"


,"Total_Stops","dep_time_slot","dep_day" ),
remove_first_dummy = TRUE)

unique(train2$Total_Stops)

## [1] "1 stop" "non-stop" "2 stops" "3 stops" "4 stops"

train3=train2
colnames(train3)

## [1] "Airline" "Source"


Continued**

#Again removing the original variables, since I have created the dummy variables.
train3=train3[, -c(1:4)]
train3=train3[,-c(3,4)]
summary(train3)

## Price duration Airline_Air India Airline_GoAir


## Min. : 1759 Min. : 1.00 Min. :0.0000 Min. :0.0000
## 1st Qu.: 5266 1st Qu.: 3.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 8372 Median : 9.00 Median :0.0000 Median :0.0000
## Mean : 9057 Mean :10.74 Mean :0.1643 Mean :0.0182
## 3rd Qu.:12373 3rd Qu.:16.00 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :54826 Max. :48.00 Max. :1.0000 Max. :1.0000
## NA's :1
## Airline_IndiGo Airline_Jet Airways Airline_Multiple carriers
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1926 Mean :0.3611 Mean :0.1122
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## Airline_SpiceJet Airline_Vistara Source_Chennai Source_Delhi
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.07674 Mean :0.04494 Mean :0.03565 Mean :0.4241
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.0000
##
## Source_Kolkata Source_Mumbai Destination_Cochin Destination_Delhi
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.2693 Mean :0.0653 Mean :0.4241 Mean :0.2056
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## Destination_Hyderabad Destination_Kolkata Total_Stops_2 stops
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.0653 Mean :0.03565 Mean :0.1424
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000
##
## Total_Stops_3 stops Total_Stops_4 stops Total_Stops_non-stop
## Min. :0.000000 Min. :0.00e+00 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.:0.00e+00 1st Qu.:0.0000
## Median :0.000000 Median :0.00e+00 Median :0.0000
## Mean :0.004222 Mean :9.38e-05 Mean :0.3272
## 3rd Qu.:0.000000 3rd Qu.:0.00e+00 3rd Qu.:1.0000
## Max. :1.000000 Max. :1.00e+00 Max. :1.0000
##
## dep_time_slot_Evening dep_time_slot_Late_Night dep_time_slot_Morning
## Min. :0.000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.000 Median :0.00000 Median :0.0000
## Mean :0.267 Mean :0.05141 Mean :0.3544
## 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.000 Max. :1.00000 Max. :1.0000
##
## dep_time_slot_Pre_Morning dep_day_01Jun dep_day_01Mar dep_day_01Ma
y
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.00
000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00
000
## Median :0.00000 Median :0.00000 Median :0.0000 Median :0.00
000
## Mean :0.04363 Mean :0.03209 Mean :0.0182 Mean :0.02
599
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00
000
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.00
000
##
## dep_day_03Apr dep_day_03Jun dep_day_03Mar dep_day_03May
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.000000
## Mean :0.01032 Mean :0.03124 Mean :0.02936 Mean :0.008444
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.000000
##
## dep_day_06Apr dep_day_06Jun dep_day_06Mar dep_day_06May
## Min. :0.000000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.009288 Mean :0.04719 Mean :0.03762 Mean :0.02636
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## dep_day_09Apr dep_day_09Jun dep_day_09Mar dep_day_09May
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.01173 Mean :0.04644 Mean :0.02833 Mean :0.04541
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## dep_day_12Apr dep_day_12Jun dep_day_12Mar dep_day_12May
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.00591 Mean :0.04625 Mean :0.01332 Mean :0.0243
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.0000
##
## dep_day_15Apr dep_day_15Jun dep_day_15Mar dep_day_15May
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.00000 Median :0.00000 Median :0.0000 Median :0.000
## Mean :0.00835 Mean :0.03077 Mean :0.0152 Mean :0.038
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.000
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.000
##
## dep_day_18Apr dep_day_18Jun dep_day_18Mar dep_day_18May
## Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000000 Median :0.000000 Median :0.00000 Median :0.00000
## Mean :0.006286 Mean :0.009851 Mean :0.01464 Mean :0.04728
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.000000 Max. :1.00000 Max. :1.00000
##
## dep_day_21Apr dep_day_21Jun dep_day_21Mar dep_day_21May
## Min. :0.000000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.007693 Mean :0.01023 Mean :0.03847 Mean :0.04663
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## dep_day_24Apr dep_day_24Jun dep_day_24Mar dep_day_24May
## Min. :0.000000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.008631 Mean :0.03293 Mean :0.0303 Mean :0.02683
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.0000 Max. :1.00000
##
## dep_day_27Apr dep_day_27Jun dep_day_27Mar dep_day_27May
## Min. :0.000000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.008819 Mean :0.03331 Mean :0.02805 Mean :0.03584
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##

str(train3)

## Classes 'tbl_df', 'tbl' and 'data.frame': 10659 obs. of 64 variables:


## $ Price : num 29528 23170 14752 12599 16000 ...
## $ duration : num 13 15 20 23 2 28 16 2 5 10 ...
## $ Airline_Air India : int 0 0 1 1 0 1 0 0 0 0 ...
## $ Airline_GoAir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Airline_IndiGo : int 0 0 0 0 1 0 1 1 1 0 ...
## $ Airline_Jet Airways : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Airline_Multiple carriers: int 1 1 0 0 0 0 0 0 0 0 ...
## $ Airline_SpiceJet : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Airline_Vistara : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Source_Chennai : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Source_Delhi : int 1 1 0 0 0 1 1 0 0 0 ...
## $ Source_Kolkata : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Source_Mumbai : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Destination_Cochin : int 1 1 0 0 0 1 1 0 0 0 ...
## $ Destination_Delhi : int 0 0 1 1 0 0 0 0 1 1 ...
## $ Destination_Hyderabad : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Destination_Kolkata : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Total_Stops_2 stops : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Total_Stops_3 stops : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Total_Stops_4 stops : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Total_Stops_non-stop : int 0 0 0 0 1 0 0 1 0 0 ...
## $ dep_time_slot_Evening : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_time_slot_Late_Night : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_time_slot_Morning : int 0 0 0 0 0 0 0 1 1 1 ...
## $ dep_time_slot_Pre_Morning: int 1 1 1 1 1 1 1 0 0 0 ...
## $ dep_day_01Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_01Mar : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_day_01May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_03Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_03Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_03Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_03May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_06Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_06Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_06Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_06May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_09Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_09Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_09Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_09May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_12Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_12Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_12Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_12May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_15Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_15Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_15Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_15May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_18Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_18Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_18Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_18May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_21Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_21Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_21Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_21May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_24Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_24Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_24Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_24May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_27Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_27Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_27Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_27May : int 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "na.action")= 'omit' Named int 9040
## ..- attr(*, "names")= chr "9040"
## - attr(*, ".internal.selfref")=<externalptr>

dim(train3)

## [1] 10659 64

sum(is.na(train3))

## [1] 1

train3=na.omit(train3)

#Thus train data is ready for modelling.

#TEST DATA: #Data preparations on test data seperately, because in real time we will get
test data(new) seperately after dealing with train & modelling)
test <- read_excel("C:/Users/jasvi/Desktop/Capstone Project/FlightPrice_test.
xlsx")

#Doing same data prepartions (as did on train data) on test data:
View(test)
summary(test)

## Airline Date_of_Journey Source Destination


## Length:2671 Length:2671 Length:2671 Length:2671
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Route Dep_Time Arrival_Time Duration
## Length:2671 Length:2671 Length:2671 Length:2671
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Total_Stops Additional_Info
## Length:2671 Length:2671
## Class :character Class :character
## Mode :character Mode :character

str(test)

## Classes 'tbl_df', 'tbl' and 'data.frame': 2671 obs. of 10 variables:


## $ Airline : chr "Jet Airways" "IndiGo" "Jet Airways" "Multiple ca
rriers" ...
## $ Date_of_Journey: chr "6/06/2019" "12/05/2019" "21/05/2019" "21/05/2019
" ...
## $ Source : chr "Delhi" "Kolkata" "Delhi" "Delhi" ...
## $ Destination : chr "Cochin" "Banglore" "Cochin" "Cochin" ...
## $ Route : chr "DEL <U+2192> BOM <U+2192> COK" "CCU <U+2192> MAA
<U+2192> BLR" "DEL <U+2192> BOM <U+2192> COK" "DEL <U+2192> BOM <U+2192> COK"
...
## $ Dep_Time : chr "17:30" "06:20" "19:15" "08:00" ...
## $ Arrival_Time : chr "04:25 07 Jun" "10:20" "19:00 22 May" "21:00" ...
## $ Duration : chr "10h 55m" "4h" "23h 45m" "13h" ...
## $ Total_Stops : chr "1 stop" "1 stop" "1 stop" "1 stop" ...
## $ Additional_Info: chr "No info" "No info" "In-flight meal not included"
"No info" ...

dim(test)

## [1] 2671 10

#Missing Values Analysis:


sum(is.na(test))

## [1] 0

mv_test=data.frame(apply(test, 2, function(x){sum(is.na(x))}))
mv_test

## apply.test..2..function.x...
## Airline 0
## Date_of_Journey 0
## Source 0
## Destination 0
## Route 0
## Dep_Time 0
## Arrival_Time 0
## Duration 0
## Total_Stops 0
## Additional_Info 0

There is no missing values


#Variable 1:Airline:
unique(test$Airline)

## [1] "Jet Airways" "IndiGo"


## [3] "Multiple carriers" "Air Asia"
## [5] "Air India" "Vistara"
## [7] "SpiceJet" "Vistara Premium economy"
## [9] "GoAir" "Multiple carriers Premium econom
y"
## [11] "Jet Airways Business"

#Plotting frequecy count of each airline in test:


ggplot(test,aes(x=test$Airline,fill=test$Airline))+
geom_bar(position="dodge")+labs(title = "Counts of each Airline")+
geom_text(aes(label=..count..),stat='count',position=position_dodge(0.9),vj
ust=-0.2)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))

#True jet is not at all given in the test$Airline. Also, removing levels in test which I
removed in train. #Subset:
test=subset(test, test$Airline != "Vistara Premium economy" & test$Airline !=
"Jet Airways Business" & test$Airline != "Multiple carriers Premium economy"
)
unique(test$Airline)

## [1] "Jet Airways" "IndiGo" "Multiple carriers"


## [4] "Air Asia" "Air India" "Vistara"
## [7] "SpiceJet" "GoAir"

#Variable 2: Date_of_Journey:
unique(test$Date_of_Journey)
## [1] "6/06/2019" "12/05/2019" "21/05/2019" "24/06/2019" "12/06/2019"
## [6] "12/03/2019" "1/05/2019" "15/03/2019" "18/05/2019" "21/03/2019"
## [11] "15/06/2019" "15/05/2019" "3/06/2019" "06/03/2019" "24/03/2019"
## [16] "6/03/2019" "9/05/2019" "18/03/2019" "6/04/2019" "1/06/2019"
## [21] "3/03/2019" "27/03/2019" "9/06/2019" "3/05/2019" "1/04/2019"
## [26] "18/06/2019" "15/04/2019" "6/05/2019" "9/03/2019" "3/04/2019"
## [31] "27/06/2019" "21/06/2019" "21/04/2019" "18/04/2019" "9/04/2019"
## [36] "24/05/2019" "01/03/2019" "09/03/2019" "27/05/2019" "03/03/2019"
## [41] "27/04/2019" "1/03/2019" "24/04/2019" "12/04/2019"

#Replacing same dates od different formats (1/03/2019 and 01/03/2019) to same


format:
test$Date_of_Journey <- str_replace_all(test$Date_of_Journey, "01/03/2019",
"1/03/2019")
test$Date_of_Journey <- str_replace_all(test$Date_of_Journey, "03/03/2019",
"3/03/2019")
test$Date_of_Journey <- str_replace_all(test$Date_of_Journey, "06/03/2019",
"6/03/2019")
test$Date_of_Journey <- str_replace_all(test$Date_of_Journey, "09/03/2019",
"9/03/2019")
unique(test$Date_of_Journey)

## [1] "6/06/2019" "12/05/2019" "21/05/2019" "24/06/2019" "12/06/2019"


## [6] "12/03/2019" "1/05/2019" "15/03/2019" "18/05/2019" "21/03/2019"
## [11] "15/06/2019" "15/05/2019" "3/06/2019" "6/03/2019" "24/03/2019"
## [16] "9/05/2019" "18/03/2019" "6/04/2019" "1/06/2019" "3/03/2019"
## [21] "27/03/2019" "9/06/2019" "3/05/2019" "1/04/2019" "18/06/2019"
## [26] "15/04/2019" "6/05/2019" "9/03/2019" "3/04/2019" "27/06/2019"
## [31] "21/06/2019" "21/04/2019" "18/04/2019" "9/04/2019" "24/05/2019"
## [36] "1/03/2019" "27/05/2019" "27/04/2019" "24/04/2019" "12/04/2019"

str(test$Date_of_Journey)

## chr [1:2664] "6/06/2019" "12/05/2019" "21/05/2019" "21/05/2019" ...

test$Date_of_Journey=str_replace_all(test$Date_of_Journey, "[/]", "-")


unique(test$Date_of_Journey)

## [1] "6-06-2019" "12-05-2019" "21-05-2019" "24-06-2019" "12-06-2019"


## [6] "12-03-2019" "1-05-2019" "15-03-2019" "18-05-2019" "21-03-2019"
## [11] "15-06-2019" "15-05-2019" "3-06-2019" "6-03-2019" "24-03-2019"
## [16] "9-05-2019" "18-03-2019" "6-04-2019" "1-06-2019" "3-03-2019"
## [21] "27-03-2019" "9-06-2019" "3-05-2019" "1-04-2019" "18-06-2019"
## [26] "15-04-2019" "6-05-2019" "9-03-2019" "3-04-2019" "27-06-2019"
## [31] "21-06-2019" "21-04-2019" "18-04-2019" "9-04-2019" "24-05-2019"
## [36] "1-03-2019" "27-05-2019" "27-04-2019" "24-04-2019" "12-04-2019"

#Variable 3: Source:
unique(test$Source)

## [1] "Delhi" "Kolkata" "Banglore" "Mumbai" "Chennai"


#unique(train$Source) #same source in train & test.

#Variable 4: Destination:
unique(test$Destination)

## [1] "Cochin" "Banglore" "Delhi" "New Delhi" "Hyderabad" "Kolkata"

#unique(train$Destination) #same destination in train & test.


#Replacing New Delhi to Delhi:
test$Destination=str_replace_all(test$Destination, "New Delhi", "Delhi")
unique(test$Destination)

## [1] "Cochin" "Banglore" "Delhi" "Hyderabad" "Kolkata"

#Variable 5: Route:
unique(test$Route)

## [1] "DEL <U+2192> BOM <U+2192> COK" "CCU <U+2192> MAA <U+2192> BLR"
Continued**

#Combining Date_of_Journey & Dep_Time to new variable: #derived variable:


test$departure=paste(test$Date_of_Journey, test$Dep_Time, sep=' ') #paste dat
e & time together.
test$departure=as.POSIXlt(test$departure, format = "%d-%m-%Y %H:%M")
test=test[ order(test$departure , decreasing = FALSE ),]
class(test$departure)

## [1] "POSIXlt" "POSIXt"

#Variable 7: Arrival_Time:
unique(test$Arrival_Time)

## [1] "04:25" "08:20" "08:35" "22:55" "18:40"


Continued**

arrival_test=data.frame(str_split_fixed(test$Arrival_Time, " ", 2)) #just to


see the variables seperately, I've formed this dataframe.
test=data.frame(test,arrival_test$X2)

#Variable 8 :Duration:
unique(test$Duration)

## [1] "1h 30m" "2h 50m" "17h 5m" "12h 50m" "2h 45m" "5h 55m" "40h 40m
"

Continued**

, "h", ".00")
class(test$dur)
## [1] "character"

test$dur1=hm(test$dur)

str(test$dur1)

## Formal class 'Period' [package "lubridate"] with 6 slots


## ..@ .Data : num [1:2664] 0 0 0 0 0 0 0 0 0 0 ...
## ..@ year : num [1:2664] 0 0 0 0 0 0 0 0 0 0 ...
## ..@ month : num [1:2664] 0 0 0 0 0 0 0 0 0 0 ...
## ..@ day : num [1:2664] 0 0 0 0 0 0 0 0 0 0 ...
## ..@ hour : num [1:2664] 1 2 2 17 12 2 5 40 17 12 ...
## ..@ minute: num [1:2664] 30 50 50 5 50 45 55 40 45 10 ...

sum(is.na(test$dur1))

## [1] 1

class(test$dur1)

## [1] "Period"
## attr(,"package")
## [1] "lubridate"

summary(test$dur1)

## Min. 1st Qu.


## "1H 15M 0S" "2H 55M 0S"
## Median Mean
## "8H 40M 0S" "10H 40M 36.7254975591422S"
## 3rd Qu. Max.
## "15H 15M 0S" "1d 16H 40M 0S"
## NA's
## "1"

test$duration=round(as.duration(test$dur1)/dhours(1)) #important

#Variable 9: Total_Stops:
unique(test$Total_Stops)

## [1] "non-stop" "4 stops" "2 stops" "1 stop" "3 stops"

#Variable 10: Additional Info:


unique(test$Additional_Info)

## [1] "No info" "1 Long layover"


## [3] "Change airports" "In-flight meal not included"
## [5] "No check-in baggage included"

#Derived Variable: #Extract the hour and day data from the request time
test$dep_hour <- format(test$departure, "%H")
#Creating morning, day, evening, night, midnight timestamp using dep_hour variable:
str(test$dep_hour)

## chr [1:2664] "02" "05" "05" "05" "05" "06" "06" "06" "07" "07" "08" "08"
...

test$dep_hour=as.numeric(test$dep_hour)
test$dep_time_slot = ifelse(test$dep_hour < 5, "Pre_Morning", ifelse(test$dep
_hour < 10,"Morning",ifelse(test$dep_hour < 17,"Day_Time",ifelse(test$dep_hou
r < 22,"Evening","Late_Night"))))
test$dep_time_slot=as.factor(test$dep_time_slot)

summary(test$dep_time_slot)

## Day_Time Evening Late_Night Morning Pre_Morning


## 791 709 127 930 107

test$dep_day <- format(test$departure, "%d%b")

summary(test)

## Airline Date_of_Journey Source Destination


## Length:2664 Length:2664 Length:2664 Length:2664
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Route Dep_Time Arrival_Time Duration
## Length:2664 Length:2664 Length:2664 Length:2664
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Total_Stops Additional_Info departure
## Length:2664 Length:2664 Min. :2019-03-01 02:55:00
## Class :character Class :character 1st Qu.:2019-03-27 17:10:00
## Mode :character Mode :character Median :2019-05-15 08:25:00
## Mean :2019-05-05 05:13:36
## 3rd Qu.:2019-06-06 07:05:00
## Max. :2019-06-27 23:05:00
##
## arrival_test.X2 dur dur1
## :1599 Length:2664 Min. :1H 15M 0S
## 07 Mar : 67 Class :character 1st Qu.:2H 55M 0S
## 13 Jun : 66 Mode :character Median :8H 40M 0S
## 10 Jun : 58 Mean :10H 40M 36.7254975591422S
## 10 May : 58 3rd Qu.:15H 15M 0S
## 07 Jun : 53 Max. :1d 16H 40M 0S
## (Other): 763 NA's :1
## duration dep_hour dep_time_slot dep_day
## Min. : 1.00 Min. : 0.00 Day_Time :791 Length:2664
## 1st Qu.: 3.00 1st Qu.: 8.00 Evening :709 Class :character
## Median : 9.00 Median :12.00 Late_Night :127 Mode :character
## Mean :10.69 Mean :12.61 Morning :930
## 3rd Qu.:15.00 3rd Qu.:18.00 Pre_Morning:107
## Max. :41.00 Max. :23.00
## NA's :1

test1=test

#Dropping some variable:


colnames(test1)

## [1] "Airline" "Date_of_Journey" "Source" "Destination"


## [5] "Route" "Dep_Time" "Arrival_Time" "Duration"
## [9] "Total_Stops" "Additional_Info" "departure" "arrival_test.X
2"
## [13] "dur" "dur1" "duration" "dep_hour"
## [17] "dep_time_slot" "dep_day"

test1$Route=NULL
test1$Date_of_Journey=NULL
test1$Dep_Time=NULL
test1$Arrival_Time=NULL
test1$Duration=NULL
test1$dur=NULL
test1$dep_hour=NULL
test1$Additional_Info=NULL
test1$dur1=NULL
test1$departure=NULL

#Creating Dummy variables for categorical data:


colnames(train1)

## [1] "Airline" "Source" "Destination" "Total_Stops"


## [5] "Price" "duration" "dep_time_slot" "dep_day"

test2=dummy_cols(test1, select_columns = c("Airline","Source","Destination","


Total_Stops","dep_time_slot","dep_day" ),
remove_first_dummy = FALSE)

test3=test2
colnames(test3)

## [1] "Airline" "Source"


Continued**
test3=test3[, -c(1:5)]
test3=test3[,-c(2,3)]

summary(test3)

## duration Airline_Air Asia Airline_Air India Airline_GoAir


## Min. : 1.00 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.: 3.00 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 9.00 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :10.69 Mean :0.03228 Mean :0.1652 Mean :0.01727
## 3rd Qu.:15.00 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :41.00 Max. :1.00000 Max. :1.0000 Max. :1.00000
## NA's :1
## Airline_IndiGo Airline_Jet Airways Airline_Multiple carriers
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1918 Mean :0.3367 Mean :0.1303
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## Airline_SpiceJet Airline_Vistara Source_Banglore Source_Chennai
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.07808 Mean :0.04842 Mean :0.2068 Mean :0.02815
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.00000
##
## Source_Delhi Source_Kolkata Source_Mumbai Destination_Banglore
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.4287 Mean :0.2665 Mean :0.06982 Mean :0.2665
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
##
## Destination_Cochin Destination_Delhi Destination_Hyderabad Destination_Ko
lkata
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
0
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
0
## Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000
0
## Mean :0.4287 Mean :0.2068 Mean :0.06982 Mean :0.0281
5
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
0
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
0
##
## Total_Stops_1 stop Total_Stops_2 stops Total_Stops_3 stops Total_Stops_4
stops
## Min. :0.0000 Min. :0.0000 Min. :0.000000 Min. :0.0000
000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.:0.0000
000
## Median :1.0000 Median :0.0000 Median :0.000000 Median :0.0000
000
## Mean :0.5357 Mean :0.1423 Mean :0.004129 Mean :0.0003
754
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.000000 3rd Qu.:0.0000
000
## Max. :1.0000 Max. :1.0000 Max. :1.000000 Max. :1.0000
000
##
## Total_Stops_non-stop dep_time_slot_Day_Time dep_time_slot_Evening
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.3176 Mean :0.2969 Mean :0.2661
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## dep_time_slot_Late_Night dep_time_slot_Morning dep_time_slot_Pre_Morning
## Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.04767 Mean :0.3491 Mean :0.04017
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.00000
##
## dep_day_01Apr dep_day_01Jun dep_day_01Mar dep_day_01May
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.02928 Mean :0.03303 Mean :0.01689 Mean :0.02327
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## dep_day_03Apr dep_day_03Jun dep_day_03Mar dep_day_03May
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.01051 Mean :0.03453 Mean :0.03228 Mean :0.01014
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## dep_day_06Apr dep_day_06Jun dep_day_06Mar dep_day_06May
## Min. :0.000000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.000000 Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.005255 Mean :0.04767 Mean :0.04692 Mean :0.0274
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.000000 Max. :1.00000 Max. :1.00000 Max. :1.0000
##
## dep_day_09Apr dep_day_09Jun dep_day_09Mar dep_day_09May
## Min. :0.000000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.009009 Mean :0.04467 Mean :0.02853 Mean :0.05405
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## dep_day_12Apr dep_day_12Jun dep_day_12Mar dep_day_12May
## Min. :0.000000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.004129 Mean :0.05068 Mean :0.01614 Mean :0.02553
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## dep_day_15Apr dep_day_15Jun dep_day_15Mar dep_day_15May
## Min. :0.000000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.008634 Mean :0.03941 Mean :0.01239 Mean :0.03979
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## dep_day_18Apr dep_day_18Jun dep_day_18Mar dep_day_18May
## Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000000 Median :0.000000 Median :0.00000 Median :0.00000
## Mean :0.004504 Mean :0.008258 Mean :0.01539 Mean :0.04842
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.000000 Max. :1.00000 Max. :1.00000
##
## dep_day_21Apr dep_day_21Jun dep_day_21Mar dep_day_21May
## Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000000 Median :0.000000 Median :0.00000 Median :0.00000
## Mean :0.008258 Mean :0.009009 Mean :0.03378 Mean :0.04429
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.000000 Max. :1.00000 Max. :1.00000
##
## dep_day_24Apr dep_day_24Jun dep_day_24Mar dep_day_24May
## Min. :0.000000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.007883 Mean :0.03191 Mean :0.0289 Mean :0.02665
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.0000 Max. :1.00000
##
## dep_day_27Apr dep_day_27Jun dep_day_27Mar dep_day_27May
## Min. :0.000000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.000000 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.005631 Mean :0.02815 Mean :0.0244 Mean :0.0244
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.000000 Max. :1.00000 Max. :1.0000 Max. :1.0000
##

str(test3)

## 'data.frame': 2664 obs. of 69 variables:


## $ duration : num 2 3 3 17 13 3 6 41 18 12 ...
## $ Airline_Air Asia : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Airline_Air India : int 0 0 0 1 1 0 0 1 0 0 ...
## $ Airline_GoAir : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Airline_IndiGo : int 0 1 0 0 0 1 0 0 0 0 ...
## $ Airline_Jet Airways : int 1 0 0 0 0 0 0 0 1 1 ...
## $ Airline_Multiple carriers: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Airline_SpiceJet : int 0 0 1 0 0 0 0 0 0 0 ...
## $ Airline_Vistara : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Source_Banglore : int 0 0 1 1 1 1 1 1 1 1 ...
## $ Source_Chennai : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Source_Delhi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Source_Kolkata : int 0 1 0 0 0 0 0 0 0 0 ...
## $ Source_Mumbai : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Destination_Banglore : int 0 1 0 0 0 0 0 0 0 0 ...
## $ Destination_Cochin : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Destination_Delhi : int 0 0 1 1 1 1 1 1 1 1 ...
## $ Destination_Hyderabad : int 1 0 0 0 0 0 0 0 0 0 ...
## $ Destination_Kolkata : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Total_Stops_1 stop : int 0 0 0 0 0 0 1 0 1 1 ...
## $ Total_Stops_2 stops : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Total_Stops_3 stops : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Total_Stops_4 stops : int 0 0 0 1 0 0 0 0 0 0 ...
## $ Total_Stops_non-stop : int 1 1 1 0 0 1 0 0 0 0 ...
## $ dep_time_slot_Day_Time : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_time_slot_Evening : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_time_slot_Late_Night : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_time_slot_Morning : int 0 1 1 1 1 1 1 1 1 1 ...
## $ dep_time_slot_Pre_Morning: int 1 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_01Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_01Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_01Mar : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_day_01May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_03Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_03Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_03Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_03May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_06Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_06Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_06Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_06May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_09Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_09Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_09Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_09May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_12Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_12Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_12Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_12May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_15Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_15Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_15Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_15May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_18Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_18Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_18Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_18May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_21Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_21Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_21Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_21May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_24Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_24Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_24Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_24May : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_27Apr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_27Jun : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_27Mar : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dep_day_27May : int 0 0 0 0 0 0 0 0 0 0 ...

dim(test3)

## [1] 2664 69

sum(is.na(test3))

## [1] 1

test3=na.omit(test3)

#Thus formed the test data for testing.


colnames(train3)
## [1] "Price" "duration"
## [3] "Airline_Air India" "Airline_GoAir"
## [5] "Airline_IndiGo" "Airline_Jet Airways"
## [7] "Airline_Multiple carriers" "Airline_SpiceJet"
## [9] "Airline_Vistara" "Source_Chennai"
## [11] "Source_Delhi" "Source_Kolkata"
## [13] "Source_Mumbai" "Destination_Cochin"
## [15] "Destination_Delhi" "Destination_Hyderabad"
## [17] "Destination_Kolkata" "Total_Stops_2 stops"
## [19] "Total_Stops_3 stops" "Total_Stops_4 stops"
## [21] "Total_Stops_non-stop" "dep_time_slot_Evening"
## [23] "dep_time_slot_Late_Night" "dep_time_slot_Morning"
## [25] "dep_time_slot_Pre_Morning" "dep_day_01Jun"
## [27] "dep_day_01Mar" "dep_day_01May"
## [29] "dep_day_03Apr" "dep_day_03Jun"
## [31] "dep_day_03Mar" "dep_day_03May"
## [33] "dep_day_06Apr" "dep_day_06Jun"
## [35] "dep_day_06Mar" "dep_day_06May"
## [37] "dep_day_09Apr" "dep_day_09Jun"
## [39] "dep_day_09Mar" "dep_day_09May"
## [41] "dep_day_12Apr" "dep_day_12Jun"
## [43] "dep_day_12Mar" "dep_day_12May"
## [45] "dep_day_15Apr" "dep_day_15Jun"
## [47] "dep_day_15Mar" "dep_day_15May"
## [49] "dep_day_18Apr" "dep_day_18Jun"
## [51] "dep_day_18Mar" "dep_day_18May"
## [53] "dep_day_21Apr" "dep_day_21Jun"
## [55] "dep_day_21Mar" "dep_day_21May"
## [57] "dep_day_24Apr" "dep_day_24Jun"
## [59] "dep_day_24Mar" "dep_day_24May"
## [61] "dep_day_27Apr" "dep_day_27Jun"
## [63] "dep_day_27Mar" "dep_day_27May"

colnames(test3)

## [1] "duration" "Airline_Air Asia"


## [3] "Airline_Air India" "Airline_GoAir"
## [5] "Airline_IndiGo" "Airline_Jet Airways"
## [7] "Airline_Multiple carriers" "Airline_SpiceJet"
## [9] "Airline_Vistara" "Source_Banglore"
## [11] "Source_Chennai" "Source_Delhi"
## [13] "Source_Kolkata" "Source_Mumbai"
## [15] "Destination_Banglore" "Destination_Cochin"
## [17] "Destination_Delhi" "Destination_Hyderabad"
## [19] "Destination_Kolkata" "Total_Stops_1 stop"
## [21] "Total_Stops_2 stops" "Total_Stops_3 stops"
## [23] "Total_Stops_4 stops" "Total_Stops_non-stop"
## [25] "dep_time_slot_Day_Time" "dep_time_slot_Evening"
## [27] "dep_time_slot_Late_Night" "dep_time_slot_Morning"
## [29] "dep_time_slot_Pre_Morning" "dep_day_01Apr"
## [31] "dep_day_01Jun" "dep_day_01Mar"
## [33] "dep_day_01May" "dep_day_03Apr"
## [35] "dep_day_03Jun" "dep_day_03Mar"
## [37] "dep_day_03May" "dep_day_06Apr"
## [39] "dep_day_06Jun" "dep_day_06Mar"
## [41] "dep_day_06May" "dep_day_09Apr"
## [43] "dep_day_09Jun" "dep_day_09Mar"
## [45] "dep_day_09May" "dep_day_12Apr"
## [47] "dep_day_12Jun" "dep_day_12Mar"
## [49] "dep_day_12May" "dep_day_15Apr"
## [51] "dep_day_15Jun" "dep_day_15Mar"
## [53] "dep_day_15May" "dep_day_18Apr"
## [55] "dep_day_18Jun" "dep_day_18Mar"
## [57] "dep_day_18May" "dep_day_21Apr"
## [59] "dep_day_21Jun" "dep_day_21Mar"
## [61] "dep_day_21May" "dep_day_24Apr"
## [63] "dep_day_24Jun" "dep_day_24Mar"
## [65] "dep_day_24May" "dep_day_27Apr"
## [67] "dep_day_27Jun" "dep_day_27Mar"
## [69] "dep_day_27May"

#Modelling: #LINEAR REGRESSION MODEL:


model1=lm(Price~.,data=train3)
summary(model1)

##
## Call:
## lm(formula = Price ~ ., data = train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10368 -1264 -119 1255 42354
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5618.028 228.843 24.550 < 2e-16 ***
## duration -15.237 4.424 -3.444 0.000576 ***
## `Airline_Air India` 1877.469 153.411 12.238 < 2e-16 ***
## Airline_GoAir 140.971 219.458 0.642 0.520654
## Airline_IndiGo 370.118 145.660 2.541 0.011069 *
## `Airline_Jet Airways` 4553.184 145.208 31.356 < 2e-16 ***
## `Airline_Multiple carriers` 3460.609 160.002 21.629 < 2e-16 ***
## Airline_SpiceJet 85.329 160.690 0.531 0.595417
## Airline_Vistara 2388.100 177.005 13.492 < 2e-16 ***
## Source_Chennai 125.282 139.297 0.899 0.368465
## Source_Delhi 247.660 91.537 2.706 0.006830 **
## Source_Kolkata 405.453 84.780 4.782 1.76e-06 ***
## Source_Mumbai -1605.191 108.144 -14.843 < 2e-16 ***
## Destination_Cochin NA NA NA NA
## Destination_Delhi NA NA NA NA
## Destination_Hyderabad NA NA NA NA
## Destination_Kolkata NA NA NA NA
## `Total_Stops_2 stops` 2368.622 82.500 28.711 < 2e-16 ***
## `Total_Stops_3 stops` 3291.665 363.980 9.044 < 2e-16 ***
## `Total_Stops_4 stops` -901.178 2387.767 -0.377 0.705873
## `Total_Stops_non-stop` -3382.937 91.935 -36.797 < 2e-16 ***
## dep_time_slot_Evening -79.450 64.014 -1.241 0.214589
## dep_time_slot_Late_Night 412.940 113.645 3.634 0.000281 ***
## dep_time_slot_Morning -218.890 59.102 -3.704 0.000214 ***
## dep_time_slot_Pre_Morning 31.341 121.576 0.258 0.796571
## dep_day_01Jun 1518.250 201.216 7.545 4.88e-14 ***
## dep_day_01Mar 11767.683 233.231 50.455 < 2e-16 ***
## dep_day_01May 1351.393 206.350 6.549 6.06e-11 ***
## dep_day_03Apr 127.696 274.203 0.466 0.641440
## dep_day_03Jun 1471.523 202.372 7.271 3.81e-13 ***
## dep_day_03Mar 5351.423 204.900 26.117 < 2e-16 ***
## dep_day_03May 1424.722 295.252 4.825 1.42e-06 ***
## dep_day_06Apr -134.897 284.372 -0.474 0.635249
## dep_day_06Jun 1419.850 184.259 7.706 1.42e-14 ***
## dep_day_06Mar 5735.347 195.159 29.388 < 2e-16 ***
## dep_day_06May 1344.926 205.575 6.542 6.34e-11 ***
## dep_day_09Apr 251.092 262.285 0.957 0.338425
## dep_day_09Jun 1463.604 184.464 7.934 2.33e-15 ***
## dep_day_09Mar 3568.481 205.953 17.327 < 2e-16 ***
## dep_day_09May 1523.929 185.543 8.213 2.40e-16 ***
## dep_day_12Apr 1677.626 338.481 4.956 7.29e-07 ***
## dep_day_12Jun 937.374 184.882 5.070 4.04e-07 ***
## dep_day_12Mar 3374.598 256.937 13.134 < 2e-16 ***
## dep_day_12May 1491.453 209.739 7.111 1.23e-12 ***
## dep_day_15Apr 632.694 295.758 2.139 0.032440 *
## dep_day_15Jun 1048.363 203.007 5.164 2.46e-07 ***
## dep_day_15Mar 1566.667 243.330 6.438 1.26e-10 ***
## dep_day_15May 1351.489 191.071 7.073 1.61e-12 ***
## dep_day_18Apr 1782.013 330.756 5.388 7.29e-08 ***
## dep_day_18Jun 1314.179 279.684 4.699 2.65e-06 ***
## dep_day_18Mar 2471.817 245.941 10.050 < 2e-16 ***
## dep_day_18May 1730.269 184.218 9.393 < 2e-16 ***
## dep_day_21Apr 1236.848 305.272 4.052 5.12e-05 ***
## dep_day_21Jun 1494.800 276.092 5.414 6.29e-08 ***
## dep_day_21Mar -65.375 193.860 -0.337 0.735953
## dep_day_21May 1658.359 184.534 8.987 < 2e-16 ***
## dep_day_24Apr 592.712 292.320 2.028 0.042624 *
## dep_day_24Jun 901.095 200.414 4.496 6.99e-06 ***
## dep_day_24Mar 2303.058 199.418 11.549 < 2e-16 ***
## dep_day_24May 1499.174 204.747 7.322 2.62e-13 ***
## dep_day_27Apr 1031.952 290.669 3.550 0.000387 ***
## dep_day_27Jun 929.576 199.994 4.648 3.39e-06 ***
## dep_day_27Mar -412.390 207.823 -1.984 0.047245 *
## dep_day_27May 1594.541 197.109 8.090 6.64e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2380 on 10598 degrees of freedom
## Multiple R-squared: 0.7165, Adjusted R-squared: 0.7149
## F-statistic: 453.9 on 59 and 10598 DF, p-value: < 2.2e-16

model2=lm(Price ~ duration + `Airline_Air India` + Airline_IndiGo +


`Airline_Jet Airways` + Airline_SpiceJet + `Airline_Multiple carr
iers` +
Airline_Vistara + Airline_GoAir + Source_Mumbai +
Source_Kolkata + `Total_Stops_non-stop` + `Total_Stops_2 stops` +
`Total_Stops_3 stops` + dep_time_slot_Morning + dep_time_slot_Lat
e_Night +
dep_day_03Mar + dep_day_06Mar + dep_day_09Mar + dep_day_12Mar +
dep_day_15Mar + dep_day_18Mar + dep_day_21Mar + dep_day_24Mar +
dep_day_27Mar + dep_day_03Apr + dep_day_06Apr +
dep_day_09Apr + dep_day_12Apr + dep_day_15Apr + dep_day_18Apr +
dep_day_21Apr + dep_day_24Apr + dep_day_27Apr + dep_day_01May +
dep_day_03May + dep_day_06May + dep_day_09May + dep_day_12May +
dep_day_15May + dep_day_18May + dep_day_21May + dep_day_24May +
dep_day_27May + dep_day_01Jun + dep_day_03Jun + dep_day_06Jun +
dep_day_09Jun + dep_day_12Jun + dep_day_15Jun + dep_day_18Jun +
dep_day_21Jun + dep_day_24Jun + dep_day_27Jun, data = train3)

summary(model2)

##
## Call:
## lm(formula = Price ~ duration + `Airline_Air India` + Airline_IndiGo +
## `Airline_Jet Airways` + Airline_SpiceJet + `Airline_Multiple carriers`
+
## Airline_Vistara + Airline_GoAir + Source_Mumbai + Source_Kolkata +
## `Total_Stops_non-stop` + `Total_Stops_2 stops` + `Total_Stops_3 stops`
+
## dep_time_slot_Morning + dep_time_slot_Late_Night + dep_day_03Mar +
## dep_day_06Mar + dep_day_09Mar + dep_day_12Mar + dep_day_15Mar +
## dep_day_18Mar + dep_day_21Mar + dep_day_24Mar + dep_day_27Mar +
## dep_day_03Apr + dep_day_06Apr + dep_day_09Apr + dep_day_12Apr +
## dep_day_15Apr + dep_day_18Apr + dep_day_21Apr + dep_day_24Apr +
## dep_day_27Apr + dep_day_01May + dep_day_03May + dep_day_06May +
## dep_day_09May + dep_day_12May + dep_day_15May + dep_day_18May +
## dep_day_21May + dep_day_24May + dep_day_27May + dep_day_01Jun +
## dep_day_03Jun + dep_day_06Jun + dep_day_09Jun + dep_day_12Jun +
## dep_day_15Jun + dep_day_18Jun + dep_day_21Jun + dep_day_24Jun +
## dep_day_27Jun, data = train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10774 -1384 -221 1256 42172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11055.688 211.901 52.174 < 2e-16 ***
## duration -14.080 4.917 -2.863 0.004200 **
## `Airline_Air India` 1897.934 170.431 11.136 < 2e-16 ***
## Airline_IndiGo 363.116 161.905 2.243 0.024932 *
## `Airline_Jet Airways` 4439.613 161.033 27.570 < 2e-16 ***
## Airline_SpiceJet 66.008 177.567 0.372 0.710098
## `Airline_Multiple carriers` 3351.805 177.742 18.858 < 2e-16 ***
## Airline_Vistara 2150.000 195.550 10.995 < 2e-16 ***
## Airline_GoAir -12.468 244.213 -0.051 0.959285
## Source_Mumbai -1803.485 114.709 -15.722 < 2e-16 ***
## Source_Kolkata -266.442 69.215 -3.849 0.000119 ***
## `Total_Stops_non-stop` -3555.842 87.839 -40.481 < 2e-16 ***
## `Total_Stops_2 stops` 2303.779 90.948 25.331 < 2e-16 ***
## `Total_Stops_3 stops` 3264.067 405.615 8.047 9.38e-16 ***
## dep_time_slot_Morning -163.297 55.761 -2.928 0.003413 **
## dep_time_slot_Late_Night 652.809 121.388 5.378 7.70e-08 ***
## dep_day_03Mar 155.585 197.867 0.786 0.431705
## dep_day_06Mar 535.831 185.209 2.893 0.003822 **
## dep_day_09Mar -1627.429 199.256 -8.168 3.51e-16 ***
## dep_day_12Mar -1959.766 257.640 -7.607 3.05e-14 ***
## dep_day_15Mar -3679.773 244.597 -15.044 < 2e-16 ***
## dep_day_18Mar -2756.340 247.837 -11.122 < 2e-16 ***
## dep_day_21Mar -5244.986 183.565 -28.573 < 2e-16 ***
## dep_day_24Mar -2640.551 194.111 -13.603 < 2e-16 ***
## dep_day_27Mar -5532.476 200.611 -27.578 < 2e-16 ***
## dep_day_03Apr -4969.375 284.134 -17.490 < 2e-16 ***
## dep_day_06Apr -5212.902 296.279 -17.595 < 2e-16 ***
## dep_day_09Apr -4864.394 269.610 -18.042 < 2e-16 ***
## dep_day_12Apr -3377.133 360.304 -9.373 < 2e-16 ***
## dep_day_15Apr -4453.249 309.804 -14.374 < 2e-16 ***
## dep_day_18Apr -3281.154 350.966 -9.349 < 2e-16 ***
## dep_day_21Apr -3850.110 320.836 -12.000 < 2e-16 ***
## dep_day_24Apr -4475.868 305.727 -14.640 < 2e-16 ***
## dep_day_27Apr -4068.156 303.462 -13.406 < 2e-16 ***
## dep_day_01May -3537.015 203.451 -17.385 < 2e-16 ***
## dep_day_03May -3665.521 309.606 -11.839 < 2e-16 ***
## dep_day_06May -3547.772 202.445 -17.525 < 2e-16 ***
## dep_day_09May -3484.382 174.426 -19.976 < 2e-16 ***
## dep_day_12May -3402.623 207.700 -16.382 < 2e-16 ***
## dep_day_15May -3639.648 182.433 -19.951 < 2e-16 ***
## dep_day_18May -3278.983 172.714 -18.985 < 2e-16 ***
## dep_day_21May -3349.419 173.262 -19.332 < 2e-16 ***
## dep_day_24May -3396.886 201.366 -16.869 < 2e-16 ***
## dep_day_27May -3546.017 187.085 -18.954 < 2e-16 ***
## dep_day_01Jun -3617.753 192.556 -18.788 < 2e-16 ***
## dep_day_03Jun -3661.461 193.991 -18.874 < 2e-16 ***
## dep_day_06Jun -3586.539 172.855 -20.749 < 2e-16 ***
## dep_day_09Jun -3535.248 173.297 -20.400 < 2e-16 ***
## dep_day_12Jun -4062.389 173.779 -23.377 < 2e-16 ***
## dep_day_15Jun -4070.925 194.994 -20.877 < 2e-16 ***
## dep_day_18Jun -3773.269 291.003 -12.966 < 2e-16 ***
## dep_day_21Jun -3583.450 286.687 -12.500 < 2e-16 ***
## dep_day_24Jun -4229.439 191.433 -22.094 < 2e-16 ***
## dep_day_27Jun -4201.964 190.942 -22.006 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2654 on 10604 degrees of freedom
## Multiple R-squared: 0.6472, Adjusted R-squared: 0.6454
## F-statistic: 367 on 53 and 10604 DF, p-value: < 2.2e-16

library(car)

vif(model2) #all variables in model2 have less vif. So, no multicollinearity.

## duration `Airline_Air India`


## 2.613822 6.032434
## Airline_IndiGo `Airline_Jet Airways`
## 6.169329 9.053966
## Airline_SpiceJet `Airline_Multiple carriers`
## 3.381050 4.762924
## Airline_Vistara Airline_GoAir
## 2.483909 1.612935
## Source_Mumbai Source_Kolkata
## 1.213804 1.426860
## `Total_Stops_non-stop` `Total_Stops_2 stops`
## 2.570692 1.528082
## `Total_Stops_3 stops` dep_time_slot_Morning
## 1.046786 1.076707
## dep_time_slot_Late_Night dep_day_03Mar
## 1.087591 1.688896
## dep_day_06Mar dep_day_09Mar
## 1.875114 1.654257
## dep_day_12Mar dep_day_15Mar
## 1.320522 1.355257
## dep_day_18Mar dep_day_21Mar
## 1.340625 1.886172
## dep_day_24Mar dep_day_27Mar
## 1.675693 1.660654
## dep_day_03Apr dep_day_06Apr
## 1.247922 1.222472
## dep_day_09Apr dep_day_12Apr
## 1.275012 1.154405
## dep_day_15Apr dep_day_18Apr
## 1.202756 1.164450
## dep_day_21Apr dep_day_24Apr
## 1.189270 1.210444
## dep_day_27Apr dep_day_01May
## 1.218276 1.585682
## dep_day_03May dep_day_06May
## 1.214600 1.592100
## dep_day_09May dep_day_12May
## 1.995892 1.547901
## dep_day_15May dep_day_18May
## 1.841163 2.033766
## dep_day_21May dep_day_24May
## 2.019663 1.602434
## dep_day_27May dep_day_01Jun
## 1.830399 1.742737
## dep_day_03Jun dep_day_06Jun
## 1.723765 2.033243
## dep_day_09Jun dep_day_12Jun
## 2.012736 2.016165
## dep_day_15Jun dep_day_18Jun
## 1.716306 1.250084
## dep_day_21Jun dep_day_24Jun
## 1.259025 1.766253
## dep_day_27Jun
## 1.776540

model2$coefficients

## (Intercept) duration
## 11055.68805 -14.07963
## `Airline_Air India` Airline_IndiGo
## 1897.93423 363.11600
## `Airline_Jet Airways` Airline_SpiceJet
## 4439.61279 66.00766
## `Airline_Multiple carriers` Airline_Vistara
## 3351.80540 2150.00034
## Airline_GoAir Source_Mumbai
## -12.46773 -1803.48536
## Source_Kolkata `Total_Stops_non-stop`
## -266.44157 -3555.84209
## `Total_Stops_2 stops` `Total_Stops_3 stops`
## 2303.77949 3264.06732
## dep_time_slot_Morning dep_time_slot_Late_Night
## -163.29670 652.80944
## dep_day_03Mar dep_day_06Mar
## 155.58477 535.83117
## dep_day_09Mar dep_day_12Mar
## -1627.42863 -1959.76598
## dep_day_15Mar dep_day_18Mar
## -3679.77324 -2756.33985
## dep_day_21Mar dep_day_24Mar
## -5244.98551 -2640.55079
## dep_day_27Mar dep_day_03Apr
## -5532.47568 -4969.37514
## dep_day_06Apr dep_day_09Apr
## -5212.90185 -4864.39387
## dep_day_12Apr dep_day_15Apr
## -3377.13272 -4453.24886
## dep_day_18Apr dep_day_21Apr
## -3281.15398 -3850.11046
## dep_day_24Apr dep_day_27Apr
## -4475.86771 -4068.15568
## dep_day_01May dep_day_03May
## -3537.01511 -3665.52117
## dep_day_06May dep_day_09May
## -3547.77164 -3484.38234
## dep_day_12May dep_day_15May
## -3402.62319 -3639.64846
## dep_day_18May dep_day_21May
## -3278.98336 -3349.41947
## dep_day_24May dep_day_27May
## -3396.88592 -3546.01709
## dep_day_01Jun dep_day_03Jun
## -3617.75298 -3661.46089
## dep_day_06Jun dep_day_09Jun
## -3586.53887 -3535.24772
## dep_day_12Jun dep_day_15Jun
## -4062.38900 -4070.92460
## dep_day_18Jun dep_day_21Jun
## -3773.26938 -3583.45018
## dep_day_24Jun dep_day_27Jun
## -4229.43878 -4201.96394

#Evaluating Linear Regression model:


mape=function(y,yhat){
mean(abs((y-yhat)/y))
}

mape(train3$Price,model2$fitted.values)

## [1] 0.2251769

library(Metrics)

rmse(train3$Price,model2$fitted.values)

## [1] 2647.094

#Testing using LM:


x=train3[,2:64]
y=train3[,1]
x_test=test3
pred1=predict(model2,data = x_test)
pred1

## 1 2 3 4 5 6
## 14224.45829 14196.29903 12672.02972 12629.79084 6031.31735 12559.39270
## 7 8 9 10 11 12
Continued**

#TOP MOST FEATURES IMPACTING THE FLIGHT PRICE ARE (FROM LINEAR REGRESSION MOD
EL);
summary(model2)

##
## Call:
## lm(formula = Price ~ duration + `Airline_Air India` + Airline_IndiGo +
## `Airline_Jet Airways` + Airline_SpiceJet + `Airline_Multiple carriers`
+
## Airline_Vistara + Airline_GoAir + Source_Mumbai + Source_Kolkata +
## `Total_Stops_non-stop` + `Total_Stops_2 stops` + `Total_Stops_3 stops`
+
## dep_time_slot_Morning + dep_time_slot_Late_Night + dep_day_03Mar +
## dep_day_06Mar + dep_day_09Mar + dep_day_12Mar + dep_day_15Mar +
## dep_day_18Mar + dep_day_21Mar + dep_day_24Mar + dep_day_27Mar +
## dep_day_03Apr + dep_day_06Apr + dep_day_09Apr + dep_day_12Apr +
## dep_day_15Apr + dep_day_18Apr + dep_day_21Apr + dep_day_24Apr +
## dep_day_27Apr + dep_day_01May + dep_day_03May + dep_day_06May +
## dep_day_09May + dep_day_12May + dep_day_15May + dep_day_18May +
## dep_day_21May + dep_day_24May + dep_day_27May + dep_day_01Jun +
## dep_day_03Jun + dep_day_06Jun + dep_day_09Jun + dep_day_12Jun +
## dep_day_15Jun + dep_day_18Jun + dep_day_21Jun + dep_day_24Jun +
## dep_day_27Jun, data = train3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10774 -1384 -221 1256 42172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11055.688 211.901 52.174 < 2e-16 ***
## duration -14.080 4.917 -2.863 0.004200 **
## `Airline_Air India` 1897.934 170.431 11.136 < 2e-16 ***
## Airline_IndiGo 363.116 161.905 2.243 0.024932 *
## `Airline_Jet Airways` 4439.613 161.033 27.570 < 2e-16 ***
## Airline_SpiceJet 66.008 177.567 0.372 0.710098
## `Airline_Multiple carriers` 3351.805 177.742 18.858 < 2e-16 ***
## Airline_Vistara 2150.000 195.550 10.995 < 2e-16 ***
## Airline_GoAir -12.468 244.213 -0.051 0.959285
## Source_Mumbai -1803.485 114.709 -15.722 < 2e-16 ***
## Source_Kolkata -266.442 69.215 -3.849 0.000119 ***
## `Total_Stops_non-stop` -3555.842 87.839 -40.481 < 2e-16 ***
## `Total_Stops_2 stops` 2303.779 90.948 25.331 < 2e-16 ***
## `Total_Stops_3 stops` 3264.067 405.615 8.047 9.38e-16 ***
## dep_time_slot_Morning -163.297 55.761 -2.928 0.003413 **
## dep_time_slot_Late_Night 652.809 121.388 5.378 7.70e-08 ***
## dep_day_03Mar 155.585 197.867 0.786 0.431705
## dep_day_06Mar 535.831 185.209 2.893 0.003822 **
## dep_day_09Mar -1627.429 199.256 -8.168 3.51e-16 ***
## dep_day_12Mar -1959.766 257.640 -7.607 3.05e-14 ***
## dep_day_15Mar -3679.773 244.597 -15.044 < 2e-16 ***
## dep_day_18Mar -2756.340 247.837 -11.122 < 2e-16 ***
## dep_day_21Mar -5244.986 183.565 -28.573 < 2e-16 ***
## dep_day_24Mar -2640.551 194.111 -13.603 < 2e-16 ***
## dep_day_27Mar -5532.476 200.611 -27.578 < 2e-16 ***
## dep_day_03Apr -4969.375 284.134 -17.490 < 2e-16 ***
## dep_day_06Apr -5212.902 296.279 -17.595 < 2e-16 ***
## dep_day_09Apr -4864.394 269.610 -18.042 < 2e-16 ***
## dep_day_12Apr -3377.133 360.304 -9.373 < 2e-16 ***
## dep_day_15Apr -4453.249 309.804 -14.374 < 2e-16 ***
## dep_day_18Apr -3281.154 350.966 -9.349 < 2e-16 ***
## dep_day_21Apr -3850.110 320.836 -12.000 < 2e-16 ***
## dep_day_24Apr -4475.868 305.727 -14.640 < 2e-16 ***
## dep_day_27Apr -4068.156 303.462 -13.406 < 2e-16 ***
## dep_day_01May -3537.015 203.451 -17.385 < 2e-16 ***
## dep_day_03May -3665.521 309.606 -11.839 < 2e-16 ***
## dep_day_06May -3547.772 202.445 -17.525 < 2e-16 ***
## dep_day_09May -3484.382 174.426 -19.976 < 2e-16 ***
## dep_day_12May -3402.623 207.700 -16.382 < 2e-16 ***
## dep_day_15May -3639.648 182.433 -19.951 < 2e-16 ***
## dep_day_18May -3278.983 172.714 -18.985 < 2e-16 ***
## dep_day_21May -3349.419 173.262 -19.332 < 2e-16 ***
## dep_day_24May -3396.886 201.366 -16.869 < 2e-16 ***
## dep_day_27May -3546.017 187.085 -18.954 < 2e-16 ***
## dep_day_01Jun -3617.753 192.556 -18.788 < 2e-16 ***
## dep_day_03Jun -3661.461 193.991 -18.874 < 2e-16 ***
## dep_day_06Jun -3586.539 172.855 -20.749 < 2e-16 ***
## dep_day_09Jun -3535.248 173.297 -20.400 < 2e-16 ***
## dep_day_12Jun -4062.389 173.779 -23.377 < 2e-16 ***
## dep_day_15Jun -4070.925 194.994 -20.877 < 2e-16 ***
## dep_day_18Jun -3773.269 291.003 -12.966 < 2e-16 ***
## dep_day_21Jun -3583.450 286.687 -12.500 < 2e-16 ***
## dep_day_24Jun -4229.439 191.433 -22.094 < 2e-16 ***
## dep_day_27Jun -4201.964 190.942 -22.006 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2654 on 10604 degrees of freedom
## Multiple R-squared: 0.6472, Adjusted R-squared: 0.6454
## F-statistic: 367 on 53 and 10604 DF, p-value: < 2.2e-16

model2$coefficients
## (Intercept) duration
## 11055.68805 -14.07963
## `Airline_Air India` Airline_IndiGo
## 1897.93423 363.11600
## `Airline_Jet Airways` Airline_SpiceJet
## 4439.61279 66.00766
## `Airline_Multiple carriers` Airline_Vistara
## 3351.80540 2150.00034
## Airline_GoAir Source_Mumbai
## -12.46773 -1803.48536
## Source_Kolkata `Total_Stops_non-stop`
## -266.44157 -3555.84209
## `Total_Stops_2 stops` `Total_Stops_3 stops`
## 2303.77949 3264.06732
## dep_time_slot_Morning dep_time_slot_Late_Night
## -163.29670 652.80944
## dep_day_03Mar dep_day_06Mar
## 155.58477 535.83117
## dep_day_09Mar dep_day_12Mar
## -1627.42863 -1959.76598
## dep_day_15Mar dep_day_18Mar
## -3679.77324 -2756.33985
## dep_day_21Mar dep_day_24Mar
## -5244.98551 -2640.55079
## dep_day_27Mar dep_day_03Apr
## -5532.47568 -4969.37514
## dep_day_06Apr dep_day_09Apr
## -5212.90185 -4864.39387
## dep_day_12Apr dep_day_15Apr
## -3377.13272 -4453.24886
## dep_day_18Apr dep_day_21Apr
## -3281.15398 -3850.11046
## dep_day_24Apr dep_day_27Apr
## -4475.86771 -4068.15568
## dep_day_01May dep_day_03May
## -3537.01511 -3665.52117
## dep_day_06May dep_day_09May
## -3547.77164 -3484.38234
## dep_day_12May dep_day_15May
## -3402.62319 -3639.64846
## dep_day_18May dep_day_21May
## -3278.98336 -3349.41947
## dep_day_24May dep_day_27May
## -3396.88592 -3546.01709
## dep_day_01Jun dep_day_03Jun
## -3617.75298 -3661.46089
## dep_day_06Jun dep_day_09Jun
## -3586.53887 -3535.24772
## dep_day_12Jun dep_day_15Jun
## -4062.38900 -4070.92460
## dep_day_18Jun dep_day_21Jun
## -3773.26938 -3583.45018
## dep_day_24Jun dep_day_27Jun
## -4229.43878 -4201.96394

#CROSS VALIDATION:
set.seed(100)
# Define train control for k fold cross validation
ctrl<-trainControl(method='cv',number = 10)

model_cv<-train(Price ~ ., data = train3, method ='lm',trControl = ctrl,metri


c='Rsquared')

summary(model_cv)

##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10368 -1264 -119 1255 42354
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5618.028 228.843 24.550 < 2e-16 **
*
## duration -15.237 4.424 -3.444 0.000576 **
*
## `\\`Airline_Air India\\`` 1877.469 153.411 12.238 < 2e-16 **
*
## Airline_GoAir 140.971 219.458 0.642 0.520654
## Airline_IndiGo 370.118 145.660 2.541 0.011069 *
## `\\`Airline_Jet Airways\\`` 4553.184 145.208 31.356 < 2e-16 **
*
## `\\`Airline_Multiple carriers\\`` 3460.609 160.002 21.629 < 2e-16 **
*
## Airline_SpiceJet 85.329 160.690 0.531 0.595417
## Airline_Vistara 2388.100 177.005 13.492 < 2e-16 **
*
## Source_Chennai 125.282 139.297 0.899 0.368465
## Source_Delhi 247.660 91.537 2.706 0.006830 **
## Source_Kolkata 405.453 84.780 4.782 1.76e-06 **
*
## Source_Mumbai -1605.191 108.144 -14.843 < 2e-16 **
*
## Destination_Cochin NA NA NA NA
## Destination_Delhi NA NA NA NA
## Destination_Hyderabad NA NA NA NA
## Destination_Kolkata NA NA NA NA
## `\\`Total_Stops_2 stops\\`` 2368.622 82.500 28.711 < 2e-16 **
*
## `\\`Total_Stops_3 stops\\`` 3291.665 363.980 9.044 < 2e-16 **
*
## `\\`Total_Stops_4 stops\\`` -901.178 2387.767 -0.377 0.705873
## `\\`Total_Stops_non-stop\\`` -3382.937 91.935 -36.797 < 2e-16 **
*
## dep_time_slot_Evening -79.450 64.014 -1.241 0.214589
## dep_time_slot_Late_Night 412.940 113.645 3.634 0.000281 **
*
## dep_time_slot_Morning -218.890 59.102 -3.704 0.000214 **
*
## dep_time_slot_Pre_Morning 31.341 121.576 0.258 0.796571
## dep_day_01Jun 1518.250 201.216 7.545 4.88e-14 **
*
## dep_day_01Mar 11767.683 233.231 50.455 < 2e-16 **
*
## dep_day_01May 1351.393 206.350 6.549 6.06e-11 **
*
## dep_day_03Apr 127.696 274.203 0.466 0.641440
## dep_day_03Jun 1471.523 202.372 7.271 3.81e-13 **
*
## dep_day_03Mar 5351.423 204.900 26.117 < 2e-16 **
*
## dep_day_03May 1424.722 295.252 4.825 1.42e-06 **
*
## dep_day_06Apr -134.897 284.372 -0.474 0.635249
## dep_day_06Jun 1419.850 184.259 7.706 1.42e-14 **
*
## dep_day_06Mar 5735.347 195.159 29.388 < 2e-16 **
*
## dep_day_06May 1344.926 205.575 6.542 6.34e-11 **
*
## dep_day_09Apr 251.092 262.285 0.957 0.338425
## dep_day_09Jun 1463.604 184.464 7.934 2.33e-15 **
*
## dep_day_09Mar 3568.481 205.953 17.327 < 2e-16 **
*
## dep_day_09May 1523.929 185.543 8.213 2.40e-16 **
*
## dep_day_12Apr 1677.626 338.481 4.956 7.29e-07 **
*
## dep_day_12Jun 937.374 184.882 5.070 4.04e-07 **
*
## dep_day_12Mar 3374.598 256.937 13.134 < 2e-16 **
*
## dep_day_12May 1491.453 209.739 7.111 1.23e-12 **
*
## dep_day_15Apr 632.694 295.758 2.139 0.032440 *
## dep_day_15Jun 1048.363 203.007 5.164 2.46e-07 **
*
## dep_day_15Mar 1566.667 243.330 6.438 1.26e-10 **
*
## dep_day_15May 1351.489 191.071 7.073 1.61e-12 **
*
## dep_day_18Apr 1782.013 330.756 5.388 7.29e-08 **
*
## dep_day_18Jun 1314.179 279.684 4.699 2.65e-06 **
*
## dep_day_18Mar 2471.817 245.941 10.050 < 2e-16 **
*
## dep_day_18May 1730.269 184.218 9.393 < 2e-16 **
*
## dep_day_21Apr 1236.848 305.272 4.052 5.12e-05 **
*
## dep_day_21Jun 1494.800 276.092 5.414 6.29e-08 **
*
## dep_day_21Mar -65.375 193.860 -0.337 0.735953
## dep_day_21May 1658.359 184.534 8.987 < 2e-16 **
*
## dep_day_24Apr 592.712 292.320 2.028 0.042624 *
## dep_day_24Jun 901.095 200.414 4.496 6.99e-06 **
*
## dep_day_24Mar 2303.058 199.418 11.549 < 2e-16 **
*
## dep_day_24May 1499.174 204.747 7.322 2.62e-13 **
*
## dep_day_27Apr 1031.952 290.669 3.550 0.000387 **
*
## dep_day_27Jun 929.576 199.994 4.648 3.39e-06 **
*
## dep_day_27Mar -412.390 207.823 -1.984 0.047245 *
## dep_day_27May 1594.541 197.109 8.090 6.64e-16 **
*
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2380 on 10598 degrees of freedom
## Multiple R-squared: 0.7165, Adjusted R-squared: 0.7149
## F-statistic: 453.9 on 59 and 10598 DF, p-value: < 2.2e-16

pred2=predict(model_cv,x_test)

pred2

#REGULARISATION:
library(glmnet)

## Loading required package: Matrix


##
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':


##
## expand, pack, unpack

## Loaded glmnet 3.0-2

library(ISLR)
library(dplyr)
library(tidyr)
library(Metrics)

set.seed(100)
train_x=as.matrix(train3[,2:64])
train_y=as.matrix(train3[,1])

test_x=as.matrix(test3)

custom=trainControl(method='repeatedcv',number=10,repeats=5,verboseIter=TRUE)

#RIDGE REGRESSION:
ridge=train(Price~.,train3,method='glmnet',tuneGrid=expand.grid(alpha=0,lambd
a=seq(0.001,1,length=5)),
trControl=custom)

## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0, lambda = 1 on full training set

plot(ridge)
ridge

## glmnet
##
## 10658 samples
## 63 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 9592, 9592, 9592, 9592, 9592, 9594, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00100 2409.696 0.7088255 1722.293
## 0.25075 2409.696 0.7088255 1722.293
## 0.50050 2409.696 0.7088255 1722.293
## 0.75025 2409.696 0.7088255 1722.293
## 1.00000 2409.696 0.7088255 1722.293
##
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 1.

plot(ridge$finalModel,xvar = 'lambda',label=T)
plot(ridge$finalModel,xvar = 'dev',label=T)

plot(varImp(ridge,scale = T))
pred3=predict(ridge,test_x)
pred3

#LASSO REGRESSION:
set.seed(100)

lasso=train(Price~.,train3,method='glmnet',tuneGrid=expand.grid(alpha=1,lambd
a=seq(0.001,0.5,length=5)),
trControl=custom)

## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 1, lambda = 0.5 on full training set

plot(lasso)
lasso

## glmnet
##
## 10658 samples
## 63 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 9592, 9592, 9592, 9592, 9592, 9594, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.00100 2380.943 0.7148115 1705.007
## 0.12575 2380.943 0.7148115 1705.007
## 0.25050 2380.943 0.7148115 1705.007
## 0.37525 2380.943 0.7148115 1705.007
## 0.50000 2380.939 0.7148124 1704.999
##
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.5.

plot(lasso$finalModel,xvar = 'lambda',label=T)
plot(lasso$finalModel,xvar = 'dev',label=T)

plot(varImp(ridge,scale = T))
pred4=predict(lasso,test_x)
pred4

#ELASTIC NET REGRESSION:


set.seed(100)
elastic=train(Price~.,train3,method='glmnet',tuneGrid=expand.grid(alpha=seq(0
,1,length=10),lambda=seq(0.0001,1,length=5)),
trControl=custom)

## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 0.111, lambda = 1 on full training set

plot(elastic)
elastic

## glmnet
##
## 10658 samples
## 63 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 9592, 9592, 9592, 9592, 9592, 9594, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.0000000 0.000100 2409.696 0.7088255 1722.293
## 0.0000000 0.250075 2409.696 0.7088255 1722.293
## 0.0000000 0.500050 2409.696 0.7088255 1722.293
## 0.0000000 0.750025 2409.696 0.7088255 1722.293
## 0.0000000 1.000000 2409.696 0.7088255 1722.293
## 0.1111111 0.000100 2380.899 0.7148185 1704.944
## 0.1111111 0.250075 2380.899 0.7148185 1704.944
## 0.1111111 0.500050 2380.899 0.7148185 1704.944
## 0.1111111 0.750025 2380.899 0.7148185 1704.944
## 0.1111111 1.000000 2380.899 0.7148185 1704.944
## 0.2222222 0.000100 2381.000 0.7147973 1704.937
## 0.2222222 0.250075 2381.000 0.7147973 1704.937
## 0.2222222 0.500050 2381.000 0.7147973 1704.937
## 0.2222222 0.750025 2381.000 0.7147973 1704.937
## 0.2222222 1.000000 2381.000 0.7147973 1704.937
## 0.3333333 0.000100 2381.008 0.7147943 1704.979
## 0.3333333 0.250075 2381.008 0.7147943 1704.979
## 0.3333333 0.500050 2381.008 0.7147943 1704.979
## 0.3333333 0.750025 2381.008 0.7147943 1704.979
## 0.3333333 1.000000 2381.008 0.7147943 1704.979
## 0.4444444 0.000100 2380.950 0.7148072 1704.945
## 0.4444444 0.250075 2380.950 0.7148072 1704.945
## 0.4444444 0.500050 2380.950 0.7148072 1704.945
## 0.4444444 0.750025 2380.950 0.7148072 1704.945
## 0.4444444 1.000000 2380.947 0.7148079 1704.941
## 0.5555556 0.000100 2380.919 0.7148163 1704.966
## 0.5555556 0.250075 2380.919 0.7148163 1704.966
## 0.5555556 0.500050 2380.919 0.7148163 1704.966
## 0.5555556 0.750025 2380.919 0.7148163 1704.966
## 0.5555556 1.000000 2380.939 0.7148113 1704.926
## 0.6666667 0.000100 2380.963 0.7148051 1705.020
## 0.6666667 0.250075 2380.963 0.7148051 1705.020
## 0.6666667 0.500050 2380.963 0.7148051 1705.020
## 0.6666667 0.750025 2380.957 0.7148067 1704.998
## 0.6666667 1.000000 2380.982 0.7148006 1704.884
## 0.7777778 0.000100 2380.988 0.7148001 1705.015
## 0.7777778 0.250075 2380.988 0.7148001 1705.015
## 0.7777778 0.500050 2380.988 0.7148001 1705.015
## 0.7777778 0.750025 2380.991 0.7147989 1704.946
## 0.7777778 1.000000 2381.026 0.7147896 1704.821
## 0.8888889 0.000100 2380.919 0.7148164 1704.999
## 0.8888889 0.250075 2380.919 0.7148164 1704.999
## 0.8888889 0.500050 2380.919 0.7148164 1704.999
## 0.8888889 0.750025 2380.952 0.7148079 1704.893
## 0.8888889 1.000000 2381.058 0.7147819 1704.767
## 1.0000000 0.000100 2380.943 0.7148115 1705.007
## 1.0000000 0.250075 2380.943 0.7148115 1705.007
## 1.0000000 0.500050 2380.939 0.7148124 1704.999
## 1.0000000 0.750025 2380.990 0.7147990 1704.851
## 1.0000000 1.000000 2381.117 0.7147674 1704.701
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1111111 and lambda = 1.

plot(elastic$finalModel,xvar = 'lambda',label=T)
plot(elastic$finalModel,xvar = 'dev',label=T)

plot(varImp(elastic,scale = T))
pred5=predict(elastic,test_x)
pred5

#Comparision of Regularisation techniques:


model_list=list(Ridge=ridge,Lasso=lasso,Elasticnet=elastic)
result=resamples(model_list)
summary(result)

##
## Call:
## summary.resamples(object = result)
##
## Models: Ridge, Lasso, Elasticnet
## Number of resamples: 50
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Ridge 1624.772 1692.027 1726.927 1722.293 1752.307 1854.102 0
## Lasso 1606.835 1668.504 1705.873 1704.999 1737.276 1823.627 0
## Elasticnet 1607.009 1668.361 1705.925 1704.944 1737.141 1823.500 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Ridge 2139.305 2274.853 2348.793 2409.696 2567.539 2859.873 0
## Lasso 2109.680 2242.681 2319.599 2380.939 2526.477 2823.308 0
## Elasticnet 2109.715 2243.130 2319.699 2380.899 2526.683 2823.606 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA'
s
## Ridge 0.6486506 0.6912938 0.7129501 0.7088255 0.7271613 0.7570388
0
## Lasso 0.6582928 0.6920968 0.7219713 0.7148124 0.7335355 0.7636205
0
## Elasticnet 0.6579882 0.6921236 0.7219617 0.7148185 0.7334666 0.7636060
0

bwplot(result)

#Best model: Finalising elastic net model based on summary of result.


elastic$bestTune

## alpha lambda
## 10 0.1111111 1

best=elastic$finalModel
coef(best,s=elastic$bestTune$lambda) #Coefficients from elastic net model.

## 64 x 1 sparse Matrix of class "dgCMatrix"


## 1
## (Intercept) 6082.7662773
## duration -14.6227482
## `Airline_Air India` 1784.5109657
## Airline_GoAir 50.2846584
## Airline_IndiGo 284.4922025
## `Airline_Jet Airways` 4461.0194549
## `Airline_Multiple carriers` 3372.3569210
## Airline_SpiceJet 0.7277197
## Airline_Vistara 2292.8732485
## Source_Chennai -33.2679606
## Source_Delhi 30.8265216
## Source_Kolkata 192.3814137
## Source_Mumbai -915.5784353
## Destination_Cochin 22.2029862
## Destination_Delhi -196.2843698
## Destination_Hyderabad -883.5414110
## Destination_Kolkata -33.6616396
## `Total_Stops_2 stops` 2367.2896438
## `Total_Stops_3 stops` 3277.5967959
## `Total_Stops_4 stops` -864.4283169
## `Total_Stops_non-stop` -3378.2086253
## dep_time_slot_Evening -78.4496133
## dep_time_slot_Late_Night 404.5719151
## dep_time_slot_Morning -217.7848170
## dep_time_slot_Pre_Morning 25.3491013
## dep_day_01Jun 1323.4532859
## dep_day_01Mar 11568.9827586
## dep_day_01May 1166.9689651
## dep_day_03Apr -59.9696944
## dep_day_03Jun 1276.3632602
## dep_day_03Mar 5154.4813018
## dep_day_03May 1228.9349731
## dep_day_06Apr -320.4806783
## dep_day_06Jun 1230.7107164
## dep_day_06Mar 5539.4716476
## dep_day_06May 1161.8855969
## dep_day_09Apr 58.3374273
## dep_day_09Jun 1275.4566223
## dep_day_09Mar 3372.7268192
## dep_day_09May 1335.6115054
## dep_day_12Apr 1482.7007413
## dep_day_12Jun 749.9603257
## dep_day_12Mar 3182.0260132
## dep_day_12May 1307.4208109
## dep_day_15Apr 439.9495854
## dep_day_15Jun 854.6155435
## dep_day_15Mar 1375.7315131
## dep_day_15May 1163.9834251
## dep_day_18Apr 1586.0516922
## dep_day_18Jun 1120.2569813
## dep_day_18Mar 2280.2644679
## dep_day_18May 1542.6652543
## dep_day_21Apr 1042.7466924
## dep_day_21Jun 1300.7780243
## dep_day_21Mar -253.4610759
## dep_day_21May 1471.4333588
## dep_day_24Apr 400.6554778
## dep_day_24Jun 709.1695213
## dep_day_24Mar 2118.6258248
## dep_day_24May 1317.4671974
## dep_day_27Apr 836.3175362
## dep_day_27Jun 738.6326767
## dep_day_27Mar -600.2306698
## dep_day_27May 1402.8517401

You might also like