Predicting Airline Passengers Satisfaction
Predicting Airline Passengers Satisfaction
ASAAJU BABATUNDE
Title 1
EXECUTIVE SUMMARY
Customer satisfaction is an important part of staying ahead in today‟s
competitive business environment. Social media in particular has done much
to promote infamous incidents of poor customer service, and now more than
ever, the perception that an airline does not care about the satisfaction of
their customers can be very damaging.
Statement of the Problem
The airline industry like every other business is faced with opportunities
(high traffic due to increase in demand) as well as challenges from
competition. By so doing, all players in the airline industry require to
continuously innovating in quality of services and technology enhancement
for better safety of the passengers and profitability.
However, the survey feedback of about 90,917 respondents‟ views on
customer satisfaction (CSAT) has shown 55% satisfied and 45% neutral
or dissatisfied customers in this case. There are 22 important survey
variables or parameters with contribution tendency to the customer
satisfaction outcome.
Satfisfaction Rate
100,000
50,000
Satfisfaction
0 Rate
Neutral or Satfisfied
Dissatisfied
Many e-commerce businesses
might feel pleased if their CSAT rating is over 70% but the average
global Customer Satisfaction benchmark that includes all industries
worldwide is estimated at 86%.
Target (Response) Satisfaction
Features Gender Online support
Customer Type Ease of Online booking
Age On-board service
Type of Travel Leg room service
Class Baggage handling
Seat comfort Check in service
Food and Drink Cleanliness
Inflight wifi service Online boarding
Inflight entertainment Arrival Delay-time
Departure/Arrival Convenient Departure Delay-time
Flight Distance Gate Location
Date
Title 2
Objective
The aim of this project is to determine the relative importance of each
parameter with regards to their contribution to passenger satisfaction using
both analytical techniques and predictive algorithm.
Specific objectives are;
a. To understand which variables play an important role in swaying a
passenger feedback towards a very high percentage of „satisfied‟.
b. To determine the relative importance of each survey feedback factors
with regards to passenger satisfaction.
Methodology
The marketing survey outcome is regarded as raw data for analysis so as to
discover useful information for business decision in improving passenger
satisfaction. This process involves cleaning, evaluating, transforming and
modeling the survey data using the logical and analytical reasoning to
carefully examine each parameter or feature of the data collected.
R analytics is adopted for the data statistics computing, visualization,
preparation (diagnostic) and predictive analysis. Having tried different
models using both train and test data (dataset splitting), random forest model
(supervised machine learning algorithm) seems to be suitable with the best
accuracy and AUROC.
Insights
The dataset consists about 46% missing values (42,075).
Four variables/parameters (Departure Delay-time, Flight Distance,
Arrival Delay-time and Seat Comfort) in the dataset show skewness
(positively skewed) with mean value to the right which presence and
direction of outliers in the features‟ entries and probability
distribution.
Positive correlation is noticed between Departure Delay-time and
Arrival Delay-time which implies stronger relationship between the
two variables.
The p-values and coefficients in regression analysis of the dataset
indicate that all the independent variables (parameters) are statistically
significant to the target variable (Satisfaction).
Date
Title 3
Recommendations
1. The survey research method (quantitative) used requires improvement
for flexibility and simplicity to avoid recording large missing values
in future.
2. It is imperative for the airline management to understand which
features or parameters play important role in swaying passengers‟
feedback positively.
3. The airline management needs to innovate in both procedural and
technological services since almost all the features are statistically
significant.
4. Improvement emphasis should be on services that can be digitalized
for ease of use and accessibility such as; Check-In Service, Online
Support, Online Booking and Online Boarding due to outcome of
variable importance plot.
Date
Title 4
CONTENTS
DEPLOYMENT ....................................... 7
APPENDIX .............................................. 8
Date
Title 5
ANALYTICAL APPROACH
Having clearly identified the business problem (customer satisfaction),
analytical approach is necessary to determine the data requirements because
the methods of analysis to be used require specific content, formats, and data
representations based on domain knowledge.
Step 1: Data Understanding
Descriptive statistics and visualization techniques (EDA): The data is
collected using two separate excel spreadsheets (flight and survey data) but
saved in CSV file format for wide range application purpose and merged
using the data dictionary provided with a LIKERT chart.
The dataset dimension is 90,917 observations and 24 variables which must
have been collected over a period time (at least one month with Boeing 747
capacity at 5 trips per day supposedly). 42,075 missing values (NAs) exist in
the dataset and possible outliers observed.
Univariate analysis of the variables to find patterns within the data in term of
mean, mode, median, mode and variance
Date
Title 6
Date
Title 7
DEPLOYMENT
In conclusion, variable importance plot as part of critical output of the best
algorithm (Random Forest) in this case is being used to show the top 10
variables based on their Mean Decrease Accuracy values.
Date
Title 8
APPENDIX
Load Packages
library(ggplot2) # For graphs and visualisations
library (corrplot)
library(caTools) # Split Data into Test and Train Set
library(rpart)
library(rattle) # To visualise decision tree
Import Data
aviation_data = read.csv("Marketing Project-Aviation Data.csv", sep = "
,", header = TRUE)
Sanity Checks
dim(aviation_data)
## [1] 90917 24
colnames(aviation_data)
Date
Title 9
Descriptive Statistics
summary(aviation_data)
## NA's :284
Date
Title 10
str(aviation_data)
Date
Title 11
s travel",..: 3 3 3 3 1 3 3 3 3 3 ...
## $ Class : Factor w/ 3 levels "Business",
"Eco",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Flight_Distance : int 265 2138 623 354 1894 227
1812 1556 104 3633 ...
## $ DepartureDelayin_Mins : int 0 0 0 0 0 17 0 30 47 0 ..
.
## $ ArrivalDelayin_Mins : int 0 0 0 0 0 15 0 26 48 0 ..
.
## $ Satisfaction : Factor w/ 2 levels "neutral or
dissatisfied",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Seat_comfort : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Departure.Arrival.time_convenient: int 0 0 NA 0 0 0 0 NA 0 0 ...
## $ Food_drink : int 0 0 0 0 0 NA NA 0 0 0 ...
## $ Gate_location : int 2 3 3 3 3 3 3 3 3 4 ...
## $ Inflightwifi_service : int 2 2 3 4 2 2 2 2 3 2 ...
## $ Inflight_entertainment : int 4 0 4 3 0 5 0 0 3 0 ...
## $ Online_support : int 2 2 3 4 2 5 2 2 3 2 ...
## $ Ease_of_Onlinebooking : int 3 2 1 2 2 5 2 2 3 2 ...
## $ Onboard_service : int 3 NA 1 2 5 5 3 2 3 3 ...
## $ Leg_room_service : int 0 3 0 0 4 0 3 4 0 2 ...
## $ Baggage_handling : int 3 4 1 2 5 5 4 5 1 5 ...
## $ Checkin_service : int 5 4 4 4 5 5 5 3 2 2 ...
## $ Cleanliness : int 3 4 1 2 4 5 4 4 3 5 ...
## $ Online_boarding : int 2 2 3 5 2 3 2 2 5 2 ...
colnames(aviation_data) = new_vars
names(aviation_data)
Date
Title 12
summary(aviation_data)
Date
Title 13
## NA's :7179
Date
Title 14
Univariate Analysis
Age
par(mfrow = c(1,2))
hist(aviation_data$Age, col = "pink", main = "Age")
boxplot(aviation_data$Age, horizontal = T, col = "pink", main = "Age")
Date
Title 15
Date
Title 16
boxplot(aviation_data$Departure_ArrivalTime_Covenient, horizontal = T,
col = "green", main = "Departure-ArrivalTime Convenient")
boxplot(aviation_data$Food_Drink, horizontal = T, col = "yellow", main
= "Food-Drink")
Date
Title 17
Date
Title 18
Date
Title 19
Date
Title 20
Date
Title 21
plot(aviation_data$Class)
Date
Title 22
plot(aviation_data$TypeTravel)
Date
Title 23
Satisfaction
plot(aviation_data$Satisfaction)
Bivariate Analysis
corx = cor(aviation_data[,c(4,7:9, 11:24)], use="pairwise.complete.obs"
)
corrplot(corx,)
Date
Title 24
Date
Title 25
Date
Title 26
xtabs(~Satisfaction+CustomerType, aviation_data)
## CustomerType
## Satisfaction Neutral Disloyal Customer Loyal Customer
## neutral or dissatisfied 4225 11351 25580
## satisfied 4874 3570 41317
Date
Title 27
unlink("C:/Users/OLIVIA/Documents/R/win-library/3.6/00LOCK", recursive
= TRUE)
Load Packages
library(ggplot2)
library(backports)
library (corrplot)
library(caTools)
library(rpart)
library(rattle)
library(dummies)
library(mice)
library(gam)
library(caret)
library(psych)
Import Data
aviation_data = read.csv("Marketing Project-Aviation Data.csv", header
= TRUE, na.strings = c(""))
Sanity Checks
dim(aviation_data)
## [1] 90917 24
colnames(aviation_data)
Date
Title 28
Descriptive Statistics
str(aviation_data)
aviation_data$Gender = as.factor(as.character(aviation_data$Gender))
aviation_data$CustomerType = as.factor(as.character(aviation_data$Custo
merType))
aviation_data$TypeTravel = as.factor(as.character(aviation_data$TypeTra
vel))
aviation_data$Class = as.factor(as.character(aviation_data$Class))
aviation_data$Satisfaction = as.factor(as.character(aviation_data$Satis
faction))
Date
Title 29
summary(aviation_data)
## Mean : 2.839
## Max. :10.000
##
Date
Title 30
Data Pre-processiing
Date
Title 31
colnames(aviation_data) = new_vars
names(aviation_data)
summary(aviation_data)
Date
Title 32
DelayTime
## Business travel:56481 Business:43535 Min. : 50 Min. :
0.00
## Personal Travel:25348 Eco :40758 1st Qu.:1360 1st Qu.:
0.00
## NA's : 9088 Eco Plus: 6624 Median :1927 Median :
0.00
## Mean :1982 Mean :
14.69
## 3rd Qu.:2542 3rd Qu.:
12.00
## Max. :6950 Max. :15
92.00
##
Date
Title 33
## NA's :7179
## [1] 90917 23
## [1] TRUE
sum(is.na(aviation_data))
## [1] 42075
colSums(is.na(aviation_data))
## Gender CustomerType
## 0 9099
## Age TypeTravel
## 0 9088
## Class Flight_Distance
## 0 0
## Departure_DelayTime Arrival_DelayTime
## 0 284
## Satisfaction Seat_Comfort
## 0 0
## Departure_ArrivalTime_Covenient Food_Drink
## 8244 8181
## Gate_Location Inflight_WifiService
## 0 0
## Inflight_Entertainment Online_Support
## 0 0
## Ease_Of_OnlineBooking Onboard_Service
## 0 7179
## LegRoom_Service Baggage_Handling
## 0 0
## CheckIn_Service Cleanliness
## 0 0
## Online_Boarding
## 0
Date
Title 34
p = function(x){sum(is.na(x))/length(x)*100}
apply(aviation_data, 2, p)
## Gender CustomerType
## 0.0000000 10.0080293
## Age TypeTravel
## 0.0000000 9.9959304
## Class Flight_Distance
## 0.0000000 0.0000000
## Departure_DelayTime Arrival_DelayTime
## 0.0000000 0.3123728
## Satisfaction Seat_Comfort
## 0.0000000 0.0000000
## Departure_ArrivalTime_Covenient Food_Drink
## 9.0676111 8.9983171
## Gate_Location Inflight_WifiService
## 0.0000000 0.0000000
## Inflight_Entertainment Online_Support
## 0.0000000 0.0000000
## Ease_Of_OnlineBooking Onboard_Service
## 0.0000000 7.8962130
## LegRoom_Service Baggage_Handling
## 0.0000000 0.0000000
## CheckIn_Service Cleanliness
## 0.0000000 0.0000000
## Online_Boarding
## 0.0000000
## [1] 53634 23
any(is.na(aviation_clean_data))
## [1] FALSE
summary(aviation_clean_data)
Date
Title 35
Date
Title 36
aviation_data2 = aviation_data
aviation_data2$Onboard_Service[which(is.na(aviation_data2$Onboard_Servi
ce))] = mean(aviation_data2$Onboard_Service, na.rm = TRUE)
aviation_data2$Departure_ArrivalTime_Covenient[which(is.na(aviation_dat
a2$Departure_ArrivalTime_Covenient))] = mean(aviation_data2$Departure_A
rrivalTime_Covenient, na.rm = TRUE)
aviation_data2$Food_Drink[which(is.na(aviation_data2$Food_Drink))] = me
an(aviation_data2$Food_Drink, na.rm = TRUE)
aviation_data2$Arrival_DelayTime[which(is.na(aviation_data2$Arrival_Del
ayTime))] = mean(aviation_data2$Arrival_DelayTime, na.rm = TRUE)
summary(aviation_data2)
Date
Title 37
##
## iter imp variable
## 1 1 CustomerType TypeTravel
## 1 2 CustomerType TypeTravel
## 1 3 CustomerType TypeTravel
## 2 1 CustomerType TypeTravel
## 2 2 CustomerType TypeTravel
## 2 3 CustomerType TypeTravel
## 3 1 CustomerType TypeTravel
## 3 2 CustomerType TypeTravel
## 3 3 CustomerType TypeTravel
## 4 1 CustomerType TypeTravel
## 4 2 CustomerType TypeTravel
## 4 3 CustomerType TypeTravel
any(is.na(aviation_imputed_data))
Date
Title 38
## [1] FALSE
summary(aviation_imputed_data)
Date
Title 39
##
## Paired t-test
##
## data: aviation_clean_data$Onboard_Service and aviation_clean_data$D
eparture_ArrivalTime_Covenient
## t = 57.458, df = 53633, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.4608110 0.4933598
## sample estimates:
## mean of the differences
## 0.4770854
t.test(aviation_imputed_data$Onboard_Service, aviation_imputed_data$Dep
arture_ArrivalTime_Covenient, paired = T)
##
## Paired t-test
##
## data: aviation_imputed_data$Onboard_Service and aviation_imputed_da
ta$Departure_ArrivalTime_Covenient
## t = 77.317, df = 90916, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.4612551 0.4852492
## sample estimates:
## mean of the differences
## 0.4732521
Outliers Treatment
Date
Title 40
boxplot(aviation_imputed_data$Departure_DelayTime)
quantile(aviation_imputed_data$Departure_DelayTime)
#Flight Distance
boxplot(aviation_imputed_data$Flight_Distance)
quantile(aviation_imputed_data$Flight_Distance)
Date
Title 41
quantile(aviation_imputed_data$Arrival_DelayTime)
Seat Comfort
boxplot(aviation_imputed_data$Seat_Comfort)
quantile(aviation_imputed_data$Seat_Comfort)
airline_data$Departure_DelayTime = ifelse(airline_data$Departure_DelayT
ime > 12, 12, airline_data$Departure_DelayTime)
boxplot(airline_data$Departure_DelayTime)
Date
Title 42
boxplot(airline_data$Flight_Distance)
airline_data$Arrival_DelayTime = ifelse(airline_data$Arrival_DelayTime
> 15, 15, airline_data$Arrival_DelayTime)
boxplot(airline_data$Arrival_DelayTime)
boxplot(airline_data$Seat_Comfort)
Date
Title 43
## [1] 90917 29
str(airline_data_trans)
Date
Title 44
.
## $ Age : num 65 15 60 70 30 66 10 2
2 58 34 ...
## $ TypeTravel.Business.travel : num 0 0 0 0 0 0 0 0 0 0 ..
.
## $ TypeTravel.Personal.Travel : num 1 1 1 1 1 1 1 1 1 1 ..
.
## $ Class.Business : num 0 0 0 0 0 0 0 0 0 0 ..
.
## $ Class.Eco : num 1 1 1 1 1 1 1 1 1 1 ..
.
## $ Class.Eco.Plus : num 0 0 0 0 0 0 0 0 0 0 ..
.
## $ Flight_Distance : num 265 2138 623 354 1894
...
## $ Departure_DelayTime : num 0 0 0 0 0 12 0 12 12 0
...
## $ Arrival_DelayTime : num 0 0 0 0 0 15 0 15 15 0
...
## $ Satisfaction.neutral.or.dissatisfied: num 0 0 0 0 0 0 0 0 0 0 ..
.
## $ Satisfaction.satisfied : num 1 1 1 1 1 1 1 1 1 1 ..
.
## $ Seat_Comfort : num 0 0 0 0 0 0 0 0 0 0 ..
.
## $ Departure_ArrivalTime_Covenient : num 0 0 2.99 0 0 ...
## $ Food_Drink : num 0 0 0 0 0 ...
## $ Gate_Location : num 2 3 3 3 3 3 3 3 3 4 ..
.
## $ Inflight_WifiService : num 2 2 3 4 2 2 2 2 3 2 ..
.
## $ Inflight_Entertainment : num 4 0 4 3 0 5 0 0 3 0 ..
.
## $ Online_Support : num 2 2 3 4 2 5 2 2 3 2 ..
.
## $ Ease_Of_OnlineBooking : num 3 2 1 2 2 5 2 2 3 2 ..
.
## $ Onboard_Service : num 3 3.47 1 2 5 ...
## $ LegRoom_Service : num 0 3 0 0 4 0 3 4 0 2 ..
.
## $ Baggage_Handling : num 3 4 1 2 5 5 4 5 1 5 ..
.
## $ CheckIn_Service : num 5 4 4 4 5 5 5 3 2 2 ..
.
## $ Cleanliness : num 3 4 1 2 4 5 4 4 3 5 ..
.
## $ Online_Boarding : num 2 2 3 5 2 3 2 2 5 2 ..
.
Date
Title 45
Correlation
corx = cor(airline_data_trans, use="pairwise.complete.obs")
corrplot(corx,)
print(head(corMasterList,10))
## i j
cor
## 1 Gender.Female Gender.Male -1.
000000000
Date
Title 46
## 2 Gender.Female CustomerType.disloyal.Customer 0.
031651529
## 3 Gender.Male CustomerType.disloyal.Customer -0.
031651529
## 4 Gender.Female CustomerType.Loyal.Customer -0.
031651529
## 5 Gender.Male CustomerType.Loyal.Customer 0.
031651529
## 6 CustomerType.disloyal.Customer CustomerType.Loyal.Customer -1.
000000000
## 7 Gender.Female Age -0.
009800715
## 8 Gender.Male Age 0.
009800715
## 9 CustomerType.disloyal.Customer Age -0.
284396085
## 10 CustomerType.Loyal.Customer Age 0.
284396085
## p
## 1 0.000000000
## 2 0.000000000
## 3 0.000000000
## 4 0.000000000
## 5 0.000000000
## 6 0.000000000
## 7 0.003124809
## 8 0.003124809
## 9 0.000000000
## 10 0.000000000
pairs.panels(airline_data[c(bestSub,'Onboard_Service')])
Date
Title 47
unlink("C:/Users/OLIVIA/Documents/R/win-library/4.0/00LOCK", recursive
= T)
Load Packages
library(ggplot2)
library(backports)
library (corrplot)
library(caTools)
library(rpart)
library(rattle)
library(dummies)
library(mice)
library(gam)
library(caret)
library(psych)
library(randomForest)
Import Data
aviation_data = read.csv("Marketing Project-Aviation Data.csv", header
= TRUE, na.strings = c(""))
Sanity Checks
dim(aviation_data)
## [1] 90917 24
colnames(aviation_data)
Date
Title 48
Descriptive Statistics
str(aviation_data)
aviation_data$Gender = as.factor(as.character(aviation_data$Gender))
aviation_data$CustomerType = as.factor(as.character(aviation_data$Custo
merType))
aviation_data$TypeTravel = as.factor(as.character(aviation_data$TypeTra
vel))
aviation_data$Class = as.factor(as.character(aviation_data$Class))
Date
Title 49
aviation_data$Satisfaction = as.factor(as.character(aviation_data$Satis
faction))
summary(aviation_data)
## Mean : 2.839
## Max. :10.000
##
Date
Title 50
Data Pre-processiing
Date
Title 51
", "Online_Boarding")
colnames(aviation_data) = new_vars
names(aviation_data)
summary(aviation_data)
Date
Title 52
Date
Title 53
## NA's :7179
## [1] 90917 23
## [1] TRUE
sum(is.na(aviation_data))
## [1] 42075
colSums(is.na(aviation_data))
## Gender CustomerType
## 0 9099
## Age TypeTravel
## 0 9088
## Class Flight_Distance
## 0 0
## Departure_DelayTime Arrival_DelayTime
## 0 284
## Satisfaction Seat_Comfort
## 0 0
## Departure_ArrivalTime_Covenient Food_Drink
## 8244 8181
## Gate_Location Inflight_WifiService
## 0 0
## Inflight_Entertainment Online_Support
## 0 0
## Ease_Of_OnlineBooking Onboard_Service
## 0 7179
## LegRoom_Service Baggage_Handling
## 0 0
## CheckIn_Service Cleanliness
## 0 0
Date
Title 54
## Online_Boarding
## 0
p = function(x){sum(is.na(x))/length(x)*100}
apply(aviation_data, 2, p)
## Gender CustomerType
## 0.0000000 10.0080293
## Age TypeTravel
## 0.0000000 9.9959304
## Class Flight_Distance
## 0.0000000 0.0000000
## Departure_DelayTime Arrival_DelayTime
## 0.0000000 0.3123728
## Satisfaction Seat_Comfort
## 0.0000000 0.0000000
## Departure_ArrivalTime_Covenient Food_Drink
## 9.0676111 8.9983171
## Gate_Location Inflight_WifiService
## 0.0000000 0.0000000
## Inflight_Entertainment Online_Support
## 0.0000000 0.0000000
## Ease_Of_OnlineBooking Onboard_Service
## 0.0000000 7.8962130
## LegRoom_Service Baggage_Handling
## 0.0000000 0.0000000
## CheckIn_Service Cleanliness
## 0.0000000 0.0000000
## Online_Boarding
## 0.0000000
aviation_data2 = aviation_data
aviation_data2$Onboard_Service[which(is.na(aviation_data2$Onboard_Servi
ce))] = mean(aviation_data2$Onboard_Service, na.rm = TRUE)
aviation_data2$Departure_ArrivalTime_Covenient[which(is.na(aviation_dat
a2$Departure_ArrivalTime_Covenient))] = mean(aviation_data2$Departure_A
rrivalTime_Covenient, na.rm = TRUE)
aviation_data2$Food_Drink[which(is.na(aviation_data2$Food_Drink))] = me
an(aviation_data2$Food_Drink, na.rm = TRUE)
aviation_data2$Arrival_DelayTime[which(is.na(aviation_data2$Arrival_Del
ayTime))] = mean(aviation_data2$Arrival_DelayTime, na.rm = TRUE)
summary(aviation_data2)
Date
Title 55
Date
Title 56
##
## iter imp variable
## 1 1 CustomerType TypeTravel
## 1 2 CustomerType TypeTravel
## 1 3 CustomerType TypeTravel
## 2 1 CustomerType TypeTravel
## 2 2 CustomerType TypeTravel
## 2 3 CustomerType TypeTravel
## 3 1 CustomerType TypeTravel
## 3 2 CustomerType TypeTravel
## 3 3 CustomerType TypeTravel
## 4 1 CustomerType TypeTravel
## 4 2 CustomerType TypeTravel
## 4 3 CustomerType TypeTravel
any(is.na(aviation_imputed_data))
## [1] FALSE
summary(aviation_imputed_data)
Date
Title 57
#Outliers Treatment
Date
Title 58
airline_data$Departure_DelayTime = ifelse(airline_data$Departure_DelayT
ime > 12, 12, airline_data$Departure_DelayTime)
airline_data$Arrival_DelayTime = ifelse(airline_data$Arrival_DelayTime
> 15, 15, airline_data$Arrival_DelayTime)
DATA PREPARATION
dim(aviation_imputed_data)
## [1] 90917 23
str(aviation_imputed_data)
table(aviation_imputed_data$Satisfaction)
Date
Title 59
##
## neutral or dissatisfied satisfied
## 41156 49761
airline = aviation_imputed_data
colnames(airline)
## [23] "Online_Boarding"
#Convert to Numeric
airline$Gender = as.numeric(as.factor(airline$Gender))
airline$CustomerType = as.numeric(as.factor(airline$CustomerType))
airline$Age = as.numeric(as.integer(airline$Age))
airline$TypeTravel = as.numeric(as.factor(airline$TypeTravel))
airline$Class = as.numeric(as.factor(airline$Class))
airline$Flight_Distance = as.numeric(as.integer(airline$Flight_Distance
))
airline$Departure_DelayTime = as.numeric(as.integer(airline$Departure_D
elayTime))
airline$Seat_Comfort = as.numeric(as.integer(airline$Seat_Comfort))
airline$Gate_Location = as.numeric(as.integer(airline$Gate_Location))
airline$Inflight_WifiService = as.numeric(as.integer(airline$Inflight_W
ifiService))
airline$Inflight_Entertainment = as.numeric(as.integer(airline$Inflight
_Entertainment))
airline$Online_Support = as.numeric(as.integer(airline$Online_Support))
airline$Ease_Of_OnlineBooking = as.numeric(as.integer(airline$Ease_Of_O
nlineBooking))
airline$LegRoom_Service = as.numeric(as.integer(airline$LegRoom_Service
Date
Title 60
))
airline$Baggage_Handling = as.numeric(as.integer(airline$Baggage_Handli
ng))
airline$CheckIn_Service = as.numeric(as.integer(airline$CheckIn_Service
))
airline$Cleanliness = as.numeric(as.integer(airline$Cleanliness))
airline$Online_Boarding = as.numeric(as.integer(airline$Online_Boarding
))
str(airline)
table(airline$Satisfaction)
##
## neutral or dissatisfied satisfied
## 41156 49761
##
## 0 1
## 41156 49761
Date
Title 61
##Data Spliting
set.seed(1234)
sample = sample.split(airline$Satisfaction, SplitRatio = 0.7)
train_data = subset(airline, sample == T)
test_data = subset(airline, sample == F)
dim(train_data)
## [1] 63642 23
dim(test_data)
## [1] 27275 23
prop.table(table(train_data$Satisfaction))
##
## 0 1
## 0.4526728 0.5473272
prop.table(table(test_data$Satisfaction))
##
## 0 1
## 0.4526856 0.5473144
MODEL BUILDING
##Applying Logestic Regression
logr_train = train_data
logr_test = test_data
##
## Call:
## glm(formula = Satisfaction ~ ., family = binomial, data = logr_train
)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8900 -0.5829 0.1977 0.5246 3.5122
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z
|)
## (Intercept) -6.621e+00 1.091e-01 -60.681 < 2e-
16 ***
## Gender -9.667e-01 2.341e-02 -41.299 < 2e-
16 ***
Date
Title 62
Date
Title 63
##
## FALSE TRUE
## 0 23495 5314
## 1 5256 29577
#Accuracy
TR_tpr = logr_TR[2,2]/(logr_TR[2,1] + logr_TR[2,2])
TR_fpr = logr_TR[1,1]/(logr_TR[1,1] + logr_TR[1,2])
TR_accuracy = sum(diag(logr_TR))/sum(logr_TR)
TR_accuracy
## [1] 0.8339147
##
## FALSE TRUE
## 0 10104 2243
## 1 2197 12731
#Accuracy
TE_tpr = logr_TE[2,2]/(logr_TE[2,1] + logr_TE[2,2])
TE_fpr = logr_TE[1,1]/(logr_TE[1,1] + logr_TE[1,2])
TE_accuracy = sum(diag(logr_TE))/sum(logr_TE)
TE_accuracy
## [1] 0.8372136
#Alternatively
logr_pred = predict(logr_model, newdata = logr_test, type = "response")
#Accuracy
mean(y_pred == y_act)
Date
Title 64
## [1] 0.8372136
#Confusion Matrix
library(e1071)
caret::confusionMatrix(y_pred, y_act, positive = "1")
#ROC
library(InformationValue)
##
## Attaching package: 'InformationValue'
InformationValue::plotROC(y_act, logr_pred)
Date
Title 65
InformationValue::AUROC(y_act, logr_pred)
## [1] 0.9097688
#RAndom Forest
Rand_data = train_data
Rand_test = test_data
Rand_Model = randomForest(Satisfaction ~., data = Rand_data, ntree = 50
, importance = T)
Rand_Model
##
## Call:
## randomForest(formula = Satisfaction ~ ., data = Rand_data, ntree =
50, importance = T)
## Type of random forest: classification
## Number of trees: 50
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 5.31%
## Confusion matrix:
## 0 1 class.error
## 0 27250 1559 0.05411503
## 1 1821 33012 0.05227801
#Prediction
predict_rf = predict(Rand_Model, Rand_test, type = "class")
confusionMatrix(Rand_test$Satisfaction, predict_rf)
Date
Title 66
## [1] 0 1
## <0 rows> (or 0-length row.names)
#ROC Curve
library(ROCR)
Pred_rf = predict(Rand_Model, Rand_test, type = 'prob')[,2]
require(pROC)
##
## Attaching package: 'pROC'
rf_roc = roc(Rand_test$Satisfaction,Pred_rf)
plot(rf_roc)
Date
Title 67
#Boosting
boost_train = train_data
boost_test = test_data
features_train = as.matrix(boost_train[,c(1:8,10:23)])
label_train = as.matrix(boost_train[,9])
features_test = as.matrix(boost_test[,c(1:8,10:23)])
#Model
library(xgboost)
##
## Attaching package: 'xgboost'
Date
Title 68
## This may not be accurate due to some parameters are only used in l
anguage bindings but
## passed down to XGBoost core. Or some parameters are not used but
slip through this
## verification. Please open an issue if you find above cases.
#Prediction
XGB_predict = predict(XGBModel, features_test)
tabXGB = table(boost_test$Satisfaction, XGB_predict > 0.5)
tabXGB
##
## FALSE TRUE
## 0 10914 1433
## 1 2052 12876
#Confusion Matrix
XG_tpr = tabXGB[2,2]/(tabXGB[2,1] + tabXGB[2,2])
XG_fpr = tabXGB[1,1]/(tabXGB[1,1] + tabXGB[1,2])
XG_tpr
## [1] 0.8625402
XG_fpr
## [1] 0.8839394
#Accuracy
sum(diag(tabXGB))/sum(tabXGB)
## [1] 0.8722273
#ROC
InformationValue::plotROC(bs_act, XGB_predict)
Date
Title 69
InformationValue::AUROC(bs_act, XGB_predict)
## [1] 0.8732398
Date