100% found this document useful (7 votes)
1K views70 pages

Predicting Airline Passengers Satisfaction

This document provides an executive summary and methodology for a project aimed at predicting airline passenger satisfaction. It analyzed survey data from over 90,000 passengers to determine the key factors that influence satisfaction. The analysis involved cleaning the data, exploring relationships between variables, and using random forest modeling to predict satisfaction. The top factors identified were related to check-in services, online booking/support, and seat comfort. The analysis provided recommendations to airlines on improving digital services and understanding drivers of passenger satisfaction.

Uploaded by

Tunde Asaaju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (7 votes)
1K views70 pages

Predicting Airline Passengers Satisfaction

This document provides an executive summary and methodology for a project aimed at predicting airline passenger satisfaction. It analyzed survey data from over 90,000 passengers to determine the key factors that influence satisfaction. The analysis involved cleaning the data, exploring relationships between variables, and using random forest modeling to predict satisfaction. The top factors identified were related to check-in services, online booking/support, and seat comfort. The analysis provided recommendations to airlines on improving digital services and understanding drivers of passenger satisfaction.

Uploaded by

Tunde Asaaju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

AIRLINE PASSENGER SATISFACTION PREDICTION

PGP-DSBA Capstone Project


Final Report

ASAAJU BABATUNDE
Title 1

EXECUTIVE SUMMARY
Customer satisfaction is an important part of staying ahead in today‟s
competitive business environment. Social media in particular has done much
to promote infamous incidents of poor customer service, and now more than
ever, the perception that an airline does not care about the satisfaction of
their customers can be very damaging.
Statement of the Problem
The airline industry like every other business is faced with opportunities
(high traffic due to increase in demand) as well as challenges from
competition. By so doing, all players in the airline industry require to
continuously innovating in quality of services and technology enhancement
for better safety of the passengers and profitability.
However, the survey feedback of about 90,917 respondents‟ views on
customer satisfaction (CSAT) has shown 55% satisfied and 45% neutral
or dissatisfied customers in this case. There are 22 important survey
variables or parameters with contribution tendency to the customer
satisfaction outcome.
Satfisfaction Rate
100,000
50,000
Satfisfaction
0 Rate
Neutral or Satfisfied
Dissatisfied
Many e-commerce businesses
might feel pleased if their CSAT rating is over 70% but the average
global Customer Satisfaction benchmark that includes all industries
worldwide is estimated at 86%.
Target (Response) Satisfaction
Features  Gender  Online support
 Customer Type  Ease of Online booking
 Age  On-board service
 Type of Travel  Leg room service
 Class  Baggage handling
 Seat comfort  Check in service
 Food and Drink  Cleanliness
 Inflight wifi service  Online boarding
 Inflight entertainment  Arrival Delay-time
 Departure/Arrival Convenient  Departure Delay-time
 Flight Distance  Gate Location

Date
Title 2

Objective
The aim of this project is to determine the relative importance of each
parameter with regards to their contribution to passenger satisfaction using
both analytical techniques and predictive algorithm.
Specific objectives are;
a. To understand which variables play an important role in swaying a
passenger feedback towards a very high percentage of „satisfied‟.
b. To determine the relative importance of each survey feedback factors
with regards to passenger satisfaction.
Methodology
The marketing survey outcome is regarded as raw data for analysis so as to
discover useful information for business decision in improving passenger
satisfaction. This process involves cleaning, evaluating, transforming and
modeling the survey data using the logical and analytical reasoning to
carefully examine each parameter or feature of the data collected.
R analytics is adopted for the data statistics computing, visualization,
preparation (diagnostic) and predictive analysis. Having tried different
models using both train and test data (dataset splitting), random forest model
(supervised machine learning algorithm) seems to be suitable with the best
accuracy and AUROC.
Insights
 The dataset consists about 46% missing values (42,075).
 Four variables/parameters (Departure Delay-time, Flight Distance,
Arrival Delay-time and Seat Comfort) in the dataset show skewness
(positively skewed) with mean value to the right which presence and
direction of outliers in the features‟ entries and probability
distribution.
 Positive correlation is noticed between Departure Delay-time and
Arrival Delay-time which implies stronger relationship between the
two variables.
 The p-values and coefficients in regression analysis of the dataset
indicate that all the independent variables (parameters) are statistically
significant to the target variable (Satisfaction).

Date
Title 3

Recommendations
1. The survey research method (quantitative) used requires improvement
for flexibility and simplicity to avoid recording large missing values
in future.
2. It is imperative for the airline management to understand which
features or parameters play important role in swaying passengers‟
feedback positively.
3. The airline management needs to innovate in both procedural and
technological services since almost all the features are statistically
significant.
4. Improvement emphasis should be on services that can be digitalized
for ease of use and accessibility such as; Check-In Service, Online
Support, Online Booking and Online Boarding due to outcome of
variable importance plot.

Date
Title 4

CONTENTS

EXECUTIVE SUMMARY ....................... 1


CONTENTS ............................................. 4
ANALYTICAL APPROACH ................... 5

DEPLOYMENT ....................................... 7
APPENDIX .............................................. 8

Date
Title 5

ANALYTICAL APPROACH
Having clearly identified the business problem (customer satisfaction),
analytical approach is necessary to determine the data requirements because
the methods of analysis to be used require specific content, formats, and data
representations based on domain knowledge.
Step 1: Data Understanding
Descriptive statistics and visualization techniques (EDA): The data is
collected using two separate excel spreadsheets (flight and survey data) but
saved in CSV file format for wide range application purpose and merged
using the data dictionary provided with a LIKERT chart.
The dataset dimension is 90,917 observations and 24 variables which must
have been collected over a period time (at least one month with Boeing 747
capacity at 5 trips per day supposedly). 42,075 missing values (NAs) exist in
the dataset and possible outliers observed.

Univariate analysis of the variables to find patterns within the data in term of
mean, mode, median, mode and variance

See appendix for capstone project notes I

Step 2: Data Pre-processing


The Data Pre-processing step includes all the activities used to create the
dataset used during the modeling phase. This includes cleansing data and
transforming data into more useful variables.
 Removal of unwanted variable/attribute: Customer ID
 Missing value treatment using both Simple and Mice Imputations.
 Variables‟ relationship visualization -Bivariate analysis.
 Replacement of maximum outlier values as discovered by 4th quantlie
values.
 Variables transformation using feature engineering and text analysis
to derive new structured variables for possible enrichment of
predictors and model prediction improvement.
 Splitting dataset into training and testing data in readiness for model
building. The training dataset is implemented to build up a model,
while a testing dataset is to validate the model built.

See appendix for capstone project notes II

Date
Title 6

Step 3: Model Building and Validation


The modeling process is actually very iterative and general goal of data
analysis is to acquire knowledge from data so as to improve business
decision towards sustainability and profitability in this case.

Statistical models provide a convenient framework for achieving this project


objective because they make it possible to know importance of each survey
parameter (variable) and how they collectively influence passengers‟
satisfaction through predictions.

Considering the project‟s statement of problem and main objective,


predictive modeling (supervised learning) is more applicable because output
variable (response), Y exists and the goal is to build a function f(X) of the
inputs X for predicting Y.

Hence, Logistic Regression, Random Forest, and XGBoosting models are


better considered for the best algorithm since the output variable is a 2-level
factor.

Model Evaluation Outcome


Logistic Regression XGBoosting Random Forest
Sensitivity 85% 86% 95%
Specificity 82% 88% 95%
Accuracy 84% 87% 95%
AUROC 91% 87% 97%
 Random Forest seems to be the best though it appears to tend
towards overfitting.

See appendix for capstone project notes III

Date
Title 7

DEPLOYMENT
In conclusion, variable importance plot as part of critical output of the best
algorithm (Random Forest) in this case is being used to show the top 10
variables based on their Mean Decrease Accuracy values.

S/N Variable (Feature) Recommendation


1 Check-In Service Check- in Service should be flexible and less
rigorous to improve passenger satisfaction.
2 Inflight Entertainment Inflight Entertainment must be innovative (such as
movie streams) and cost effective.
3 Seat Comfort EDA has shown the highest number of business
travel passenger and ergonomics of the seat would
be a great concern. Hence, there is high need to
invest in modern aircraft with seat comfort.
4 Online Support We are in a digital World. Relevant mobile Apps
for online support are to be deployed.
5 Customer Type Customer loyalty is critical. Greater percentages of
the passengers are loyal to the airline and this must
sustained and increased with improved service
delivery.
6 Baggage Handling Professional and secured handling of baggage is
required.
7 Travel Type Most passengers are business travelers and they
are viable customers due to frequency of travel.
8 Age Older people use the airline. “The older the better”
9 Food and Drink Choice of food and drink must be suitable for
older and business class passengers.
10 Departure/Arrival-Time Convenient Too early and late departure time should be
minimized.

Date
Title 8

APPENDIX

Capstone Project Notes I

Load Packages
library(ggplot2) # For graphs and visualisations
library (corrplot)
library(caTools) # Split Data into Test and Train Set
library(rpart)
library(rattle) # To visualise decision tree

Set Working Directory


setwd("C:/Users/OLIVIA/Desktop/DSBA/Dataset/")

Import Data
aviation_data = read.csv("Marketing Project-Aviation Data.csv", sep = "
,", header = TRUE)

Exploratory Data Analysis

Sanity Checks
dim(aviation_data)

## [1] 90917 24

colnames(aviation_data)

## [1] "CustomerID" "Gender"

## [3] "CustomerType" "Age"

## [5] "TypeTravel" "Class"

## [7] "Flight_Distance" "DepartureDelayin_Mins"

## [9] "ArrivalDelayin_Mins" "Satisfaction"

## [11] "Seat_comfort" "Departure.Arrival.time_con


venient"
## [13] "Food_drink" "Gate_location"

## [15] "Inflightwifi_service" "Inflight_entertainment"

## [17] "Online_support" "Ease_of_Onlinebooking"

## [19] "Onboard_service" "Leg_room_service"

## [21] "Baggage_handling" "Checkin_service"

## [23] "Cleanliness" "Online_boarding"

Date
Title 9

Descriptive Statistics
summary(aviation_data)

## CustomerID Gender CustomerType Age

## Min. :149965 Female:46186 : 9099 Min. :


7.00
## 1st Qu.:172694 Male :44731 disloyal Customer:14921 1st Qu.:2
7.00
## Median :195423 Loyal Customer :66897 Median :4
0.00
## Mean :195423 Mean :3
9.45
## 3rd Qu.:218152 3rd Qu.:5
1.00
## Max. :240881 Max. :8
5.00
##

## TypeTravel Class Flight_Distance DepartureD


elayin_Mins
## : 9088 Business:43535 Min. : 50 Min. :
0.00
## Business travel:56481 Eco :40758 1st Qu.:1360 1st Qu.:
0.00
## Personal Travel:25348 Eco Plus: 6624 Median :1927 Median :
0.00
## Mean :1982 Mean :
14.69
## 3rd Qu.:2542 3rd Qu.:
12.00
## Max. :6950 Max. :15
92.00
##

## ArrivalDelayin_Mins Satisfaction Seat_comfort

## Min. : 0.00 neutral or dissatisfied:41156 Min. : 0.000

## 1st Qu.: 0.00 satisfied :49761 1st Qu.: 2.000

## Median : 0.00 Median : 3.000

## Mean : 15.06 Mean : 2.839

## 3rd Qu.: 13.00 3rd Qu.: 4.000

## Max. :1584.00 Max. :10.000

## NA's :284

## Departure.Arrival.time_convenient Food_drink Gate_location

Date
Title 10

## Min. :0.000 Min. :0.00 Min. :0.00


## 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.00
## Median :3.000 Median :3.00 Median :3.00
## Mean :2.993 Mean :2.85 Mean :2.99
## 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :5.000 Max. :5.00 Max. :5.00
## NA's :8244 NA's :8181
## Inflightwifi_service Inflight_entertainment Online_support
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.252 Mean :3.384 Mean :3.519
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000
##
## Ease_of_Onlinebooking Onboard_service Leg_room_service Baggage_hand
ling
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.00
0
## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.00
0
## Median :4.000 Median :4.000 Median :4.000 Median :4.00
0
## Mean :3.476 Mean :3.467 Mean :3.487 Mean :3.69
7
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.00
0
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.00
0
## NA's :7179

## Checkin_service Cleanliness Online_boarding


## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.341 Mean :3.708 Mean :3.352
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000
##

str(aviation_data)

## 'data.frame': 90917 obs. of 24 variables:


## $ CustomerID : int 149965 149966 149967 1499
68 149969 149970 149971 149972 149973 149974 ...
## $ Gender : Factor w/ 2 levels "Female","M
ale": 1 1 1 1 2 1 2 2 1 1 ...
## $ CustomerType : Factor w/ 3 levels "","disloya
l Customer",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Age : int 65 15 60 70 30 66 10 22 5
8 34 ...
## $ TypeTravel : Factor w/ 3 levels "","Busines

Date
Title 11

s travel",..: 3 3 3 3 1 3 3 3 3 3 ...
## $ Class : Factor w/ 3 levels "Business",
"Eco",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Flight_Distance : int 265 2138 623 354 1894 227
1812 1556 104 3633 ...
## $ DepartureDelayin_Mins : int 0 0 0 0 0 17 0 30 47 0 ..
.
## $ ArrivalDelayin_Mins : int 0 0 0 0 0 15 0 26 48 0 ..
.
## $ Satisfaction : Factor w/ 2 levels "neutral or
dissatisfied",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Seat_comfort : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Departure.Arrival.time_convenient: int 0 0 NA 0 0 0 0 NA 0 0 ...
## $ Food_drink : int 0 0 0 0 0 NA NA 0 0 0 ...
## $ Gate_location : int 2 3 3 3 3 3 3 3 3 4 ...
## $ Inflightwifi_service : int 2 2 3 4 2 2 2 2 3 2 ...
## $ Inflight_entertainment : int 4 0 4 3 0 5 0 0 3 0 ...
## $ Online_support : int 2 2 3 4 2 5 2 2 3 2 ...
## $ Ease_of_Onlinebooking : int 3 2 1 2 2 5 2 2 3 2 ...
## $ Onboard_service : int 3 NA 1 2 5 5 3 2 3 3 ...
## $ Leg_room_service : int 0 3 0 0 4 0 3 4 0 2 ...
## $ Baggage_handling : int 3 4 1 2 5 5 4 5 1 5 ...
## $ Checkin_service : int 5 4 4 4 5 5 5 3 2 2 ...
## $ Cleanliness : int 3 4 1 2 4 5 4 4 3 5 ...
## $ Online_boarding : int 2 2 3 5 2 3 2 2 5 2 ...

Renaming the variables


new_vars = c("CustomerID", "Gender", "CustomerType", "Age", "TypeTravel
", "Class", "Flight_Distance", "Departure_DelayTime", "Arrival_DelayTim
e", "Satisfaction", "Seat_Comfort", "Departure_ArrivalTime_Covenient",
"Food_Drink", "Gate_Location", "Inflight_WifiService", "Inflight_Entert
ainment", "Online_Support", "Ease_Of_OnlineBooking", "Onboard_Service",
"LegRoom_Service", "Baggage_Handling", "CheckIn_Service", "Cleanliness
", "Online_Boarding")

colnames(aviation_data) = new_vars
names(aviation_data)

## [1] "CustomerID" "Gender"

## [3] "CustomerType" "Age"

## [5] "TypeTravel" "Class"

## [7] "Flight_Distance" "Departure_DelayTime"

## [9] "Arrival_DelayTime" "Satisfaction"

## [11] "Seat_Comfort" "Departure_ArrivalTime_Coveni


ent"
## [13] "Food_Drink" "Gate_Location"

Date
Title 12

## [15] "Inflight_WifiService" "Inflight_Entertainment"

## [17] "Online_Support" "Ease_Of_OnlineBooking"

## [19] "Onboard_Service" "LegRoom_Service"

## [21] "Baggage_Handling" "CheckIn_Service"

## [23] "Cleanliness" "Online_Boarding"

levels(aviation_data$CustomerType) = c("Neutral", "Disloyal Customer",


"Loyal Customer")

levels(aviation_data$TypeTravel) = c("Other", "Business Travel", "Perso


nal Travel")

summary(aviation_data)

## CustomerID Gender CustomerType Age

## Min. :149965 Female:46186 Neutral : 9099 Min. :


7.00
## 1st Qu.:172694 Male :44731 Disloyal Customer:14921 1st Qu.:2
7.00
## Median :195423 Loyal Customer :66897 Median :4
0.00
## Mean :195423 Mean :3
9.45
## 3rd Qu.:218152 3rd Qu.:5
1.00
## Max. :240881 Max. :8
5.00
##

## TypeTravel Class Flight_Distance Departure_


DelayTime
## Other : 9088 Business:43535 Min. : 50 Min. :
0.00
## Business Travel:56481 Eco :40758 1st Qu.:1360 1st Qu.:
0.00
## Personal Travel:25348 Eco Plus: 6624 Median :1927 Median :
0.00
## Mean :1982 Mean :
14.69
## 3rd Qu.:2542 3rd Qu.:
12.00
## Max. :6950 Max. :15
92.00
##

## Arrival_DelayTime Satisfaction Seat_Comfort


## Min. : 0.00 neutral or dissatisfied:41156 Min. : 0.000

Date
Title 13

## 1st Qu.: 0.00 satisfied :49761 1st Qu.: 2.000


## Median : 0.00 Median : 3.000
## Mean : 15.06 Mean : 2.839
## 3rd Qu.: 13.00 3rd Qu.: 4.000
## Max. :1584.00 Max. :10.000
## NA's :284
## Departure_ArrivalTime_Covenient Food_Drink Gate_Location
## Min. :0.000 Min. :0.00 Min. :0.00
## 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.00
## Median :3.000 Median :3.00 Median :3.00
## Mean :2.993 Mean :2.85 Mean :2.99
## 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :5.000 Max. :5.00 Max. :5.00
## NA's :8244 NA's :8181
## Inflight_WifiService Inflight_Entertainment Online_Support
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.252 Mean :3.384 Mean :3.519
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000
##
## Ease_Of_OnlineBooking Onboard_Service LegRoom_Service Baggage_Handl
ing
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.000

## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.000

## Median :4.000 Median :4.000 Median :4.000 Median :4.000

## Mean :3.476 Mean :3.467 Mean :3.487 Mean :3.697

## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.000

## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000

## NA's :7179

## CheckIn_Service Cleanliness Online_Boarding


## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.341 Mean :3.708 Mean :3.352
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000
##

Date
Title 14

Univariate Analysis

Age
par(mfrow = c(1,2))
hist(aviation_data$Age, col = "pink", main = "Age")
boxplot(aviation_data$Age, horizontal = T, col = "pink", main = "Age")

Flight Distance and Departure Delay Time


par(mfrow = c(2,2))
hist(aviation_data$Flight_Distance, col = "blue", main = "Flight Distan
ce")
hist(aviation_data$Departure_DelayTime, col = "brown", main = "Departur
e DelayTime")

boxplot(aviation_data$Flight_Distance, horizontal = T, col = "blue", ma


in = "Flight Distance")
boxplot(aviation_data$Departure_DelayTime, horizontal = T, col = "brown
", main = "Departure DelayTime")

Date
Title 15

Arrival Delay Time and Seat Comfort


par(mfrow = c(2,2))
hist(aviation_data$Arrival_DelayTime, col = "red", main = "Arrival Dela
yTime")
hist(aviation_data$Seat_Comfort, col = "purple", main = "Seat Comfort")

boxplot(aviation_data$Arrival_DelayTime, horizontal = T, col = "red", m


ain = "Arrival DelayTime")
boxplot(aviation_data$Seat_Comfort, horizontal = T, col = "purple", mai
n = "Seat Comfort")

Date
Title 16

Departure-Arrival Time Convenient and Food-Drink


par(mfrow = c(2,2))
hist(aviation_data$Departure_ArrivalTime_Covenient, col = "green", main
= "Departure-ArrivalTime Convenient")
hist(aviation_data$Food_Drink, col = "yellow", main = "Food-Drink")

boxplot(aviation_data$Departure_ArrivalTime_Covenient, horizontal = T,
col = "green", main = "Departure-ArrivalTime Convenient")
boxplot(aviation_data$Food_Drink, horizontal = T, col = "yellow", main
= "Food-Drink")

Date
Title 17

Gate Location, Baggage Handling and Cleanliness


par(mfrow = c(3,3))
hist(aviation_data$Gate_Location, col = "blue", main = "Gate Location")
hist(aviation_data$Baggage_Handling, col = "gold", main = "Baggage Hand
ling")
hist(aviation_data$Cleanliness, col = "brown", main = "Cleanliness")

boxplot(aviation_data$Gate_Location, horizontal = T, col = "blue", main


= "Gate Location")
boxplot(aviation_data$Baggage_Handling, horizontal = T, col = "gold", m
ain = "Baggage Handling")
boxplot(aviation_data$Cleanliness, horizontal = T, col = "brown", main
= "Cleanliness")

Date
Title 18

Inflight WifiService and Inflight Entertainment


par(mfrow = c(2,2))
hist(aviation_data$Inflight_WifiService, col = "red", main = "Inflight
WifiService")
hist(aviation_data$Inflight_Entertainment, col = "purple", main = "Infl
ight Entertainment")

boxplot(aviation_data$Inflight_WifiService, horizontal = T, col = "red"


, main = "Inflight WifiService")
boxplot(aviation_data$Inflight_Entertainment, horizontal = T, col = "pu
rple", main = "Inflight Entertainment")

Date
Title 19

Online Support, Online Boarding and Ease of Online Booking


par(mfrow = c(3,3))
hist(aviation_data$Online_Support, col = "green", main = "Online Suppor
t")
hist(aviation_data$Online_Boarding, col = "grey", main = "Online Boardi
ng")
hist(aviation_data$Ease_Of_OnlineBooking, col = "yellow", main = "Ease
of OnlineBooking")

boxplot(aviation_data$Online_Support, horizontal = T, col = "green", ma


in = "Online Support")
boxplot(aviation_data$Online_Boarding, horizontal = T, col = "grey", ma
in = "Online Boarding")
boxplot(aviation_data$Ease_Of_OnlineBooking, horizontal = T, col = "yel
low", main = "Ease of OnlineBooking")

Onboard Service and LegRoom Service


par(mfrow = c(3,3))
hist(aviation_data$Onboard_Service, col = "blue", main = "Onboard Servi
ce")
hist(aviation_data$LegRoom_Service, col = "brown", main = "LegRoom Serv
ice")
hist(aviation_data$CheckIn_Service, col = "pink", main = "CheckIn Servi
ce")

boxplot(aviation_data$Onboard_Service, horizontal = T, col = "blue", ma


in = "Onboard Service")

Date
Title 20

boxplot(aviation_data$LegRoom_Service, horizontal = T, col = "brown", m


ain = "LegRoom Service")
boxplot(aviation_data$CheckIn_Service, horizontal = T, col = "pink", ma
in = "CheckIn Service")

Gender and Class


plot(aviation_data$Gender)

Date
Title 21

plot(aviation_data$Class)

Customer Type and Travel Type


plot(aviation_data$CustomerType)

Date
Title 22

plot(aviation_data$TypeTravel)

Date
Title 23

Satisfaction
plot(aviation_data$Satisfaction)

Bivariate Analysis
corx = cor(aviation_data[,c(4,7:9, 11:24)], use="pairwise.complete.obs"
)
corrplot(corx,)

Date
Title 24

plot(Departure_DelayTime~Arrival_DelayTime, data = aviation_data)

Date
Title 25

Online Boarding and Satisfaction


boxplot(aviation_data$Online_Boarding~aviation_data$Satisfaction, col =
c("blue", "grey", "red"), main = "Online boarding distribution", xlab
= "Satisfaction", ylab = "Online_Boarding")

Travel Type and Ease of Online Booking


boxplot(aviation_data$Ease_Of_OnlineBooking~aviation_data$TypeTravel, c
ol = c("blue", "grey", "red"), main = "Ease of online booking distribut
ion", xlab = "TypeTravel", ylab = "OnlineBooking")

Date
Title 26

xtabs(~Satisfaction+CustomerType, aviation_data)

## CustomerType
## Satisfaction Neutral Disloyal Customer Loyal Customer
## neutral or dissatisfied 4225 11351 25580
## satisfied 4874 3570 41317

plot(xtabs(~Satisfaction+CustomerType, aviation_data), main = "Customer


Type and Satisfaction")

Date
Title 27

Capstone Project Notes II

unlink("C:/Users/OLIVIA/Documents/R/win-library/3.6/00LOCK", recursive
= TRUE)

Load Packages
library(ggplot2)
library(backports)
library (corrplot)
library(caTools)
library(rpart)
library(rattle)
library(dummies)
library(mice)
library(gam)
library(caret)
library(psych)

Set Working Directory


setwd("C:/Users/OLIVIA/Desktop/DSBA/Dataset/")

Import Data
aviation_data = read.csv("Marketing Project-Aviation Data.csv", header
= TRUE, na.strings = c(""))

Sanity Checks
dim(aviation_data)

## [1] 90917 24

colnames(aviation_data)

## [1] "CustomerID" "Gender"

## [3] "CustomerType" "Age"

## [5] "TypeTravel" "Class"

## [7] "Flight_Distance" "DepartureDelayin_Mins"

## [9] "ArrivalDelayin_Mins" "Satisfaction"

## [11] "Seat_comfort" "Departure.Arrival.time_con


venient"
## [13] "Food_drink" "Gate_location"

## [15] "Inflightwifi_service" "Inflight_entertainment"

## [17] "Online_support" "Ease_of_Onlinebooking"

## [19] "Onboard_service" "Leg_room_service"

Date
Title 28

## [21] "Baggage_handling" "Checkin_service"

## [23] "Cleanliness" "Online_boarding"

Descriptive Statistics
str(aviation_data)

## 'data.frame': 90917 obs. of 24 variables:


## $ CustomerID : int 149965 149966 149967 1499
68 149969 149970 149971 149972 149973 149974 ...
## $ Gender : chr "Female" "Female" "Female
" "Female" ...
## $ CustomerType : chr "Loyal Customer" "Loyal C
ustomer" "Loyal Customer" "Loyal Customer" ...
## $ Age : int 65 15 60 70 30 66 10 22 5
8 34 ...
## $ TypeTravel : chr "Personal Travel" "Person
al Travel" "Personal Travel" "Personal Travel" ...
## $ Class : chr "Eco" "Eco" "Eco" "Eco" .
..
## $ Flight_Distance : int 265 2138 623 354 1894 227
1812 1556 104 3633 ...
## $ DepartureDelayin_Mins : int 0 0 0 0 0 17 0 30 47 0 ..
.
## $ ArrivalDelayin_Mins : chr "0" "0" "0" "0" ...
## $ Satisfaction : chr "satisfied" "satisfied" "
satisfied" "satisfied" ...
## $ Seat_comfort : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Departure.Arrival.time_convenient: int 0 0 NA 0 0 0 0 NA 0 0 ...
## $ Food_drink : int 0 0 0 0 0 NA NA 0 0 0 ...
## $ Gate_location : int 2 3 3 3 3 3 3 3 3 4 ...
## $ Inflightwifi_service : int 2 2 3 4 2 2 2 2 3 2 ...
## $ Inflight_entertainment : int 4 0 4 3 0 5 0 0 3 0 ...
## $ Online_support : int 2 2 3 4 2 5 2 2 3 2 ...
## $ Ease_of_Onlinebooking : int 3 2 1 2 2 5 2 2 3 2 ...
## $ Onboard_service : int 3 NA 1 2 5 5 3 2 3 3 ...
## $ Leg_room_service : int 0 3 0 0 4 0 3 4 0 2 ...
## $ Baggage_handling : int 3 4 1 2 5 5 4 5 1 5 ...
## $ Checkin_service : int 5 4 4 4 5 5 5 3 2 2 ...
## $ Cleanliness : int 3 4 1 2 4 5 4 4 3 5 ...
## $ Online_boarding : int 2 2 3 5 2 3 2 2 5 2 ...

aviation_data$Gender = as.factor(as.character(aviation_data$Gender))
aviation_data$CustomerType = as.factor(as.character(aviation_data$Custo
merType))
aviation_data$TypeTravel = as.factor(as.character(aviation_data$TypeTra
vel))
aviation_data$Class = as.factor(as.character(aviation_data$Class))
aviation_data$Satisfaction = as.factor(as.character(aviation_data$Satis
faction))

Date
Title 29

summary(aviation_data)

## CustomerID Gender CustomerType Age

## Min. :149965 Female:46186 disloyal Customer:14921 Min. :


7.00
## 1st Qu.:172694 Male :44731 Loyal Customer :66897 1st Qu.:2
7.00
## Median :195423 NA's : 9099 Median :4
0.00
## Mean :195423 Mean :3
9.45
## 3rd Qu.:218152 3rd Qu.:5
1.00
## Max. :240881 Max. :8
5.00
##

## TypeTravel Class Flight_Distance DepartureD


elayin_Mins
## Business travel:56481 Business:43535 Min. : 50 Min. :
0.00
## Personal Travel:25348 Eco :40758 1st Qu.:1360 1st Qu.:
0.00
## NA's : 9088 Eco Plus: 6624 Median :1927 Median :
0.00
## Mean :1982 Mean :
14.69
## 3rd Qu.:2542 3rd Qu.:
12.00
## Max. :6950 Max. :15
92.00
##

## ArrivalDelayin_Mins Satisfaction Seat_comfort

## Length:90917 neutral or dissatisfied:41156 Min. : 0.000

## Class :character satisfied :49761 1st Qu.: 2.000

## Mode :character Median : 3.000

## Mean : 2.839

## 3rd Qu.: 4.000

## Max. :10.000

##

## Departure.Arrival.time_convenient Food_drink Gate_location


## Min. :0.000 Min. :0.00 Min. :0.00

Date
Title 30

## 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.00


## Median :3.000 Median :3.00 Median :3.00
## Mean :2.993 Mean :2.85 Mean :2.99
## 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :5.000 Max. :5.00 Max. :5.00
## NA's :8244 NA's :8181
## Inflightwifi_service Inflight_entertainment Online_support
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.252 Mean :3.384 Mean :3.519
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000
##
## Ease_of_Onlinebooking Onboard_service Leg_room_service Baggage_hand
ling
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.00
0
## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.00
0
## Median :4.000 Median :4.000 Median :4.000 Median :4.00
0
## Mean :3.476 Mean :3.467 Mean :3.487 Mean :3.69
7
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.00
0
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.00
0
## NA's :7179

## Checkin_service Cleanliness Online_boarding


## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.341 Mean :3.708 Mean :3.352
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000
##

Data Pre-processiing

Renaming the variables


new_vars = c("CustomerID", "Gender", "CustomerType", "Age", "TypeTravel
", "Class", "Flight_Distance", "Departure_DelayTime", "Arrival_DelayTim
e", "Satisfaction", "Seat_Comfort", "Departure_ArrivalTime_Covenient",
"Food_Drink", "Gate_Location", "Inflight_WifiService", "Inflight_Entert
ainment", "Online_Support", "Ease_Of_OnlineBooking", "Onboard_Service",
"LegRoom_Service", "Baggage_Handling", "CheckIn_Service", "Cleanliness
", "Online_Boarding")

Date
Title 31

colnames(aviation_data) = new_vars
names(aviation_data)

## [1] "CustomerID" "Gender"

## [3] "CustomerType" "Age"

## [5] "TypeTravel" "Class"

## [7] "Flight_Distance" "Departure_DelayTime"

## [9] "Arrival_DelayTime" "Satisfaction"

## [11] "Seat_Comfort" "Departure_ArrivalTime_Coveni


ent"
## [13] "Food_Drink" "Gate_Location"

## [15] "Inflight_WifiService" "Inflight_Entertainment"

## [17] "Online_Support" "Ease_Of_OnlineBooking"

## [19] "Onboard_Service" "LegRoom_Service"

## [21] "Baggage_Handling" "CheckIn_Service"

## [23] "Cleanliness" "Online_Boarding"

#Converting Arrival_DelayTime to numeric


aviation_data$Arrival_DelayTime = as.numeric(as.character(aviation_data
$Arrival_DelayTime))

## Warning: NAs introduced by coercion

summary(aviation_data)

## CustomerID Gender CustomerType Age

## Min. :149965 Female:46186 disloyal Customer:14921 Min. :


7.00
## 1st Qu.:172694 Male :44731 Loyal Customer :66897 1st Qu.:2
7.00
## Median :195423 NA's : 9099 Median :4
0.00
## Mean :195423 Mean :3
9.45
## 3rd Qu.:218152 3rd Qu.:5
1.00
## Max. :240881 Max. :8
5.00
##

## TypeTravel Class Flight_Distance Departure_

Date
Title 32

DelayTime
## Business travel:56481 Business:43535 Min. : 50 Min. :
0.00
## Personal Travel:25348 Eco :40758 1st Qu.:1360 1st Qu.:
0.00
## NA's : 9088 Eco Plus: 6624 Median :1927 Median :
0.00
## Mean :1982 Mean :
14.69
## 3rd Qu.:2542 3rd Qu.:
12.00
## Max. :6950 Max. :15
92.00
##

## Arrival_DelayTime Satisfaction Seat_Comfort


## Min. : 0.00 neutral or dissatisfied:41156 Min. : 0.000
## 1st Qu.: 0.00 satisfied :49761 1st Qu.: 2.000
## Median : 0.00 Median : 3.000
## Mean : 15.06 Mean : 2.839
## 3rd Qu.: 13.00 3rd Qu.: 4.000
## Max. :1584.00 Max. :10.000
## NA's :284
## Departure_ArrivalTime_Covenient Food_Drink Gate_Location
## Min. :0.000 Min. :0.00 Min. :0.00
## 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.00
## Median :3.000 Median :3.00 Median :3.00
## Mean :2.993 Mean :2.85 Mean :2.99
## 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :5.000 Max. :5.00 Max. :5.00
## NA's :8244 NA's :8181
## Inflight_WifiService Inflight_Entertainment Online_Support
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.252 Mean :3.384 Mean :3.519
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000
##
## Ease_Of_OnlineBooking Onboard_Service LegRoom_Service Baggage_Handl
ing
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.000

## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.000

## Median :4.000 Median :4.000 Median :4.000 Median :4.000

## Mean :3.476 Mean :3.467 Mean :3.487 Mean :3.697

## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.000

## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000

Date
Title 33

## NA's :7179

## CheckIn_Service Cleanliness Online_Boarding


## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.341 Mean :3.708 Mean :3.352
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000
##

Removing attribute CustomerID


aviation_data$CustomerID = NULL
dim(aviation_data)

## [1] 90917 23

Missing Value (NA) Treatment


any(is.na(aviation_data))

## [1] TRUE

sum(is.na(aviation_data))

## [1] 42075

colSums(is.na(aviation_data))

## Gender CustomerType
## 0 9099
## Age TypeTravel
## 0 9088
## Class Flight_Distance
## 0 0
## Departure_DelayTime Arrival_DelayTime
## 0 284
## Satisfaction Seat_Comfort
## 0 0
## Departure_ArrivalTime_Covenient Food_Drink
## 8244 8181
## Gate_Location Inflight_WifiService
## 0 0
## Inflight_Entertainment Online_Support
## 0 0
## Ease_Of_OnlineBooking Onboard_Service
## 0 7179
## LegRoom_Service Baggage_Handling
## 0 0
## CheckIn_Service Cleanliness
## 0 0
## Online_Boarding
## 0

Date
Title 34

p = function(x){sum(is.na(x))/length(x)*100}
apply(aviation_data, 2, p)

## Gender CustomerType
## 0.0000000 10.0080293
## Age TypeTravel
## 0.0000000 9.9959304
## Class Flight_Distance
## 0.0000000 0.0000000
## Departure_DelayTime Arrival_DelayTime
## 0.0000000 0.3123728
## Satisfaction Seat_Comfort
## 0.0000000 0.0000000
## Departure_ArrivalTime_Covenient Food_Drink
## 9.0676111 8.9983171
## Gate_Location Inflight_WifiService
## 0.0000000 0.0000000
## Inflight_Entertainment Online_Support
## 0.0000000 0.0000000
## Ease_Of_OnlineBooking Onboard_Service
## 0.0000000 7.8962130
## LegRoom_Service Baggage_Handling
## 0.0000000 0.0000000
## CheckIn_Service Cleanliness
## 0.0000000 0.0000000
## Online_Boarding
## 0.0000000

##Removing NAs using listwise deletion


aviation_data1 = aviation_data
aviation_clean_data = na.omit(aviation_data1)
dim(aviation_clean_data)

## [1] 53634 23

any(is.na(aviation_clean_data))

## [1] FALSE

summary(aviation_clean_data)

## Gender CustomerType Age


## Female:27370 disloyal Customer: 9741 Min. : 7.0
## Male :26264 Loyal Customer :43893 1st Qu.:27.0
## Median :40.0
## Mean :39.4
## 3rd Qu.:51.0
## Max. :85.0
## TypeTravel Class Flight_Distance Departure_
DelayTime
## Business travel:36982 Business:25647 Min. : 50 Min. :
0.00

Date
Title 35

## Personal Travel:16652 Eco :24078 1st Qu.:1360 1st Qu.:


0.00
## Eco Plus: 3909 Median :1928 Median :
0.00
## Mean :1982 Mean :
14.72
## 3rd Qu.:2540 3rd Qu.:
12.00
## Max. :6950 Max. :11
28.00
## Arrival_DelayTime Satisfaction Seat_Comfort
## Min. : 0.00 neutral or dissatisfied:24243 Min. :0.000
## 1st Qu.: 0.00 satisfied :29391 1st Qu.:2.000
## Median : 0.00 Median :3.000
## Mean : 15.14 Mean :2.838
## 3rd Qu.: 13.00 3rd Qu.:4.000
## Max. :1115.00 Max. :5.000
## Departure_ArrivalTime_Covenient Food_Drink Gate_Location
## Min. :0.00 Min. :0.000 Min. :0.000
## 1st Qu.:2.00 1st Qu.:2.000 1st Qu.:2.000
## Median :3.00 Median :3.000 Median :3.000
## Mean :2.99 Mean :2.846 Mean :2.987
## 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.00 Max. :5.000 Max. :5.000
## Inflight_WifiService Inflight_Entertainment Online_Support
## Min. :0.000 Min. :0.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.256 Mean :3.384 Mean :3.523
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000
## Ease_Of_OnlineBooking Onboard_Service LegRoom_Service Baggage_Handl
ing
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.0

## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.0

## Median :4.000 Median :4.000 Median :4.000 Median :4.0

## Mean :3.481 Mean :3.467 Mean :3.495 Mean :3.7

## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.0

## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.0

## CheckIn_Service Cleanliness Online_Boarding


## Min. :1.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.338 Mean :3.712 Mean :3.355
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000

Date
Title 36

##Handling NAs using imputation


#Continuous Variables: Simple Imputation (Mean Subsitution)

aviation_data2 = aviation_data
aviation_data2$Onboard_Service[which(is.na(aviation_data2$Onboard_Servi
ce))] = mean(aviation_data2$Onboard_Service, na.rm = TRUE)

aviation_data2$Departure_ArrivalTime_Covenient[which(is.na(aviation_dat
a2$Departure_ArrivalTime_Covenient))] = mean(aviation_data2$Departure_A
rrivalTime_Covenient, na.rm = TRUE)

aviation_data2$Food_Drink[which(is.na(aviation_data2$Food_Drink))] = me
an(aviation_data2$Food_Drink, na.rm = TRUE)

aviation_data2$Arrival_DelayTime[which(is.na(aviation_data2$Arrival_Del
ayTime))] = mean(aviation_data2$Arrival_DelayTime, na.rm = TRUE)

summary(aviation_data2)

## Gender CustomerType Age


## Female:46186 disloyal Customer:14921 Min. : 7.00
## Male :44731 Loyal Customer :66897 1st Qu.:27.00
## NA's : 9099 Median :40.00
## Mean :39.45
## 3rd Qu.:51.00
## Max. :85.00
## TypeTravel Class Flight_Distance Departure_
DelayTime
## Business travel:56481 Business:43535 Min. : 50 Min. :
0.00
## Personal Travel:25348 Eco :40758 1st Qu.:1360 1st Qu.:
0.00
## NA's : 9088 Eco Plus: 6624 Median :1927 Median :
0.00
## Mean :1982 Mean :
14.69
## 3rd Qu.:2542 3rd Qu.:
12.00
## Max. :6950 Max. :15
92.00
## Arrival_DelayTime Satisfaction Seat_Comfort
## Min. : 0.00 neutral or dissatisfied:41156 Min. : 0.000
## 1st Qu.: 0.00 satisfied :49761 1st Qu.: 2.000
## Median : 0.00 Median : 3.000
## Mean : 15.06 Mean : 2.839
## 3rd Qu.: 13.00 3rd Qu.: 4.000
## Max. :1584.00 Max. :10.000
## Departure_ArrivalTime_Covenient Food_Drink Gate_Location
## Min. :0.000 Min. :0.00 Min. :0.00
## 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.00
## Median :3.000 Median :3.00 Median :3.00

Date
Title 37

## Mean :2.993 Mean :2.85 Mean :2.99


## 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :5.000 Max. :5.00 Max. :5.00
## Inflight_WifiService Inflight_Entertainment Online_Support
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.252 Mean :3.384 Mean :3.519
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000
## Ease_Of_OnlineBooking Onboard_Service LegRoom_Service Baggage_Handl
ing
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.000

## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.000

## Median :4.000 Median :4.000 Median :4.000 Median :4.000

## Mean :3.476 Mean :3.467 Mean :3.487 Mean :3.697

## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.000

## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000

## CheckIn_Service Cleanliness Online_Boarding


## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.341 Mean :3.708 Mean :3.352
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000

Categorical Variables: Mice Imputation


aviation_imputed_data = complete(mice(aviation_data2, m=3, maxit= 4), 3
)

##
## iter imp variable
## 1 1 CustomerType TypeTravel
## 1 2 CustomerType TypeTravel
## 1 3 CustomerType TypeTravel
## 2 1 CustomerType TypeTravel
## 2 2 CustomerType TypeTravel
## 2 3 CustomerType TypeTravel
## 3 1 CustomerType TypeTravel
## 3 2 CustomerType TypeTravel
## 3 3 CustomerType TypeTravel
## 4 1 CustomerType TypeTravel
## 4 2 CustomerType TypeTravel
## 4 3 CustomerType TypeTravel

any(is.na(aviation_imputed_data))

Date
Title 38

## [1] FALSE

summary(aviation_imputed_data)

## Gender CustomerType Age


## Female:46186 disloyal Customer:16666 Min. : 7.00
## Male :44731 Loyal Customer :74251 1st Qu.:27.00
## Median :40.00
## Mean :39.45
## 3rd Qu.:51.00
## Max. :85.00
## TypeTravel Class Flight_Distance Departure_
DelayTime
## Business travel:62785 Business:43535 Min. : 50 Min. :
0.00
## Personal Travel:28132 Eco :40758 1st Qu.:1360 1st Qu.:
0.00
## Eco Plus: 6624 Median :1927 Median :
0.00
## Mean :1982 Mean :
14.69
## 3rd Qu.:2542 3rd Qu.:
12.00
## Max. :6950 Max. :15
92.00
## Arrival_DelayTime Satisfaction Seat_Comfort
## Min. : 0.00 neutral or dissatisfied:41156 Min. : 0.000
## 1st Qu.: 0.00 satisfied :49761 1st Qu.: 2.000
## Median : 0.00 Median : 3.000
## Mean : 15.06 Mean : 2.839
## 3rd Qu.: 13.00 3rd Qu.: 4.000
## Max. :1584.00 Max. :10.000
## Departure_ArrivalTime_Covenient Food_Drink Gate_Location
## Min. :0.000 Min. :0.00 Min. :0.00
## 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.00
## Median :3.000 Median :3.00 Median :3.00
## Mean :2.993 Mean :2.85 Mean :2.99
## 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :5.000 Max. :5.00 Max. :5.00
## Inflight_WifiService Inflight_Entertainment Online_Support
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.252 Mean :3.384 Mean :3.519
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000
## Ease_Of_OnlineBooking Onboard_Service LegRoom_Service Baggage_Handl
ing
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.000

## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.000

Date
Title 39

## Median :4.000 Median :4.000 Median :4.000 Median :4.000

## Mean :3.476 Mean :3.467 Mean :3.487 Mean :3.697

## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.000

## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000

## CheckIn_Service Cleanliness Online_Boarding


## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.341 Mean :3.708 Mean :3.352
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000

Descriptive Statistics of Missing Values’ Treatment


t.test(aviation_clean_data$Onboard_Service, aviation_clean_data$Departu
re_ArrivalTime_Covenient, paired = T)

##
## Paired t-test
##
## data: aviation_clean_data$Onboard_Service and aviation_clean_data$D
eparture_ArrivalTime_Covenient
## t = 57.458, df = 53633, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.4608110 0.4933598
## sample estimates:
## mean of the differences
## 0.4770854

t.test(aviation_imputed_data$Onboard_Service, aviation_imputed_data$Dep
arture_ArrivalTime_Covenient, paired = T)

##
## Paired t-test
##
## data: aviation_imputed_data$Onboard_Service and aviation_imputed_da
ta$Departure_ArrivalTime_Covenient
## t = 77.317, df = 90916, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.4612551 0.4852492
## sample estimates:
## mean of the differences
## 0.4732521

Outliers Treatment

#Departure Delay Time

Date
Title 40

boxplot(aviation_imputed_data$Departure_DelayTime)

quantile(aviation_imputed_data$Departure_DelayTime)

## 0% 25% 50% 75% 100%


## 0 0 0 12 1592

#Flight Distance
boxplot(aviation_imputed_data$Flight_Distance)

quantile(aviation_imputed_data$Flight_Distance)

## 0% 25% 50% 75% 100%


## 50 1360 1927 2542 6950

Arrival Delay Time


boxplot(aviation_imputed_data$Arrival_DelayTime)

Date
Title 41

quantile(aviation_imputed_data$Arrival_DelayTime)

## 0% 25% 50% 75% 100%


## 0 0 0 13 1584

Seat Comfort
boxplot(aviation_imputed_data$Seat_Comfort)

quantile(aviation_imputed_data$Seat_Comfort)

## 0% 25% 50% 75% 100%


## 0 2 3 4 10

Replace maximum outliers value by 4th quantile


airline_data = aviation_imputed_data

airline_data$Departure_DelayTime = ifelse(airline_data$Departure_DelayT
ime > 12, 12, airline_data$Departure_DelayTime)

boxplot(airline_data$Departure_DelayTime)

Date
Title 42

airline_data$Flight_Distance = ifelse(airline_data$Flight_Distance > 43


00, 4300, airline_data$Flight_Distance)

boxplot(airline_data$Flight_Distance)

airline_data$Arrival_DelayTime = ifelse(airline_data$Arrival_DelayTime
> 15, 15, airline_data$Arrival_DelayTime)

boxplot(airline_data$Arrival_DelayTime)

airline_data$Seat_Comfort = ifelse(airline_data$Seat_Comfort > 5, 5, ai


rline_data$Seat_Comfort)

boxplot(airline_data$Seat_Comfort)

Date
Title 43

Exploratory Data Analysis


str(airline_data)

## 'data.frame': 90917 obs. of 23 variables:


## $ Gender : Factor w/ 2 levels "Female","Mal
e": 1 1 1 1 2 1 2 2 1 1 ...
## $ CustomerType : Factor w/ 2 levels "disloyal Cus
tomer",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Age : int 65 15 60 70 30 66 10 22 58
34 ...
## $ TypeTravel : Factor w/ 2 levels "Business tra
vel",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Class : Factor w/ 3 levels "Business","E
co",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Flight_Distance : num 265 2138 623 354 1894 ...
## $ Departure_DelayTime : num 0 0 0 0 0 12 0 12 12 0 ...
## $ Arrival_DelayTime : num 0 0 0 0 0 15 0 15 15 0 ...
## $ Satisfaction : Factor w/ 2 levels "neutral or d
issatisfied",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Seat_Comfort : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Departure_ArrivalTime_Covenient: num 0 0 2.99 0 0 ...
## $ Food_Drink : num 0 0 0 0 0 ...
## $ Gate_Location : int 2 3 3 3 3 3 3 3 3 4 ...
## $ Inflight_WifiService : int 2 2 3 4 2 2 2 2 3 2 ...
## $ Inflight_Entertainment : int 4 0 4 3 0 5 0 0 3 0 ...
## $ Online_Support : int 2 2 3 4 2 5 2 2 3 2 ...
## $ Ease_Of_OnlineBooking : int 3 2 1 2 2 5 2 2 3 2 ...
## $ Onboard_Service : num 3 3.47 1 2 5 ...
## $ LegRoom_Service : int 0 3 0 0 4 0 3 4 0 2 ...
## $ Baggage_Handling : int 3 4 1 2 5 5 4 5 1 5 ...
## $ CheckIn_Service : int 5 4 4 4 5 5 5 3 2 2 ...
## $ Cleanliness : int 3 4 1 2 4 5 4 4 3 5 ...
## $ Online_Boarding : int 2 2 3 5 2 3 2 2 5 2 ...

Addition of new variables


dmy = dummyVars("~.", data = airline_data)
airline_data_trans = data.frame(predict(dmy, newdata = airline_data))
dim(airline_data_trans)

## [1] 90917 29

str(airline_data_trans)

## 'data.frame': 90917 obs. of 29 variables:


## $ Gender.Female : num 1 1 1 1 0 1 0 0 1 1 ..
.
## $ Gender.Male : num 0 0 0 0 1 0 1 1 0 0 ..
.
## $ CustomerType.disloyal.Customer : num 0 0 0 0 0 0 0 0 0 0 ..
.
## $ CustomerType.Loyal.Customer : num 1 1 1 1 1 1 1 1 1 1 ..

Date
Title 44

.
## $ Age : num 65 15 60 70 30 66 10 2
2 58 34 ...
## $ TypeTravel.Business.travel : num 0 0 0 0 0 0 0 0 0 0 ..
.
## $ TypeTravel.Personal.Travel : num 1 1 1 1 1 1 1 1 1 1 ..
.
## $ Class.Business : num 0 0 0 0 0 0 0 0 0 0 ..
.
## $ Class.Eco : num 1 1 1 1 1 1 1 1 1 1 ..
.
## $ Class.Eco.Plus : num 0 0 0 0 0 0 0 0 0 0 ..
.
## $ Flight_Distance : num 265 2138 623 354 1894
...
## $ Departure_DelayTime : num 0 0 0 0 0 12 0 12 12 0
...
## $ Arrival_DelayTime : num 0 0 0 0 0 15 0 15 15 0
...
## $ Satisfaction.neutral.or.dissatisfied: num 0 0 0 0 0 0 0 0 0 0 ..
.
## $ Satisfaction.satisfied : num 1 1 1 1 1 1 1 1 1 1 ..
.
## $ Seat_Comfort : num 0 0 0 0 0 0 0 0 0 0 ..
.
## $ Departure_ArrivalTime_Covenient : num 0 0 2.99 0 0 ...
## $ Food_Drink : num 0 0 0 0 0 ...
## $ Gate_Location : num 2 3 3 3 3 3 3 3 3 4 ..
.
## $ Inflight_WifiService : num 2 2 3 4 2 2 2 2 3 2 ..
.
## $ Inflight_Entertainment : num 4 0 4 3 0 5 0 0 3 0 ..
.
## $ Online_Support : num 2 2 3 4 2 5 2 2 3 2 ..
.
## $ Ease_Of_OnlineBooking : num 3 2 1 2 2 5 2 2 3 2 ..
.
## $ Onboard_Service : num 3 3.47 1 2 5 ...
## $ LegRoom_Service : num 0 3 0 0 4 0 3 4 0 2 ..
.
## $ Baggage_Handling : num 3 4 1 2 5 5 4 5 1 5 ..
.
## $ CheckIn_Service : num 5 4 4 4 5 5 5 3 2 2 ..
.
## $ Cleanliness : num 3 4 1 2 4 5 4 4 3 5 ..
.
## $ Online_Boarding : num 2 2 3 5 2 3 2 2 5 2 ..
.

Date
Title 45

Correlation
corx = cor(airline_data_trans, use="pairwise.complete.obs")
corrplot(corx,)

cor.prob <- function (X, dfr = nrow(X) - 2) {


R <- cor(X, use="pairwise.complete.obs")
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr/(1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
R[row(R) == col(R)] <- NA
R
}

flattenSquareMatrix <- function(m) {


if( (class(m) != "matrix") | (nrow(m) != ncol(m))) stop("Must be a sq
uare matrix.")
if(!identical(rownames(m), colnames(m))) stop("Row and column names m
ust be equal.")
ut <- upper.tri(m)
data.frame(i = rownames(m)[row(m)[ut]],
j = rownames(m)[col(m)[ut]],
cor=t(m)[ut],
p=m[ut])
}

corMasterList <- flattenSquareMatrix (cor.prob(airline_data_trans))

## Warning in if ((class(m) != "matrix") | (nrow(m) != ncol(m))) stop("


Must be a
## square matrix."): the condition has length > 1 and only the first el
ement will
## be used

print(head(corMasterList,10))

## i j
cor
## 1 Gender.Female Gender.Male -1.
000000000

Date
Title 46

## 2 Gender.Female CustomerType.disloyal.Customer 0.
031651529
## 3 Gender.Male CustomerType.disloyal.Customer -0.
031651529
## 4 Gender.Female CustomerType.Loyal.Customer -0.
031651529
## 5 Gender.Male CustomerType.Loyal.Customer 0.
031651529
## 6 CustomerType.disloyal.Customer CustomerType.Loyal.Customer -1.
000000000
## 7 Gender.Female Age -0.
009800715
## 8 Gender.Male Age 0.
009800715
## 9 CustomerType.disloyal.Customer Age -0.
284396085
## 10 CustomerType.Loyal.Customer Age 0.
284396085
## p
## 1 0.000000000
## 2 0.000000000
## 3 0.000000000
## 4 0.000000000
## 5 0.000000000
## 6 0.000000000
## 7 0.003124809
## 8 0.003124809
## 9 0.000000000
## 10 0.000000000

corList <- corMasterList[order(corMasterList$cor),]

selectedSub <- subset(corList, (abs(cor) > 0.2 & j == 'Onboard_Service'


))

bestSub <- sapply(strsplit(as.character(selectedSub$i),'[.]'), "[", 1)

pairs.panels(airline_data[c(bestSub,'Onboard_Service')])

Date
Title 47

Capstone Project Notes I

unlink("C:/Users/OLIVIA/Documents/R/win-library/4.0/00LOCK", recursive
= T)

Load Packages
library(ggplot2)
library(backports)
library (corrplot)
library(caTools)
library(rpart)
library(rattle)
library(dummies)
library(mice)
library(gam)
library(caret)
library(psych)
library(randomForest)

Set Working Directory


setwd("C:/Users/OLIVIA/Desktop/DSBA/Dataset/")

Import Data
aviation_data = read.csv("Marketing Project-Aviation Data.csv", header
= TRUE, na.strings = c(""))

Sanity Checks
dim(aviation_data)

## [1] 90917 24

colnames(aviation_data)

## [1] "CustomerID" "Gender"

## [3] "CustomerType" "Age"

## [5] "TypeTravel" "Class"

## [7] "Flight_Distance" "DepartureDelayin_Mins"

## [9] "ArrivalDelayin_Mins" "Satisfaction"

## [11] "Seat_comfort" "Departure.Arrival.time_con


venient"
## [13] "Food_drink" "Gate_location"

## [15] "Inflightwifi_service" "Inflight_entertainment"

## [17] "Online_support" "Ease_of_Onlinebooking"

Date
Title 48

## [19] "Onboard_service" "Leg_room_service"

## [21] "Baggage_handling" "Checkin_service"

## [23] "Cleanliness" "Online_boarding"

Descriptive Statistics
str(aviation_data)

## 'data.frame': 90917 obs. of 24 variables:


## $ CustomerID : int 149965 149966 149967 1499
68 149969 149970 149971 149972 149973 149974 ...
## $ Gender : chr "Female" "Female" "Female
" "Female" ...
## $ CustomerType : chr "Loyal Customer" "Loyal C
ustomer" "Loyal Customer" "Loyal Customer" ...
## $ Age : int 65 15 60 70 30 66 10 22 5
8 34 ...
## $ TypeTravel : chr "Personal Travel" "Person
al Travel" "Personal Travel" "Personal Travel" ...
## $ Class : chr "Eco" "Eco" "Eco" "Eco" .
..
## $ Flight_Distance : int 265 2138 623 354 1894 227
1812 1556 104 3633 ...
## $ DepartureDelayin_Mins : int 0 0 0 0 0 17 0 30 47 0 ..
.
## $ ArrivalDelayin_Mins : chr "0" "0" "0" "0" ...
## $ Satisfaction : chr "satisfied" "satisfied" "
satisfied" "satisfied" ...
## $ Seat_comfort : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Departure.Arrival.time_convenient: int 0 0 NA 0 0 0 0 NA 0 0 ...
## $ Food_drink : int 0 0 0 0 0 NA NA 0 0 0 ...
## $ Gate_location : int 2 3 3 3 3 3 3 3 3 4 ...
## $ Inflightwifi_service : int 2 2 3 4 2 2 2 2 3 2 ...
## $ Inflight_entertainment : int 4 0 4 3 0 5 0 0 3 0 ...
## $ Online_support : int 2 2 3 4 2 5 2 2 3 2 ...
## $ Ease_of_Onlinebooking : int 3 2 1 2 2 5 2 2 3 2 ...
## $ Onboard_service : int 3 NA 1 2 5 5 3 2 3 3 ...
## $ Leg_room_service : int 0 3 0 0 4 0 3 4 0 2 ...
## $ Baggage_handling : int 3 4 1 2 5 5 4 5 1 5 ...
## $ Checkin_service : int 5 4 4 4 5 5 5 3 2 2 ...
## $ Cleanliness : int 3 4 1 2 4 5 4 4 3 5 ...
## $ Online_boarding : int 2 2 3 5 2 3 2 2 5 2 ...

aviation_data$Gender = as.factor(as.character(aviation_data$Gender))
aviation_data$CustomerType = as.factor(as.character(aviation_data$Custo
merType))
aviation_data$TypeTravel = as.factor(as.character(aviation_data$TypeTra
vel))
aviation_data$Class = as.factor(as.character(aviation_data$Class))

Date
Title 49

aviation_data$Satisfaction = as.factor(as.character(aviation_data$Satis
faction))

summary(aviation_data)

## CustomerID Gender CustomerType Age

## Min. :149965 Female:46186 disloyal Customer:14921 Min. :


7.00
## 1st Qu.:172694 Male :44731 Loyal Customer :66897 1st Qu.:2
7.00
## Median :195423 NA's : 9099 Median :4
0.00
## Mean :195423 Mean :3
9.45
## 3rd Qu.:218152 3rd Qu.:5
1.00
## Max. :240881 Max. :8
5.00
##

## TypeTravel Class Flight_Distance DepartureD


elayin_Mins
## Business travel:56481 Business:43535 Min. : 50 Min. :
0.00
## Personal Travel:25348 Eco :40758 1st Qu.:1360 1st Qu.:
0.00
## NA's : 9088 Eco Plus: 6624 Median :1927 Median :
0.00
## Mean :1982 Mean :
14.69
## 3rd Qu.:2542 3rd Qu.:
12.00
## Max. :6950 Max. :15
92.00
##

## ArrivalDelayin_Mins Satisfaction Seat_comfort

## Length:90917 neutral or dissatisfied:41156 Min. : 0.000

## Class :character satisfied :49761 1st Qu.: 2.000

## Mode :character Median : 3.000

## Mean : 2.839

## 3rd Qu.: 4.000

## Max. :10.000

##

Date
Title 50

## Departure.Arrival.time_convenient Food_drink Gate_location


## Min. :0.000 Min. :0.00 Min. :0.00
## 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.00
## Median :3.000 Median :3.00 Median :3.00
## Mean :2.993 Mean :2.85 Mean :2.99
## 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :5.000 Max. :5.00 Max. :5.00
## NA's :8244 NA's :8181
## Inflightwifi_service Inflight_entertainment Online_support
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.252 Mean :3.384 Mean :3.519
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000
##
## Ease_of_Onlinebooking Onboard_service Leg_room_service Baggage_hand
ling
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.00
0
## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.00
0
## Median :4.000 Median :4.000 Median :4.000 Median :4.00
0
## Mean :3.476 Mean :3.467 Mean :3.487 Mean :3.69
7
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.00
0
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.00
0
## NA's :7179

## Checkin_service Cleanliness Online_boarding


## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.341 Mean :3.708 Mean :3.352
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000
##

Data Pre-processiing

Renaming the variables


new_vars = c("CustomerID", "Gender", "CustomerType", "Age", "TypeTravel
", "Class", "Flight_Distance", "Departure_DelayTime", "Arrival_DelayTim
e", "Satisfaction", "Seat_Comfort", "Departure_ArrivalTime_Covenient",
"Food_Drink", "Gate_Location", "Inflight_WifiService", "Inflight_Entert
ainment", "Online_Support", "Ease_Of_OnlineBooking", "Onboard_Service",
"LegRoom_Service", "Baggage_Handling", "CheckIn_Service", "Cleanliness

Date
Title 51

", "Online_Boarding")

colnames(aviation_data) = new_vars
names(aviation_data)

## [1] "CustomerID" "Gender"

## [3] "CustomerType" "Age"

## [5] "TypeTravel" "Class"

## [7] "Flight_Distance" "Departure_DelayTime"

## [9] "Arrival_DelayTime" "Satisfaction"

## [11] "Seat_Comfort" "Departure_ArrivalTime_Coveni


ent"
## [13] "Food_Drink" "Gate_Location"

## [15] "Inflight_WifiService" "Inflight_Entertainment"

## [17] "Online_Support" "Ease_Of_OnlineBooking"

## [19] "Onboard_Service" "LegRoom_Service"

## [21] "Baggage_Handling" "CheckIn_Service"

## [23] "Cleanliness" "Online_Boarding"

#Converting Arrival_DelayTime to numeric


aviation_data$Arrival_DelayTime = as.numeric(as.character(aviation_data
$Arrival_DelayTime))

## Warning: NAs introduced by coercion

summary(aviation_data)

## CustomerID Gender CustomerType Age

## Min. :149965 Female:46186 disloyal Customer:14921 Min. :


7.00
## 1st Qu.:172694 Male :44731 Loyal Customer :66897 1st Qu.:2
7.00
## Median :195423 NA's : 9099 Median :4
0.00
## Mean :195423 Mean :3
9.45
## 3rd Qu.:218152 3rd Qu.:5
1.00
## Max. :240881 Max. :8
5.00
##

Date
Title 52

## TypeTravel Class Flight_Distance Departure_


DelayTime
## Business travel:56481 Business:43535 Min. : 50 Min. :
0.00
## Personal Travel:25348 Eco :40758 1st Qu.:1360 1st Qu.:
0.00
## NA's : 9088 Eco Plus: 6624 Median :1927 Median :
0.00
## Mean :1982 Mean :
14.69
## 3rd Qu.:2542 3rd Qu.:
12.00
## Max. :6950 Max. :15
92.00
##

## Arrival_DelayTime Satisfaction Seat_Comfort


## Min. : 0.00 neutral or dissatisfied:41156 Min. : 0.000
## 1st Qu.: 0.00 satisfied :49761 1st Qu.: 2.000
## Median : 0.00 Median : 3.000
## Mean : 15.06 Mean : 2.839
## 3rd Qu.: 13.00 3rd Qu.: 4.000
## Max. :1584.00 Max. :10.000
## NA's :284
## Departure_ArrivalTime_Covenient Food_Drink Gate_Location
## Min. :0.000 Min. :0.00 Min. :0.00
## 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.00
## Median :3.000 Median :3.00 Median :3.00
## Mean :2.993 Mean :2.85 Mean :2.99
## 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :5.000 Max. :5.00 Max. :5.00
## NA's :8244 NA's :8181
## Inflight_WifiService Inflight_Entertainment Online_Support
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.252 Mean :3.384 Mean :3.519
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000
##
## Ease_Of_OnlineBooking Onboard_Service LegRoom_Service Baggage_Handl
ing
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.000

## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.000

## Median :4.000 Median :4.000 Median :4.000 Median :4.000

## Mean :3.476 Mean :3.467 Mean :3.487 Mean :3.697

## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.000

Date
Title 53

## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000

## NA's :7179

## CheckIn_Service Cleanliness Online_Boarding


## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.341 Mean :3.708 Mean :3.352
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000
##

Removing attribute CustomerID


aviation_data$CustomerID = NULL
dim(aviation_data)

## [1] 90917 23

Missing Value (NA) Treatment


any(is.na(aviation_data))

## [1] TRUE

sum(is.na(aviation_data))

## [1] 42075

colSums(is.na(aviation_data))

## Gender CustomerType
## 0 9099
## Age TypeTravel
## 0 9088
## Class Flight_Distance
## 0 0
## Departure_DelayTime Arrival_DelayTime
## 0 284
## Satisfaction Seat_Comfort
## 0 0
## Departure_ArrivalTime_Covenient Food_Drink
## 8244 8181
## Gate_Location Inflight_WifiService
## 0 0
## Inflight_Entertainment Online_Support
## 0 0
## Ease_Of_OnlineBooking Onboard_Service
## 0 7179
## LegRoom_Service Baggage_Handling
## 0 0
## CheckIn_Service Cleanliness
## 0 0

Date
Title 54

## Online_Boarding
## 0

p = function(x){sum(is.na(x))/length(x)*100}
apply(aviation_data, 2, p)

## Gender CustomerType
## 0.0000000 10.0080293
## Age TypeTravel
## 0.0000000 9.9959304
## Class Flight_Distance
## 0.0000000 0.0000000
## Departure_DelayTime Arrival_DelayTime
## 0.0000000 0.3123728
## Satisfaction Seat_Comfort
## 0.0000000 0.0000000
## Departure_ArrivalTime_Covenient Food_Drink
## 9.0676111 8.9983171
## Gate_Location Inflight_WifiService
## 0.0000000 0.0000000
## Inflight_Entertainment Online_Support
## 0.0000000 0.0000000
## Ease_Of_OnlineBooking Onboard_Service
## 0.0000000 7.8962130
## LegRoom_Service Baggage_Handling
## 0.0000000 0.0000000
## CheckIn_Service Cleanliness
## 0.0000000 0.0000000
## Online_Boarding
## 0.0000000

##Handling NAs using imputation


#Continuous Variables: Simple Imputation (Mean Subsitution)

aviation_data2 = aviation_data
aviation_data2$Onboard_Service[which(is.na(aviation_data2$Onboard_Servi
ce))] = mean(aviation_data2$Onboard_Service, na.rm = TRUE)

aviation_data2$Departure_ArrivalTime_Covenient[which(is.na(aviation_dat
a2$Departure_ArrivalTime_Covenient))] = mean(aviation_data2$Departure_A
rrivalTime_Covenient, na.rm = TRUE)

aviation_data2$Food_Drink[which(is.na(aviation_data2$Food_Drink))] = me
an(aviation_data2$Food_Drink, na.rm = TRUE)

aviation_data2$Arrival_DelayTime[which(is.na(aviation_data2$Arrival_Del
ayTime))] = mean(aviation_data2$Arrival_DelayTime, na.rm = TRUE)

summary(aviation_data2)

## Gender CustomerType Age


## Female:46186 disloyal Customer:14921 Min. : 7.00

Date
Title 55

## Male :44731 Loyal Customer :66897 1st Qu.:27.00


## NA's : 9099 Median :40.00
## Mean :39.45
## 3rd Qu.:51.00
## Max. :85.00
## TypeTravel Class Flight_Distance Departure_
DelayTime
## Business travel:56481 Business:43535 Min. : 50 Min. :
0.00
## Personal Travel:25348 Eco :40758 1st Qu.:1360 1st Qu.:
0.00
## NA's : 9088 Eco Plus: 6624 Median :1927 Median :
0.00
## Mean :1982 Mean :
14.69
## 3rd Qu.:2542 3rd Qu.:
12.00
## Max. :6950 Max. :15
92.00
## Arrival_DelayTime Satisfaction Seat_Comfort
## Min. : 0.00 neutral or dissatisfied:41156 Min. : 0.000
## 1st Qu.: 0.00 satisfied :49761 1st Qu.: 2.000
## Median : 0.00 Median : 3.000
## Mean : 15.06 Mean : 2.839
## 3rd Qu.: 13.00 3rd Qu.: 4.000
## Max. :1584.00 Max. :10.000
## Departure_ArrivalTime_Covenient Food_Drink Gate_Location
## Min. :0.000 Min. :0.00 Min. :0.00
## 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.00
## Median :3.000 Median :3.00 Median :3.00
## Mean :2.993 Mean :2.85 Mean :2.99
## 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :5.000 Max. :5.00 Max. :5.00
## Inflight_WifiService Inflight_Entertainment Online_Support
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.252 Mean :3.384 Mean :3.519
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000
## Ease_Of_OnlineBooking Onboard_Service LegRoom_Service Baggage_Handl
ing
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.000

## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.000

## Median :4.000 Median :4.000 Median :4.000 Median :4.000

## Mean :3.476 Mean :3.467 Mean :3.487 Mean :3.697

## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.000

Date
Title 56

## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000

## CheckIn_Service Cleanliness Online_Boarding


## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.341 Mean :3.708 Mean :3.352
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000

Categorical Variables: Mice Imputation


aviation_imputed_data = complete(mice(aviation_data2, m=3, maxit= 4), 3
)

##
## iter imp variable
## 1 1 CustomerType TypeTravel
## 1 2 CustomerType TypeTravel
## 1 3 CustomerType TypeTravel
## 2 1 CustomerType TypeTravel
## 2 2 CustomerType TypeTravel
## 2 3 CustomerType TypeTravel
## 3 1 CustomerType TypeTravel
## 3 2 CustomerType TypeTravel
## 3 3 CustomerType TypeTravel
## 4 1 CustomerType TypeTravel
## 4 2 CustomerType TypeTravel
## 4 3 CustomerType TypeTravel

any(is.na(aviation_imputed_data))

## [1] FALSE

summary(aviation_imputed_data)

## Gender CustomerType Age


## Female:46186 disloyal Customer:16631 Min. : 7.00
## Male :44731 Loyal Customer :74286 1st Qu.:27.00
## Median :40.00
## Mean :39.45
## 3rd Qu.:51.00
## Max. :85.00
## TypeTravel Class Flight_Distance Departure_
DelayTime
## Business travel:62781 Business:43535 Min. : 50 Min. :
0.00
## Personal Travel:28136 Eco :40758 1st Qu.:1360 1st Qu.:
0.00
## Eco Plus: 6624 Median :1927 Median :
0.00
## Mean :1982 Mean :
14.69

Date
Title 57

## 3rd Qu.:2542 3rd Qu.:


12.00
## Max. :6950 Max. :15
92.00
## Arrival_DelayTime Satisfaction Seat_Comfort
## Min. : 0.00 neutral or dissatisfied:41156 Min. : 0.000
## 1st Qu.: 0.00 satisfied :49761 1st Qu.: 2.000
## Median : 0.00 Median : 3.000
## Mean : 15.06 Mean : 2.839
## 3rd Qu.: 13.00 3rd Qu.: 4.000
## Max. :1584.00 Max. :10.000
## Departure_ArrivalTime_Covenient Food_Drink Gate_Location
## Min. :0.000 Min. :0.00 Min. :0.00
## 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.00
## Median :3.000 Median :3.00 Median :3.00
## Mean :2.993 Mean :2.85 Mean :2.99
## 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :5.000 Max. :5.00 Max. :5.00
## Inflight_WifiService Inflight_Entertainment Online_Support
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.252 Mean :3.384 Mean :3.519
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000
## Ease_Of_OnlineBooking Onboard_Service LegRoom_Service Baggage_Handl
ing
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :1.000

## 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.000

## Median :4.000 Median :4.000 Median :4.000 Median :4.000

## Mean :3.476 Mean :3.467 Mean :3.487 Mean :3.697

## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.000

## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000

## CheckIn_Service Cleanliness Online_Boarding


## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000
## Median :3.000 Median :4.000 Median :4.000
## Mean :3.341 Mean :3.708 Mean :3.352
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :5.000

#Outliers Treatment

Replace maximum outliers value by 4th quantile


airline_data = aviation_imputed_data

Date
Title 58

airline_data$Departure_DelayTime = ifelse(airline_data$Departure_DelayT
ime > 12, 12, airline_data$Departure_DelayTime)

airline_data$Flight_Distance = ifelse(airline_data$Flight_Distance > 43


00, 4300, airline_data$Flight_Distance)

airline_data$Arrival_DelayTime = ifelse(airline_data$Arrival_DelayTime
> 15, 15, airline_data$Arrival_DelayTime)

airline_data$Seat_Comfort = ifelse(airline_data$Seat_Comfort > 5, 5, ai


rline_data$Seat_Comfort)

DATA PREPARATION
dim(aviation_imputed_data)

## [1] 90917 23

str(aviation_imputed_data)

## 'data.frame': 90917 obs. of 23 variables:


## $ Gender : Factor w/ 2 levels "Female","Mal
e": 1 1 1 1 2 1 2 2 1 1 ...
## $ CustomerType : Factor w/ 2 levels "disloyal Cus
tomer",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Age : int 65 15 60 70 30 66 10 22 58
34 ...
## $ TypeTravel : Factor w/ 2 levels "Business tra
vel",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Class : Factor w/ 3 levels "Business","E
co",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Flight_Distance : int 265 2138 623 354 1894 227 1
812 1556 104 3633 ...
## $ Departure_DelayTime : int 0 0 0 0 0 17 0 30 47 0 ...
## $ Arrival_DelayTime : num 0 0 0 0 0 15 0 26 48 0 ...
## $ Satisfaction : Factor w/ 2 levels "neutral or d
issatisfied",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Seat_Comfort : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Departure_ArrivalTime_Covenient: num 0 0 2.99 0 0 ...
## $ Food_Drink : num 0 0 0 0 0 ...
## $ Gate_Location : int 2 3 3 3 3 3 3 3 3 4 ...
## $ Inflight_WifiService : int 2 2 3 4 2 2 2 2 3 2 ...
## $ Inflight_Entertainment : int 4 0 4 3 0 5 0 0 3 0 ...
## $ Online_Support : int 2 2 3 4 2 5 2 2 3 2 ...
## $ Ease_Of_OnlineBooking : int 3 2 1 2 2 5 2 2 3 2 ...
## $ Onboard_Service : num 3 3.47 1 2 5 ...
## $ LegRoom_Service : int 0 3 0 0 4 0 3 4 0 2 ...
## $ Baggage_Handling : int 3 4 1 2 5 5 4 5 1 5 ...
## $ CheckIn_Service : int 5 4 4 4 5 5 5 3 2 2 ...
## $ Cleanliness : int 3 4 1 2 4 5 4 4 3 5 ...
## $ Online_Boarding : int 2 2 3 5 2 3 2 2 5 2 ...

table(aviation_imputed_data$Satisfaction)

Date
Title 59

##
## neutral or dissatisfied satisfied
## 41156 49761

airline = aviation_imputed_data
colnames(airline)

## [1] "Gender" "CustomerType"

## [3] "Age" "TypeTravel"

## [5] "Class" "Flight_Distance"

## [7] "Departure_DelayTime" "Arrival_DelayTime"

## [9] "Satisfaction" "Seat_Comfort"

## [11] "Departure_ArrivalTime_Covenient" "Food_Drink"

## [13] "Gate_Location" "Inflight_WifiService"

## [15] "Inflight_Entertainment" "Online_Support"

## [17] "Ease_Of_OnlineBooking" "Onboard_Service"

## [19] "LegRoom_Service" "Baggage_Handling"

## [21] "CheckIn_Service" "Cleanliness"

## [23] "Online_Boarding"

#Convert to Numeric
airline$Gender = as.numeric(as.factor(airline$Gender))
airline$CustomerType = as.numeric(as.factor(airline$CustomerType))
airline$Age = as.numeric(as.integer(airline$Age))
airline$TypeTravel = as.numeric(as.factor(airline$TypeTravel))
airline$Class = as.numeric(as.factor(airline$Class))
airline$Flight_Distance = as.numeric(as.integer(airline$Flight_Distance
))
airline$Departure_DelayTime = as.numeric(as.integer(airline$Departure_D
elayTime))
airline$Seat_Comfort = as.numeric(as.integer(airline$Seat_Comfort))
airline$Gate_Location = as.numeric(as.integer(airline$Gate_Location))
airline$Inflight_WifiService = as.numeric(as.integer(airline$Inflight_W
ifiService))
airline$Inflight_Entertainment = as.numeric(as.integer(airline$Inflight
_Entertainment))
airline$Online_Support = as.numeric(as.integer(airline$Online_Support))
airline$Ease_Of_OnlineBooking = as.numeric(as.integer(airline$Ease_Of_O
nlineBooking))
airline$LegRoom_Service = as.numeric(as.integer(airline$LegRoom_Service

Date
Title 60

))
airline$Baggage_Handling = as.numeric(as.integer(airline$Baggage_Handli
ng))
airline$CheckIn_Service = as.numeric(as.integer(airline$CheckIn_Service
))
airline$Cleanliness = as.numeric(as.integer(airline$Cleanliness))
airline$Online_Boarding = as.numeric(as.integer(airline$Online_Boarding
))

str(airline)

## 'data.frame': 90917 obs. of 23 variables:


## $ Gender : num 1 1 1 1 2 1 2 2 1 1 ...
## $ CustomerType : num 2 2 2 2 2 2 2 2 2 2 ...
## $ Age : num 65 15 60 70 30 66 10 22 58
34 ...
## $ TypeTravel : num 2 2 2 2 2 2 2 2 2 2 ...
## $ Class : num 2 2 2 2 2 2 2 2 2 2 ...
## $ Flight_Distance : num 265 2138 623 354 1894 ...
## $ Departure_DelayTime : num 0 0 0 0 0 17 0 30 47 0 ...
## $ Arrival_DelayTime : num 0 0 0 0 0 15 0 26 48 0 ...
## $ Satisfaction : Factor w/ 2 levels "neutral or d
issatisfied",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Seat_Comfort : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Departure_ArrivalTime_Covenient: num 0 0 2.99 0 0 ...
## $ Food_Drink : num 0 0 0 0 0 ...
## $ Gate_Location : num 2 3 3 3 3 3 3 3 3 4 ...
## $ Inflight_WifiService : num 2 2 3 4 2 2 2 2 3 2 ...
## $ Inflight_Entertainment : num 4 0 4 3 0 5 0 0 3 0 ...
## $ Online_Support : num 2 2 3 4 2 5 2 2 3 2 ...
## $ Ease_Of_OnlineBooking : num 3 2 1 2 2 5 2 2 3 2 ...
## $ Onboard_Service : num 3 3.47 1 2 5 ...
## $ LegRoom_Service : num 0 3 0 0 4 0 3 4 0 2 ...
## $ Baggage_Handling : num 3 4 1 2 5 5 4 5 1 5 ...
## $ CheckIn_Service : num 5 4 4 4 5 5 5 3 2 2 ...
## $ Cleanliness : num 3 4 1 2 4 5 4 4 3 5 ...
## $ Online_Boarding : num 2 2 3 5 2 3 2 2 5 2 ...

table(airline$Satisfaction)

##
## neutral or dissatisfied satisfied
## 41156 49761

airline$Satisfaction = ifelse(airline$Satisfaction == "satisfied", 1, 0


)
airline$Satisfaction = factor(airline$Satisfaction, levels = c(0, 1))
table(airline$Satisfaction)

##
## 0 1
## 41156 49761

Date
Title 61

##Data Spliting
set.seed(1234)
sample = sample.split(airline$Satisfaction, SplitRatio = 0.7)
train_data = subset(airline, sample == T)
test_data = subset(airline, sample == F)

dim(train_data)

## [1] 63642 23

dim(test_data)

## [1] 27275 23

prop.table(table(train_data$Satisfaction))

##
## 0 1
## 0.4526728 0.5473272

prop.table(table(test_data$Satisfaction))

##
## 0 1
## 0.4526856 0.5473144

MODEL BUILDING
##Applying Logestic Regression
logr_train = train_data
logr_test = test_data

logr_model = glm(Satisfaction~ ., data = logr_train, family = binomial)


summary(logr_model)

##
## Call:
## glm(formula = Satisfaction ~ ., family = binomial, data = logr_train
)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8900 -0.5829 0.1977 0.5246 3.5122
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z
|)
## (Intercept) -6.621e+00 1.091e-01 -60.681 < 2e-
16 ***
## Gender -9.667e-01 2.341e-02 -41.299 < 2e-
16 ***

Date
Title 62

## CustomerType 2.086e+00 3.508e-02 59.470 < 2e-


16 ***
## Age -8.079e-03 8.098e-04 -9.976 < 2e-
16 ***
## TypeTravel -8.920e-01 3.091e-02 -28.861 < 2e-
16 ***
## Class -5.115e-01 2.169e-02 -23.583 < 2e-
16 ***
## Flight_Distance -1.112e-04 1.219e-05 -9.126 < 2e-
16 ***
## Departure_DelayTime 2.607e-03 1.077e-03 2.421 0.01
55 *
## Arrival_DelayTime -7.166e-03 1.067e-03 -6.718 1.84e-
11 ***
## Seat_Comfort 2.631e-01 1.264e-02 20.826 < 2e-
16 ***
## Departure_ArrivalTime_Covenient -1.926e-01 9.711e-03 -19.828 < 2e-
16 ***
## Food_Drink -2.113e-01 1.292e-02 -16.352 < 2e-
16 ***
## Gate_Location 9.689e-02 1.072e-02 9.039 < 2e-
16 ***
## Inflight_WifiService -7.845e-02 1.258e-02 -6.235 4.52e-
10 ***
## Inflight_Entertainment 6.937e-01 1.178e-02 58.905 < 2e-
16 ***
## Online_Support 7.884e-02 1.282e-02 6.149 7.78e-
10 ***
## Ease_Of_OnlineBooking 2.535e-01 1.648e-02 15.379 < 2e-
16 ***
## Onboard_Service 2.864e-01 1.187e-02 24.127 < 2e-
16 ***
## LegRoom_Service 2.152e-01 9.948e-03 21.632 < 2e-
16 ***
## Baggage_Handling 1.127e-01 1.312e-02 8.586 < 2e-
16 ***
## CheckIn_Service 3.020e-01 9.823e-03 30.747 < 2e-
16 ***
## Cleanliness 9.612e-02 1.369e-02 7.020 2.21e-
12 ***
## Online_Boarding 1.711e-01 1.411e-02 12.126 < 2e-
16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 87655 on 63641 degrees of freedom
## Residual deviance: 49372 on 63619 degrees of freedom
## AIC: 49418
##
## Number of Fisher Scoring iterations: 5

Date
Title 63

#Prediction -Train Dataset


logr_pred = predict(logr_model, newdata = logr_train, type = "response"
)
logr_TR = table(logr_train$Satisfaction, logr_pred > 0.5)
logr_TR

##
## FALSE TRUE
## 0 23495 5314
## 1 5256 29577

#Accuracy
TR_tpr = logr_TR[2,2]/(logr_TR[2,1] + logr_TR[2,2])
TR_fpr = logr_TR[1,1]/(logr_TR[1,1] + logr_TR[1,2])
TR_accuracy = sum(diag(logr_TR))/sum(logr_TR)

TR_accuracy

## [1] 0.8339147

#Prediction -Test Dataset


logr_pred1 = predict(logr_model, newdata = logr_test, type = "response"
)
logr_TE = table(logr_test$Satisfaction, logr_pred1 > 0.5)
logr_TE

##
## FALSE TRUE
## 0 10104 2243
## 1 2197 12731

#Accuracy
TE_tpr = logr_TE[2,2]/(logr_TE[2,1] + logr_TE[2,2])
TE_fpr = logr_TE[1,1]/(logr_TE[1,1] + logr_TE[1,2])
TE_accuracy = sum(diag(logr_TE))/sum(logr_TE)

TE_accuracy

## [1] 0.8372136

#Alternatively
logr_pred = predict(logr_model, newdata = logr_test, type = "response")

y_pred_num = ifelse(logr_pred > 0.5, 1,0)


y_pred = factor(y_pred_num, levels = c(0, 1))
y_act = logr_test$Satisfaction

#Accuracy
mean(y_pred == y_act)

Date
Title 64

## [1] 0.8372136

#Confusion Matrix
library(e1071)
caret::confusionMatrix(y_pred, y_act, positive = "1")

## Confusion Matrix and Statistics


##
## Reference
## Prediction 0 1
## 0 10104 2197
## 1 2243 12731
##
## Accuracy : 0.8372
## 95% CI : (0.8328, 0.8416)
## No Information Rate : 0.5473
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6714
##
## Mcnemar's Test P-Value : 0.4995
##
## Sensitivity : 0.8528
## Specificity : 0.8183
## Pos Pred Value : 0.8502
## Neg Pred Value : 0.8214
## Prevalence : 0.5473
## Detection Rate : 0.4668
## Detection Prevalence : 0.5490
## Balanced Accuracy : 0.8356
##
## 'Positive' Class : 1
##

#ROC
library(InformationValue)

##
## Attaching package: 'InformationValue'

## The following objects are masked from 'package:caret':


##
## confusionMatrix, precision, sensitivity, specificity

InformationValue::plotROC(y_act, logr_pred)

Date
Title 65

InformationValue::AUROC(y_act, logr_pred)

## [1] 0.9097688

#RAndom Forest
Rand_data = train_data
Rand_test = test_data
Rand_Model = randomForest(Satisfaction ~., data = Rand_data, ntree = 50
, importance = T)

Rand_Model

##
## Call:
## randomForest(formula = Satisfaction ~ ., data = Rand_data, ntree =
50, importance = T)
## Type of random forest: classification
## Number of trees: 50
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 5.31%
## Confusion matrix:
## 0 1 class.error
## 0 27250 1559 0.05411503
## 1 1821 33012 0.05227801

#Prediction
predict_rf = predict(Rand_Model, Rand_test, type = "class")
confusionMatrix(Rand_test$Satisfaction, predict_rf)

Date
Title 66

## Warning in Ops.factor(predictedScores, threshold): '<' not meaningfu


l for
## factors

## [1] 0 1
## <0 rows> (or 0-length row.names)

#ROC Curve
library(ROCR)
Pred_rf = predict(Rand_Model, Rand_test, type = 'prob')[,2]
require(pROC)

## Loading required package: pROC

## Type 'citation("pROC")' for a citation.

##
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':


##
## cov, smooth, var

rf_roc = roc(Rand_test$Satisfaction,Pred_rf)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(rf_roc)

Area under ROC curve


auc(rf_roc)

## Area under the curve: 0.9901

Date
Title 67

varImpPlot(Rand_Model, sort = T, n.var = 10, main = "Top 10 -Variable I


mportance")

#Boosting

boost_train = train_data
boost_test = test_data

features_train = as.matrix(boost_train[,c(1:8,10:23)])
label_train = as.matrix(boost_train[,9])
features_test = as.matrix(boost_test[,c(1:8,10:23)])

#Model
library(xgboost)

##
## Attaching package: 'xgboost'

## The following object is masked from 'package:rattle':


##
## xgboost

XGBModel = xgboost(data = features_train, label = label_train, eta = .0


01, max_depth = 5, min_child_weight = 3, nrounds = 10, nfold = 5, objec
tive = "binary:logistic", verbose = 0, early_stopping_rounds = 10)

## [22:22:00] WARNING: amalgamation/../src/learner.cc:480:


## Parameters: { nfold } might not be used.
##

Date
Title 68

## This may not be accurate due to some parameters are only used in l
anguage bindings but
## passed down to XGBoost core. Or some parameters are not used but
slip through this
## verification. Please open an issue if you find above cases.

#Prediction
XGB_predict = predict(XGBModel, features_test)
tabXGB = table(boost_test$Satisfaction, XGB_predict > 0.5)
tabXGB

##
## FALSE TRUE
## 0 10914 1433
## 1 2052 12876

#Confusion Matrix
XG_tpr = tabXGB[2,2]/(tabXGB[2,1] + tabXGB[2,2])
XG_fpr = tabXGB[1,1]/(tabXGB[1,1] + tabXGB[1,2])

XG_tpr

## [1] 0.8625402

XG_fpr

## [1] 0.8839394

#Accuracy
sum(diag(tabXGB))/sum(tabXGB)

## [1] 0.8722273

bs_pred = ifelse(XGB_predict > 0.5, 1,0)


bs_act = boost_test$Satisfaction

#ROC
InformationValue::plotROC(bs_act, XGB_predict)

Date
Title 69

InformationValue::AUROC(bs_act, XGB_predict)

## [1] 0.8732398

Date

You might also like