67% found this document useful (3 votes)
1K views

Machine Learning Project

This document summarizes an analysis of employee commute data to predict which mode of transport employees will use. The analysis includes: 1) Exploratory data analysis including univariate, bivariate and outlier analysis to understand patterns in the age, salary, experience and other attributes between transport modes. 2) Checks for missing data and multicollinearity between variables. 3) Data preparation including handling imbalanced classes with SMOTE. 4) Application of machine learning algorithms like logistic regression, KNN, naive bayes and ensemble methods to predict employee transport mode. 5) Evaluation of model performance and selection of best model.

Uploaded by

Prasanna rs
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
67% found this document useful (3 votes)
1K views

Machine Learning Project

This document summarizes an analysis of employee commute data to predict which mode of transport employees will use. The analysis includes: 1) Exploratory data analysis including univariate, bivariate and outlier analysis to understand patterns in the age, salary, experience and other attributes between transport modes. 2) Checks for missing data and multicollinearity between variables. 3) Data preparation including handling imbalanced classes with SMOTE. 4) Application of machine learning algorithms like logistic regression, KNN, naive bayes and ensemble methods to predict employee transport mode. 5) Evaluation of model performance and selection of best model.

Uploaded by

Prasanna rs
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

[Type here]

This document forms the Report to


predict whether or not an employee
will use Car as a mode of transport
using Machine Learning Algorithms.

Project -5
Machine Learning

R S Prashanti

pg. 0
Project-4 greatlearning

Table of Contents

No. Questions Ratings Pts

EDA - Basic data summary, Univariate, Bivariate analysis, graphs, Check for Outliers
1. 7
and missing values and check the summary of the dataset

2. EDA - Illustrate the insights based on EDA 5

3. EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it. 3

4. Data Preparation (SMOTE) 10

5. Applying Logistic Regression & Interpret results. 3

6. Applying KNN Model & Interpret results 3

Applying Naïve Bayes Model & Interpret results (is it applicable here? comment and if
7 3
it is not applicable, how can you build an NB model in this case?)

8 Confusion matrix interpretation 3

9 Remarks on Model validation exercise <Which model performed the best> 3

10 Bagging 7.5

11 Boosting 7.5

12 Actionable Insights and Recommendations 5

All the questions are in same order as in the Table of contents

pg. 1
Project-4 greatlearning

Projective Objective

This project requires the understanding of what mode of transport employees prefers to commute to
their office. The attached data 'Cars.csv' includes employee information about their mode of transport
as well as their personal and professional details like age, salary, work exp. We need to predict whether
or not an employee will use Car as a mode of transport. Also, which variables are a significant predictor
behind this decision.

The following steps will be performed to predict the commute mode:

 Perform an EDA on the data


 Illustrate the insights based on EDA
 Check for Multicollinearity
 Data Preparation
 Prepare the data for analysis (SMOTE)
 Modeling
 Create multiple models and explore how each model perform using appropriate model
performance metrics
 KNN
 Naive
 Logistic Regression
 Apply both bagging and boosting modeling procedures to create 2 models and compare its
accuracy with the best model of the above step.

Note: Each question referred to the above table has been answered with the same nomenclature.

pg. 2
Project-4 greatlearning

Q1. Basic data summary, Univariate, Bivariate analysis, graphs, Check for Outliers and missing values and
check the summary of the dataset.

A1.

setwd("D:/users/Prashanti Sharma/Personal/PROJS/Pred-2")
library(readr)
data= read.csv("Diabetes.csv")

1.1 ##Understanding the data

variable.names(data)

[1] "Age" "Gender" "Engineer" "MBA" "Work.Exp" "Salary" "Distance"


[8] "license" "Transport"

str(data)

str(data)
'data.frame': 444 obs. of 9 variables:
$ Age : int 28 23 29 28 27 26 28 26 22 27 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 1 2 2 ...
$ Engineer : int 0 1 1 1 1 1 1 1 1 1 ...
$ MBA : int 0 0 0 1 0 0 0 0 0 0 ...
$ Work.Exp : int 4 4 7 5 4 4 5 3 1 4 ...
$ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : int 0 0 0 0 0 1 0 0 0 0 ...
$ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 3 3 3 3 3 3 1 3 3 3 ...

>

dim(data)

> dim(data)
[1] 444 9

View(data)

#basic data summary

Variables like Engineer, MBA and license has been read as numeric so should be converted to factors
first.

pg. 3
Project-4 greatlearning

summary(data)

> summary(data)
Age Gender Engineer MBA Work.Exp Salary
Min. :18.00 Female:128 0:109 0 :331 Min. : 0.0 Min. : 6.50
1st Qu.:25.00 Male :316 1:335 1 :112 1st Qu.: 3.0 1st Qu.: 9.80
Median :27.00 NA's: 1 Median : 5.0 Median :13.60
Mean :27.75 Mean : 6.3 Mean :16.24
3rd Qu.:30.00 3rd Qu.: 8.0 3rd Qu.:15.72
Max. :43.00 Max. :24.0 Max. :57.00
Distance license Transport
Min. : 3.20 0:340 2Wheeler : 83
1st Qu.: 8.80 1:104 Car : 61
Median :11.00 Public Transport:300
Mean :11.32
3rd Qu.:13.43
Max. :23.40

 We can conclude that we have majority of Males approx.. 75%


 Similarly Engineers outnumber MBA’s
 Total number of engineers and MBA’s is greater then 444, hence possibly some of candidates ha
ve dual degree
 One of data point for MBA is missing
 Salary might have skewed distribution
 Again, public transport is most common mode of transportation

# Checking null data

sapply(data,function(x) sum(is.na(x)))

> sapply(data,function(x) sum(is.na(x)))


Age Gender Engineer MBA Work.Exp Salary Distance license Transport
0 0 0 1 0 0 0 0 0

#Checking for Outliers

summary(data$Gender)
summary(data$Engineer)
summary(data$MBA)
summary(data$Work.Exp)
summary(data$Salary)
summary(data$Distance)
summary(data$license)

pg. 4
Project-4 greatlearning

summary(data$Transport)

summary(data$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 25.00 27.00 27.75 30.00 43.00
> summary(data$Gender)
Female Male
128 316
> summary(data$Engineer)
0 1
109 335
> summary(data$MBA)
0 1 NA's
331 112 1
> summary(data$Work.Exp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 3.0 5.0 6.3 8.0 24.0
> summary(data$Salary)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.50 9.80 13.60 16.24 15.72 57.00
> summary(data$Distance)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.20 8.80 11.00 11.32 13.43 23.40
> summary(data$license)
0 1
340 104
> summary(data$Transport)
2Wheeler Car Public Transport
83 61 300

>

#Checking for Missing Values


data.frame(data)
is.na(data)
sum(is.na(data))

> sum(is.na(data))
[1] 1

As stated earlier in the summary analysis one data point is missing which is apparent from the above.

#Univariate analysis

pg. 5
Project-4 greatlearning

Univariate analysis of all the continuous factors (numeric)

library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
data %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(col="Blue")

pg. 6
Project-4 greatlearning

#Bivariate Analysis

boxplot(data$Age ~data$Engineer, main = "Age vs Eng.")


boxplot(data$Age ~data$MBA, main ="Age Vs MBA")

As expected not much of difference here, people for all qualifications and all work exp would be
employed in firm.

boxplot(data$Age ~data$Engineer, main = "Salary vs Eng.")


boxplot(data$Age ~data$MBA, main ="Salary vs MBA.")

pg. 7
Project-4 greatlearning

We do not see any appreciable difference in salary of Engs Vs Non-Engs or Mba vs Non-MBA’s
Also, mean salary for both MBA’s and Eng is around 16

hist(data$Work.Exp, col = "blue", main = "Distribution of work exp")

pg. 8
Project-4 greatlearning

This is skewed towards right, again this would be on expected lines as there would be more juniors then
seniors in any firm

table(data$license,data$Transport)

2Wheeler Car Public Transport


0 60 13 267
1 23 48 33

boxplot(data$Work.Exp ~ data$Gender)

Not much of difference between mean work experience in two genders, so population is equally
distributed for both male and females.

pg. 9
Project-4 greatlearning

Hypothesis Testing

1. Higher the salary more the chances of using car for commute.

boxplot(data$Salary~data$Transport, main="Salary vs Transport")

Plot clearly shows as salary increase, inclination of commuting by car is higher.

2. Again with age or work. Exp (Age and work exp would be collinear), propensity of using car
Increases.
cor(data$Age, data$Work.Exp)

> cor(data$Age, data$Work.Exp)


[1] 0.9322364

boxplot(data$Age~data$Transport, main="Age vs Transport")

pg. 10
Project-4 greatlearning

As was the case with salary, we could see clear demarcation in usage of transport. With lower age group
2-wheeler is preferable and with higher work exp car is preferred.

3. As distance increase employee, would prefer car for comfort and ease.

boxplot(data$Distance~data$Transport, main="Distance vs Transport")

There is a slight pattern that could be observed here. For greater distance car is preferred followed by 2-
wheeler and then public transport.

4. Females would prefer more of private transfer then public transport

table(data$Gender,data$Transport)

pg. 11
Project-4 greatlearning

table(data$Gender,data$Transport)

2Wheeler Car Public Transport


Female 38 13 77
Male 45 48 223

We could see that around 40 % of females use private transport and 10% use car compared to males wh
ere 15% prefers car and total of 30% uses private transport. Thus, even though percentage of car usage
is high but they are also high on public transport.

A.2 –Insights of EDA

 We can conclude that we have majority of Males approx.. 75%


 Similarly Engineers outnumber MBA’s
 Total number of engineers and MBA’s is greater then 444, hence possibly some of candidates ha
ve dual degree
 One of data point for MBA is missing
 Salary might have skewed distribution
 Again, public transport is most common mode of transportation
 We could see that around 40 % of females use private transport and 10% use car compared to m
ales where 15% prefers car and total of 30% uses private transport. Thus, even though percenta
ge of car usage is high but they are also high on public transport.
 There is a slight pattern that could be observed here. For greater distance car is preferred follow
ed by 2-wheeler and then public transport
 As was the case with salary, we could see clear demarcation in usage of transport. With lower ag
e group 2-wheeler is preferable and with higher work exp car is preferred.
 Again with age or work. Exp (Age and work exp would be collinear), propensity of using car
Increases.

A.3 Multicollinearity

library(corrplot)
data.matrix=cor(data)
corrplot(data.matrix)
library(ppcor)
pcor(data1, method = "pearson")

pg. 12
Project-4 greatlearning

A.4 SMOTE

## SMOTE

The columns Engineer,MBA and license need to be converted into factors

data$Engineer<-as.factor(data$Engineer)
data$MBA<-as.factor(data$MBA)
data$license<-as.factor( data$license)
data<-knnImputation(data)

data$CarUsage<-ifelse(data$Transport =='Car',1,0)
table(data$CarUsage)

sum(data$CarUsage == 1)/nrow(data)

data$CarUsage<-as.factor(data$CarUsage)

data$Engineer<-as.factor(data$Engineer)
> data$MBA<-as.factor(data$MBA)
> data$license<-as.factor( data$license)
> data<-knnImputation(data)
Warning message:
In knnImputation(data) :
No case has missing values. Stopping as there is nothing to do.
>
>
> data$CarUsage<-ifelse(data$Transport =='Car',1,0)
> table(data$CarUsage)

0 1
383 61
>
> sum(data$CarUsage == 1)/nrow(data)
[1] 0.1373874
>
> data$CarUsage<-as.factor(data$CarUsage)

The number of records for people travelling by car is in minority. Hence we need to use an appropriate
sampling method on the train data.

##Split the data into test and train

set.seed(400)

pg. 13
Project-4 greatlearning

carindex<-createDataPartition(data$CarUsage, p=0.7,list = FALSE,times = 1)


carsdatatrain<-data[carindex,]
carsdatatest<-data[-carindex,]
prop.table(table(carsdatatrain$CarUsage))

> set.seed(400)
> carindex<-createDataPartition(data$CarUsage, p=0.7,list = FALSE,times = 1)
> carsdatatrain<-data[carindex,]
> carsdatatest<-data[-carindex,]
> prop.table(table(carsdatatrain$CarUsage))

0 1
0.8621795 0.1378205

prop.table(table(carsdatatest$CarUsage))

> prop.table(table(carsdatatest$CarUsage))

0 1
0.8636364 0.1363636

carsdatatrain<-carsdatatrain[,c(1:8,10)]
carsdatatest<-carsdatatest[,c(1:8,10)]

## The train and test data have almost same percentage of cars usage as the base data
## Apply SMOTE on Training data set

library(DMwR)
attach(carsdatatrain)
carsdataSMOTE<-SMOTE(CarUsage~., carsdatatrain, perc.over = 250,perc.under = 150)
prop.table(table(carsdataSMOTE$CarUsage))

arsdataSMOTE<-SMOTE(CarUsage~., carsdatatrain, perc.over = 250,perc.under = 150)


> prop.table(table(carsdataSMOTE$CarUsage))

0 1
0.5 0.5

We now have an equal split in the data between car users and non car users. Let us proceed with
building the models ## Model Building We will use the Logistic regression method a model on the
SMOTE data to understand the factors influencing car usage. Since we have only limited variable, we will
use them all in model building

outcomevar<-'CarUsage'
regressors<-c("Age","Work.Exp","Salary","Distance","license","Engineer","MBA","Gender")
trainctrl<-trainControl(method = 'repeatedcv',number = 10,repeats = 3)

pg. 14
Project-4 greatlearning

carsglm<-train(carsdataSMOTE[,regressors],carsdataSMOTE[,outcomevar],method = "glm", family =


"binomial",trControl = trainctrl)

> outcomevar<-'CarUsage'
> regressors<-c("Age","Work.Exp","Salary","Distance","license","Engineer","MBA","Gender")
> trainctrl<-trainControl(method = 'repeatedcv',number = 10,repeats = 3)
> carsglm<-train(carsdataSMOTE[,regressors],carsdataSMOTE[,outcomevar],method = "glm", family = "b
inomial",trControl = trainctrl)
Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
4: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(carsglm$finalModel)

> summary(carsglm$finalModel)

Call:
NULL

Deviance Residuals:
Min 1Q Median 3Q Max
-2.09669 -0.01064 0.00000 0.02798 2.11942

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -64.8739 18.4863 -3.509 0.000449 ***
Age 1.7911 0.5273 3.397 0.000681 ***
Work.Exp -0.2056 0.4347 -0.473 0.636175
Salary -0.1061 2.1572 -0.049 0.960759
Distance 0.6172 0.1957 3.154 0.001611 **
license1 1.5699 0.8530 1.840 0.065711 .
Engineer1 3.1162 1.2024 2.592 0.009554 **
MBA1 -3.4351 1.1784 -2.915 0.003557 **
GenderMale 1.6470 0.9453 1.742 0.081460 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 357.664 on 257 degrees of freedom


Residual deviance: 49.909 on 249 degrees of freedom
AIC: 67.909

Number of Fisher Scoring iterations: 9

pg. 15
Project-4 greatlearning

carglmcoeff<-exp(coef(carsglm$finalModel))
write.csv(carglmcoeff,file = "Coeffs.csv")
varImp(object = carsglm)

carglmcoeff<-exp(coef(carsglm$finalModel))
> write.csv(carglmcoeff,file = "Coeffs.csv")
> varImp(object = carsglm)
glm variable importance

Overall
Age 100.00
Distance 92.74
MBA1 85.60
Engineer1 75.94
license1 53.50
GenderMale 50.57
Work.Exp 12.66
Salary 0.00

plot(varImp(object = carsglm), main="Vairable Importance for Logistic Regression")

pg. 16
Project-4 greatlearning

From the model we see that Age and License are more significant. When we look at the odds and
probabilities table, we get to see that Increase in age by 1 year implies that thre is a 98% probability that
the employee will use a car. As expected , if the employee has a license, then it implies a 99% probability
that he/she will use a car. One lkah increase in salary increases the probability of car usage by 72% The
null deviance of this model is 357.664 and the residual deviance is 17.959. This yields a McFadden R
Sqaure o almost 0.94 yielding a very good fit. We get to see Accuracy and Kappa values are high We shall
do the prediction based on this model

#confusion matrix

carusageprediction<-predict.train(object = carsglm,carsdatatest[,regressors],type = "raw")


confusionMatrix(carusageprediction,carsdatatest[,outcomevar], positive='1')

plot(varImp(object = carsglm), main="Vairable Importance for Logistic Regression")


> carusageprediction<-predict.train(object = carsglm,carsdatatest[,regressors],type = "raw")
> confusionMatrix(carusageprediction,carsdatatest[,outcomevar], positive='1')
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 111 3
1 3 15

Accuracy : 0.9545
95% CI : (0.9037, 0.9831)
No Information Rate : 0.8636
P-Value [Acc > NIR] : 0.0005559

Kappa : 0.807

Mcnemar's Test P-Value : 1.0000000

Sensitivity : 0.8333
Specificity : 0.9737
Pos Pred Value : 0.8333
Neg Pred Value : 0.9737
Prevalence : 0.1364
Detection Rate : 0.1136
Detection Prevalence : 0.1364
Balanced Accuracy : 0.9035

'Positive' Class : 1

pg. 17
Project-4 greatlearning

A.5 Applying Logistic Regression

1. Logistic Regression

### Model Building - Logistic regression

install.packages("MASS")
library(class)
model <-glm(Transport ~.,family=binomial(link='logit'),data=cars_train)
summary(model)

model <-glm(Transport ~.,family=binomial(link='logit'),data=cars_train)


> summary(model)

Call:
glm(formula = Transport ~ ., family = binomial(link = "logit"),
data = cars_train)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.5733 0.1351 0.3372 0.5675 1.8440

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.48505 2.68180 -4.283 1.85e-05 ***
Age 0.52837 0.11253 4.696 2.66e-06 ***
GenderMale 1.52604 0.38092 4.006 6.17e-05 ***
Engineer1 0.33374 0.38788 0.860 0.38956
MBA1 0.64960 0.43553 1.492 0.13583
Work.Exp -0.07111 0.12017 -0.592 0.55403
Salary -2.55115 2.29565 -1.111 0.26644
Distance -0.15098 0.05167 -2.922 0.00348 **
license1 -2.32637 0.46451 -5.008 5.49e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 302.59 on 311 degrees of freedom


Residual deviance: 223.09 on 303 degrees of freedom
AIC: 241.09

Number of Fisher Scoring iterations: 6

The top three most relevant features are "Work.Exp", "Distance", "Salary " and "license" because of the
low p-values.

pg. 18
Project-4 greatlearning

Also, the AIC# Score is 241.09. This will be observed in subsequent stages when we refine the model.
The model having least AIC Score would be the most preferred and optimized one.

anova(model, train="Chisq")

anova(model, train="Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: Transport

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev


NULL 311 302.59
Age 1 26.1316 310 276.46
Gender 1 10.2970 309 266.16
Engineer 1 1.3586 308 264.80
MBA 1 0.8274 307 263.98
Work.Exp 1 2.6335 306 261.34
Salary 1 0.1789 305 261.16
Distance 1 11.1693 304 249.99
license 1 26.9018 303 223.09

## Cross Validation

fitted.results <- predict(model,newdata=train,type='response')


fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != train$class)
print(paste('Accuracy',1-misClasificError))

[1] "Accuracy 0.775925925925926"

## Decision Tree

## Decision Tree

library(rpart)
model2 <- rpart(Transport ~ Salary + Distance + license
+Work.Exp, data=cars_train, method="class")

plot(model2, uniform=TRUE,
main="Classification Tree for data")
text(model2, use.n=TRUE, all=TRUE, cex=.8)

pg. 19
Project-4 greatlearning

treePred <- predict(model2, cars_train, type = 'class')


table(treePred, cars_train$Transport)
mean(treePred==cars_train$Transport)

library(rpart)
> model2 <- rpart(Transport ~ Salary + Distance + license
+ +Work.Exp, data=cars_train, method="class")
>
> plot(model2, uniform=TRUE,
+ main="Classification Tree for data")
> text(model2, use.n=TRUE, all=TRUE, cex=.8)
> treePred <- predict(model2, cars_train, type = 'class')
> table(treePred, cars_train$Transport)

treePred 2Wheeler Car Public Transport


2Wheeler 24 0 10
Car 3 36 0
Public Transport 32 7 200
> mean(treePred==cars_train$Transport)
[1] 0.8333333

pg. 20
Project-4 greatlearning

A.6Applying KNN Model

#KNN

library(caret)
library(mlbench)
library(caret)

random <- createDataPartition(data$Transport, p=0.70, list=FALSE)


cars_train <- data[ random,]
cars_test <- data[-random,]

library(class)

trControl <- trainControl(method = "cv", number = 10)


fit.knn <- train(Transport ~ .,
+ method = "knn",
+ tuneGrid = expand.grid(k = 2:20),
+ trControl = trControl,
+ metric = "Accuracy",
+ preProcess = c("center","scale"),
+ data = cars_train)
fit.knn

k-Nearest Neighbors

312 samples
8 predict or
3 classes: '2Wheeler', 'Car', 'Public Transport'

Pre-processing: centered (8), scaled (8)


Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 281, 281, 280, 281, 280, 282, ...
Resampling results across tuning parameters:

k Accuracy Kappa
2 0.7365457 0.4543489
3 0.7855712 0.5248631
4 0.7629839 0.4800127
5 0.7828562 0.5081854
6 0.7734812 0.4905393
7 0.7634005 0.4624704
8 0.7408065 0.4118105
9 0.7534005 0.4199273
10 0.7536022 0.4116860
11 0.7598454 0.4168749
12 0.7662970 0.4266860

pg. 21
Project-4 greatlearning

13 0.7662970 0.4213708
14 0.7566129 0.3930122
15 0.7661895 0.4135919
16 0.7660887 0.4090611
17 0.7566129 0.3862387
18 0.7629637 0.3926229
19 0.7661895 0.4026549
20 0.7661895 0.3942178

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 3

KNN_predictions = predict(fit.knn,cars_train)
table(KNN_predictions, cars_train$Transport)

KNN_predictions 2Wheeler Car Public Transport


2Wheeler 37 0 8
Car 0 35 2
Public Transport 22 8 200

KNN_predictions = predict(fit.knn,cars_test)
table(KNN_predictions, cars_test$Transport)

KNN_predictions 2Wheeler Car Public Transport


2Wheeler 9 0 11
Car 1 15 3
Public Transport 14 3 76

predict(fit.knn,carsTest)

[1] Public Transport Public Transport


Levels: 2Wheeler Car Public TransportA.2 Data Split

For classification problem, it is important to ensure that the train and test sets have approximately the
same percentage of samples of each target class. Hence, we will use stratified sampling

install.packages("caret")
library(caret)
set.seed(1234)
trainIndex = createDataPartition(class, p=0.7, list = FALSE, times = 1)
train.data = data[trainIndex, ]
test.data = data[-trainIndex,]

dim(train.data)
dim(test.data)

pg. 22
Project-4 greatlearning

prop.table(table(data$class))
prop.table(table(train.data$class))
prop.table(table(test.data$class))

> set.seed(1234)
> trainIndex = createDataPartition(class, p=0.7, list = FALSE, times = 1)
> train.data = data[trainIndex, ]
> test.data = data[-trainIndex,]
>
> dim(train.data)
[1] 538 9
> dim(test.data)
[1] 230 9
>
> prop.table(table(data$class))

0 1
0.6510417 0.3489583
> prop.table(table(train.data$class))

0 1
0.6728625 0.3271375
> prop.table(table(test.data$class))

0 1
0.6 0.4

A.7 Applying Naïve Bayes

Data Cleaning

Missing values

anyNA(data)

anyNA(data)
[1] TRUE

Finding out where the missing value is

data[!complete.cases(data), ]

data[!complete.cases(data), ]
Age Gender Engineer MBA Work.Exp Salary Distance license Transport
145 28 Female 0 <NA> 6 13.7 9.4 0 Public Transport

pg. 23
Project-4 greatlearning

Use KNN means method to impute the missing value


library(DMwR)

data = knnImputation(data, 5)

Normalize continuous variables

# Model Building and Predictions

Naïve Bayes

library(e1071)
Naive_Bayes_Model=naiveBayes(cars_train$Transport ~., data=cars_train)
Naive_Bayes_Model

library(e1071)
> Naive_Bayes_Model=naiveBayes(cars_train$Transport ~., data=cars_train)
> Naive_Bayes_Model

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
2Wheeler Car Public Transport
0.1891026 0.1378205 0.6730769

Conditional probabilities:
Age
Y [,1] [,2]
2Wheeler 25.11864 3.173624
Car 35.20930 3.211276
Public Transport 26.76190 2.952653

Gender
Y Female Male
2Wheeler 0.4915254 0.5084746
Car 0.1395349 0.8604651
Public Transport 0.2285714 0.7714286

Engineer
Y 0 1
2Wheeler 0.2542373 0.7457627

pg. 24
Project-4 greatlearning

Car 0.1627907 0.8372093


Public Transport 0.2714286 0.7285714

MBA
Y 0 1
2Wheeler 0.7627119 0.2372881
Car 0.8604651 0.1395349
Public Transport 0.7142857 0.2857143

Work.Exp
Y [,1] [,2]
2Wheeler 4.000000 3.449138
Car 15.046512 4.690180
Public Transport 5.004762 3.181131

Salary
Y [,1] [,2]
2Wheeler 2.442295 0.4234003
Car 3.479695 0.4330174
Public Transport 2.520766 0.3208979

Distance
Y [,1] [,2]
2Wheeler 12.03051 3.178172
Car 15.07209 3.903559
Public Transport 10.11524 2.964633

license
Y 0 1
2Wheeler 0.7288136 0.2711864
Car 0.2093023 0.7906977
Public Transport 0.8857143 0.1142857

This gives us the rule or factors which can help us employees decision to use car or not.

General way to interpret this output is that for any factor variable say license we can say that 72% of
people without license use 2-wheeler and 27% with license.
For continuous variables for example distance we can say 2-wheeler is used by people for whom
commute distance is 11.9 with sd of 3.5.

#Prediction on the test dataset

NB_Predictions=predict(Naive_Bayes_Model,cars_test)
table(NB_Predictions,cars_test$Transport)

NB_Predictions=predict(Naive_Bayes_Model,cars_test)

pg. 25
Project-4 greatlearning

> table(NB_Predictions,cars_test$Transport)

NB_Predictions 2Wheeler Car Public Transport


2Wheeler 7 0 5
Car 0 14 3
Public Transport 17 4 82

# prediction for test sample

NB_Predictions=predict(Naive_Bayes_Model,carsTest)
NB_Predictions

Levels: 2Wheeler Car Public Transport

A.10 Boosting

#BOOSTING

install.packages("xgboost")

boostcontrol <- trainControl(number=10)

xgbGrid <- expand.grid(


eta = 0.3,
max_depth = 1,
nrounds = 50,
gamma = 0,
colsample_bytree = 0.6,
min_child_weight = 1, subsample = 1
)

carsxgb <- train(Transport ~ .,cars_train,trControl = boostcontrol,tuneGrid = xgbGrid,metric =


"Accuracy",method = "xgbTree")

carsxgb$finalModel

carsxgb <- train(Transport ~ .,cars_train,trControl = boostcontrol,tuneGrid = xgbGrid,metric = "Accuracy


",method = "xgbTree")
>
> carsxgb$finalModel
##### xgb.Booster
raw: 38.5 Kb
call:
xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth,
gamma = param$gamma, colsample_bytree = param$colsample_bytree,

pg. 26
Project-4 greatlearning

min_child_weight = param$min_child_weight, subsample = param$subsample),


data = x, nrounds = param$nrounds, num_class = length(lev),
objective = "multi:softprob")
params (as set within xgb.train):
eta = "0.3", max_depth = "1", gamma = "0", colsample_bytree = "0.6", min_child_weight = "1", subsam
ple = "1", num_class = "3", objective = "multi:softprob", silent = "1"
xgb.attributes:
niter
callbacks:
cb.print.evaluation(period = print_every_n)
# of features: 8
niter: 50
nfeatures : 8
xNames : Age GenderMale Engineer1 MBA1 Work.Exp Salary Distance license1
problemType : Classification
tuneValue :
nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
1 50 1 0.3 0 0.6 1 1
obsLevels : 2Wheeler Car Public Transport
param :
list()

Let us predict using the test dataset

predictions_xgb<-predict(carsxgb,cars_test)
confusionMatrix(predictions_xgb,cars_test$Transport)

> predictions_xgb<-predict(carsxgb,cars_test)
> confusionMatrix(predictions_xgb,cars_test$Transport)
Confusion Matrix and Statistics

Reference
Prediction 2Wheeler Car Public Transport
2Wheeler 5 0 4
Car 2 17 2
Public Transport 17 1 84

Overall Statistics

Accuracy : 0.803
95% CI : (0.7249, 0.8671)
No Information Rate : 0.6818
P-Value [Acc > NIR] : 0.001323

Kappa : 0.5514

pg. 27
Project-4 greatlearning

Mcnemar's Test P-Value : 0.015591

Statistics by Class:

Class: 2Wheeler Class: Car Class: Public Transport


Sensitivity 0.20833 0.9444 0.9333
Specificity 0.96296 0.9649 0.5714
Pos Pred Value 0.55556 0.8095 0.8235
Neg Pred Value 0.84553 0.9910 0.8000
Prevalence 0.18182 0.1364 0.6818
Detection Rate 0.03788 0.1288 0.6364
Detection Prevalence 0.06818 0.1591 0.7727
Balanced Accuracy 0.58565 0.9547 0.7524

The accuracy is 80%.

A.11 Bagging

# Bagging

library(ipred)
library(rpart)

German.bagging <- bagging(Transport ~.,


data=carsdatatrain,
control=rpart.control(maxdepth=5, minsplit=4))

carsdatatest$pred.Transport<- predict(German.bagging, carsdatatest)


carsdatatest$pred.Transport<- ifelse(carsdatatest$pred.Transport < 0.5,0,1)
confusionMatrix(data=factor(carsdatatest$pred.Transport),
reference=factor(carsdatatest$Transport),
positive='1')

pg. 28
Project-4 greatlearning

A.12 Actionable Insights and Recommendations

We see that all three models predict Public Transport for the two test samples

Let us summarize the conclusions from analysis and models for employee’s decision whether to use car
Or not:

 Important variables are Age, Work.Exp, Distance and License


 Age and Work.Exp are correlated hence we could use any one (prefer Work.Exp) here
 Hence employees with work exp of 10 and above are likely to use car
 Employees who must commute for distance greater than 12 are more likely to prefer car
 With license, we do see that 74% who commute through car have license and 89% who commut
e through bus don’t have. But surprisingly 72% without license use 2-wheeler.
 Again, people with higher salaries (>20) are likely to use cars

pg. 29

You might also like