67% found this document useful (3 votes)

1K views

Machine Learning Project

This document summarizes an analysis of employee commute data to predict which mode of transport employees will use. The analysis includes: 1) Exploratory data analysis including univariate, bivariate and outlier analysis to understand patterns in the age, salary, experience and other attributes between transport modes. 2) Checks for missing data and multicollinearity between variables. 3) Data preparation including handling imbalanced classes with SMOTE. 4) Application of machine learning algorithms like logistic regression, KNN, naive bayes and ensemble methods to predict employee transport mode. 5) Evaluation of model performance and selection of best model.

Uploaded by

Prasanna rs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

67% found this document useful (3 votes)

1K views

Machine Learning Project

Uploaded by

Prasanna rs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

[Type here]

This document forms the Report to

predict whether or not an employee
will use Car as a mode of transport
using Machine Learning Algorithms.

Project -5
Machine Learning

R S Prashanti

pg. 0
Project-4 greatlearning

Table of Contents

No. Questions Ratings Pts

EDA - Basic data summary, Univariate, Bivariate analysis, graphs, Check for Outliers
1. 7
and missing values and check the summary of the dataset

2. EDA - Illustrate the insights based on EDA 5

3. EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it. 3

4. Data Preparation (SMOTE) 10

5. Applying Logistic Regression & Interpret results. 3

6. Applying KNN Model & Interpret results 3

Applying Naïve Bayes Model & Interpret results (is it applicable here? comment and if
7 3
it is not applicable, how can you build an NB model in this case?)

8 Confusion matrix interpretation 3

9 Remarks on Model validation exercise <Which model performed the best> 3

10 Bagging 7.5

11 Boosting 7.5

12 Actionable Insights and Recommendations 5

All the questions are in same order as in the Table of contents

pg. 1
Project-4 greatlearning

Projective Objective

This project requires the understanding of what mode of transport employees prefers to commute to
their office. The attached data 'Cars.csv' includes employee information about their mode of transport
as well as their personal and professional details like age, salary, work exp. We need to predict whether
or not an employee will use Car as a mode of transport. Also, which variables are a significant predictor
behind this decision.

The following steps will be performed to predict the commute mode:

 Perform an EDA on the data

 Illustrate the insights based on EDA
 Check for Multicollinearity
 Data Preparation
 Prepare the data for analysis (SMOTE)
 Modeling
 Create multiple models and explore how each model perform using appropriate model
performance metrics
 KNN
 Naive
 Logistic Regression
 Apply both bagging and boosting modeling procedures to create 2 models and compare its
accuracy with the best model of the above step.

Note: Each question referred to the above table has been answered with the same nomenclature.

pg. 2
Project-4 greatlearning

Q1. Basic data summary, Univariate, Bivariate analysis, graphs, Check for Outliers and missing values and
check the summary of the dataset.

A1.

setwd("D:/users/Prashanti Sharma/Personal/PROJS/Pred-2")
library(readr)
data= read.csv("Diabetes.csv")

1.1 ##Understanding the data

variable.names(data)

[1] "Age" "Gender" "Engineer" "MBA" "Work.Exp" "Salary" "Distance"

[8] "license" "Transport"

str(data)

str(data)
'data.frame': 444 obs. of 9 variables:
$ Age : int 28 23 29 28 27 26 28 26 22 27 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 1 2 2 ...
$ Engineer : int 0 1 1 1 1 1 1 1 1 1 ...
$ MBA : int 0 0 0 1 0 0 0 0 0 0 ...
$ Work.Exp : int 4 4 7 5 4 4 5 3 1 4 ...
$ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : int 0 0 0 0 0 1 0 0 0 0 ...
$ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 3 3 3 3 3 3 1 3 3 3 ...

dim(data)

> dim(data)
[1] 444 9

View(data)

#basic data summary

Variables like Engineer, MBA and license has been read as numeric so should be converted to factors
first.

pg. 3
Project-4 greatlearning

summary(data)

> summary(data)
Age Gender Engineer MBA Work.Exp Salary
Min. :18.00 Female:128 0:109 0 :331 Min. : 0.0 Min. : 6.50
1st Qu.:25.00 Male :316 1:335 1 :112 1st Qu.: 3.0 1st Qu.: 9.80
Median :27.00 NA's: 1 Median : 5.0 Median :13.60
Mean :27.75 Mean : 6.3 Mean :16.24
3rd Qu.:30.00 3rd Qu.: 8.0 3rd Qu.:15.72
Max. :43.00 Max. :24.0 Max. :57.00
Distance license Transport
Min. : 3.20 0:340 2Wheeler : 83
1st Qu.: 8.80 1:104 Car : 61
Median :11.00 Public Transport:300
Mean :11.32
3rd Qu.:13.43
Max. :23.40

 We can conclude that we have majority of Males approx.. 75%

# Checking null data

sapply(data,function(x) sum(is.na(x)))

> sapply(data,function(x) sum(is.na(x)))

Age Gender Engineer MBA Work.Exp Salary Distance license Transport
0 0 0 1 0 0 0 0 0

#Checking for Outliers

summary(data$Gender)
summary(data$Engineer)
summary(data$MBA)
summary(data$Work.Exp)
summary(data$Salary)
summary(data$Distance)
summary(data$license)

pg. 4
Project-4 greatlearning

summary(data$Transport)

summary(data$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 25.00 27.00 27.75 30.00 43.00
> summary(data$Gender)
Female Male
128 316
> summary(data$Engineer)
0 1
109 335
> summary(data$MBA)
0 1 NA's
331 112 1
> summary(data$Work.Exp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 3.0 5.0 6.3 8.0 24.0
> summary(data$Salary)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.50 9.80 13.60 16.24 15.72 57.00
> summary(data$Distance)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.20 8.80 11.00 11.32 13.43 23.40
> summary(data$license)
0 1
340 104
> summary(data$Transport)
2Wheeler Car Public Transport
83 61 300

#Checking for Missing Values

data.frame(data)
is.na(data)
sum(is.na(data))

> sum(is.na(data))
[1] 1

As stated earlier in the summary analysis one data point is missing which is apparent from the above.

#Univariate analysis

pg. 5
Project-4 greatlearning

Univariate analysis of all the continuous factors (numeric)

library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
data %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(col="Blue")

pg. 6
Project-4 greatlearning

#Bivariate Analysis

boxplot(data$Age ~data$Engineer, main = "Age vs Eng.")

boxplot(data$Age ~data$MBA, main ="Age Vs MBA")

As expected not much of difference here, people for all qualifications and all work exp would be
employed in firm.

boxplot(data$Age ~data$Engineer, main = "Salary vs Eng.")

boxplot(data$Age ~data$MBA, main ="Salary vs MBA.")

pg. 7
Project-4 greatlearning

We do not see any appreciable difference in salary of Engs Vs Non-Engs or Mba vs Non-MBA’s
Also, mean salary for both MBA’s and Eng is around 16

hist(data$Work.Exp, col = "blue", main = "Distribution of work exp")

pg. 8
Project-4 greatlearning

This is skewed towards right, again this would be on expected lines as there would be more juniors then
seniors in any firm

table(data$license,data$Transport)

2Wheeler Car Public Transport

0 60 13 267
1 23 48 33

boxplot(data$Work.Exp ~ data$Gender)

Not much of difference between mean work experience in two genders, so population is equally
distributed for both male and females.

pg. 9
Project-4 greatlearning

Hypothesis Testing

1. Higher the salary more the chances of using car for commute.

boxplot(data$Salary~data$Transport, main="Salary vs Transport")

Plot clearly shows as salary increase, inclination of commuting by car is higher.

2. Again with age or work. Exp (Age and work exp would be collinear), propensity of using car
Increases.
cor(data$Age, data$Work.Exp)

> cor(data$Age, data$Work.Exp)

[1] 0.9322364

boxplot(data$Age~data$Transport, main="Age vs Transport")

pg. 10
Project-4 greatlearning

As was the case with salary, we could see clear demarcation in usage of transport. With lower age group
2-wheeler is preferable and with higher work exp car is preferred.

3. As distance increase employee, would prefer car for comfort and ease.

boxplot(data$Distance~data$Transport, main="Distance vs Transport")

There is a slight pattern that could be observed here. For greater distance car is preferred followed by 2-
wheeler and then public transport.

4. Females would prefer more of private transfer then public transport

table(data$Gender,data$Transport)

pg. 11
Project-4 greatlearning

table(data$Gender,data$Transport)

2Wheeler Car Public Transport

Female 38 13 77
Male 45 48 223

We could see that around 40 % of females use private transport and 10% use car compared to males wh
ere 15% prefers car and total of 30% uses private transport. Thus, even though percentage of car usage
is high but they are also high on public transport.

A.2 –Insights of EDA

 We can conclude that we have majority of Males approx.. 75%

 Similarly Engineers outnumber MBA’s
 Total number of engineers and MBA’s is greater then 444, hence possibly some of candidates ha
ve dual degree
 One of data point for MBA is missing
 Salary might have skewed distribution
 Again, public transport is most common mode of transportation
 We could see that around 40 % of females use private transport and 10% use car compared to m
ales where 15% prefers car and total of 30% uses private transport. Thus, even though percenta
ge of car usage is high but they are also high on public transport.
 There is a slight pattern that could be observed here. For greater distance car is preferred follow
ed by 2-wheeler and then public transport
 As was the case with salary, we could see clear demarcation in usage of transport. With lower ag
e group 2-wheeler is preferable and with higher work exp car is preferred.
 Again with age or work. Exp (Age and work exp would be collinear), propensity of using car
Increases.

A.3 Multicollinearity

library(corrplot)
data.matrix=cor(data)
corrplot(data.matrix)
library(ppcor)
pcor(data1, method = "pearson")

pg. 12
Project-4 greatlearning

A.4 SMOTE

## SMOTE

The columns Engineer,MBA and license need to be converted into factors

data$Engineer<-as.factor(data$Engineer)
data$MBA<-as.factor(data$MBA)
data$license<-as.factor( data$license)
data<-knnImputation(data)

data$CarUsage<-ifelse(data$Transport =='Car',1,0)
table(data$CarUsage)

sum(data$CarUsage == 1)/nrow(data)

data$CarUsage<-as.factor(data$CarUsage)

data$Engineer<-as.factor(data$Engineer)
> data$MBA<-as.factor(data$MBA)
> data$license<-as.factor( data$license)
> data<-knnImputation(data)
Warning message:
In knnImputation(data) :
No case has missing values. Stopping as there is nothing to do.
>
>
> data$CarUsage<-ifelse(data$Transport =='Car',1,0)
> table(data$CarUsage)

0 1
383 61
>
> sum(data$CarUsage == 1)/nrow(data)
[1] 0.1373874
>
> data$CarUsage<-as.factor(data$CarUsage)

The number of records for people travelling by car is in minority. Hence we need to use an appropriate
sampling method on the train data.

##Split the data into test and train

set.seed(400)

pg. 13
Project-4 greatlearning

carindex<-createDataPartition(data$CarUsage, p=0.7,list = FALSE,times = 1)

carsdatatrain<-data[carindex,]
carsdatatest<-data[-carindex,]
prop.table(table(carsdatatrain$CarUsage))

> set.seed(400)
> carindex<-createDataPartition(data$CarUsage, p=0.7,list = FALSE,times = 1)
> carsdatatrain<-data[carindex,]
> carsdatatest<-data[-carindex,]
> prop.table(table(carsdatatrain$CarUsage))

0 1
0.8621795 0.1378205

prop.table(table(carsdatatest$CarUsage))

> prop.table(table(carsdatatest$CarUsage))

0 1
0.8636364 0.1363636

carsdatatrain<-carsdatatrain[,c(1:8,10)]
carsdatatest<-carsdatatest[,c(1:8,10)]

## The train and test data have almost same percentage of cars usage as the base data
## Apply SMOTE on Training data set

library(DMwR)
attach(carsdatatrain)
carsdataSMOTE<-SMOTE(CarUsage~., carsdatatrain, perc.over = 250,perc.under = 150)
prop.table(table(carsdataSMOTE$CarUsage))

arsdataSMOTE<-SMOTE(CarUsage~., carsdatatrain, perc.over = 250,perc.under = 150)

> prop.table(table(carsdataSMOTE$CarUsage))

0 1
0.5 0.5

We now have an equal split in the data between car users and non car users. Let us proceed with
building the models ## Model Building We will use the Logistic regression method a model on the
SMOTE data to understand the factors influencing car usage. Since we have only limited variable, we will
use them all in model building

outcomevar<-'CarUsage'
regressors<-c("Age","Work.Exp","Salary","Distance","license","Engineer","MBA","Gender")
trainctrl<-trainControl(method = 'repeatedcv',number = 10,repeats = 3)

pg. 14
Project-4 greatlearning

carsglm<-train(carsdataSMOTE[,regressors],carsdataSMOTE[,outcomevar],method = "glm", family =

"binomial",trControl = trainctrl)

> outcomevar<-'CarUsage'
> regressors<-c("Age","Work.Exp","Salary","Distance","license","Engineer","MBA","Gender")
> trainctrl<-trainControl(method = 'repeatedcv',number = 10,repeats = 3)
> carsglm<-train(carsdataSMOTE[,regressors],carsdataSMOTE[,outcomevar],method = "glm", family = "b
inomial",trControl = trainctrl)
Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
4: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(carsglm$finalModel)

> summary(carsglm$finalModel)

Call:
NULL

Deviance Residuals:
Min 1Q Median 3Q Max
-2.09669 -0.01064 0.00000 0.02798 2.11942

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -64.8739 18.4863 -3.509 0.000449 ***
Age 1.7911 0.5273 3.397 0.000681 ***
Work.Exp -0.2056 0.4347 -0.473 0.636175
Salary -0.1061 2.1572 -0.049 0.960759
Distance 0.6172 0.1957 3.154 0.001611 **
license1 1.5699 0.8530 1.840 0.065711 .
Engineer1 3.1162 1.2024 2.592 0.009554 **
MBA1 -3.4351 1.1784 -2.915 0.003557 **
GenderMale 1.6470 0.9453 1.742 0.081460 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 357.664 on 257 degrees of freedom

Residual deviance: 49.909 on 249 degrees of freedom
AIC: 67.909

Number of Fisher Scoring iterations: 9

pg. 15
Project-4 greatlearning

carglmcoeff<-exp(coef(carsglm$finalModel))
write.csv(carglmcoeff,file = "Coeffs.csv")
varImp(object = carsglm)

carglmcoeff<-exp(coef(carsglm$finalModel))
> write.csv(carglmcoeff,file = "Coeffs.csv")
> varImp(object = carsglm)
glm variable importance

Overall
Age 100.00
Distance 92.74
MBA1 85.60
Engineer1 75.94
license1 53.50
GenderMale 50.57
Work.Exp 12.66
Salary 0.00

plot(varImp(object = carsglm), main="Vairable Importance for Logistic Regression")

pg. 16
Project-4 greatlearning

From the model we see that Age and License are more significant. When we look at the odds and
probabilities table, we get to see that Increase in age by 1 year implies that thre is a 98% probability that
the employee will use a car. As expected , if the employee has a license, then it implies a 99% probability
that he/she will use a car. One lkah increase in salary increases the probability of car usage by 72% The
null deviance of this model is 357.664 and the residual deviance is 17.959. This yields a McFadden R
Sqaure o almost 0.94 yielding a very good fit. We get to see Accuracy and Kappa values are high We shall
do the prediction based on this model

#confusion matrix

carusageprediction<-predict.train(object = carsglm,carsdatatest[,regressors],type = "raw")

confusionMatrix(carusageprediction,carsdatatest[,outcomevar], positive='1')

plot(varImp(object = carsglm), main="Vairable Importance for Logistic Regression")

> carusageprediction<-predict.train(object = carsglm,carsdatatest[,regressors],type = "raw")
> confusionMatrix(carusageprediction,carsdatatest[,outcomevar], positive='1')
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 111 3
1 3 15

Accuracy : 0.9545
95% CI : (0.9037, 0.9831)
No Information Rate : 0.8636
P-Value [Acc > NIR] : 0.0005559

Kappa : 0.807

Mcnemar's Test P-Value : 1.0000000

Sensitivity : 0.8333
Specificity : 0.9737
Pos Pred Value : 0.8333
Neg Pred Value : 0.9737
Prevalence : 0.1364
Detection Rate : 0.1136
Detection Prevalence : 0.1364
Balanced Accuracy : 0.9035

'Positive' Class : 1

pg. 17
Project-4 greatlearning

A.5 Applying Logistic Regression

1. Logistic Regression

### Model Building - Logistic regression

install.packages("MASS")
library(class)
model <-glm(Transport ~.,family=binomial(link='logit'),data=cars_train)
summary(model)

model <-glm(Transport ~.,family=binomial(link='logit'),data=cars_train)

> summary(model)

Call:
glm(formula = Transport ~ ., family = binomial(link = "logit"),
data = cars_train)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.5733 0.1351 0.3372 0.5675 1.8440

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.48505 2.68180 -4.283 1.85e-05 ***
Age 0.52837 0.11253 4.696 2.66e-06 ***
GenderMale 1.52604 0.38092 4.006 6.17e-05 ***
Engineer1 0.33374 0.38788 0.860 0.38956
MBA1 0.64960 0.43553 1.492 0.13583
Work.Exp -0.07111 0.12017 -0.592 0.55403
Salary -2.55115 2.29565 -1.111 0.26644
Distance -0.15098 0.05167 -2.922 0.00348 **
license1 -2.32637 0.46451 -5.008 5.49e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 302.59 on 311 degrees of freedom

Residual deviance: 223.09 on 303 degrees of freedom
AIC: 241.09

Number of Fisher Scoring iterations: 6

The top three most relevant features are "Work.Exp", "Distance", "Salary " and "license" because of the
low p-values.

pg. 18
Project-4 greatlearning

Also, the AIC# Score is 241.09. This will be observed in subsequent stages when we refine the model.
The model having least AIC Score would be the most preferred and optimized one.

anova(model, train="Chisq")

anova(model, train="Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: Transport

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev

NULL 311 302.59
Age 1 26.1316 310 276.46
Gender 1 10.2970 309 266.16
Engineer 1 1.3586 308 264.80
MBA 1 0.8274 307 263.98
Work.Exp 1 2.6335 306 261.34
Salary 1 0.1789 305 261.16
Distance 1 11.1693 304 249.99
license 1 26.9018 303 223.09

## Cross Validation

fitted.results <- predict(model,newdata=train,type='response')

fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != train$class)
print(paste('Accuracy',1-misClasificError))

[1] "Accuracy 0.775925925925926"

## Decision Tree

library(rpart)
model2 <- rpart(Transport ~ Salary + Distance + license
+Work.Exp, data=cars_train, method="class")

plot(model2, uniform=TRUE,
main="Classification Tree for data")
text(model2, use.n=TRUE, all=TRUE, cex=.8)

pg. 19
Project-4 greatlearning

treePred <- predict(model2, cars_train, type = 'class')

table(treePred, cars_train$Transport)
mean(treePred==cars_train$Transport)

library(rpart)
> model2 <- rpart(Transport ~ Salary + Distance + license
+ +Work.Exp, data=cars_train, method="class")
>
> plot(model2, uniform=TRUE,
+ main="Classification Tree for data")
> text(model2, use.n=TRUE, all=TRUE, cex=.8)
> treePred <- predict(model2, cars_train, type = 'class')
> table(treePred, cars_train$Transport)

treePred 2Wheeler Car Public Transport

2Wheeler 24 0 10
Car 3 36 0
Public Transport 32 7 200
> mean(treePred==cars_train$Transport)
[1] 0.8333333

pg. 20
Project-4 greatlearning

A.6Applying KNN Model

#KNN

library(caret)
library(mlbench)
library(caret)

random <- createDataPartition(data$Transport, p=0.70, list=FALSE)

cars_train <- data[ random,]
cars_test <- data[-random,]

library(class)

trControl <- trainControl(method = "cv", number = 10)

fit.knn <- train(Transport ~ .,
+ method = "knn",
+ tuneGrid = expand.grid(k = 2:20),
+ trControl = trControl,
+ metric = "Accuracy",
+ preProcess = c("center","scale"),
+ data = cars_train)
fit.knn

k-Nearest Neighbors

312 samples
8 predict or
3 classes: '2Wheeler', 'Car', 'Public Transport'

Pre-processing: centered (8), scaled (8)

Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 281, 281, 280, 281, 280, 282, ...
Resampling results across tuning parameters:

k Accuracy Kappa
2 0.7365457 0.4543489
3 0.7855712 0.5248631
4 0.7629839 0.4800127
5 0.7828562 0.5081854
6 0.7734812 0.4905393
7 0.7634005 0.4624704
8 0.7408065 0.4118105
9 0.7534005 0.4199273
10 0.7536022 0.4116860
11 0.7598454 0.4168749
12 0.7662970 0.4266860

pg. 21
Project-4 greatlearning

13 0.7662970 0.4213708
14 0.7566129 0.3930122
15 0.7661895 0.4135919
16 0.7660887 0.4090611
17 0.7566129 0.3862387
18 0.7629637 0.3926229
19 0.7661895 0.4026549
20 0.7661895 0.3942178

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 3

KNN_predictions = predict(fit.knn,cars_train)
table(KNN_predictions, cars_train$Transport)

KNN_predictions 2Wheeler Car Public Transport

2Wheeler 37 0 8
Car 0 35 2
Public Transport 22 8 200

KNN_predictions = predict(fit.knn,cars_test)
table(KNN_predictions, cars_test$Transport)

KNN_predictions 2Wheeler Car Public Transport

2Wheeler 9 0 11
Car 1 15 3
Public Transport 14 3 76

predict(fit.knn,carsTest)

[1] Public Transport Public Transport

Levels: 2Wheeler Car Public TransportA.2 Data Split

For classification problem, it is important to ensure that the train and test sets have approximately the
same percentage of samples of each target class. Hence, we will use stratified sampling

install.packages("caret")
library(caret)
set.seed(1234)
trainIndex = createDataPartition(class, p=0.7, list = FALSE, times = 1)
train.data = data[trainIndex, ]
test.data = data[-trainIndex,]

dim(train.data)
dim(test.data)

pg. 22
Project-4 greatlearning

prop.table(table(data$class))
prop.table(table(train.data$class))
prop.table(table(test.data$class))

> set.seed(1234)
> trainIndex = createDataPartition(class, p=0.7, list = FALSE, times = 1)
> train.data = data[trainIndex, ]
> test.data = data[-trainIndex,]
>
> dim(train.data)
[1] 538 9
> dim(test.data)
[1] 230 9
>
> prop.table(table(data$class))

0 1
0.6510417 0.3489583
> prop.table(table(train.data$class))

0 1
0.6728625 0.3271375
> prop.table(table(test.data$class))

0 1
0.6 0.4

A.7 Applying Naïve Bayes

Data Cleaning

Missing values

anyNA(data)

anyNA(data)
[1] TRUE

Finding out where the missing value is

data[!complete.cases(data), ]

data[!complete.cases(data), ]
Age Gender Engineer MBA Work.Exp Salary Distance license Transport
145 28 Female 0 <NA> 6 13.7 9.4 0 Public Transport

pg. 23
Project-4 greatlearning

Use KNN means method to impute the missing value

library(DMwR)

data = knnImputation(data, 5)

Normalize continuous variables

# Model Building and Predictions

Naïve Bayes

library(e1071)
Naive_Bayes_Model=naiveBayes(cars_train$Transport ~., data=cars_train)
Naive_Bayes_Model

library(e1071)
> Naive_Bayes_Model=naiveBayes(cars_train$Transport ~., data=cars_train)
> Naive_Bayes_Model

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
2Wheeler Car Public Transport
0.1891026 0.1378205 0.6730769

Conditional probabilities:
Age
Y [,1] [,2]
2Wheeler 25.11864 3.173624
Car 35.20930 3.211276
Public Transport 26.76190 2.952653

Gender
Y Female Male
2Wheeler 0.4915254 0.5084746
Car 0.1395349 0.8604651
Public Transport 0.2285714 0.7714286

Engineer
Y 0 1
2Wheeler 0.2542373 0.7457627

pg. 24
Project-4 greatlearning

Car 0.1627907 0.8372093

Public Transport 0.2714286 0.7285714

MBA
Y 0 1
2Wheeler 0.7627119 0.2372881
Car 0.8604651 0.1395349
Public Transport 0.7142857 0.2857143

Work.Exp
Y [,1] [,2]
2Wheeler 4.000000 3.449138
Car 15.046512 4.690180
Public Transport 5.004762 3.181131

Salary
Y [,1] [,2]
2Wheeler 2.442295 0.4234003
Car 3.479695 0.4330174
Public Transport 2.520766 0.3208979

Distance
Y [,1] [,2]
2Wheeler 12.03051 3.178172
Car 15.07209 3.903559
Public Transport 10.11524 2.964633

license
Y 0 1
2Wheeler 0.7288136 0.2711864
Car 0.2093023 0.7906977
Public Transport 0.8857143 0.1142857

This gives us the rule or factors which can help us employees decision to use car or not.

General way to interpret this output is that for any factor variable say license we can say that 72% of
people without license use 2-wheeler and 27% with license.
For continuous variables for example distance we can say 2-wheeler is used by people for whom
commute distance is 11.9 with sd of 3.5.

#Prediction on the test dataset

NB_Predictions=predict(Naive_Bayes_Model,cars_test)
table(NB_Predictions,cars_test$Transport)

NB_Predictions=predict(Naive_Bayes_Model,cars_test)

pg. 25
Project-4 greatlearning

> table(NB_Predictions,cars_test$Transport)

NB_Predictions 2Wheeler Car Public Transport

2Wheeler 7 0 5
Car 0 14 3
Public Transport 17 4 82

# prediction for test sample

NB_Predictions=predict(Naive_Bayes_Model,carsTest)
NB_Predictions

Levels: 2Wheeler Car Public Transport

A.10 Boosting

#BOOSTING

install.packages("xgboost")

boostcontrol <- trainControl(number=10)

xgbGrid <- expand.grid(

eta = 0.3,
max_depth = 1,
nrounds = 50,
gamma = 0,
colsample_bytree = 0.6,
min_child_weight = 1, subsample = 1
)

carsxgb <- train(Transport ~ .,cars_train,trControl = boostcontrol,tuneGrid = xgbGrid,metric =

"Accuracy",method = "xgbTree")

carsxgb$finalModel

carsxgb <- train(Transport ~ .,cars_train,trControl = boostcontrol,tuneGrid = xgbGrid,metric = "Accuracy

",method = "xgbTree")
>
> carsxgb$finalModel
##### xgb.Booster
raw: 38.5 Kb
call:
xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth,
gamma = param$gamma, colsample_bytree = param$colsample_bytree,

pg. 26
Project-4 greatlearning

min_child_weight = param$min_child_weight, subsample = param$subsample),

data = x, nrounds = param$nrounds, num_class = length(lev),
objective = "multi:softprob")
params (as set within xgb.train):
eta = "0.3", max_depth = "1", gamma = "0", colsample_bytree = "0.6", min_child_weight = "1", subsam
ple = "1", num_class = "3", objective = "multi:softprob", silent = "1"
xgb.attributes:
niter
callbacks:
cb.print.evaluation(period = print_every_n)
# of features: 8
niter: 50
nfeatures : 8
xNames : Age GenderMale Engineer1 MBA1 Work.Exp Salary Distance license1
problemType : Classification
tuneValue :
nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
1 50 1 0.3 0 0.6 1 1
obsLevels : 2Wheeler Car Public Transport
param :
list()

Let us predict using the test dataset

predictions_xgb<-predict(carsxgb,cars_test)
confusionMatrix(predictions_xgb,cars_test$Transport)

> predictions_xgb<-predict(carsxgb,cars_test)
> confusionMatrix(predictions_xgb,cars_test$Transport)
Confusion Matrix and Statistics

Reference
Prediction 2Wheeler Car Public Transport
2Wheeler 5 0 4
Car 2 17 2
Public Transport 17 1 84

Overall Statistics

Accuracy : 0.803
95% CI : (0.7249, 0.8671)
No Information Rate : 0.6818
P-Value [Acc > NIR] : 0.001323

Kappa : 0.5514

pg. 27
Project-4 greatlearning

Mcnemar's Test P-Value : 0.015591

Statistics by Class:

Class: 2Wheeler Class: Car Class: Public Transport

Sensitivity 0.20833 0.9444 0.9333
Specificity 0.96296 0.9649 0.5714
Pos Pred Value 0.55556 0.8095 0.8235
Neg Pred Value 0.84553 0.9910 0.8000
Prevalence 0.18182 0.1364 0.6818
Detection Rate 0.03788 0.1288 0.6364
Detection Prevalence 0.06818 0.1591 0.7727
Balanced Accuracy 0.58565 0.9547 0.7524

The accuracy is 80%.

A.11 Bagging

# Bagging

library(ipred)
library(rpart)

German.bagging <- bagging(Transport ~.,

data=carsdatatrain,
control=rpart.control(maxdepth=5, minsplit=4))

carsdatatest$pred.Transport<- predict(German.bagging, carsdatatest)

carsdatatest$pred.Transport<- ifelse(carsdatatest$pred.Transport < 0.5,0,1)
confusionMatrix(data=factor(carsdatatest$pred.Transport),
reference=factor(carsdatatest$Transport),
positive='1')

pg. 28
Project-4 greatlearning

A.12 Actionable Insights and Recommendations

We see that all three models predict Public Transport for the two test samples

Let us summarize the conclusions from analysis and models for employee’s decision whether to use car
Or not:

 Important variables are Age, Work.Exp, Distance and License

 Age and Work.Exp are correlated hence we could use any one (prefer Work.Exp) here
 Hence employees with work exp of 10 and above are likely to use car
 Employees who must commute for distance greater than 12 are more likely to prefer car
 With license, we do see that 74% who commute through car have license and 89% who commut
e through bus don’t have. But surprisingly 72% without license use 2-wheeler.
 Again, people with higher salaries (>20) are likely to use cars

pg. 29

ML-2 Guided Project Report
No ratings yet
ML-2 Guided Project Report
63 pages
Business Report SMDM Project - Coded
No ratings yet
Business Report SMDM Project - Coded
27 pages
Arnab Chowdhury As1
No ratings yet
Arnab Chowdhury As1
12 pages
Flange 10 & 12 Inch
50% (2)
Flange 10 & 12 Inch
2 pages
New Wheels - Project - Report
No ratings yet
New Wheels - Project - Report
31 pages
Girish Chadha - 29th December 2022
100% (3)
Girish Chadha - 29th December 2022
35 pages
Extended Project FastKart SQLite MYSQL 1 1 PDF
No ratings yet
Extended Project FastKart SQLite MYSQL 1 1 PDF
5 pages
Clustering Clean Ads - Data
0% (1)
Clustering Clean Ads - Data
1,657 pages
Project Instructions 2
No ratings yet
Project Instructions 2
5 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
MRA Project Milestone2 PDF
100% (1)
MRA Project Milestone2 PDF
1 page
Week 7 Project Report 1 and 2
No ratings yet
Week 7 Project Report 1 and 2
10 pages
Harshini Week 8 Doc PDF
No ratings yet
Harshini Week 8 Doc PDF
10 pages
Application Form For Use of Library Internet Service: Questions 1-6
No ratings yet
Application Form For Use of Library Internet Service: Questions 1-6
5 pages
Science 4 WK 1 & 2
No ratings yet
Science 4 WK 1 & 2
12 pages
Machine Learning Project - Sapan Parikh
100% (1)
Machine Learning Project - Sapan Parikh
12 pages
Car Transport Machine Learning
89% (9)
Car Transport Machine Learning
28 pages
Machine Learning Solution
100% (1)
Machine Learning Solution
12 pages
Machine Learning (Project5) PDF
100% (2)
Machine Learning (Project5) PDF
13 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
Sunira - Predictive Modeling
100% (1)
Sunira - Predictive Modeling
65 pages
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
No ratings yet
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
56 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
RACHIT MITTAL Capstone Project. Notes 2 PDF
No ratings yet
RACHIT MITTAL Capstone Project. Notes 2 PDF
39 pages
ML Quiz-2
No ratings yet
ML Quiz-2
5 pages
ML2 Easy Visa Project Business Report
100% (1)
ML2 Easy Visa Project Business Report
24 pages
Capstone-2 Market Basket Analysis Vinothkumar R
No ratings yet
Capstone-2 Market Basket Analysis Vinothkumar R
18 pages
Predictive Modelling - Linear Discriminant Analysis - Mentor Version - Jupyter Notebook
100% (1)
Predictive Modelling - Linear Discriminant Analysis - Mentor Version - Jupyter Notebook
25 pages
Data Mining Project PCA Report
100% (1)
Data Mining Project PCA Report
27 pages
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
New Wheels Quarterly Business Report
No ratings yet
New Wheels Quarterly Business Report
20 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
Project Predictive Modeling PDF
100% (1)
Project Predictive Modeling PDF
58 pages
Predictive Modeling Business Report Seetharaman Final Changes PDF
100% (1)
Predictive Modeling Business Report Seetharaman Final Changes PDF
28 pages
Tableau Questions
No ratings yet
Tableau Questions
2 pages
Project ML
100% (4)
Project ML
36 pages
Social Media Tourism - Capstone Project
No ratings yet
Social Media Tourism - Capstone Project
13 pages
Project Report Abhay PDF
100% (1)
Project Report Abhay PDF
20 pages
Business Report Problem 2
No ratings yet
Business Report Problem 2
10 pages
Fra Project Report-Bajaj Auto Ltd. Vs Hero Motocorp Ltd. (Group-X)
100% (1)
Fra Project Report-Bajaj Auto Ltd. Vs Hero Motocorp Ltd. (Group-X)
10 pages
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
No ratings yet
Pranjal - Singh - 30.10.2022 SMDM PROJECT REPORT
9 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
Predictive Modelling Project 2
100% (4)
Predictive Modelling Project 2
32 pages
SQL Quiz Results
No ratings yet
SQL Quiz Results
17 pages
Pranjal - Singh - 25.12.2022 - Data Mining Project
No ratings yet
Pranjal - Singh - 25.12.2022 - Data Mining Project
8 pages
Sandhya Assignment SQL
No ratings yet
Sandhya Assignment SQL
16 pages
Project Questions
No ratings yet
Project Questions
3 pages
Machine Learning Project: Problem 1
67% (3)
Machine Learning Project: Problem 1
26 pages
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
Marketing & Retail Analytics - Report - Part A
100% (2)
Marketing & Retail Analytics - Report - Part A
18 pages
Problem Statement1
No ratings yet
Problem Statement1
1 page
Lifi
100% (1)
Lifi
16 pages
Predictive Modelling Project - Business Report
100% (1)
Predictive Modelling Project - Business Report
23 pages
ML Quiz 2
No ratings yet
ML Quiz 2
1 page
Facebook Comment Volume Prediction
No ratings yet
Facebook Comment Volume Prediction
20 pages
Jupyter Notebook Project CART RF ANN
100% (1)
Jupyter Notebook Project CART RF ANN
41 pages
Quiz 3 Name: Kainat Iftikhar Reg# 2021630007 1. List Three Examples of Time Series Data. Time Series Data
No ratings yet
Quiz 3 Name: Kainat Iftikhar Reg# 2021630007 1. List Three Examples of Time Series Data. Time Series Data
2 pages
Great Learning DVT Final Project - Car Claims For Insurance
100% (1)
Great Learning DVT Final Project - Car Claims For Insurance
113 pages
SMDM - Project Report - Lakshmi
No ratings yet
SMDM - Project Report - Lakshmi
26 pages
Suresh-Rose Time Series Forecasting Project Report
100% (1)
Suresh-Rose Time Series Forecasting Project Report
75 pages
Project 4 - Cars-Datasets PDF
100% (2)
Project 4 - Cars-Datasets PDF
44 pages
Health Trends, Issues, and Concerns (Global Level)
No ratings yet
Health Trends, Issues, and Concerns (Global Level)
4 pages
Chapter 2 Economics
No ratings yet
Chapter 2 Economics
8 pages
Improvement Strategy and Substitutes For Strategy: From A Literature Review To A Proposed Framework For Future Research
No ratings yet
Improvement Strategy and Substitutes For Strategy: From A Literature Review To A Proposed Framework For Future Research
14 pages
BS en 81-73-2005
No ratings yet
BS en 81-73-2005
20 pages
Class VIII Subject Enrichment Activity
No ratings yet
Class VIII Subject Enrichment Activity
2 pages
Stealth Crew List PDF
No ratings yet
Stealth Crew List PDF
2 pages
(CBSE BOARD 2023) The Interview - English
No ratings yet
(CBSE BOARD 2023) The Interview - English
13 pages
Dredging Techniques
100% (1)
Dredging Techniques
34 pages
B.R.A.V.E. Ib
No ratings yet
B.R.A.V.E. Ib
10 pages
Phenolic Tolerances and Cross Reference
No ratings yet
Phenolic Tolerances and Cross Reference
3 pages
Light and Color
No ratings yet
Light and Color
28 pages
Handling of Test Item
No ratings yet
Handling of Test Item
4 pages
Three Years - William Wordsworth Answers
No ratings yet
Three Years - William Wordsworth Answers
1 page
02 Application-Notes - Determiantion of Moisture in A Protein Sample Using A Portable NIR Ins
No ratings yet
02 Application-Notes - Determiantion of Moisture in A Protein Sample Using A Portable NIR Ins
3 pages
George-J.-Borjas-We-Wanted-Workers(1)
No ratings yet
George-J.-Borjas-We-Wanted-Workers(1)
202 pages
MAE166A Homework 3 Solutuion
No ratings yet
MAE166A Homework 3 Solutuion
13 pages
Biomedical Photonics Handbook Volume 1 2nd Edition Vo-Dinh instant download
100% (1)
Biomedical Photonics Handbook Volume 1 2nd Edition Vo-Dinh instant download
56 pages
Lazarus LMath0 - 6
No ratings yet
Lazarus LMath0 - 6
173 pages
1 PB
No ratings yet
1 PB
3 pages
Learning Competencies: Predict Genotypes and Phenotypes of Parents and Offspring Using The Laws of Inheritance
No ratings yet
Learning Competencies: Predict Genotypes and Phenotypes of Parents and Offspring Using The Laws of Inheritance
33 pages
MGT 524 Individual Assignment Ethics in Research Draft
No ratings yet
MGT 524 Individual Assignment Ethics in Research Draft
8 pages
Mathematics & Filipino Kindergarten Quiz
No ratings yet
Mathematics & Filipino Kindergarten Quiz
1 page
1.2 Research Methods and Research Methodology
No ratings yet
1.2 Research Methods and Research Methodology
19 pages
MEELUX Product Introduction - EN
No ratings yet
MEELUX Product Introduction - EN
26 pages
2025-09-exam-p-syllabus
No ratings yet
2025-09-exam-p-syllabus
7 pages
Mentor Record Format
No ratings yet
Mentor Record Format
11 pages
Answers To Questions
No ratings yet
Answers To Questions
8 pages