Machine Learning Project: Choice of Employee Mode of Transport
Machine Learning Project: Choice of Employee Mode of Transport
MALAVIKA DUTTA
PGP BABI
BUSINESS PROBLEM:
The main objective of the case study is to understand the mode of transport employees
would prefer to commute to their office.
There are many factors which plays a vital role in selecting their mode of transport like:
o Monthly salary
o Expenses
o Work experience
o Distance
o Age
o Qualification
So in this case study we will try to understand that what are the factors which influence
employees decision to use cars their favorable means of transport by using machine
learning model.
1|Page
#IMPORTING DATA
cars=read.csv('D:/data_science/machine_learning/cars.csv',header=TRUE)
View(cars)
#EXPLORATORY DATA ANALYSIS
str(cars)
#CONVERTING TO CATEGORICAL VARIABLES
cars$Engineer=as.factor(cars$Engineer)
cars$MBA=as.factor(cars$MBA)
cars$license=as.factor(cars$license)
str(cars)
#DATA CLEANING
#CHECKING FOR MISSING VALUES
sum(is.na(cars))
#REMOVING MISSING VALUES
cars<-na.omit(cars)
cars<-as.data.frame(cars)
3|Page
Conclusion:
1 is for 2wheeler,2 is for cars and 3 is for public transport
Out of 444 employees,83 travel by two wheeler,61 by car and 300 by public transport
So min percentage of employees use cars
4|Page
hist(cars$Age)
hist(as.numeric(cars$Engineer))
hist(as.numeric(cars$MBA))
hist(cars$WorkExp,col="red",main="distribution of work exp")
5|Page
Conclusion:
Its right skewed as such there are more less experienced compared to experienced
employees
6|Page
hist(cars$Salary)
hist(cars$Distance)
hist(as.numeric(cars$license))
Conclusion:
7|Page
No of people without license is more compared to with license
#BIVARIATE ANALYSIS
boxplot(cars$Age~cars$Engineer,main="Age vs Eng")
8|Page
boxplot(cars$Age~cars$MBA,main="age vs MBA")
#checking the avg difference in salary for two profession
9|Page
boxplot(cars$Salary ~cars$Engineer, main = "Salary vs Eng.")
10 | P a g e
Conclusion:Not much difference between salary of engineers and salary of mba.
11 | P a g e
boxplot(cars$Age~cars$Transport, main="Age vs Transport")
12 | P a g e
table(cars$license,cars$Transport)
Our main interest as per the problem is to understand the factors influencing car
usage.Hencewe will create a value 0 for public transport and 2 wheeler and 1 for car and
try to understand the proportion of cars in transport mode accordingly.
13 | P a g e
#CONVERTING transport TO BINARY for making it categorical
cars$Transport<-ifelse(cars$Transport == "Car",1,0)
cars$Transport<-as.numeric(cars$Transport)
table(cars$Transport)
0 1
383 61
#CHECKING MULTICOLLEANRITY
plot(cars)
14 | P a g e
Conclusion:
o Age is highly correlated to work_Exp and dist
o Work_exp is highly correlated to Distance
o Work_exp is also correlated to salary and Age
15 | P a g e
cars$Engineer<-as.numeric(cars$Engineer)
cars$MBA<-as.numeric(cars$MBA)
cars$license<-as.numeric(cars$license)
cars$Gender<-as.numeric(cars$Gender)
install.packages("usdm")
library(usdm)
#CHECKING OUTLIERS
boxplot(cars$Age)
boxplot(cars$Salary)
boxplot(cars$Distance)
#REMOVING OUTLIERS
quantile(cars$Age,c(0.95))
cars$Age[which(cars$Age>38)]<-38
quantile(cars$Salary,c(0.95))
16 | P a g e
cars$Salary[which(cars$Salary>43)]<-43
quantile(cars$Distance,c(0.95))
cars$Distance[which(cars$Distance>17.89)]<-17.89
#CHANGING TARGET VARIABLE TO FACTOR
cars$Transport<-as.factor(cars$Transport)
table(cars$Transport)
0 1
383 61
We can see a huge disparity in the proportion of the target variable so we will be applying
SMOTE after we have converted engineer,MBA and license variables into factor variables.
#APPLYING SMOTE
library(DMwR)
library(caret)
set.seed(42)
carsdata=SMOTE(Transport~.,cars)
summary(carsdata$Transport)
0 1
244 183
17 | P a g e
#Splitting data into train and test
set.seed(100)
ind<-createDataPartition(y=carsdata$Transport,p=0.7,list=FALSE)
training<-carsdata[ind,]
table(training$Transport)
test<-carsdata[-ind,]
table(test$Transport)
#CHECKING CORRELATION
cars$Transport=as.integer(cars$Transport)
corrplot(cor(cars))
18 | P a g e
Conclusion:
o Age ,work experience and salary are highly correlated.
o Age ,work experience and salary are moderately correlated to distance and license.
Since we are unable to identify the variables from which we can predict the mode of
transport,we will perform logistic regression.
19 | P a g e
#LOGISTIC REGRESSION MODEL
lgmodel<-glm(carsdata$Transport~.,data=carsdata,family=binomial)
summary(lgmodel)
Call:
glm(formula = carsdata$Transport ~ ., family = binomial, data = carsdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.15329 -0.00153 -0.00001 0.00179 1.32833
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -143.0556 33.0822 -4.324 1.53e-05 ***
Age 4.4554 1.0106 4.409 1.04e-05 ***
Gender 4.2940 1.3968 3.074 0.002110 **
Engineer 1.4398 1.2114 1.189 0.234616
MBA -1.1701 0.8967 -1.305 0.191911
WorkExp -2.7302 0.6417 -4.254 2.10e-05 ***
Salary 0.4608 0.1208 3.814 0.000137 ***
Distance 0.7937 0.2485 3.194 0.001405 **
license 5.0192 1.5167 3.309 0.000936 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
20 | P a g e
vif(lgmodel)
lg_predict<-predict(lgmodel,test,type="response")
lg_predict
#CHECKING MODEL PERFORMANCE WITH ROC
library(ROCR)
#ROC ON TRAIN
predroc1=predict(lgmodel,newdata=training)
pred.lg1<-prediction(predroc1,training$Transport)
perf.lg1<-performance(pred.lg1,"tpr","fpr")
plot(perf.lg1,colorize=T)
21 | P a g e
as.numeric(performance(pred.lg,"auc")@y.values)
0.9984779
#ROC ON TEST
predroc2=predict(lgmodel,newdata=test)
22 | P a g e
pred.lg2<-prediction(predroc2,test$Transport)
perf.lg2<-performance(pred.lg2,"tpr","fpr")
plot(perf.lg2,colorize=T)
as.numeric(performance(pred.lg2,"auc")@y.values)
23 | P a g e
0.9954338
0.9633333
#K-S chart
#KS ON TRAIN
kstrain=max(attr(perf.lg1,'y.values')[[1]]-attr(perf.lg1,'x.values')[[1]])
24 | P a g e
kstrain
0.9611043
#KS ON TEST
kstest=max(attr(perf.lg2,'y.values')[[1]]-attr(perf.lg2,'x.values')[[1]])
kstest
0.9726027
#GINI INDEX
#GINI FOR TRAIN
ginitrain=ineq(ctrain,type="Gini")
ginitrain
0.7306115
#GINI FOR TEST
ginitest=ineq(ctest,type="Gini")
ginitest
0.7193671
25 | P a g e
#**********************************************************
CREATING A K-NN MODEL
#knn model
library(class)
trcontrol<-trainControl(method="repeatedcv",number=10,repeats=3,
classProbs=TRUE)
set.seed(222)
fit<-caret::train(Transport~.,method='knn',
trControl=trcontrol,
preProcess=c("center","scale"),
data=training,
metric="ROC",
tuneGrid=expand.grid(k=1:60))
pred<-predict(fit,newdata=test)
pred
confusionMatrix(pred,test$Transport)
27 | P a g e
#confusion matrix for train
prednb1=predict(nbmodel,newdata=training,type="class")
tablenb1=table(training$Transport,prednb1)
sum(diag(tablenb1))/sum(tablenb1)
0.9266667
28 | P a g e
as.numeric(performance(nbpredroc1,"auc")@y.values)
0.986358
29 | P a g e
nbperf2=performance(nbpredroc2,"tpr","fpr")
plot(nbperf2,colorize=T)
as.numeric(performance(nbpredroc2,"auc")@y.values)
0.9844828
30 | P a g e
#BAGGING
install.packages("gbm")
library(gbm)
install.packages("xgboost")
library("xgboost")
install.packages("ipred")
library(ipred)
install.packages("plyr")
library(plyr)
install.packages("rpart")
library(rpart)
mod.bagging<-bagging(Transport~.,data=training,
control=rpart.control(maxdepth=5,minsplit=4))
bag.pred<-predict(mod.bagging,test)
confusionMatrix(bag.pred,test$Transport)
Confusion Matrix and Statistics
Reference
Prediction 0 1
31 | P a g e
0 72 1
1 1 53
Accuracy : 0.9843
95% CI : (0.9443, 0.9981)
No Information Rate : 0.5748
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9678
Sensitivity : 0.9863
Specificity : 0.9815
Pos Pred Value : 0.9863
Neg Pred Value : 0.9815
Prevalence : 0.5748
Detection Rate : 0.5669
Detection Prevalence : 0.5748
Balanced Accuracy : 0.9839
'Positive' Class : 0
>
CREATING A MODEL USING XGBOOST
#BOOSTING
features_train=as.matrix(training[,1:8])
label_train=as.matrix(training[,9])
features_test=as.matrix(test[,1:8])
xgboostmodel=xgboost(
data=features_train,
32 | P a g e
label=label_train,
eta=0.1,
max_depth=5,
min_child_weight=3,
nrounds=10,
nfold=5,
objective="reg:linear",
verbose=0,
early_stopping_rounds = 10)
#CONFUSION MATRIX
xgboosttest=predict(xgboostmodel,features_test)
tabxg=table(test$Transport,xgboosttest>0.5)
tabxg
FALSE TRUE
0 72 1
1 4 50
Conclusion:
33 | P a g e
This model is showing almost a 100% accuracy that the customers are using cars.Bagging
was biased towards minority and majority classes but boosting is a proper representation
of both majority and minority classes.
Logistic regression model,K-NN and Naïve Bayes models were all able to predict the
transport mode with very high accuracy and are more or less stable models.
However boosting we can predict the choice of transport with 100% accuracy.Logistic
regression also showed a high sensitivity.
The key aspect is however SMOTE for balancing minorty and majority class,without which
our models will not be so accurate.
34 | P a g e