0% found this document useful (0 votes)
69 views

Machine Learning Project: Choice of Employee Mode of Transport

1. The document describes a machine learning project to understand the factors that influence an employee's choice of transportation to work using logistic regression and KNN models. 2. Logistic regression on age, work experience, salary, distance, and license was able to accurately predict transportation choice with over 99% AUC on both training and test data. 3. KNN modeling was also applied but logistic regression achieved better performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Machine Learning Project: Choice of Employee Mode of Transport

1. The document describes a machine learning project to understand the factors that influence an employee's choice of transportation to work using logistic regression and KNN models. 2. Logistic regression on age, work experience, salary, distance, and license was able to accurately predict transportation choice with over 99% AUC on both training and test data. 3. KNN modeling was also applied but logistic regression achieved better performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

MACHINE LEARNING PROJECT

CHOICE OF EMPLOYEE MODE OF TRANSPORT

MALAVIKA DUTTA
PGP BABI
BUSINESS PROBLEM:
The main objective of the case study is to understand the mode of transport employees
would prefer to commute to their office.
There are many factors which plays a vital role in selecting their mode of transport like:
o Monthly salary
o Expenses
o Work experience
o Distance
o Age
o Qualification
So in this case study we will try to understand that what are the factors which influence
employees decision to use cars their favorable means of transport by using machine
learning model.

1|Page
#IMPORTING DATA
cars=read.csv('D:/data_science/machine_learning/cars.csv',header=TRUE)
View(cars)
#EXPLORATORY DATA ANALYSIS
str(cars)
#CONVERTING TO CATEGORICAL VARIABLES
cars$Engineer=as.factor(cars$Engineer)
cars$MBA=as.factor(cars$MBA)
cars$license=as.factor(cars$license)
str(cars)
#DATA CLEANING
#CHECKING FOR MISSING VALUES
sum(is.na(cars))
#REMOVING MISSING VALUES
cars<-na.omit(cars)
cars<-as.data.frame(cars)

#CONVERTING GENDER TO BINARY


2|Page
cars$Gender<-ifelse(cars$Gender == "Female",1,0)
cars$Gender<-as.numeric(cars$Gender)
head(cars)
UNIVARIATE AND BIVARIATE ANALYSIS
#PLOTTING THE DATA FOR VISUAL ANALYSIS
install.packages("dplyr")
library(dplyr)
install.packages("ggplot2")
library(ggplot2)
cars %>%
ggplot(aes(x=Age,fill=Age))+
geom_bar()
hist(cars$Transport)

3|Page
Conclusion:
1 is for 2wheeler,2 is for cars and 3 is for public transport
Out of 444 employees,83 travel by two wheeler,61 by car and 300 by public transport
So min percentage of employees use cars

4|Page
hist(cars$Age)
hist(as.numeric(cars$Engineer))
hist(as.numeric(cars$MBA))
hist(cars$WorkExp,col="red",main="distribution of work exp")

5|Page
Conclusion:
Its right skewed as such there are more less experienced compared to experienced
employees

6|Page
hist(cars$Salary)
hist(cars$Distance)
hist(as.numeric(cars$license))

Conclusion:

7|Page
No of people without license is more compared to with license

#BIVARIATE ANALYSIS

boxplot(cars$Age~cars$Engineer,main="Age vs Eng")

8|Page
boxplot(cars$Age~cars$MBA,main="age vs MBA")
#checking the avg difference in salary for two profession

9|Page
boxplot(cars$Salary ~cars$Engineer, main = "Salary vs Eng.")

boxplot(cars$Salary ~cars$MBA, main = "Salary vs MBA.")

10 | P a g e
Conclusion:Not much difference between salary of engineers and salary of mba.

boxplot(cars$Salary~cars$Transport, main="Salary vs Transport")


cor(cars$Age, cars$WorkExp)

11 | P a g e
boxplot(cars$Age~cars$Transport, main="Age vs Transport")

boxplot(cars$Distance~cars$Transport, main="Distance vs Transport")

12 | P a g e
table(cars$license,cars$Transport)

Our main interest as per the problem is to understand the factors influencing car
usage.Hencewe will create a value 0 for public transport and 2 wheeler and 1 for car and
try to understand the proportion of cars in transport mode accordingly.

13 | P a g e
#CONVERTING transport TO BINARY for making it categorical
cars$Transport<-ifelse(cars$Transport == "Car",1,0)
cars$Transport<-as.numeric(cars$Transport)
table(cars$Transport)
0 1
383 61

#CHECKING MULTICOLLEANRITY
plot(cars)

14 | P a g e
Conclusion:
o Age is highly correlated to work_Exp and dist
o Work_exp is highly correlated to Distance
o Work_exp is also correlated to salary and Age

15 | P a g e
cars$Engineer<-as.numeric(cars$Engineer)
cars$MBA<-as.numeric(cars$MBA)
cars$license<-as.numeric(cars$license)
cars$Gender<-as.numeric(cars$Gender)

install.packages("usdm")
library(usdm)

#CHECKING OUTLIERS
boxplot(cars$Age)
boxplot(cars$Salary)
boxplot(cars$Distance)
#REMOVING OUTLIERS
quantile(cars$Age,c(0.95))
cars$Age[which(cars$Age>38)]<-38
quantile(cars$Salary,c(0.95))
16 | P a g e
cars$Salary[which(cars$Salary>43)]<-43
quantile(cars$Distance,c(0.95))
cars$Distance[which(cars$Distance>17.89)]<-17.89
#CHANGING TARGET VARIABLE TO FACTOR
cars$Transport<-as.factor(cars$Transport)
table(cars$Transport)
0 1
383 61

We can see a huge disparity in the proportion of the target variable so we will be applying
SMOTE after we have converted engineer,MBA and license variables into factor variables.
#APPLYING SMOTE
library(DMwR)
library(caret)
set.seed(42)
carsdata=SMOTE(Transport~.,cars)
summary(carsdata$Transport)
0 1
244 183

17 | P a g e
#Splitting data into train and test
set.seed(100)
ind<-createDataPartition(y=carsdata$Transport,p=0.7,list=FALSE)
training<-carsdata[ind,]
table(training$Transport)
test<-carsdata[-ind,]
table(test$Transport)

#CHECKING CORRELATION
cars$Transport=as.integer(cars$Transport)
corrplot(cor(cars))

18 | P a g e
Conclusion:
o Age ,work experience and salary are highly correlated.
o Age ,work experience and salary are moderately correlated to distance and license.
Since we are unable to identify the variables from which we can predict the mode of
transport,we will perform logistic regression.
19 | P a g e
#LOGISTIC REGRESSION MODEL
lgmodel<-glm(carsdata$Transport~.,data=carsdata,family=binomial)
summary(lgmodel)
Call:
glm(formula = carsdata$Transport ~ ., family = binomial, data = carsdata)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.15329 -0.00153 -0.00001 0.00179 1.32833

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -143.0556 33.0822 -4.324 1.53e-05 ***
Age 4.4554 1.0106 4.409 1.04e-05 ***
Gender 4.2940 1.3968 3.074 0.002110 **
Engineer 1.4398 1.2114 1.189 0.234616
MBA -1.1701 0.8967 -1.305 0.191911
WorkExp -2.7302 0.6417 -4.254 2.10e-05 ***
Salary 0.4608 0.1208 3.814 0.000137 ***
Distance 0.7937 0.2485 3.194 0.001405 **
license 5.0192 1.5167 3.309 0.000936 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 583.204 on 426 degrees of freedom


Residual deviance: 59.589 on 418 degrees of freedom
AIC: 77.589

Number of Fisher Scoring iterations: 10

We can see age,workexp,salary and license are highly significant.


install.packages("VIF")
library(VIF)

20 | P a g e
vif(lgmodel)
lg_predict<-predict(lgmodel,test,type="response")
lg_predict
#CHECKING MODEL PERFORMANCE WITH ROC
library(ROCR)
#ROC ON TRAIN
predroc1=predict(lgmodel,newdata=training)
pred.lg1<-prediction(predroc1,training$Transport)
perf.lg1<-performance(pred.lg1,"tpr","fpr")
plot(perf.lg1,colorize=T)

21 | P a g e
as.numeric(performance(pred.lg,"auc")@y.values)
0.9984779

#ROC ON TEST
predroc2=predict(lgmodel,newdata=test)

22 | P a g e
pred.lg2<-prediction(predroc2,test$Transport)
perf.lg2<-performance(pred.lg2,"tpr","fpr")
plot(perf.lg2,colorize=T)

as.numeric(performance(pred.lg2,"auc")@y.values)

23 | P a g e
0.9954338

#CONFUSION MATRIX ON TRAIN DATA


ctrain=predict(lgmodel,newdata=training[,-9],type="response")
tab1=table(training$Transport,ctrain>0.5)
sum(diag(tab1))/sum(tab1)

0.9633333

#CONFUSION MATRIX ON TEST DATA


ctest=predict(lgmodel,newdata=test[,-9],type="response")
tab2=table(test$Transport,ctest>0.5)
sum(diag(tab2))/sum(tab2)
0.9685039

#K-S chart
#KS ON TRAIN
kstrain=max(attr(perf.lg1,'y.values')[[1]]-attr(perf.lg1,'x.values')[[1]])

24 | P a g e
kstrain
0.9611043

#KS ON TEST
kstest=max(attr(perf.lg2,'y.values')[[1]]-attr(perf.lg2,'x.values')[[1]])
kstest
0.9726027

#GINI INDEX
#GINI FOR TRAIN
ginitrain=ineq(ctrain,type="Gini")
ginitrain

0.7306115
#GINI FOR TEST
ginitest=ineq(ctest,type="Gini")
ginitest

0.7193671

25 | P a g e
#**********************************************************
CREATING A K-NN MODEL
#knn model
library(class)

trcontrol<-trainControl(method="repeatedcv",number=10,repeats=3,
classProbs=TRUE)

set.seed(222)
fit<-caret::train(Transport~.,method='knn',
trControl=trcontrol,
preProcess=c("center","scale"),
data=training,
metric="ROC",
tuneGrid=expand.grid(k=1:60))

#checking model performance


26 | P a g e
fit
plot(fit)
varImp(fit)

pred<-predict(fit,newdata=test)
pred
confusionMatrix(pred,test$Transport)

CREATING NAÏVE BAYES MODEL


Naive Bayes classifier assumes that the presence of a feature in a class is unrelated to the
presence of any other feature in the same class.

#NAIVE BAYES ~,,training,family=binomial)


install.packages('e1071')
library(e1071)
#USEKERNEL FOR REDUCING MISCLASSIFICATION
nbmodel<-naiveBayes(Transport~.,data=training,usekernel=T)
nbmodel

27 | P a g e
#confusion matrix for train
prednb1=predict(nbmodel,newdata=training,type="class")
tablenb1=table(training$Transport,prednb1)
sum(diag(tablenb1))/sum(tablenb1)
0.9266667

#confusion matrix on test


prednb2=predict(nbmodel,newdata=test,type="class")
tablenb2=table(test$Transport,prednb2)
sum(diag(tablenb2))/sum(tablenb2)
[1] 0.8740157

#ROC ON TRAIN DATA


nbpredroc1=ROCR::prediction(training[,1],prednb1)
nbperf1=performance(nbpredroc1,"tpr","fpr")
plot(nbperf1,colorize=T)

28 | P a g e
as.numeric(performance(nbpredroc1,"auc")@y.values)
0.986358

#ROC ON TEST DATA


nbpredroc2=ROCR::prediction(test[,1],prednb2)

29 | P a g e
nbperf2=performance(nbpredroc2,"tpr","fpr")
plot(nbperf2,colorize=T)

as.numeric(performance(nbpredroc2,"auc")@y.values)
0.9844828

30 | P a g e
#BAGGING
install.packages("gbm")
library(gbm)
install.packages("xgboost")
library("xgboost")
install.packages("ipred")
library(ipred)
install.packages("plyr")
library(plyr)
install.packages("rpart")
library(rpart)
mod.bagging<-bagging(Transport~.,data=training,
control=rpart.control(maxdepth=5,minsplit=4))
bag.pred<-predict(mod.bagging,test)
confusionMatrix(bag.pred,test$Transport)
Confusion Matrix and Statistics

Reference
Prediction 0 1

31 | P a g e
0 72 1
1 1 53

Accuracy : 0.9843
95% CI : (0.9443, 0.9981)
No Information Rate : 0.5748
P-Value [Acc > NIR] : <2e-16

Kappa : 0.9678

Mcnemar's Test P-Value : 1

Sensitivity : 0.9863
Specificity : 0.9815
Pos Pred Value : 0.9863
Neg Pred Value : 0.9815
Prevalence : 0.5748
Detection Rate : 0.5669
Detection Prevalence : 0.5748
Balanced Accuracy : 0.9839

'Positive' Class : 0

>
CREATING A MODEL USING XGBOOST
#BOOSTING
features_train=as.matrix(training[,1:8])
label_train=as.matrix(training[,9])
features_test=as.matrix(test[,1:8])
xgboostmodel=xgboost(
data=features_train,

32 | P a g e
label=label_train,
eta=0.1,
max_depth=5,
min_child_weight=3,
nrounds=10,
nfold=5,
objective="reg:linear",
verbose=0,
early_stopping_rounds = 10)

#CONFUSION MATRIX
xgboosttest=predict(xgboostmodel,features_test)
tabxg=table(test$Transport,xgboosttest>0.5)
tabxg
FALSE TRUE
0 72 1
1 4 50
Conclusion:

33 | P a g e
This model is showing almost a 100% accuracy that the customers are using cars.Bagging
was biased towards minority and majority classes but boosting is a proper representation
of both majority and minority classes.

MODEL COMPARISON AND CONCLUSION

Logistic regression model,K-NN and Naïve Bayes models were all able to predict the
transport mode with very high accuracy and are more or less stable models.
However boosting we can predict the choice of transport with 100% accuracy.Logistic
regression also showed a high sensitivity.
The key aspect is however SMOTE for balancing minorty and majority class,without which
our models will not be so accurate.

34 | P a g e

You might also like