0% found this document useful (0 votes)

77 views35 pages

Machine Learning Project: Choice of Employee Mode of Transport

1. The document describes a machine learning project to understand the factors that influence an employee's choice of transportation to work using logistic regression and KNN models. 2. Logistic regression on age, work experience, salary, distance, and license was able to accurately predict transportation choice with over 99% AUC on both training and test data. 3. KNN modeling was also applied but logistic regression achieved better performance.

Uploaded by

Datapoint Mavericks

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views35 pages

Machine Learning Project: Choice of Employee Mode of Transport

Uploaded by

Datapoint Mavericks

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 35

MACHINE LEARNING PROJECT

CHOICE OF EMPLOYEE MODE OF TRANSPORT

MALAVIKA DUTTA
PGP BABI
BUSINESS PROBLEM:
The main objective of the case study is to understand the mode of transport employees
would prefer to commute to their office.
There are many factors which plays a vital role in selecting their mode of transport like:
o Monthly salary
o Expenses
o Work experience
o Distance
o Age
o Qualification
So in this case study we will try to understand that what are the factors which influence
employees decision to use cars their favorable means of transport by using machine
learning model.

1|Page
#IMPORTING DATA
cars=read.csv('D:/data_science/machine_learning/cars.csv',header=TRUE)
View(cars)
#EXPLORATORY DATA ANALYSIS
str(cars)
#CONVERTING TO CATEGORICAL VARIABLES
cars$Engineer=as.factor(cars$Engineer)
cars$MBA=as.factor(cars$MBA)
cars$license=as.factor(cars$license)
str(cars)
#DATA CLEANING
#CHECKING FOR MISSING VALUES
sum(is.na(cars))
#REMOVING MISSING VALUES
cars<-na.omit(cars)
cars<-as.data.frame(cars)

#CONVERTING GENDER TO BINARY

2|Page
cars$Gender<-ifelse(cars$Gender == "Female",1,0)
cars$Gender<-as.numeric(cars$Gender)
head(cars)
UNIVARIATE AND BIVARIATE ANALYSIS
#PLOTTING THE DATA FOR VISUAL ANALYSIS
install.packages("dplyr")
library(dplyr)
install.packages("ggplot2")
library(ggplot2)
cars %>%
ggplot(aes(x=Age,fill=Age))+
geom_bar()
hist(cars$Transport)

3|Page
Conclusion:
1 is for 2wheeler,2 is for cars and 3 is for public transport
Out of 444 employees,83 travel by two wheeler,61 by car and 300 by public transport
So min percentage of employees use cars

4|Page
hist(cars$Age)
hist(as.numeric(cars$Engineer))
hist(as.numeric(cars$MBA))
hist(cars$WorkExp,col="red",main="distribution of work exp")

5|Page
Conclusion:
Its right skewed as such there are more less experienced compared to experienced
employees

6|Page
hist(cars$Salary)
hist(cars$Distance)
hist(as.numeric(cars$license))

Conclusion:

7|Page
No of people without license is more compared to with license

#BIVARIATE ANALYSIS

boxplot(cars$Age~cars$Engineer,main="Age vs Eng")

8|Page
boxplot(cars$Age~cars$MBA,main="age vs MBA")
#checking the avg difference in salary for two profession

9|Page
boxplot(cars$Salary ~cars$Engineer, main = "Salary vs Eng.")

boxplot(cars$Salary ~cars$MBA, main = "Salary vs MBA.")

10 | P a g e
Conclusion:Not much difference between salary of engineers and salary of mba.

boxplot(cars$Salary~cars$Transport, main="Salary vs Transport")

cor(cars$Age, cars$WorkExp)

11 | P a g e
boxplot(cars$Age~cars$Transport, main="Age vs Transport")

boxplot(cars$Distance~cars$Transport, main="Distance vs Transport")

12 | P a g e
table(cars$license,cars$Transport)

Our main interest as per the problem is to understand the factors influencing car
usage.Hencewe will create a value 0 for public transport and 2 wheeler and 1 for car and
try to understand the proportion of cars in transport mode accordingly.

13 | P a g e
#CONVERTING transport TO BINARY for making it categorical
cars$Transport<-ifelse(cars$Transport == "Car",1,0)
cars$Transport<-as.numeric(cars$Transport)
table(cars$Transport)
0 1
383 61

#CHECKING MULTICOLLEANRITY
plot(cars)

14 | P a g e
Conclusion:
o Age is highly correlated to work_Exp and dist
o Work_exp is highly correlated to Distance
o Work_exp is also correlated to salary and Age

15 | P a g e
cars$Engineer<-as.numeric(cars$Engineer)
cars$MBA<-as.numeric(cars$MBA)
cars$license<-as.numeric(cars$license)
cars$Gender<-as.numeric(cars$Gender)

install.packages("usdm")
library(usdm)

#CHECKING OUTLIERS
boxplot(cars$Age)
boxplot(cars$Salary)
boxplot(cars$Distance)
#REMOVING OUTLIERS
quantile(cars$Age,c(0.95))
cars$Age[which(cars$Age>38)]<-38
quantile(cars$Salary,c(0.95))
16 | P a g e
cars$Salary[which(cars$Salary>43)]<-43
quantile(cars$Distance,c(0.95))
cars$Distance[which(cars$Distance>17.89)]<-17.89
#CHANGING TARGET VARIABLE TO FACTOR
cars$Transport<-as.factor(cars$Transport)
table(cars$Transport)
0 1
383 61

We can see a huge disparity in the proportion of the target variable so we will be applying
SMOTE after we have converted engineer,MBA and license variables into factor variables.
#APPLYING SMOTE
library(DMwR)
library(caret)
set.seed(42)
carsdata=SMOTE(Transport~.,cars)
summary(carsdata$Transport)
0 1
244 183

17 | P a g e
#Splitting data into train and test
set.seed(100)
ind<-createDataPartition(y=carsdata$Transport,p=0.7,list=FALSE)
training<-carsdata[ind,]
table(training$Transport)
test<-carsdata[-ind,]
table(test$Transport)

#CHECKING CORRELATION
cars$Transport=as.integer(cars$Transport)
corrplot(cor(cars))

18 | P a g e
Conclusion:
o Age ,work experience and salary are highly correlated.
o Age ,work experience and salary are moderately correlated to distance and license.
Since we are unable to identify the variables from which we can predict the mode of
transport,we will perform logistic regression.
19 | P a g e
#LOGISTIC REGRESSION MODEL
lgmodel<-glm(carsdata$Transport~.,data=carsdata,family=binomial)
summary(lgmodel)
Call:
glm(formula = carsdata$Transport ~ ., family = binomial, data = carsdata)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.15329 -0.00153 -0.00001 0.00179 1.32833

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -143.0556 33.0822 -4.324 1.53e-05 ***
Age 4.4554 1.0106 4.409 1.04e-05 ***
Gender 4.2940 1.3968 3.074 0.002110 **
Engineer 1.4398 1.2114 1.189 0.234616
MBA -1.1701 0.8967 -1.305 0.191911
WorkExp -2.7302 0.6417 -4.254 2.10e-05 ***
Salary 0.4608 0.1208 3.814 0.000137 ***
Distance 0.7937 0.2485 3.194 0.001405 **
license 5.0192 1.5167 3.309 0.000936 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 583.204 on 426 degrees of freedom

Residual deviance: 59.589 on 418 degrees of freedom
AIC: 77.589

Number of Fisher Scoring iterations: 10

We can see age,workexp,salary and license are highly significant.

install.packages("VIF")
library(VIF)

20 | P a g e
vif(lgmodel)
lg_predict<-predict(lgmodel,test,type="response")
lg_predict
#CHECKING MODEL PERFORMANCE WITH ROC
library(ROCR)
#ROC ON TRAIN
predroc1=predict(lgmodel,newdata=training)
pred.lg1<-prediction(predroc1,training$Transport)
perf.lg1<-performance(pred.lg1,"tpr","fpr")
plot(perf.lg1,colorize=T)

21 | P a g e
as.numeric(performance(pred.lg,"auc")@y.values)
0.9984779

#ROC ON TEST
predroc2=predict(lgmodel,newdata=test)

22 | P a g e
pred.lg2<-prediction(predroc2,test$Transport)
perf.lg2<-performance(pred.lg2,"tpr","fpr")
plot(perf.lg2,colorize=T)

as.numeric(performance(pred.lg2,"auc")@y.values)

23 | P a g e
0.9954338

#CONFUSION MATRIX ON TRAIN DATA

ctrain=predict(lgmodel,newdata=training[,-9],type="response")
tab1=table(training$Transport,ctrain>0.5)
sum(diag(tab1))/sum(tab1)

0.9633333

#CONFUSION MATRIX ON TEST DATA

ctest=predict(lgmodel,newdata=test[,-9],type="response")
tab2=table(test$Transport,ctest>0.5)
sum(diag(tab2))/sum(tab2)
0.9685039

#K-S chart
#KS ON TRAIN
kstrain=max(attr(perf.lg1,'y.values')[[1]]-attr(perf.lg1,'x.values')[[1]])

24 | P a g e
kstrain
0.9611043

#KS ON TEST
kstest=max(attr(perf.lg2,'y.values')[[1]]-attr(perf.lg2,'x.values')[[1]])
kstest
0.9726027

#GINI INDEX
#GINI FOR TRAIN
ginitrain=ineq(ctrain,type="Gini")
ginitrain

0.7306115
#GINI FOR TEST
ginitest=ineq(ctest,type="Gini")
ginitest

0.7193671

25 | P a g e
#**********************************************************
CREATING A K-NN MODEL
#knn model
library(class)

trcontrol<-trainControl(method="repeatedcv",number=10,repeats=3,
classProbs=TRUE)

set.seed(222)
fit<-caret::train(Transport~.,method='knn',
trControl=trcontrol,
preProcess=c("center","scale"),
data=training,
metric="ROC",
tuneGrid=expand.grid(k=1:60))

#checking model performance

26 | P a g e
fit
plot(fit)
varImp(fit)

pred<-predict(fit,newdata=test)
pred
confusionMatrix(pred,test$Transport)

CREATING NAÏVE BAYES MODEL

Naive Bayes classifier assumes that the presence of a feature in a class is unrelated to the
presence of any other feature in the same class.

#NAIVE BAYES ~,,training,family=binomial)

install.packages('e1071')
library(e1071)
#USEKERNEL FOR REDUCING MISCLASSIFICATION
nbmodel<-naiveBayes(Transport~.,data=training,usekernel=T)
nbmodel

27 | P a g e
#confusion matrix for train
prednb1=predict(nbmodel,newdata=training,type="class")
tablenb1=table(training$Transport,prednb1)
sum(diag(tablenb1))/sum(tablenb1)
0.9266667

#confusion matrix on test

prednb2=predict(nbmodel,newdata=test,type="class")
tablenb2=table(test$Transport,prednb2)
sum(diag(tablenb2))/sum(tablenb2)
[1] 0.8740157

#ROC ON TRAIN DATA

nbpredroc1=ROCR::prediction(training[,1],prednb1)
nbperf1=performance(nbpredroc1,"tpr","fpr")
plot(nbperf1,colorize=T)

28 | P a g e
as.numeric(performance(nbpredroc1,"auc")@y.values)
0.986358

#ROC ON TEST DATA

nbpredroc2=ROCR::prediction(test[,1],prednb2)

29 | P a g e
nbperf2=performance(nbpredroc2,"tpr","fpr")
plot(nbperf2,colorize=T)

as.numeric(performance(nbpredroc2,"auc")@y.values)
0.9844828

30 | P a g e
#BAGGING
install.packages("gbm")
library(gbm)
install.packages("xgboost")
library("xgboost")
install.packages("ipred")
library(ipred)
install.packages("plyr")
library(plyr)
install.packages("rpart")
library(rpart)
mod.bagging<-bagging(Transport~.,data=training,
control=rpart.control(maxdepth=5,minsplit=4))
bag.pred<-predict(mod.bagging,test)
confusionMatrix(bag.pred,test$Transport)
Confusion Matrix and Statistics

Reference
Prediction 0 1

31 | P a g e
0 72 1
1 1 53

Accuracy : 0.9843
95% CI : (0.9443, 0.9981)
No Information Rate : 0.5748
P-Value [Acc > NIR] : <2e-16

Kappa : 0.9678

Mcnemar's Test P-Value : 1

Sensitivity : 0.9863
Specificity : 0.9815
Pos Pred Value : 0.9863
Neg Pred Value : 0.9815
Prevalence : 0.5748
Detection Rate : 0.5669
Detection Prevalence : 0.5748
Balanced Accuracy : 0.9839

'Positive' Class : 0

>
CREATING A MODEL USING XGBOOST
#BOOSTING
features_train=as.matrix(training[,1:8])
label_train=as.matrix(training[,9])
features_test=as.matrix(test[,1:8])
xgboostmodel=xgboost(
data=features_train,

32 | P a g e
label=label_train,
eta=0.1,
max_depth=5,
min_child_weight=3,
nrounds=10,
nfold=5,
objective="reg:linear",
verbose=0,
early_stopping_rounds = 10)

#CONFUSION MATRIX
xgboosttest=predict(xgboostmodel,features_test)
tabxg=table(test$Transport,xgboosttest>0.5)
tabxg
FALSE TRUE
0 72 1
1 4 50
Conclusion:

33 | P a g e
This model is showing almost a 100% accuracy that the customers are using cars.Bagging
was biased towards minority and majority classes but boosting is a proper representation
of both majority and minority classes.

MODEL COMPARISON AND CONCLUSION

Logistic regression model,K-NN and Naïve Bayes models were all able to predict the
transport mode with very high accuracy and are more or less stable models.
However boosting we can predict the choice of transport with 100% accuracy.Logistic
regression also showed a high sensitivity.
The key aspect is however SMOTE for balancing minorty and majority class,without which
our models will not be so accurate.

34 | P a g e

Huawei ISO9001 ISO14001
No ratings yet
Huawei ISO9001 ISO14001
30 pages
SAMPLE - Hold Departure Order
100% (1)
SAMPLE - Hold Departure Order
4 pages
Amrapali Adarsh Awas Yojna Automated - Brochure
No ratings yet
Amrapali Adarsh Awas Yojna Automated - Brochure
5 pages
Machine Learning Project
67% (3)
Machine Learning Project
30 pages
Machine Learning Project On Cars
92% (13)
Machine Learning Project On Cars
22 pages
Facing The Interview
100% (1)
Facing The Interview
331 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
Reference SAP FICO - CFIN Resume
No ratings yet
Reference SAP FICO - CFIN Resume
2 pages
Trubend 5000 Series Retrofitting Guide
No ratings yet
Trubend 5000 Series Retrofitting Guide
18 pages
Project 4 - Cars-Datasets PDF
100% (2)
Project 4 - Cars-Datasets PDF
44 pages
SJLT Book Final
No ratings yet
SJLT Book Final
164 pages
Surefield v. Redfin
No ratings yet
Surefield v. Redfin
30 pages
Machine Learning Assignment Report - Cars
100% (4)
Machine Learning Assignment Report - Cars
42 pages
Analysis of Transport Choice of Employees - A Project On Machine Learning
100% (10)
Analysis of Transport Choice of Employees - A Project On Machine Learning
24 pages
Check
No ratings yet
Check
152 pages
Car Transport Machine Learning
89% (9)
Car Transport Machine Learning
28 pages
Machine Learning
100% (2)
Machine Learning
30 pages
TCW Module 2
No ratings yet
TCW Module 2
12 pages
How Are IKEA Mattresses Packaged 800 S4
No ratings yet
How Are IKEA Mattresses Packaged 800 S4
2 pages
Overall Structural Fabrication List Prestige - 18!03!2025
No ratings yet
Overall Structural Fabrication List Prestige - 18!03!2025
55 pages
Questionnaire For Financial Audit
No ratings yet
Questionnaire For Financial Audit
2 pages
Machine Learning
100% (1)
Machine Learning
33 pages
SPE-121928 Field Development and Productivity Improvement in ... Synergistic Approach To Carbonate Fracture Acidizing
No ratings yet
SPE-121928 Field Development and Productivity Improvement in ... Synergistic Approach To Carbonate Fracture Acidizing
10 pages
2023 LSE MY474 Applied Machine Learning Social Science, Lecture2
No ratings yet
2023 LSE MY474 Applied Machine Learning Social Science, Lecture2
86 pages
Cther Crowther Marketing Event Outcomes Tactical Strategic
No ratings yet
Cther Crowther Marketing Event Outcomes Tactical Strategic
15 pages
Fauske-Presentation by Knoechel Wood
No ratings yet
Fauske-Presentation by Knoechel Wood
26 pages
Linear Regression: 1 1 N N I I I D I I
No ratings yet
Linear Regression: 1 1 N N I I I D I I
20 pages
Rv4 65d r5 Product Specifications
No ratings yet
Rv4 65d r5 Product Specifications
6 pages
Swagatam Letter
No ratings yet
Swagatam Letter
5 pages
Why Use MPC Based Contact For - Bonded - Connections
No ratings yet
Why Use MPC Based Contact For - Bonded - Connections
5 pages
Business Environment Notes
No ratings yet
Business Environment Notes
4 pages
ML - Extended Project Business Report-Richa
No ratings yet
ML - Extended Project Business Report-Richa
32 pages
White Label Atm
No ratings yet
White Label Atm
5 pages
A630 A640 Advanced - Manual.en
No ratings yet
A630 A640 Advanced - Manual.en
147 pages
Emergency
No ratings yet
Emergency
5 pages
DuPont Analysis
No ratings yet
DuPont Analysis
2 pages
Machine Learning Project - Sapan Parikh
100% (1)
Machine Learning Project - Sapan Parikh
12 pages
The Short-Run Trade-Off Between Inflation and Unemployment (Chapter 22, N. Gregory Mankiw, "Principles of Macroeconomics")
No ratings yet
The Short-Run Trade-Off Between Inflation and Unemployment (Chapter 22, N. Gregory Mankiw, "Principles of Macroeconomics")
7 pages
Thera Bank
100% (1)
Thera Bank
25 pages
Cars Project PDF
No ratings yet
Cars Project PDF
9 pages
DVT Project
No ratings yet
DVT Project
35 pages
Machine Learning Extended Project - BrahmaChari
No ratings yet
Machine Learning Extended Project - BrahmaChari
29 pages
Sarkarire DH G Sult - Com-Mr-02-2023-Batch - Watermark
No ratings yet
Sarkarire DH G Sult - Com-Mr-02-2023-Batch - Watermark
2 pages
MKT4080-Codes
No ratings yet
MKT4080-Codes
9 pages
Brembo - P 50 067
No ratings yet
Brembo - P 50 067
4 pages
Machine Learning Solution
100% (1)
Machine Learning Solution
12 pages
Project - Machine Learning (E)
No ratings yet
Project - Machine Learning (E)
34 pages
Project 2
No ratings yet
Project 2
3 pages
Machine Learning Project - Parijat
No ratings yet
Machine Learning Project - Parijat
26 pages
Assignment ML
100% (2)
Assignment ML
21 pages
Factory Mutual LPDS 1-49
100% (3)
Factory Mutual LPDS 1-49
25 pages
Project 5 Surabhi Sood - Report
No ratings yet
Project 5 Surabhi Sood - Report
34 pages
Machine Learning (Project5) PDF
100% (2)
Machine Learning (Project5) PDF
13 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
HACKATHON
No ratings yet
HACKATHON
8 pages
Car Transport Prediction
100% (2)
Car Transport Prediction
27 pages
Group Assignment - Data Mining
No ratings yet
Group Assignment - Data Mining
28 pages
Data Science Unit-5
No ratings yet
Data Science Unit-5
37 pages
A Note On R
No ratings yet
A Note On R
90 pages
BAUDM Assignment2
No ratings yet
BAUDM Assignment2
16 pages
Project 5 - Cars
100% (1)
Project 5 - Cars
22 pages
Assignment Week 3 500832
No ratings yet
Assignment Week 3 500832
6 pages
Supervised Learning in R Classification
No ratings yet
Supervised Learning in R Classification
7 pages
Bank Loan Title
No ratings yet
Bank Loan Title
7 pages
Lead Score Case Study
No ratings yet
Lead Score Case Study
9 pages
Lead Score Case Study
No ratings yet
Lead Score Case Study
9 pages
DSBAProject Oct 2020
No ratings yet
DSBAProject Oct 2020
24 pages
Machine Learning With Titanic Dataset Tutorial
No ratings yet
Machine Learning With Titanic Dataset Tutorial
7 pages
BDA Lab Manual (12 Weeks)
No ratings yet
BDA Lab Manual (12 Weeks)
22 pages
Predicting Mode of Transport
No ratings yet
Predicting Mode of Transport
29 pages
ML Assignemnt PDF
No ratings yet
ML Assignemnt PDF
21 pages
Note 4
No ratings yet
Note 4
18 pages
Surabhi Charu Project
No ratings yet
Surabhi Charu Project
16 pages
Lead Scoring Logistic Regression
No ratings yet
Lead Scoring Logistic Regression
19 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
ML Lab 10 - Ensemble Learning
No ratings yet
ML Lab 10 - Ensemble Learning
7 pages
Capstone Final PPT Group 6
No ratings yet
Capstone Final PPT Group 6
19 pages
Practical Machine Learning Course Notes
No ratings yet
Practical Machine Learning Course Notes
76 pages
Sakhil Capstone
No ratings yet
Sakhil Capstone
20 pages
Classification Models
No ratings yet
Classification Models
3 pages
Logistic Regression Assignment
No ratings yet
Logistic Regression Assignment
20 pages
R Codes
No ratings yet
R Codes
23 pages
Rstudio Study Notes For PA 20181126
No ratings yet
Rstudio Study Notes For PA 20181126
6 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
R Assignment
No ratings yet
R Assignment
8 pages
MachineLearning Project PDF
No ratings yet
MachineLearning Project PDF
32 pages
Predicting Mode of Transport (ML) : Akalya KS
No ratings yet
Predicting Mode of Transport (ML) : Akalya KS
17 pages
FRA Assignment - India Credit Model
No ratings yet
FRA Assignment - India Credit Model
14 pages
Fuel Saving Techniques for Airline Pilots
From Everand
Fuel Saving Techniques for Airline Pilots
Tomas Kopta
No ratings yet
C Programming Pocket Primer: An Essential Guide to C Programming Basics
From Everand
C Programming Pocket Primer: An Essential Guide to C Programming Basics
Mercury Learning and Information
No ratings yet

Machine Learning Project: Choice of Employee Mode of Transport

Uploaded by

Machine Learning Project: Choice of Employee Mode of Transport

Uploaded by

MACHINE LEARNING PROJECT

CHOICE OF EMPLOYEE MODE OF TRANSPORT

#CONVERTING GENDER TO BINARY

boxplot(cars$Salary ~cars$MBA, main = "Salary vs MBA.")

boxplot(cars$Salary~cars$Transport, main="Salary vs Transport")

boxplot(cars$Distance~cars$Transport, main="Distance vs Transport")

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 583.204 on 426 degrees of freedom

Number of Fisher Scoring iterations: 10

We can see age,workexp,salary and license are highly significant.

#CONFUSION MATRIX ON TRAIN DATA

#CONFUSION MATRIX ON TEST DATA

#checking model performance

CREATING NAÏVE BAYES MODEL

#NAIVE BAYES ~,,training,family=binomial)

#confusion matrix on test

#ROC ON TRAIN DATA

#ROC ON TEST DATA

Mcnemar's Test P-Value : 1

MODEL COMPARISON AND CONCLUSION

You might also like