0% found this document useful (0 votes)

136 views24 pages

DSBAProject Oct 2020

This project analyzes commute data to predict employees' preferred mode of transport. The document reviews a dataset of 418 employees' ages, salaries, distances to work, and whether they use a car to commute. Visualizations and models including KNN, naive Bayes, logistic regression, and random forest are created. Logistic regression performs best with 100% accuracy on both training and test sets. The coefficients indicate that as distance and age increase, odds of commuting by car decrease.

Uploaded by

Abhay Poddar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

136 views24 pages

DSBAProject Oct 2020

Uploaded by

Abhay Poddar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 24

Project_Oct_2020

Stephen

10/30/2020
---

Introduction: This project is focused on understanding understanding the mode

of transport employees prefer to commute to their office. The "Cars-dataset"
includes employee information about their mode of transport as well as their
personal and professional details like age, salary, work exp.

Objectives: To predict whether or not an employee will use Car as a mode of

transport.

To determine the variables that are a significant predictor behind the

decision of the employees

```{r}

setwd("C:/Users/steph/Documents/DSBA")

getwd()

#Load data set

commute=read.csv("Cars-dataset.csv")

```

Key Assumptions:

The data variables are independent of one another

The sample size is representative of the number of Employees.

I now proceed to load the necessary packages

```{r loading the libraries}

library(ggplot2)

library(caTools)

library(corrplot)

library(gridExtra)

library(psych)
library(dplyr)

library(ROCR)

library(janitor)

```

Reviewing the data set

```{r data review}

names(commute)

commute=clean_names(commute)#The dot in work.Exp should be changed to an

underscore

names(commute)

View(commute)

head(commute)

tail(commute)

dim(commute)#418 rows and 9 columns

str(commute)

```

#Work.Exp should be Numeric and Age too. Engineer, MBA, and license are class
variables and should be factors

```{r convert variables}

commute$age<-as.numeric(commute$age)

commute$work_exp<-as.numeric(commute$work_exp)

commute$engineer<-as.factor(commute$engineer)

commute$mba<-as.factor(commute$mba)

commute$license<-as.factor(commute$license)

#Gender and Transport should be converted to factor to aid the exploratory

data analysis.

commute$gender<-as.factor(commute$gender)
commute$transport<-as.factor(commute$transport)

str(commute)

```

```{r}

summary(commute)

```

The average age of the employees is 27 years. The data is well distributed
with the youngest staff being 18 years old, while the oldest is 43 years old.

121/(121+297)#28% of the the employees are female.

The 109 of the employees have MBA, 313 have a degree, while only 85 of them
have license.

The work experience is 5.8 years. Again, this data set does not suggest there
is an outlier as the mean and median are very close. There are employees in
the organisation withput any work experience.

Average annual salary is 15,418. There are outliers given that the mean are
median are different. The highest maximum earner collects 57,000.

Average distance of employees to the office is 11.29km. The employee with the
shortest distance covers 3.2 km.

In terms of how they transport themselves to the office, 300 of them (71%)
take public Transport to work. Only 35 (8%) drive a car to work.

```{r}
#check for missing values

sum(is.na(commute))

#As seen in the summary, and also here above, there is a missing value.

#Since we have a fairly large data and only one entry is missing, we could
simply delete the row with the missing value.

commute.m<-na.omit(commute)

summary(commute.m)

#I need to streamline the outcome (dependent) variable by reducing the levels

of category from 3 to 2.

commute.m$transport=as.character(commute.m$transport)

commute.m$transport[commute.m$transport=="2Wheeler"|

commute.m$transport=="Public Transport"]="Not_car"

commute.m$transport[commute.m$transport=="Car"]="car"

table(commute.m$transport)

```

```{r}

Plotting Charts

#Checking for outliers

attach(commute.m)#To avoid having to attach the data frame everytime

hist(age)

boxplot(commute.m[,c(1,5,6,7)],horizontal=TRUE)

boxplot(salary~transport,xlab="transport", ylab="salary", main="transport by

salary",col=c("orange","blue"))

boxplot(distance~transport,xlab="transport", ylab="distance", main="transport

by distance",col=c("orange","blue"))

```
```{r}

#Histogram

qplot(age,main="Age Distribution", xlab="age", ylab="Frequency")

#barplot

qplot(transport,main="Transport Distribution", xlab="Transport",

ylab="Frequency")

qplot(gender,main="Gender Distribution", xlab="Gender", ylab="Frequency")

```
```{r}

#Multivariate

qplot(gender,salary,geom="boxplot",main="Salary Distribution per Gender",

xlab="Gender", ylab="Salary")

qplot(transport,salary, geom="boxplot",main="Mode of Transport by Salary

Earned", xlab="Transport", ylab="Salary")

qplot(salary,fill=transport, geom="density",alpha=I(.8),main="Mode of
Transport by Salary Earned", xlab="Transport", ylab="Density")

qplot(salary,fill=gender, geom="density",alpha=I(.8),main="Salary Earned by

Gender", xlab="Gender", ylab="Density")

qplot(salary,fill=license, geom="density",alpha=I(.8),main="Salary Earned by

Gender", xlab="Gender", ylab="Density")

qplot(age,fill=transport, geom="bar",main="Mode of Transport vs Age",

xlab="Age", ylab="Frequency")

qplot(transport,fill=gender, geom="bar",main="Mode of Transport vs Gender",

xlab="transport", ylab="Frequency")

plot(commute.m[,c(5:7)])

```
```{r}
#percentile Distribution for the treatment of outliers

#define a function

outlier_treatment_fun = function(commute.m,var_name){

capping=as.vector(quantile(commute.m[,var_name],0.99))

flooring=as.vector(quantile(commute.m[,var_name],0.01))

commute.m[,var_name][which(commute.m[,var_name]<flooring)]=flooring

commute.m[,var_name][which(commute.m[,var_name]>capping)]=capping

return(commute.m)}

new_vars=c('age','work_exp','salary','distance')

#performing outlier treatment to all the variables

for(i in new_vars){

commute.m=outlier_treatment_fun(commute.m,i)

correlations=cor(commute.m[,new_vars]) #Checking for multicolineraity

corrplot(correlations)

#Correlation coefficient is high

```
```{r}

#Dealing with Multicolinearity

ncol(commute.m)

names(commute.m)

train.matrix=data.matrix(commute.m)

corr_mat=cor(train.matrix[,-9])

corrplot(corr_mat,order="hclust")#hclust ensures the variables with high

correlation are grouped together.

#exclude the highly correlated variable(s)

library(usdm)

vifcor((train.matrix[,-9]),th=0.8)#At the correlation threshold of 0.8,

work_exp and Saalary have been excluded.

```

```{r}

#### Creating a filtered data frame

str(commute.m)

filter.commute.m=commute.m[,-5:-6]

head(filter.commute.m)

filter.commute.m$transport=as.factor(filter.commute.m$transport)

```

```{r}

#modelling

#splitting data to train and test

set.seed(1234)
sample=sample.split(filter.commute.m$transport,SplitRatio = 0.75)

train=subset(filter.commute.m,sample==T)

test=subset(filter.commute.m,sample==F)

table(filter.commute.m$transport)

str(train)

dim(train)

dim(test)

```

```{r}

#KNN

#It does not depend on Mathematical formula.

library(class)

names(train)#Helps to see the identify the sequence of columns

knn(train[,c(1,5)],test[,c(1,5)],train$transport,k=5)

#Convert to a model

KNN_transport=knn(train[,c(1,5)],test[,c(1,5)],train$transport,k=5)#It is
good to have odd numbers of neighbours(k)to avoid a tie situation.

#Checking for the performance on the test data

table(test$transport,KNN_transport) #Confusion matrix

(8+95)/105 #98.09% Accuracy

8/9 #88.8% sensitivity

95/96 #98.9% specificity

#scale the data to see if the performance of the model will be better

KNN_transport2=knn(scale(train[,c(1,5)]),scale(test[,c(1,5)]),train$transport
,k=5)

table(test$transport,KNN_transport2)

#The result did not change

```
```{r}

#Naive Bayes (NB)

#In its simple form, the NB runs on data with all categorical predictor
variables.

library(e1071)

naiveBayes(transport~.,train)

#Conditional distribution specifying the mean, and standard deviation, given

the dependent variable ('use of car' or 'Not use of car')

#To predict, I need to create a Naive Bayes object. In order to apply this in
a situation like this where we have continuous variables as predictors, we
assume normality of the data.

#Multicolinearity must also be avoided. This has already been addressed.

NB_transport=naiveBayes(transport~.,train)

predict(NB_transport,type="raw",newdata=train)

plot(predict(NB_transport,type="raw",newdata=train)[,2])

```
```{r}

#Logistic Regression

train$transport=as.factor(train$transport)

str(train)

glm(transport~.,data=train,family=binomial)

model=glm(transport~.,data=train,family=binomial)

summary(model)

```

```{r}

model$fitted.values

plot(train$transport,model$fitted.values)

transport.predicted=ifelse(model$fitted.values<0.4, "Car", "Not Car")

table(train$transport,transport.predicted)#Confusion Matrix

(20+214)/nrow(train) #100% Accuracy

20/20 #100% sensitivity

214/214 #100% specificity

#ROC Curve

library(pROC)

roc(train$transport,model$fitted.values)#At 100, ROC is outstanding

plot.roc(train$transport,model$fitted.values)

```
```{r}

#Test data check

glm(transport~.,data=test,family=binomial)

model2=glm(transport~.,data=test,family=binomial)

model2

plot(test$transport,model2$fitted.values)

transport.predicted=ifelse(model2$fitted.values<0.4, "Car", "Not Car")

table(test$transport,transport.predicted)#Confusion Matrix

#The performance is equally excellent

#ROC Curve
roc(test$transport,model2$fitted.values)#At 100, ROC is outstanding

plot.roc(test$transport,model2$fitted.values)

```

```{r}

#McFadden Rsquared interpretation

library(pscl)

pR2(model)#At 92%, it is an excellent model.

```
```{r}

#Significance of the coefficients

summary(model)

#When distance increases by a unit, the odds of transport being 1 decreases

by 1.1. For age, it is a decrease of 1.4

```

```{r}

library(caTools) # Split Data into Test and Train Set

library(caret) # for confusion matrix function

library(randomForest) # to build a random forest model

library(rpart) # to build a decision model

library(rpart.plot) # to plot decision tree model

library(rattle)

install.packages('xgboost')

library(xgboost) # to build a XGboost model

library(ipred)

library(DMwR) # for sMOTE

install.packages("adabag")

library(adabag)

```

Setting up the general parameters for training multiple models

```{r}

#Defining the training control

commutectrl= trainControl(method = 'repeatedcv',number = 3,repeats = 1,

allowParallel = TRUE,

classProbs = TRUE,
summaryFunction=twoClassSummary)

```

First Model using Gradient Boosting Machines

```{r}

train$transport=as.character(train$transport)

GBM_model <- train(transport ~ ., data = train,

method = "gbm",

trControl = commutectrl,

verbose = FALSE)

```

# Predict using the trained model & check performance on test set

```{r}

GBM_pred_test <- predict(GBM_model, newdata = test, type = "raw")

confusionMatrix(GBM_pred_test, test$transport)

```

Second is Random Forest

```{r}

Rmodel <- train(transport ~ ., data = train,

method = "rf",

ntree = 30,

maxdepth = 5,

tuneLength = 10,

trControl = commutectrl)

```
# Predict using the trained model & check performance on test set

```{r}

Rf_pred_test <- predict(Rmodel, newdata = test, type = "raw")

confusionMatrix(Rf_pred_test, test)

```

COMPARING MODELS

```{r}

# Compare model performances using resample()

compare <- resamples(list(Gradient_boosting = GBM_model,Random_Forest =

rf_model))

# Summary of the models performances

summary(compare)

```

# Drawing the box plots to compare models

```{r}

scales <- list(x=list(relation="free"), y=list(relation="free"))

bwplot(compare, scales=scales)

```

Conclusion:

- The models developed performed well on both Train and test data.

- Most employees will not use Car as a mode of transport.

- The most significant predictor behind their decisions include Age, and
distance to be covered.

Project 4 - Cars-Datasets PDF
100% (2)
Project 4 - Cars-Datasets PDF
44 pages
Machine Learning Project
67% (3)
Machine Learning Project
30 pages
Machine Learning Project On Cars
92% (13)
Machine Learning Project On Cars
22 pages
Assignment - Predictive Modeling
88% (24)
Assignment - Predictive Modeling
66 pages
Analysis of Transport Choice of Employees - A Project On Machine Learning
100% (10)
Analysis of Transport Choice of Employees - A Project On Machine Learning
24 pages
Car Transport Machine Learning
89% (9)
Car Transport Machine Learning
28 pages
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
100% (4)
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
19 pages
Machine Learning Assignment Report - Cars
100% (4)
Machine Learning Assignment Report - Cars
42 pages
Capstone Project SupplyChain DataCo Supplychain FinalReport
100% (8)
Capstone Project SupplyChain DataCo Supplychain FinalReport
79 pages
Project 5 - Cars
100% (1)
Project 5 - Cars
22 pages
Machine Learning (Project5) PDF
100% (2)
Machine Learning (Project5) PDF
13 pages
Assignment ML
100% (2)
Assignment ML
21 pages
Cheat Sheet Final
No ratings yet
Cheat Sheet Final
2 pages
Machine Learning Project - Sapan Parikh
100% (1)
Machine Learning Project - Sapan Parikh
12 pages
Cheat Sheet F
No ratings yet
Cheat Sheet F
2 pages
R Assignment
No ratings yet
R Assignment
8 pages
(Practical) Programming With R
No ratings yet
(Practical) Programming With R
5 pages
Project 5 Surabhi Sood - Report
No ratings yet
Project 5 Surabhi Sood - Report
34 pages
Machine Learning Solution
100% (1)
Machine Learning Solution
12 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
ML Assignemnt PDF
No ratings yet
ML Assignemnt PDF
21 pages
Predicting Mode of Transport (ML) : Akalya KS
No ratings yet
Predicting Mode of Transport (ML) : Akalya KS
17 pages
How To Optimize An Expert Advisor Using MetaTrader 4 Strategy Tester
100% (2)
How To Optimize An Expert Advisor Using MetaTrader 4 Strategy Tester
10 pages
Telecom Churn Report
No ratings yet
Telecom Churn Report
66 pages
Bessel Function Zeroes
No ratings yet
Bessel Function Zeroes
5 pages
Cars4u Project: Proprietary Content. © Great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
100% (2)
Cars4u Project: Proprietary Content. © Great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
30 pages
Group Assignment - Data Mining
No ratings yet
Group Assignment - Data Mining
28 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Machine Learning Project: Choice of Employee Mode of Transport
No ratings yet
Machine Learning Project: Choice of Employee Mode of Transport
35 pages
Lab 6
No ratings yet
Lab 6
7 pages
Predictive Analytics: Group Assignment 2
No ratings yet
Predictive Analytics: Group Assignment 2
6 pages
Department of Business Administration: Final Exam (Take Home) Summer - 2020
No ratings yet
Department of Business Administration: Final Exam (Take Home) Summer - 2020
9 pages
PA Univariate R Solution
No ratings yet
PA Univariate R Solution
6 pages
Bias:Variance Tradeoff
No ratings yet
Bias:Variance Tradeoff
6 pages
Logistic Regression Implementation in R: The Dataset
No ratings yet
Logistic Regression Implementation in R: The Dataset
8 pages
K Nearest Neighbours (KNN) : Short Intro To KNN
No ratings yet
K Nearest Neighbours (KNN) : Short Intro To KNN
13 pages
DVT Project
No ratings yet
DVT Project
35 pages
Predictive Modeling-Handouts
No ratings yet
Predictive Modeling-Handouts
11 pages
Modern Regression 1 - hw6
No ratings yet
Modern Regression 1 - hw6
11 pages
Handling The Dataset Using R - Word
No ratings yet
Handling The Dataset Using R - Word
54 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
FRA Assignment - India Credit Model
No ratings yet
FRA Assignment - India Credit Model
14 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
Logistic Regression Assignment
No ratings yet
Logistic Regression Assignment
20 pages
Regression
No ratings yet
Regression
36 pages
R Course
No ratings yet
R Course
7 pages
R Functions
No ratings yet
R Functions
6 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
Ds
No ratings yet
Ds
2 pages
AMDA Practical - A048
No ratings yet
AMDA Practical - A048
35 pages
Rstudio Study Notes For PA 20181126
No ratings yet
Rstudio Study Notes For PA 20181126
6 pages
Surabhi Charu Project
No ratings yet
Surabhi Charu Project
16 pages
Project Employee Absenteeism
No ratings yet
Project Employee Absenteeism
33 pages
R Codes
No ratings yet
R Codes
23 pages
1.1 Loading The Data: Survival by Sex
No ratings yet
1.1 Loading The Data: Survival by Sex
6 pages
Assignment 2 297
No ratings yet
Assignment 2 297
6 pages
1.1 Objective: 2. Data Preparation and Exploratory Analysis
No ratings yet
1.1 Objective: 2. Data Preparation and Exploratory Analysis
11 pages
Predicting Mode of Transport
No ratings yet
Predicting Mode of Transport
29 pages
R Practicals
No ratings yet
R Practicals
32 pages
Lab Manual - DSR
No ratings yet
Lab Manual - DSR
32 pages
Model Lab
No ratings yet
Model Lab
6 pages
People Analytics InSEM
No ratings yet
People Analytics InSEM
11 pages
R Codes
No ratings yet
R Codes
5 pages
MKT4080-Codes
No ratings yet
MKT4080-Codes
9 pages
Solution For Kriging Calculation
100% (2)
Solution For Kriging Calculation
6 pages
Project
No ratings yet
Project
16 pages
Quantum Cryptography and The Future of Cyber Security 1st Edition by Chaubey Nirbhay 1799822567 9781799822561 PDF Download
100% (1)
Quantum Cryptography and The Future of Cyber Security 1st Edition by Chaubey Nirbhay 1799822567 9781799822561 PDF Download
75 pages
Cars4U - Rajat Kapoor 21032021 FINAL-2
0% (1)
Cars4U - Rajat Kapoor 21032021 FINAL-2
39 pages
Assignment2 Group5B
No ratings yet
Assignment2 Group5B
60 pages
Design and Analysis of Algorithms
No ratings yet
Design and Analysis of Algorithms
13 pages
Project: ©great Learning. Proprietary Content. All Rights Reserved. Unauthorised Use or Distribution Prohibited
No ratings yet
Project: ©great Learning. Proprietary Content. All Rights Reserved. Unauthorised Use or Distribution Prohibited
7 pages
2019-20-Detection Points, Lines and Edges
No ratings yet
2019-20-Detection Points, Lines and Edges
32 pages
TMA4180 Solutions To Recommended Exercises in Chapter 12 of N&W
No ratings yet
TMA4180 Solutions To Recommended Exercises in Chapter 12 of N&W
4 pages
The Cholesky Decomposition
No ratings yet
The Cholesky Decomposition
4 pages
Maths (041) Xii PB 1 QP Set C
No ratings yet
Maths (041) Xii PB 1 QP Set C
7 pages
Axis Insurance Project
No ratings yet
Axis Insurance Project
14 pages
J Jcde 2016 07 002
No ratings yet
J Jcde 2016 07 002
9 pages
Chapter 7 - K-Nearest-Neighbor: Data Mining For Business Analytics
No ratings yet
Chapter 7 - K-Nearest-Neighbor: Data Mining For Business Analytics
16 pages
Earley Parsing PDF
No ratings yet
Earley Parsing PDF
27 pages
Capstone Project NBFC Loan Foreclosure Prediction
No ratings yet
Capstone Project NBFC Loan Foreclosure Prediction
48 pages
Factor Analysis: KMO and Bartlett's Test
No ratings yet
Factor Analysis: KMO and Bartlett's Test
6 pages
Bca 4 Sem Operations Research 21101051 Mar 2021
No ratings yet
Bca 4 Sem Operations Research 21101051 Mar 2021
3 pages
Ai Module 3
No ratings yet
Ai Module 3
41 pages
1 s2.0 S0950705124005999 Main
No ratings yet
1 s2.0 S0950705124005999 Main
12 pages
IEEE 2024 DQL Improved DQL
No ratings yet
IEEE 2024 DQL Improved DQL
6 pages
Rotate A Matrix by 90 Degree Without Using Any Extra Space
No ratings yet
Rotate A Matrix by 90 Degree Without Using Any Extra Space
4 pages
2000 Sussman&Puckett CLSVOF
No ratings yet
2000 Sussman&Puckett CLSVOF
37 pages
DSA With C Recursion
No ratings yet
DSA With C Recursion
11 pages
Repetitive Control
No ratings yet
Repetitive Control
22 pages
Denoisng of Images
No ratings yet
Denoisng of Images
59 pages
2nd Sem - PR2 2.5 Handouts
No ratings yet
2nd Sem - PR2 2.5 Handouts
3 pages
Workshop Schedule: 9:00 - 10:00 Am Director, Director (R&D), Dean (A), Dean (SW), All Hods, Principal (Pharmacy)
No ratings yet
Workshop Schedule: 9:00 - 10:00 Am Director, Director (R&D), Dean (A), Dean (SW), All Hods, Principal (Pharmacy)
1 page
Neural Machine Translation Advised by Statistical Machine Translation
No ratings yet
Neural Machine Translation Advised by Statistical Machine Translation
7 pages
Recognition of Persisting Emotional Valence From EEG Using Convolutional Neural Networks PDF
No ratings yet
Recognition of Persisting Emotional Valence From EEG Using Convolutional Neural Networks PDF
6 pages
Chapter 8: Network Security
No ratings yet
Chapter 8: Network Security
13 pages
M2 Genomics Informatics and Mathematics For Health and Environment Université Paris-Saclay
No ratings yet
M2 Genomics Informatics and Mathematics For Health and Environment Université Paris-Saclay
3 pages
Bert
No ratings yet
Bert
2 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet