0% found this document useful (0 votes)
136 views24 pages

DSBAProject Oct 2020

This project analyzes commute data to predict employees' preferred mode of transport. The document reviews a dataset of 418 employees' ages, salaries, distances to work, and whether they use a car to commute. Visualizations and models including KNN, naive Bayes, logistic regression, and random forest are created. Logistic regression performs best with 100% accuracy on both training and test sets. The coefficients indicate that as distance and age increase, odds of commuting by car decrease.

Uploaded by

Abhay Poddar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views24 pages

DSBAProject Oct 2020

This project analyzes commute data to predict employees' preferred mode of transport. The document reviews a dataset of 418 employees' ages, salaries, distances to work, and whether they use a car to commute. Visualizations and models including KNN, naive Bayes, logistic regression, and random forest are created. Logistic regression performs best with 100% accuracy on both training and test sets. The coefficients indicate that as distance and age increase, odds of commuting by car decrease.

Uploaded by

Abhay Poddar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Project_Oct_2020

Stephen

10/30/2020
---

Introduction: This project is focused on understanding understanding the mode


of transport employees prefer to commute to their office. The "Cars-dataset"
includes employee information about their mode of transport as well as their
personal and professional details like age, salary, work exp.

Objectives: To predict whether or not an employee will use Car as a mode of


transport.

To determine the variables that are a significant predictor behind the


decision of the employees

```{r}

setwd("C:/Users/steph/Documents/DSBA")

getwd()

#Load data set

commute=read.csv("Cars-dataset.csv")

```

Key Assumptions:

The data variables are independent of one another

The sample size is representative of the number of Employees.

I now proceed to load the necessary packages

```{r loading the libraries}

library(ggplot2)

library(caTools)

library(corrplot)

library(gridExtra)

library(psych)
library(dplyr)

library(ROCR)

library(janitor)

```

Reviewing the data set

```{r data review}

names(commute)

commute=clean_names(commute)#The dot in work.Exp should be changed to an


underscore

names(commute)

View(commute)

head(commute)

tail(commute)

dim(commute)#418 rows and 9 columns

str(commute)

```

#Work.Exp should be Numeric and Age too. Engineer, MBA, and license are class
variables and should be factors

```{r convert variables}

commute$age<-as.numeric(commute$age)

commute$work_exp<-as.numeric(commute$work_exp)

commute$engineer<-as.factor(commute$engineer)

commute$mba<-as.factor(commute$mba)

commute$license<-as.factor(commute$license)

#Gender and Transport should be converted to factor to aid the exploratory


data analysis.

commute$gender<-as.factor(commute$gender)
commute$transport<-as.factor(commute$transport)

str(commute)

```

```{r}

summary(commute)

```

The average age of the employees is 27 years. The data is well distributed
with the youngest staff being 18 years old, while the oldest is 43 years old.

121/(121+297)#28% of the the employees are female.

The 109 of the employees have MBA, 313 have a degree, while only 85 of them
have license.

The work experience is 5.8 years. Again, this data set does not suggest there
is an outlier as the mean and median are very close. There are employees in
the organisation withput any work experience.

Average annual salary is 15,418. There are outliers given that the mean are
median are different. The highest maximum earner collects 57,000.

Average distance of employees to the office is 11.29km. The employee with the
shortest distance covers 3.2 km.

In terms of how they transport themselves to the office, 300 of them (71%)
take public Transport to work. Only 35 (8%) drive a car to work.

```{r}
#check for missing values

sum(is.na(commute))

#As seen in the summary, and also here above, there is a missing value.

#Since we have a fairly large data and only one entry is missing, we could
simply delete the row with the missing value.

commute.m<-na.omit(commute)

summary(commute.m)

#I need to streamline the outcome (dependent) variable by reducing the levels


of category from 3 to 2.

commute.m$transport=as.character(commute.m$transport)

commute.m$transport[commute.m$transport=="2Wheeler"|

commute.m$transport=="Public Transport"]="Not_car"

commute.m$transport[commute.m$transport=="Car"]="car"

table(commute.m$transport)

```

```{r}

Plotting Charts

#Checking for outliers

attach(commute.m)#To avoid having to attach the data frame everytime

hist(age)

boxplot(commute.m[,c(1,5,6,7)],horizontal=TRUE)

boxplot(salary~transport,xlab="transport", ylab="salary", main="transport by


salary",col=c("orange","blue"))

boxplot(distance~transport,xlab="transport", ylab="distance", main="transport


by distance",col=c("orange","blue"))

```
```{r}

#Histogram

qplot(age,main="Age Distribution", xlab="age", ylab="Frequency")

#barplot

qplot(transport,main="Transport Distribution", xlab="Transport",


ylab="Frequency")

qplot(gender,main="Gender Distribution", xlab="Gender", ylab="Frequency")

```
```{r}

#Multivariate

qplot(gender,salary,geom="boxplot",main="Salary Distribution per Gender",


xlab="Gender", ylab="Salary")

qplot(transport,salary, geom="boxplot",main="Mode of Transport by Salary


Earned", xlab="Transport", ylab="Salary")

qplot(salary,fill=transport, geom="density",alpha=I(.8),main="Mode of
Transport by Salary Earned", xlab="Transport", ylab="Density")

qplot(salary,fill=gender, geom="density",alpha=I(.8),main="Salary Earned by


Gender", xlab="Gender", ylab="Density")

qplot(salary,fill=license, geom="density",alpha=I(.8),main="Salary Earned by


Gender", xlab="Gender", ylab="Density")

qplot(age,fill=transport, geom="bar",main="Mode of Transport vs Age",


xlab="Age", ylab="Frequency")

qplot(transport,fill=gender, geom="bar",main="Mode of Transport vs Gender",


xlab="transport", ylab="Frequency")

plot(commute.m[,c(5:7)])

```
```{r}
#percentile Distribution for the treatment of outliers

#define a function

outlier_treatment_fun = function(commute.m,var_name){

capping=as.vector(quantile(commute.m[,var_name],0.99))

flooring=as.vector(quantile(commute.m[,var_name],0.01))

commute.m[,var_name][which(commute.m[,var_name]<flooring)]=flooring

commute.m[,var_name][which(commute.m[,var_name]>capping)]=capping

return(commute.m)}

new_vars=c('age','work_exp','salary','distance')

#performing outlier treatment to all the variables

for(i in new_vars){

commute.m=outlier_treatment_fun(commute.m,i)

correlations=cor(commute.m[,new_vars]) #Checking for multicolineraity

corrplot(correlations)

#Correlation coefficient is high

```
```{r}

#Dealing with Multicolinearity

ncol(commute.m)

names(commute.m)

train.matrix=data.matrix(commute.m)

corr_mat=cor(train.matrix[,-9])

corrplot(corr_mat,order="hclust")#hclust ensures the variables with high


correlation are grouped together.

#exclude the highly correlated variable(s)

library(usdm)

vifcor((train.matrix[,-9]),th=0.8)#At the correlation threshold of 0.8,


work_exp and Saalary have been excluded.

```

```{r}

#### Creating a filtered data frame

str(commute.m)

filter.commute.m=commute.m[,-5:-6]

head(filter.commute.m)

filter.commute.m$transport=as.factor(filter.commute.m$transport)

```

```{r}

#modelling

#splitting data to train and test

set.seed(1234)
sample=sample.split(filter.commute.m$transport,SplitRatio = 0.75)

train=subset(filter.commute.m,sample==T)

test=subset(filter.commute.m,sample==F)

table(filter.commute.m$transport)

str(train)

dim(train)

dim(test)

```

```{r}

#KNN

#It does not depend on Mathematical formula.

library(class)

names(train)#Helps to see the identify the sequence of columns

knn(train[,c(1,5)],test[,c(1,5)],train$transport,k=5)

#Convert to a model

KNN_transport=knn(train[,c(1,5)],test[,c(1,5)],train$transport,k=5)#It is
good to have odd numbers of neighbours(k)to avoid a tie situation.

#Checking for the performance on the test data

table(test$transport,KNN_transport) #Confusion matrix

(8+95)/105 #98.09% Accuracy

8/9 #88.8% sensitivity

95/96 #98.9% specificity

#scale the data to see if the performance of the model will be better

KNN_transport2=knn(scale(train[,c(1,5)]),scale(test[,c(1,5)]),train$transport
,k=5)

table(test$transport,KNN_transport2)

#The result did not change

```
```{r}

#Naive Bayes (NB)

#In its simple form, the NB runs on data with all categorical predictor
variables.

library(e1071)

naiveBayes(transport~.,train)

#Conditional distribution specifying the mean, and standard deviation, given


the dependent variable ('use of car' or 'Not use of car')

#To predict, I need to create a Naive Bayes object. In order to apply this in
a situation like this where we have continuous variables as predictors, we
assume normality of the data.

#Multicolinearity must also be avoided. This has already been addressed.

NB_transport=naiveBayes(transport~.,train)

predict(NB_transport,type="raw",newdata=train)

plot(predict(NB_transport,type="raw",newdata=train)[,2])

```
```{r}

#Logistic Regression

train$transport=as.factor(train$transport)

str(train)

glm(transport~.,data=train,family=binomial)

model=glm(transport~.,data=train,family=binomial)

summary(model)

```

```{r}

model$fitted.values

plot(train$transport,model$fitted.values)

transport.predicted=ifelse(model$fitted.values<0.4, "Car", "Not Car")

table(train$transport,transport.predicted)#Confusion Matrix

(20+214)/nrow(train) #100% Accuracy

20/20 #100% sensitivity

214/214 #100% specificity

#ROC Curve

library(pROC)

roc(train$transport,model$fitted.values)#At 100, ROC is outstanding

plot.roc(train$transport,model$fitted.values)

```
```{r}

#Test data check

glm(transport~.,data=test,family=binomial)

model2=glm(transport~.,data=test,family=binomial)

model2

plot(test$transport,model2$fitted.values)

transport.predicted=ifelse(model2$fitted.values<0.4, "Car", "Not Car")

table(test$transport,transport.predicted)#Confusion Matrix

#The performance is equally excellent

#ROC Curve
roc(test$transport,model2$fitted.values)#At 100, ROC is outstanding

plot.roc(test$transport,model2$fitted.values)

```

```{r}

#McFadden Rsquared interpretation

library(pscl)

pR2(model)#At 92%, it is an excellent model.

```
```{r}

#Significance of the coefficients

summary(model)

#When distance increases by a unit, the odds of transport being 1 decreases


by 1.1. For age, it is a decrease of 1.4

```

```{r}

library(caTools) # Split Data into Test and Train Set

library(caret) # for confusion matrix function

library(randomForest) # to build a random forest model

library(rpart) # to build a decision model

library(rpart.plot) # to plot decision tree model

library(rattle)

install.packages('xgboost')

library(xgboost) # to build a XGboost model

library(ipred)

library(DMwR) # for sMOTE

install.packages("adabag")

library(adabag)

```

Setting up the general parameters for training multiple models

```{r}

#Defining the training control

commutectrl= trainControl(method = 'repeatedcv',number = 3,repeats = 1,

allowParallel = TRUE,

classProbs = TRUE,
summaryFunction=twoClassSummary)

```

First Model using Gradient Boosting Machines

```{r}

train$transport=as.character(train$transport)

GBM_model <- train(transport ~ ., data = train,

method = "gbm",

trControl = commutectrl,

verbose = FALSE)

```

# Predict using the trained model & check performance on test set

```{r}

GBM_pred_test <- predict(GBM_model, newdata = test, type = "raw")

confusionMatrix(GBM_pred_test, test$transport)

```

Second is Random Forest

```{r}

Rmodel <- train(transport ~ ., data = train,

method = "rf",

ntree = 30,

maxdepth = 5,

tuneLength = 10,

trControl = commutectrl)

```
# Predict using the trained model & check performance on test set

```{r}

Rf_pred_test <- predict(Rmodel, newdata = test, type = "raw")

confusionMatrix(Rf_pred_test, test)

```

COMPARING MODELS

```{r}

# Compare model performances using resample()

compare <- resamples(list(Gradient_boosting = GBM_model,Random_Forest =


rf_model))

# Summary of the models performances

summary(compare)

```

# Drawing the box plots to compare models

```{r}

scales <- list(x=list(relation="free"), y=list(relation="free"))

bwplot(compare, scales=scales)

```

Conclusion:

- The models developed performed well on both Train and test data.

- Most employees will not use Car as a mode of transport.

- The most significant predictor behind their decisions include Age, and
distance to be covered.

You might also like