Machine Learning Project
Machine Learning Project
Project -5
Machine Learning
R S Prashanti
pg. 0
Project-4 greatlearning
Table of Contents
EDA - Basic data summary, Univariate, Bivariate analysis, graphs, Check for Outliers
1. 7
and missing values and check the summary of the dataset
3. EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it. 3
Applying Naïve Bayes Model & Interpret results (is it applicable here? comment and if
7 3
it is not applicable, how can you build an NB model in this case?)
10 Bagging 7.5
11 Boosting 7.5
pg. 1
Project-4 greatlearning
Projective Objective
This project requires the understanding of what mode of transport employees prefers to commute to
their office. The attached data 'Cars.csv' includes employee information about their mode of transport
as well as their personal and professional details like age, salary, work exp. We need to predict whether
or not an employee will use Car as a mode of transport. Also, which variables are a significant predictor
behind this decision.
Note: Each question referred to the above table has been answered with the same nomenclature.
pg. 2
Project-4 greatlearning
Q1. Basic data summary, Univariate, Bivariate analysis, graphs, Check for Outliers and missing values and
check the summary of the dataset.
A1.
setwd("D:/users/Prashanti Sharma/Personal/PROJS/Pred-2")
library(readr)
data= read.csv("Diabetes.csv")
variable.names(data)
str(data)
str(data)
'data.frame': 444 obs. of 9 variables:
$ Age : int 28 23 29 28 27 26 28 26 22 27 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 1 2 2 ...
$ Engineer : int 0 1 1 1 1 1 1 1 1 1 ...
$ MBA : int 0 0 0 1 0 0 0 0 0 0 ...
$ Work.Exp : int 4 4 7 5 4 4 5 3 1 4 ...
$ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : int 0 0 0 0 0 1 0 0 0 0 ...
$ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 3 3 3 3 3 3 1 3 3 3 ...
>
dim(data)
> dim(data)
[1] 444 9
View(data)
Variables like Engineer, MBA and license has been read as numeric so should be converted to factors
first.
pg. 3
Project-4 greatlearning
summary(data)
> summary(data)
Age Gender Engineer MBA Work.Exp Salary
Min. :18.00 Female:128 0:109 0 :331 Min. : 0.0 Min. : 6.50
1st Qu.:25.00 Male :316 1:335 1 :112 1st Qu.: 3.0 1st Qu.: 9.80
Median :27.00 NA's: 1 Median : 5.0 Median :13.60
Mean :27.75 Mean : 6.3 Mean :16.24
3rd Qu.:30.00 3rd Qu.: 8.0 3rd Qu.:15.72
Max. :43.00 Max. :24.0 Max. :57.00
Distance license Transport
Min. : 3.20 0:340 2Wheeler : 83
1st Qu.: 8.80 1:104 Car : 61
Median :11.00 Public Transport:300
Mean :11.32
3rd Qu.:13.43
Max. :23.40
sapply(data,function(x) sum(is.na(x)))
summary(data$Gender)
summary(data$Engineer)
summary(data$MBA)
summary(data$Work.Exp)
summary(data$Salary)
summary(data$Distance)
summary(data$license)
pg. 4
Project-4 greatlearning
summary(data$Transport)
summary(data$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 25.00 27.00 27.75 30.00 43.00
> summary(data$Gender)
Female Male
128 316
> summary(data$Engineer)
0 1
109 335
> summary(data$MBA)
0 1 NA's
331 112 1
> summary(data$Work.Exp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 3.0 5.0 6.3 8.0 24.0
> summary(data$Salary)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.50 9.80 13.60 16.24 15.72 57.00
> summary(data$Distance)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.20 8.80 11.00 11.32 13.43 23.40
> summary(data$license)
0 1
340 104
> summary(data$Transport)
2Wheeler Car Public Transport
83 61 300
>
> sum(is.na(data))
[1] 1
As stated earlier in the summary analysis one data point is missing which is apparent from the above.
#Univariate analysis
pg. 5
Project-4 greatlearning
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
data %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(col="Blue")
pg. 6
Project-4 greatlearning
#Bivariate Analysis
As expected not much of difference here, people for all qualifications and all work exp would be
employed in firm.
pg. 7
Project-4 greatlearning
We do not see any appreciable difference in salary of Engs Vs Non-Engs or Mba vs Non-MBA’s
Also, mean salary for both MBA’s and Eng is around 16
pg. 8
Project-4 greatlearning
This is skewed towards right, again this would be on expected lines as there would be more juniors then
seniors in any firm
table(data$license,data$Transport)
boxplot(data$Work.Exp ~ data$Gender)
Not much of difference between mean work experience in two genders, so population is equally
distributed for both male and females.
pg. 9
Project-4 greatlearning
Hypothesis Testing
1. Higher the salary more the chances of using car for commute.
2. Again with age or work. Exp (Age and work exp would be collinear), propensity of using car
Increases.
cor(data$Age, data$Work.Exp)
pg. 10
Project-4 greatlearning
As was the case with salary, we could see clear demarcation in usage of transport. With lower age group
2-wheeler is preferable and with higher work exp car is preferred.
3. As distance increase employee, would prefer car for comfort and ease.
There is a slight pattern that could be observed here. For greater distance car is preferred followed by 2-
wheeler and then public transport.
table(data$Gender,data$Transport)
pg. 11
Project-4 greatlearning
table(data$Gender,data$Transport)
We could see that around 40 % of females use private transport and 10% use car compared to males wh
ere 15% prefers car and total of 30% uses private transport. Thus, even though percentage of car usage
is high but they are also high on public transport.
A.3 Multicollinearity
library(corrplot)
data.matrix=cor(data)
corrplot(data.matrix)
library(ppcor)
pcor(data1, method = "pearson")
pg. 12
Project-4 greatlearning
A.4 SMOTE
## SMOTE
data$Engineer<-as.factor(data$Engineer)
data$MBA<-as.factor(data$MBA)
data$license<-as.factor( data$license)
data<-knnImputation(data)
data$CarUsage<-ifelse(data$Transport =='Car',1,0)
table(data$CarUsage)
sum(data$CarUsage == 1)/nrow(data)
data$CarUsage<-as.factor(data$CarUsage)
data$Engineer<-as.factor(data$Engineer)
> data$MBA<-as.factor(data$MBA)
> data$license<-as.factor( data$license)
> data<-knnImputation(data)
Warning message:
In knnImputation(data) :
No case has missing values. Stopping as there is nothing to do.
>
>
> data$CarUsage<-ifelse(data$Transport =='Car',1,0)
> table(data$CarUsage)
0 1
383 61
>
> sum(data$CarUsage == 1)/nrow(data)
[1] 0.1373874
>
> data$CarUsage<-as.factor(data$CarUsage)
The number of records for people travelling by car is in minority. Hence we need to use an appropriate
sampling method on the train data.
set.seed(400)
pg. 13
Project-4 greatlearning
> set.seed(400)
> carindex<-createDataPartition(data$CarUsage, p=0.7,list = FALSE,times = 1)
> carsdatatrain<-data[carindex,]
> carsdatatest<-data[-carindex,]
> prop.table(table(carsdatatrain$CarUsage))
0 1
0.8621795 0.1378205
prop.table(table(carsdatatest$CarUsage))
> prop.table(table(carsdatatest$CarUsage))
0 1
0.8636364 0.1363636
carsdatatrain<-carsdatatrain[,c(1:8,10)]
carsdatatest<-carsdatatest[,c(1:8,10)]
## The train and test data have almost same percentage of cars usage as the base data
## Apply SMOTE on Training data set
library(DMwR)
attach(carsdatatrain)
carsdataSMOTE<-SMOTE(CarUsage~., carsdatatrain, perc.over = 250,perc.under = 150)
prop.table(table(carsdataSMOTE$CarUsage))
0 1
0.5 0.5
We now have an equal split in the data between car users and non car users. Let us proceed with
building the models ## Model Building We will use the Logistic regression method a model on the
SMOTE data to understand the factors influencing car usage. Since we have only limited variable, we will
use them all in model building
outcomevar<-'CarUsage'
regressors<-c("Age","Work.Exp","Salary","Distance","license","Engineer","MBA","Gender")
trainctrl<-trainControl(method = 'repeatedcv',number = 10,repeats = 3)
pg. 14
Project-4 greatlearning
> outcomevar<-'CarUsage'
> regressors<-c("Age","Work.Exp","Salary","Distance","license","Engineer","MBA","Gender")
> trainctrl<-trainControl(method = 'repeatedcv',number = 10,repeats = 3)
> carsglm<-train(carsdataSMOTE[,regressors],carsdataSMOTE[,outcomevar],method = "glm", family = "b
inomial",trControl = trainctrl)
Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
4: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(carsglm$finalModel)
> summary(carsglm$finalModel)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-2.09669 -0.01064 0.00000 0.02798 2.11942
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -64.8739 18.4863 -3.509 0.000449 ***
Age 1.7911 0.5273 3.397 0.000681 ***
Work.Exp -0.2056 0.4347 -0.473 0.636175
Salary -0.1061 2.1572 -0.049 0.960759
Distance 0.6172 0.1957 3.154 0.001611 **
license1 1.5699 0.8530 1.840 0.065711 .
Engineer1 3.1162 1.2024 2.592 0.009554 **
MBA1 -3.4351 1.1784 -2.915 0.003557 **
GenderMale 1.6470 0.9453 1.742 0.081460 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
pg. 15
Project-4 greatlearning
carglmcoeff<-exp(coef(carsglm$finalModel))
write.csv(carglmcoeff,file = "Coeffs.csv")
varImp(object = carsglm)
carglmcoeff<-exp(coef(carsglm$finalModel))
> write.csv(carglmcoeff,file = "Coeffs.csv")
> varImp(object = carsglm)
glm variable importance
Overall
Age 100.00
Distance 92.74
MBA1 85.60
Engineer1 75.94
license1 53.50
GenderMale 50.57
Work.Exp 12.66
Salary 0.00
pg. 16
Project-4 greatlearning
From the model we see that Age and License are more significant. When we look at the odds and
probabilities table, we get to see that Increase in age by 1 year implies that thre is a 98% probability that
the employee will use a car. As expected , if the employee has a license, then it implies a 99% probability
that he/she will use a car. One lkah increase in salary increases the probability of car usage by 72% The
null deviance of this model is 357.664 and the residual deviance is 17.959. This yields a McFadden R
Sqaure o almost 0.94 yielding a very good fit. We get to see Accuracy and Kappa values are high We shall
do the prediction based on this model
#confusion matrix
Reference
Prediction 0 1
0 111 3
1 3 15
Accuracy : 0.9545
95% CI : (0.9037, 0.9831)
No Information Rate : 0.8636
P-Value [Acc > NIR] : 0.0005559
Kappa : 0.807
Sensitivity : 0.8333
Specificity : 0.9737
Pos Pred Value : 0.8333
Neg Pred Value : 0.9737
Prevalence : 0.1364
Detection Rate : 0.1136
Detection Prevalence : 0.1364
Balanced Accuracy : 0.9035
'Positive' Class : 1
pg. 17
Project-4 greatlearning
1. Logistic Regression
install.packages("MASS")
library(class)
model <-glm(Transport ~.,family=binomial(link='logit'),data=cars_train)
summary(model)
Call:
glm(formula = Transport ~ ., family = binomial(link = "logit"),
data = cars_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5733 0.1351 0.3372 0.5675 1.8440
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.48505 2.68180 -4.283 1.85e-05 ***
Age 0.52837 0.11253 4.696 2.66e-06 ***
GenderMale 1.52604 0.38092 4.006 6.17e-05 ***
Engineer1 0.33374 0.38788 0.860 0.38956
MBA1 0.64960 0.43553 1.492 0.13583
Work.Exp -0.07111 0.12017 -0.592 0.55403
Salary -2.55115 2.29565 -1.111 0.26644
Distance -0.15098 0.05167 -2.922 0.00348 **
license1 -2.32637 0.46451 -5.008 5.49e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The top three most relevant features are "Work.Exp", "Distance", "Salary " and "license" because of the
low p-values.
pg. 18
Project-4 greatlearning
Also, the AIC# Score is 241.09. This will be observed in subsequent stages when we refine the model.
The model having least AIC Score would be the most preferred and optimized one.
anova(model, train="Chisq")
anova(model, train="Chisq")
Analysis of Deviance Table
Response: Transport
## Cross Validation
## Decision Tree
## Decision Tree
library(rpart)
model2 <- rpart(Transport ~ Salary + Distance + license
+Work.Exp, data=cars_train, method="class")
plot(model2, uniform=TRUE,
main="Classification Tree for data")
text(model2, use.n=TRUE, all=TRUE, cex=.8)
pg. 19
Project-4 greatlearning
library(rpart)
> model2 <- rpart(Transport ~ Salary + Distance + license
+ +Work.Exp, data=cars_train, method="class")
>
> plot(model2, uniform=TRUE,
+ main="Classification Tree for data")
> text(model2, use.n=TRUE, all=TRUE, cex=.8)
> treePred <- predict(model2, cars_train, type = 'class')
> table(treePred, cars_train$Transport)
pg. 20
Project-4 greatlearning
#KNN
library(caret)
library(mlbench)
library(caret)
library(class)
k-Nearest Neighbors
312 samples
8 predict or
3 classes: '2Wheeler', 'Car', 'Public Transport'
k Accuracy Kappa
2 0.7365457 0.4543489
3 0.7855712 0.5248631
4 0.7629839 0.4800127
5 0.7828562 0.5081854
6 0.7734812 0.4905393
7 0.7634005 0.4624704
8 0.7408065 0.4118105
9 0.7534005 0.4199273
10 0.7536022 0.4116860
11 0.7598454 0.4168749
12 0.7662970 0.4266860
pg. 21
Project-4 greatlearning
13 0.7662970 0.4213708
14 0.7566129 0.3930122
15 0.7661895 0.4135919
16 0.7660887 0.4090611
17 0.7566129 0.3862387
18 0.7629637 0.3926229
19 0.7661895 0.4026549
20 0.7661895 0.3942178
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 3
KNN_predictions = predict(fit.knn,cars_train)
table(KNN_predictions, cars_train$Transport)
KNN_predictions = predict(fit.knn,cars_test)
table(KNN_predictions, cars_test$Transport)
predict(fit.knn,carsTest)
For classification problem, it is important to ensure that the train and test sets have approximately the
same percentage of samples of each target class. Hence, we will use stratified sampling
install.packages("caret")
library(caret)
set.seed(1234)
trainIndex = createDataPartition(class, p=0.7, list = FALSE, times = 1)
train.data = data[trainIndex, ]
test.data = data[-trainIndex,]
dim(train.data)
dim(test.data)
pg. 22
Project-4 greatlearning
prop.table(table(data$class))
prop.table(table(train.data$class))
prop.table(table(test.data$class))
> set.seed(1234)
> trainIndex = createDataPartition(class, p=0.7, list = FALSE, times = 1)
> train.data = data[trainIndex, ]
> test.data = data[-trainIndex,]
>
> dim(train.data)
[1] 538 9
> dim(test.data)
[1] 230 9
>
> prop.table(table(data$class))
0 1
0.6510417 0.3489583
> prop.table(table(train.data$class))
0 1
0.6728625 0.3271375
> prop.table(table(test.data$class))
0 1
0.6 0.4
Data Cleaning
Missing values
anyNA(data)
anyNA(data)
[1] TRUE
data[!complete.cases(data), ]
data[!complete.cases(data), ]
Age Gender Engineer MBA Work.Exp Salary Distance license Transport
145 28 Female 0 <NA> 6 13.7 9.4 0 Public Transport
pg. 23
Project-4 greatlearning
data = knnImputation(data, 5)
Naïve Bayes
library(e1071)
Naive_Bayes_Model=naiveBayes(cars_train$Transport ~., data=cars_train)
Naive_Bayes_Model
library(e1071)
> Naive_Bayes_Model=naiveBayes(cars_train$Transport ~., data=cars_train)
> Naive_Bayes_Model
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
2Wheeler Car Public Transport
0.1891026 0.1378205 0.6730769
Conditional probabilities:
Age
Y [,1] [,2]
2Wheeler 25.11864 3.173624
Car 35.20930 3.211276
Public Transport 26.76190 2.952653
Gender
Y Female Male
2Wheeler 0.4915254 0.5084746
Car 0.1395349 0.8604651
Public Transport 0.2285714 0.7714286
Engineer
Y 0 1
2Wheeler 0.2542373 0.7457627
pg. 24
Project-4 greatlearning
MBA
Y 0 1
2Wheeler 0.7627119 0.2372881
Car 0.8604651 0.1395349
Public Transport 0.7142857 0.2857143
Work.Exp
Y [,1] [,2]
2Wheeler 4.000000 3.449138
Car 15.046512 4.690180
Public Transport 5.004762 3.181131
Salary
Y [,1] [,2]
2Wheeler 2.442295 0.4234003
Car 3.479695 0.4330174
Public Transport 2.520766 0.3208979
Distance
Y [,1] [,2]
2Wheeler 12.03051 3.178172
Car 15.07209 3.903559
Public Transport 10.11524 2.964633
license
Y 0 1
2Wheeler 0.7288136 0.2711864
Car 0.2093023 0.7906977
Public Transport 0.8857143 0.1142857
This gives us the rule or factors which can help us employees decision to use car or not.
General way to interpret this output is that for any factor variable say license we can say that 72% of
people without license use 2-wheeler and 27% with license.
For continuous variables for example distance we can say 2-wheeler is used by people for whom
commute distance is 11.9 with sd of 3.5.
NB_Predictions=predict(Naive_Bayes_Model,cars_test)
table(NB_Predictions,cars_test$Transport)
NB_Predictions=predict(Naive_Bayes_Model,cars_test)
pg. 25
Project-4 greatlearning
> table(NB_Predictions,cars_test$Transport)
NB_Predictions=predict(Naive_Bayes_Model,carsTest)
NB_Predictions
A.10 Boosting
#BOOSTING
install.packages("xgboost")
carsxgb$finalModel
pg. 26
Project-4 greatlearning
predictions_xgb<-predict(carsxgb,cars_test)
confusionMatrix(predictions_xgb,cars_test$Transport)
> predictions_xgb<-predict(carsxgb,cars_test)
> confusionMatrix(predictions_xgb,cars_test$Transport)
Confusion Matrix and Statistics
Reference
Prediction 2Wheeler Car Public Transport
2Wheeler 5 0 4
Car 2 17 2
Public Transport 17 1 84
Overall Statistics
Accuracy : 0.803
95% CI : (0.7249, 0.8671)
No Information Rate : 0.6818
P-Value [Acc > NIR] : 0.001323
Kappa : 0.5514
pg. 27
Project-4 greatlearning
Statistics by Class:
A.11 Bagging
# Bagging
library(ipred)
library(rpart)
pg. 28
Project-4 greatlearning
We see that all three models predict Public Transport for the two test samples
Let us summarize the conclusions from analysis and models for employee’s decision whether to use car
Or not:
pg. 29