0% found this document useful (0 votes)
91 views

Random Forest Reference Code

The document discusses random forest classification models. It shows that a random forest model was built with 500 trees using 3 variables for each tree. The out-of-bag error estimate provides an accurate assessment of the model's performance without a test set. Variable importance is assessed using a variable importance plot that sorts variables by their MeanDecreaseGini. The random forest model achieves 99% accuracy on both the training and test sets, indicating stability.

Uploaded by

Rajat Shetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views

Random Forest Reference Code

The document discusses random forest classification models. It shows that a random forest model was built with 500 trees using 3 variables for each tree. The out-of-bag error estimate provides an accurate assessment of the model's performance without a test set. Variable importance is assessed using a variable importance plot that sorts variables by their MeanDecreaseGini. The random forest model achieves 99% accuracy on both the training and test sets, indicating stability.

Uploaded by

Rajat Shetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Classification

Random Forest

www.proschoolonline.com
Random Forest
#Random Forest model
modelrf <- randomForest(as.factor(left) ~ . , data = trainSplit, do.trace=T)
modelrf

The random forest model output tells us that it has built 500 trees and used 3 variables for each tree building.
Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set.
The OOB estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-
of-bag error estimate removes the need for a set aside test set.

www.proschoolonline.com
Random Forest
#Checking variable importance in Random Forest
importance(modelrf)

varImpPlot(modelrf)

The variable importance plot displays


a plot with variables sorted by
MeanDecreaseGini

www.proschoolonline.com
Random Forest
# Prediction and Model Evaluation using Confusion Matrix
predrf_tr <- predict(modelrf, trainSplit) #Train Data
predrf_test <- predict(modelrf, testSplit) #Test Data

confusionMatrix(predrf_tr,trainSplit$left) #Train Data


confusionMatrix(predrf_test,testSplit$left) #TestData

The Confusion Matrix The Confusion Matrix


on Train data gives on Train data gives
the accuracy of 99% the accuracy of 99%

As we observe, the model shows similar performance on Train and Test data and hence we assure stability of our Random Forest model
www.proschoolonline.com
Comparing ROC curves for Decision Tree and Random Forest

# Prediction and Model Evaluation using Confusion Matrix


#Decision Tree ROC
auc1 <- roc(as.numeric(testSplit$left),
as.numeric(predtest))
plot(auc1,col =
'blue',main=paste('AUC:',round(auc1$auc[[1]],3)))

#Random Forest ROC


aucrf <- roc(as.numeric(testSplit$left),
as.numeric(predrf), ci=TRUE)
plot(aucrf, ylim=c(0,1), print.thres=TRUE,
main=paste('Random Forest
AUC:',round(aucrf$auc[[1]],3)),col = 'blue')

#Comparing both ROC curves


plot(aucrf, ylim=c(0,1), main=paste('ROC Comparison :
RF(blue),C5.0(Black))'),col = 'blue')
par(new = TRUE)
plot(auc1)
par(new = TRUE) The ROC curve for Random Forest is better for
Decision Tree.

www.proschoolonline.com
Classification Model
Naïve Bayes

www.proschoolonline.com
Naïve Bayes
#Naive Bayes
modelnb <- naiveBayes(as.factor(left) ~. , data = trainSplit)
modelnb

These are the apriori probabilities for the variables in the dataset
www.proschoolonline.com
Naïve Bayes
#Performance of Naïve Bayes using Confusion Matrix
prednb_tr <- predict(modelnb,trainSplit) #Train Data
prednb_test <- predict(modelnb,testSplit) #Test Data

confusionMatrix(prednb_tr,trainSplit$left) #Train Data


confusionMatrix(prednb_test,testSplit$left) #Test Data

The Confusion Matrix


The Confusion Matrix
on Train data gives
on Train data gives
the accuracy of
the accuracy of
78.84%
78.58%

As we observe, the model shows similar performance on Train and Test data and hence we assure stability of our Naïve Bayes model

www.proschoolonline.com
Classification Model
kNN Algorithm

www.proschoolonline.com
kNN Algorithm
#Data Preparation for kNN Algorithm

library(dummies)
#Creating dummy variables for Factor variable
dummy_df = dummy.data.frame(hr_data1[, c('role_code', 'salary.code')])

hr_data2 = hr_data1
hr_data2 = cbind.data.frame(hr_data2, dummy_df)

#Removing role_code and salary.code since we have created dummy variables


hr_data2 = hr_data2[, !(names(hr_data2) %in% c('role_code', 'salary.code'))]

#Converting variables to numeric datatype


hr_data2$Work_accident = as.numeric(hr_data2$Work_accident)
hr_data2$promotion_last_5years = as.numeric(hr_data2$promotion_last_5years)

www.proschoolonline.com
kNN Algorithm
#Data Preparation for kNN Algorithm

#Scale the variables and check their final


structure
X = hr_data2[, !(names(hr_data2) %in% c('left'))]
hr_data2_scaled = as.data.frame(scale(X))

str(hr_data2_scaled)

#Splitting the data for the model building


hr_train <- hr_data2_scaled[splitIndex,]
hr_test <- hr_data2_scaled[-splitIndex,]

hr_train_labels <- hr_data2[splitIndex, 'left']


hr_test_labels <- hr_data2[-splitIndex, 'left']

www.proschoolonline.com
kNN Algorithm
#Applying kNN Algorithm on the dataset
library(class)
library(gmodels)

test_pred_1 <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=1)


CrossTable(x=hr_test_labels ,y=test_pred_1 ,prop.chisq = FALSE)

Here from this crosstab, we can compute the accuracy


of this model, for k = 1.

Accuracy = (TP+TN)/Total
= (3311+1030)/4499
= 96.48%

www.proschoolonline.com
kNN Algorithm
#Applying kNN Algorithm on the dataset

As we calculated for k = 1, Similarly we will calculate it for k = 5,10,50,100,122.


Below we summarize the accuracy for these k values

K Accuracy
5 94.46%
10 94.17%
50 90.19%
100 86.48%
122 85.06%

From the above accuracy table, we can observe that as the k value increases the accuracy goes
down.

www.proschoolonline.com
kNN Algorithm
# Thumb rule to decide on k for k-NN is sqrt(n)/2
k = sqrt(nrow(hr_train))/2
k
#51.2347 (which can be approximated to 51

test_pred_rule <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=k)


CrossTable(x=hr_test_labels ,y=test_pred_rule ,prop.chisq = FALSE)
# accuracy = 4050/4499 = 90.02%

# Another method to determine the k for k-NN


set.seed(400)
ct <- trainControl(method="repeatedcv",repeats = 3)
fit <- train(left ~ ., data = hr_data2, method = "knn", trControl = ct, preProcess =
c("center","scale"),tuneLength = 20)
fit

# Checking accuracy of the model with k = 7


test_pred_7 <- knn(train = hr_train, test = hr_test, cl = hr_train_labels, k=7)
CrossTable(x=hr_test_labels ,y=test_pred_7 ,prop.chisq = FALSE)
# accuracy = 4357/4499 = 96.84%

#or alternatively we can use this below command


confusionMatrix (hr_test_labels,test_pred_7)

Output on next slide..


www.proschoolonline.com
Using the above code it indicates that
k=7 is the best for this data and it is
better to go with this value because it
has been cross validated

www.proschoolonline.com
Step 6
Model
Summarization

www.proschoolonline.com
Summary of Model Performance

Model Accuracy
Decision Tree 97.09%
Random Forest 99%
Naïve Bayes 78.84%
kNN Algorithm (Using k = 7) 96.84%

www.proschoolonline.com
Appendix
Packages used for the Classification Analysis:

•data.table
•reshape2
•randomForest
•party # For decision tree
•rpart # for Rpart
•rpart.plot #for Rpart plot
•lattice # Used for Data Visualization
•caret # for data pre-processing
•pROC # for ROC curve
•corrplot # for correlation plot
•e1071 # for ROC curve and Confusion matrix
•RColorBrewer
•dummies
•class
•gmodels

www.proschoolonline.com
Thank You.

You might also like