0% found this document useful (0 votes)
203 views39 pages

DM Assignment - Thena Bank

This document describes a case study about a bank called Thera Bank which wants to build a model to identify customers who are likely to purchase a personal loan. The bank has a growing customer base with most being depositors and only a small percentage currently holding loans. The bank wants to convert more deposit-only customers into loan customers while retaining their deposits. The dataset contains information on 5,000 customers to help build this predictive model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
203 views39 pages

DM Assignment - Thena Bank

This document describes a case study about a bank called Thera Bank which wants to build a model to identify customers who are likely to purchase a personal loan. The bank has a growing customer base with most being depositors and only a small percentage currently holding loans. The bank wants to convert more deposit-only customers into loan customers while retaining their deposits. The dataset contains information on 5,000 customers to help build this predictive model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

DM Assignment

Santhosh Sadasivam

12/11/2019

Description
Thera Bank - Loan Purchase Modeling
This case is about a bank (Thera Bank) which has a growing customer base. Majority of
these customers are liability customers (depositors) with varying size of deposits. The
number of customers who are also borrowers (asset customers) is quite small, and the
bank is interested in expanding this base rapidly to bring in more loan business and in the
process, earn more through the interest on loans. In particular, the management wants to
explore ways of converting its liability customers to personal loan customers (while
retaining them as depositors). A campaign that the bank ran last year for liability
customers showed a healthy conversion rate of over 9% success. This has encouraged the
retail marketing department to devise campaigns with better target marketing to increase
the success ratio with a minimal budget. The department wants to build a model that will
help them identify the potential customers who have a higher probability of purchasing the
loan. This will increase the success ratio while at the same time reduce the cost of the
campaign. The dataset has data on 5000 customers. The data include customer
demographic information (age, income, etc.), the customer’s relationship with the bank
(mortgage, securities account, etc.), and the customer response to the last personal loan
campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the
personal loan that was offered to them in the earlier campaign.
Problem Statement
# Libraries to install
library(readxl)
library(readr)
library(DataExplorer)
library(caTools)
library(rpart)
library(rpart.plot)
library(rattle)

## Rattle: A free graphical interface for data science with R.


## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(data.table)
library(ROCR)
## Loading required package: gplots

##
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':


##
## lowess

library(ineq)
library(InformationValue)
library(ModelMetrics)

##
## Attaching package: 'ModelMetrics'

## The following objects are masked from 'package:InformationValue':


##
## confusionMatrix, npv, precision, sensitivity, specificity

## The following object is masked from 'package:base':


##
## kappa

library(reshape)

##
## Attaching package: 'reshape'

## The following object is masked from 'package:data.table':


##
## melt

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

##
## Attaching package: 'randomForest'

## The following object is masked from 'package:rattle':


##
## importance

# Setting the working directory


setwd("C:/Users/santhosh/Desktop/R programming/DM Assignment")
getwd()

## [1] "C:/Users/santhosh/Desktop/R programming/DM Assignment"

# Reading the sata file


data = read_xlsx("Thera Bank_Personal_Loan_Modelling-dataset-1.xlsx",2)
# Exploratory data analysis
## Univariate Analysis
print(summary(data))

## ID Age (in years) Experience (in years) Income (in K/month)


## Min. : 1 Min. :23.00 Min. :-3.0 Min. : 8.00
## 1st Qu.:1251 1st Qu.:35.00 1st Qu.:10.0 1st Qu.: 39.00
## Median :2500 Median :45.00 Median :20.0 Median : 64.00
## Mean :2500 Mean :45.34 Mean :20.1 Mean : 73.77
## 3rd Qu.:3750 3rd Qu.:55.00 3rd Qu.:30.0 3rd Qu.: 98.00
## Max. :5000 Max. :67.00 Max. :43.0 Max. :224.00
##
## ZIP Code Family members CCAvg Education
## Min. : 9307 Min. :1.000 Min. : 0.000 Min. :1.000
## 1st Qu.:91911 1st Qu.:1.000 1st Qu.: 0.700 1st Qu.:1.000
## Median :93437 Median :2.000 Median : 1.500 Median :2.000
## Mean :93153 Mean :2.397 Mean : 1.938 Mean :1.881
## 3rd Qu.:94608 3rd Qu.:3.000 3rd Qu.: 2.500 3rd Qu.:3.000
## Max. :96651 Max. :4.000 Max. :10.000 Max. :3.000
## NA's :18
## Mortgage Personal Loan Securities Account CD Account
## Min. : 0.0 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.0 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 0.0 Median :0.000 Median :0.0000 Median :0.0000
## Mean : 56.5 Mean :0.096 Mean :0.1044 Mean :0.0604
## 3rd Qu.:101.0 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :635.0 Max. :1.000 Max. :1.0000 Max. :1.0000
##
## Online CreditCard
## Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.000
## Median :1.0000 Median :0.000
## Mean :0.5968 Mean :0.294
## 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :1.0000 Max. :1.000
##

attach(data)

# we could see the summary of the data set for each column with Mean,Median,
Min, Max, 1st Qtr, 3rd Qtr etc..

18 NA’s observed in varibale - Family Members


## Data Types
print(colnames(data))

## [1] "ID" "Age (in years)"


## [3] "Experience (in years)" "Income (in K/month)"
## [5] "ZIP Code" "Family members"
## [7] "CCAvg" "Education"
## [9] "Mortgage" "Personal Loan"
## [11] "Securities Account" "CD Account"
## [13] "Online" "CreditCard"

str(data)

## Classes 'tbl_df', 'tbl' and 'data.frame': 5000 obs. of 14 variables:


## $ ID : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Age (in years) : num 25 45 39 35 35 37 53 50 35 34 ...
## $ Experience (in years): num 1 19 15 9 8 13 27 24 10 9 ...
## $ Income (in K/month) : num 49 34 11 100 45 29 72 22 81 180 ...
## $ ZIP Code : num 91107 90089 94720 94112 91330 ...
## $ Family members : num 4 3 1 1 4 4 2 1 3 1 ...
## $ CCAvg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
## $ Education : num 1 1 1 2 2 2 2 3 2 3 ...
## $ Mortgage : num 0 0 0 0 0 155 0 0 104 0 ...
## $ Personal Loan : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Securities Account : num 1 1 0 0 0 0 0 0 0 0 ...
## $ CD Account : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Online : num 0 0 0 0 0 1 1 0 1 0 ...
## $ CreditCard : num 0 0 0 0 1 0 0 1 0 0 ...

# Data has all varibales as numeric and it is found that data is a mix of
table and dataframe

data = data.frame(data) # converting dataset into dataframe


# We could see that all varibales are indicated as numbers but we need to
convert a few variables into factors

# Converting required variables into factors


data$Online=as.factor(data$Online)
data$Personal.Loan = as.factor(data$Personal.Loan)
data$Education=as.factor(data$Education)
data$Securities.Account=as.factor(data$Securities.Account)
data$CD.Account=as.factor(data$CD.Account)
data$CreditCard = as.factor(data$CreditCard)
data$Family.members = as.factor(data$Family.members)
print(str(data))

## 'data.frame': 5000 obs. of 14 variables:


## $ ID : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Age..in.years. : num 25 45 39 35 35 37 53 50 35 34 ...
## $ Experience..in.years.: num 1 19 15 9 8 13 27 24 10 9 ...
## $ Income..in.K.month. : num 49 34 11 100 45 29 72 22 81 180 ...
## $ ZIP.Code : num 91107 90089 94720 94112 91330 ...
## $ Family.members : Factor w/ 4 levels "1","2","3","4": 4 3 1 1 4 4
2 1 3 1 ...
## $ CCAvg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
## $ Education : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 3
2 3 ...
## $ Mortgage : num 0 0 0 0 0 155 0 0 104 0 ...
## $ Personal.Loan : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2
...
## $ Securities.Account : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1
...
## $ CD.Account : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1
...
## $ Online : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 1 2 1
...
## $ CreditCard : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1
...
## NULL

# Few more univariate analysis


print(dim(data))

## [1] 5000 14

Dataset has 5000 Rows and 14 Columns


# Column Names
print(colnames(data))

## [1] "ID" "Age..in.years."


## [3] "Experience..in.years." "Income..in.K.month."
## [5] "ZIP.Code" "Family.members"
## [7] "CCAvg" "Education"
## [9] "Mortgage" "Personal.Loan"
## [11] "Securities.Account" "CD.Account"
## [13] "Online" "CreditCard"

# Making Valid clumn names syntactically

colnames(data)=make.names(colnames(data))
print(colnames(data))

## [1] "ID" "Age..in.years."


## [3] "Experience..in.years." "Income..in.K.month."
## [5] "ZIP.Code" "Family.members"
## [7] "CCAvg" "Education"
## [9] "Mortgage" "Personal.Loan"
## [11] "Securities.Account" "CD.Account"
## [13] "Online" "CreditCard"

# Removing the first column ID as it is a sequential number and Zip code is


also not required for our processing
data = data [,-1] # removing first column ID
data = data[,-4] # removing 4th column Zip code

# Identifying NA in the dataset


sum(is.na(data))

## [1] 18
There are 18 NAs in the dataset. as observed earlier all 18 in Family Members column
# Proportio of Responders and Non responders to personal loan campaign

prop.table(table(data$Personal.Loan))*100

##
## 0 1
## 90.4 9.6

9.6% responded to the personal loan campaign 90.4% has not responded to the campaign
# missing values and plotting them
plot_missing(data)

colSums(is.na(data))

## Age..in.years. Experience..in.years. Income..in.K.month.


## 0 0 0
## Family.members CCAvg Education
## 18 0 0
## Mortgage Personal.Loan Securities.Account
## 0 0 0
## CD.Account Online CreditCard
## 0 0 0

Family Members observed with 0.36% missing values Since the percnetage is low we can
delete it from the dataset
# Missing Value Treatment
print.data.frame(data[!complete.cases(data),]) # showing no of rows where NA
is present

## Age..in.years. Experience..in.years. Income..in.K.month.


## 21 56 31 25
## 59 28 2 93
## 99 49 23 94
## 162 61 35 80
## 236 38 8 71
## 290 42 15 24
## 488 39 13 88
## 722 49 24 39
## 1461 40 16 85
## 1462 54 28 48
## 2400 62 36 41
## 2833 45 21 133
## 3702 58 33 95
## 4136 48 23 168
## 4139 47 22 114
## 4403 55 25 52
## 4404 50 24 112
## 4764 51 25 173
## Family.members CCAvg Education Mortgage Personal.Loan
## 21 <NA> 0.9 2 111 0
## 59 <NA> 0.2 1 0 0
## 99 <NA> 0.3 1 0 0
## 162 <NA> 2.8 1 0 0
## 236 <NA> 1.8 3 0 0
## 290 <NA> 1.0 2 0 0
## 488 <NA> 1.4 2 0 0
## 722 <NA> 1.4 3 0 0
## 1461 <NA> 0.2 3 0 0
## 1462 <NA> 0.2 1 0 0
## 2400 <NA> 1.0 3 154 0
## 2833 <NA> 5.7 3 0 1
## 3702 <NA> 2.6 1 0 0
## 4136 <NA> 2.8 1 308 0
## 4139 <NA> 0.6 1 0 0
## 4403 <NA> 1.4 3 207 0
## 4404 <NA> 0.0 1 0 0
## 4764 <NA> 0.5 2 0 1
## Securities.Account CD.Account Online CreditCard
## 21 0 0 1 0
## 59 0 0 0 0
## 99 0 0 1 0
## 162 0 0 1 0
## 236 0 0 1 0
## 290 0 0 1 1
## 488 0 0 1 1
## 722 0 0 1 0
## 1461 0 0 1 1
## 1462 0 0 1 0
## 2400 1 0 1 0
## 2833 0 1 1 1
## 3702 0 0 1 0
## 4136 0 0 1 0
## 4139 1 1 1 1
## 4403 1 0 0 0
## 4404 0 0 0 0
## 4764 0 0 1 0

data = na.omit(data) # deleting rows containing NAs


colSums(is.na(data)) # finding columns with no values

## Age..in.years. Experience..in.years. Income..in.K.month.


## 0 0 0
## Family.members CCAvg Education
## 0 0 0
## Mortgage Personal.Loan Securities.Account
## 0 0 0
## CD.Account Online CreditCard
## 0 0 0

All the rows which had NAs are removed completely


# Negative Values - As observed earlier years of experience had negative
values. Yrs of exp cannot be negative. Also checking for other columns

print(colSums(data<0))

## Warning in Ops.factor(left, right): '<' not meaningful for factors

## Warning in Ops.factor(left, right): '<' not meaningful for factors

## Warning in Ops.factor(left, right): '<' not meaningful for factors

## Warning in Ops.factor(left, right): '<' not meaningful for factors

## Warning in Ops.factor(left, right): '<' not meaningful for factors

## Warning in Ops.factor(left, right): '<' not meaningful for factors

## Warning in Ops.factor(left, right): '<' not meaningful for factors

## Age..in.years. Experience..in.years. Income..in.K.month.


## 0 52 0
## Family.members CCAvg Education
## NA 0 NA
## Mortgage Personal.Loan Securities.Account
## 0 NA NA
## CD.Account Online CreditCard
## NA NA NA

cat("Total Negative Values:",


(length(data$Experience..in.years.[data$Experience..in.years.<0])/nrow(data))
*100 , "%")

## Total Negative Values: 1.043758 %

Total Negative Values: 1.043758 %


Professional Experience has 52 negative values
Since experience cant be negative we need to treat them
# Negative Value Treatment

data[data$Experience..in.years. <0, "Experience..in.years."] =


mean(data$Experience..in.years.[data$Experience..in.years. >= 0])

print(colSums(data < 0))

## Warning in Ops.factor(left, right): '<' not meaningful for factors

## Warning in Ops.factor(left, right): '<' not meaningful for factors

## Warning in Ops.factor(left, right): '<' not meaningful for factors

## Warning in Ops.factor(left, right): '<' not meaningful for factors

## Warning in Ops.factor(left, right): '<' not meaningful for factors

## Warning in Ops.factor(left, right): '<' not meaningful for factors

## Warning in Ops.factor(left, right): '<' not meaningful for factors

## Age..in.years. Experience..in.years. Income..in.K.month.


## 0 0 0
## Family.members CCAvg Education
## NA 0 NA
## Mortgage Personal.Loan Securities.Account
## 0 NA NA
## CD.Account Online CreditCard
## NA NA NA

Replaced Negative values with mean value of >=0 data values


## Bivariate ANalysis in EDA
# Finding correlation between the variables
# Correlation plot
plot_correlation(data)

Age and Experience has high correlation Income and Averge spending on credit card has
medium correlation There is no other significan correlation as we observe the plot
# Histogram

plot_histogram(data, binary_as_factor = FALSE, geom_histogram_args =


list("fill" = "red"))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Age and Expereince are normall distributed CC Avg income right skewed Mortgage - 70%
has no mortgage
# Finding Outliers using Boxplot

plot_boxplot(data, by="Personal.Loan", binary_as_factor = FALSE,


geom_boxplot_args = list("fill"= "Blue"))
Outliers observed
in CC Avg, Income, Mortgage
# Density Plot

plot_density(data, binary_as_factor = FALSE, geom_density_args = list("fill"


= "Green"))
## Splitting data into Train and Test Data Ser

seed = 1000
set.seed(seed)
x = sample.split(data$Personal.Loan, SplitRatio = 0.7)
TrainDS = subset(data, x==TRUE)
TestDS = subset(data,x==FALSE)
TrainDS_RF = TrainDS
TestDS_RF = TestDS

Modelling
# CART Modelling
# setting CART Parameters
cartParameters = rpart.control(minsplit = 15, cp =0.009,xval = 10)
cartModel = rpart(formula = TrainDS$Personal.Loan~ .,data = TrainDS, method =
"class", control = cartParameters)
cartModel

## n= 3488
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 3488 335 0 (0.903956422 0.096043578)
## 2) Income..in.K.month.< 119.5 2874 76 0 (0.973556019 0.026443981)
## 4) CCAvg< 2.95 2633 13 0 (0.995062666 0.004937334) *
## 5) CCAvg>=2.95 241 63 0 (0.738589212 0.261410788)
## 10) CD.Account=0 221 47 0 (0.787330317 0.212669683)
## 20) Education=1 113 9 0 (0.920353982 0.079646018) *
## 21) Education=2,3 108 38 0 (0.648148148 0.351851852)
## 42) Income..in.K.month.< 92.5 67 9 0 (0.865671642 0.134328358)
*
## 43) Income..in.K.month.>=92.5 41 12 1 (0.292682927 0.707317073)
*
## 11) CD.Account=1 20 4 1 (0.200000000 0.800000000) *
## 3) Income..in.K.month.>=119.5 614 259 0 (0.578175896 0.421824104)
## 6) Education=1 406 51 0 (0.874384236 0.125615764)
## 12) Family.members=1,2 355 0 0 (1.000000000 0.000000000) *
## 13) Family.members=3,4 51 0 1 (0.000000000 1.000000000) *
## 7) Education=2,3 208 0 1 (0.000000000 1.000000000) *

# plotting the model


fancyRpartPlot(cartModel)

printcp(cartModel)

##
## Classification tree:
## rpart(formula = TrainDS$Personal.Loan ~ ., data = TrainDS, method =
"class",
## control = cartParameters)
##
## Variables actually used in tree construction:
## [1] CCAvg CD.Account Education
## [4] Family.members Income..in.K.month.
##
## Root node error: 335/3488 = 0.096044
##
## n= 3488
##
## CP nsplit rel error xerror xstd
## 1 0.31045 0 1.00000 1.00000 0.051946
## 2 0.15224 2 0.37910 0.38806 0.033395
## 3 0.01791 3 0.22687 0.23582 0.026230
## 4 0.00900 7 0.14030 0.14925 0.020956

plotcp(cartModel)

The Built Cart tree have scope for Pruning as we see form the above plot considering the
lowest error
# Fnding the best CP

bestCP = cartModel$cptable[which.min(cartModel$cptable[,"xerror"]), "CP"]


bestCP

## [1] 0.009

## pruning Tree
pTree = prune(cartModel,cp = bestCP, "CP")
pTree

## n= 3488
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 3488 335 0 (0.903956422 0.096043578)
## 2) Income..in.K.month.< 119.5 2874 76 0 (0.973556019 0.026443981)
## 4) CCAvg< 2.95 2633 13 0 (0.995062666 0.004937334) *
## 5) CCAvg>=2.95 241 63 0 (0.738589212 0.261410788)
## 10) CD.Account=0 221 47 0 (0.787330317 0.212669683)
## 20) Education=1 113 9 0 (0.920353982 0.079646018) *
## 21) Education=2,3 108 38 0 (0.648148148 0.351851852)
## 42) Income..in.K.month.< 92.5 67 9 0 (0.865671642 0.134328358)
*
## 43) Income..in.K.month.>=92.5 41 12 1 (0.292682927 0.707317073)
*
## 11) CD.Account=1 20 4 1 (0.200000000 0.800000000) *
## 3) Income..in.K.month.>=119.5 614 259 0 (0.578175896 0.421824104)
## 6) Education=1 406 51 0 (0.874384236 0.125615764)
## 12) Family.members=1,2 355 0 0 (1.000000000 0.000000000) *
## 13) Family.members=3,4 51 0 1 (0.000000000 1.000000000) *
## 7) Education=2,3 208 0 1 (0.000000000 1.000000000) *

#Plotting Pruned Tree


fancyRpartPlot(pTree, main = "Pruned Tree")
printcp(pTree)

##
## Classification tree:
## rpart(formula = TrainDS$Personal.Loan ~ ., data = TrainDS, method =
"class",
## control = cartParameters)
##
## Variables actually used in tree construction:
## [1] CCAvg CD.Account Education
## [4] Family.members Income..in.K.month.
##
## Root node error: 335/3488 = 0.096044
##
## n= 3488
##
## CP nsplit rel error xerror xstd
## 1 0.31045 0 1.00000 1.00000 0.051946
## 2 0.15224 2 0.37910 0.38806 0.033395
## 3 0.01791 3 0.22687 0.23582 0.026230
## 4 0.00900 7 0.14030 0.14925 0.020956

The final tree is built with lowest xerror and 7 splits


## CART model Performance
# PRediction
TrainDS$Prediction = predict(pTree,TrainDS, type = "class")
TrainDS$Probability = predict(pTree,TrainDS, Type = "Prob")[,"1"]
head(TrainDS) #Prediction and probability columns added to Training Data

## Age..in.years. Experience..in.years. Income..in.K.month. Family.members


## 1 25 1 49 4
## 3 39 15 11 1
## 4 35 9 100 1
## 5 35 8 45 4
## 6 37 13 29 4
## 8 50 24 22 1
## CCAvg Education Mortgage Personal.Loan Securities.Account CD.Account
## 1 1.6 1 0 0 1 0
## 3 1.0 1 0 0 0 0
## 4 2.7 2 0 0 0 0
## 5 1.0 2 0 0 0 0
## 6 0.4 2 155 0 0 0
## 8 0.3 3 0 0 0 0
## Online CreditCard Prediction Probability
## 1 0 0 0 0.004937334
## 3 0 0 0 0.004937334
## 4 0 0 0 0.004937334
## 5 0 1 0 0.004937334
## 6 1 0 0 0.004937334
## 8 0 1 0 0.004937334

# Confusion MAtrix

tb1_TrDS_CART = table(TrainDS$Prediction, TrainDS$Personal.Loan)


tb1_TrDS_CART

##
## 0 1
## 0 3137 31
## 1 16 304

# Classification Error Rate / Misclassification

CER_TrDS = (tb1_TrDS_CART[1,2]+tb1_TrDS_CART[2,1])/sum(tb1_TrDS_CART)
CER_TrDS

## [1] 0.01347477

Classification Error or Erro is 1.3%


# Accuracy of the Model
# as we know that accuracy is 1-Error

Acc_TrDS = 1 - CER_TrDS
Acc_TrDS
## [1] 0.9865252

Accuracy of the model is 98.7$


# True positive Rate / Sensitivity
TPR_TrDS=tb1_TrDS_CART[2,2]/(tb1_TrDS_CART[1,2]+tb1_TrDS_CART[2,2]) #True
positive rate or sensitivity (TP/TP+FN)
TPR_TrDS

## [1] 0.9074627

TPR / Sensiivity is 0.9074627 / 90.7%


# True negative rate or specificity:
TNR_TrDS=tb1_TrDS_CART[1,1]/(tb1_TrDS_CART[1,1]+tb1_TrDS_CART[2,1]) #True
negative rate or specificity (TN/TN+FP)
TNR_TrDS

## [1] 0.9949255

TNR / Specificity is 0.9949255


# Creating Decile and chopping into buckets
prob_TrDS_CART = seq(0,1,length = 11)
qt_TrDS_CART = quantile(TrainDS$Probability,prob_TrDS_CART)
qt_TrDS_CART

## 0% 10% 20% 30% 40% 50%


## 0.000000000 0.000000000 0.004937334 0.004937334 0.004937334 0.004937334
## 60% 70% 80% 90% 100%
## 0.004937334 0.004937334 0.004937334 0.134328358 1.000000000

TrainDS$deciles = cut(TrainDS$Probability,
unique(qt_TrDS_CART),include.lowest = TRUE, right = TRUE)
table(TrainDS$deciles)

##
## [0,0.00494] (0.00494,0.134] (0.134,1]
## 2988 180 320

Three different buckets were created based on a specific number interval Above 0-0.00494
is one bucket, 0.00494 - 0.134 in one bucket and 0.134-1 in the kast bucket
# Rank ordering Table (Model PErofrmance 1)

TrainDS = data.table(TrainDS)
rankTbl_TrDS_CART = TrainDS[, list(
cnt = length(Personal.Loan),
cnt_tar1 = sum(Personal.Loan == 1),
cnt_tar0 = sum(Personal.Loan == 0)),
by=deciles][order(-deciles)]
rankTbl_TrDS_CART$resp_rate = round(rankTbl_TrDS_CART$cnt_tar1 /
rankTbl_TrDS_CART$cnt,4)*100;
rankTbl_TrDS_CART$cum_resp = cumsum(rankTbl_TrDS_CART$cnt_tar1)
rankTbl_TrDS_CART$cum_non_resp = cumsum(rankTbl_TrDS_CART$cnt_tar0)
rankTbl_TrDS_CART$cum_rel_resp = round(rankTbl_TrDS_CART$cum_resp /
sum(rankTbl_TrDS_CART$cnt_tar1),4)*100
rankTbl_TrDS_CART$cum_rel_non_resp = round(rankTbl_TrDS_CART$cum_non_resp /
sum(rankTbl_TrDS_CART$cnt_tar0),4)*100
rankTbl_TrDS_CART$ks = abs(rankTbl_TrDS_CART$cum_rel_resp -
rankTbl_TrDS_CART$sum_rel_non_resp)
print(rankTbl_TrDS_CART)

## deciles cnt cnt_tar1 cnt_tar0 resp_rate cum_resp cum_non_resp


## 1: (0.134,1] 320 304 16 95.00 304 16
## 2: (0.00494,0.134] 180 18 162 10.00 322 178
## 3: [0,0.00494] 2988 13 2975 0.44 335 3153
## cum_rel_resp cum_rel_non_resp ks
## 1: 90.75 0.51 NA
## 2: 96.12 5.65 NA
## 3: 100.00 100.00 NA

# auc,ks & gini computing methods


predObj_TrDS = prediction(TrainDS$Probability, TrainDS$Personal.Loan)
perf_TrDS = performance(predObj_TrDS, "tpr" , "fpr")
plot(perf_TrDS)

ROC curve has been plotted


ks_TrDS = max([email protected][[1]][email protected][[1]])
auc_TrDS = performance(predObj_TrDS, "auc")
auc_TrDS = as.numeric([email protected])
gini_TrDS = ineq(TrainDS$Probability,type = "Gini")
cat("Ks=", ks_TrDS,
"auc=" , auc_TrDS,
"gini=" , gini_TrDS)

## Ks= 0.9108586 auc= 0.9799244 gini= 0.8676614

# concordance and Discordance


Concordance_TrDS = Concordance(actuals = TrainDS$Personal.Loan,
predictedScores = TrainDS$Probability)
Concordance_TrDS

## $Concordance
## [1] 0.9629162
##
## $Discordance
## [1] 0.03708385
##
## $Tied
## [1] -2.775558e-17
##
## $Pairs
## [1] 1056255

Concordance is very good since it shows 96% so the model is very good
# Root Mean Square Error (RMSE)
# computed considering the personal loan as a continous variable or number
RMSE_TrDS = rmse(TrainDS$Personal.Loan,TrainDS$Prediction)
RMSE_TrDS

## [1] 0.1160809

Root MEan Square Error is 0.1160809 / 11%


# Mean Absolute Error considering PErsonal Loan as a number
MAE_TrDS = mae(TrainDS$Personal.Loan, TrainDS$Prediction)
MAE_TrDS

## [1] 0.01347477

Mean Absolute Error is 0.0135 / 1.3%


## Model perofrmance of Test Data on the built model

# Prediction

TestDS$Prediction=predict(pTree, TestDS, type = "class")


TestDS$Probability=predict(pTree, TestDS, type = "prob")[ ,'1']
# Confusion MAtrix

tb1_TeDS=table(TestDS$Prediction, TestDS$Personal.Loan)
print(tb1_TeDS)

##
## 0 1
## 0 1343 14
## 1 8 129

# Classification Error Computation


CeR_TeDS=(tb1_TeDS[1,2]+tb1_TeDS[2,1])/ sum(tb1_TeDS)
CeR_TeDS

## [1] 0.01472557

Classificaiton Error Rate on Test data is 0.01472557 / 1.5%


# Computing Accuracy
Accuracy_TeDS = 1-CeR_TeDS # Since Accuracy is 1-Error
Accuracy_TeDS

## [1] 0.9852744

Accuracy of the model on the testing data is 98.5% which is quite similar tot he one of the
train data
# finding True positive rate / Sensitivity
TPR_TeDS=tb1_TeDS[2,2]/(tb1_TeDS[1,2]+tb1_TeDS[2,2])
TPR_TeDS

## [1] 0.9020979

# Finding True Negative Rate / Specificity


TNR_TeDS = tb1_TeDS[1,1]/ (tb1_TeDS[1,1]+tb1_TeDS[2,1])
TNR_TeDS

## [1] 0.9940785

# Creating Decile and chopping them to buckets


prob_TeDS_CART = seq(0,1,length = 11)
qt_TeDS_CART = quantile(TestDS$Probability, prob_TeDS_CART)
qt_TeDS_CART

## 0% 10% 20% 30% 40% 50%


## 0.000000000 0.000000000 0.004937334 0.004937334 0.004937334 0.004937334
## 60% 70% 80% 90% 100%
## 0.004937334 0.004937334 0.004937334 0.134328358 1.000000000

As we observe that most of the data falls between 90% - 100% bucket Almost 86% of the
response falls in that bucket
TestDS$deciles = cut(TestDS$Probability, unique(qt_TeDS_CART),include.lowest
= TRUE, right = TRUE)
table(TestDS$deciles)

##
## [0,0.00494] (0.00494,0.134] (0.134,1]
## 1282 75 137

## Model perofrmance measures on Test Data

# Rank Order Table

testDT = data.table(TestDS)
rankTbl_TeDS_CART = testDT[, list(
cnt = length(Personal.Loan),
cnt_tar1 = sum(Personal.Loan == 1),
cnt_tar0 = sum(Personal.Loan == 0)),
by=deciles][order(-deciles)]
rankTbl_TeDS_CART$resp_rate = round(rankTbl_TeDS_CART$cnt_tar1 /
rankTbl_TeDS_CART$cnt,4)*100
rankTbl_TeDS_CART$cum_resp = cumsum(rankTbl_TeDS_CART$cnt_tar1)
rankTbl_TeDS_CART$cum_non_resp = cumsum(rankTbl_TeDS_CART$cnt_tar0)
rankTbl_TeDS_CART$cum_rel_resp = round(rankTbl_TeDS_CART$cum_resp /
sum(rankTbl_TeDS_CART$cnt_tar1),4)*100
rankTbl_TeDS_CART$cum_rel_non_resp = round(rankTbl_TeDS_CART$cum_non_resp /
sum(rankTbl_TeDS_CART$cnt_tar0),4)*100
rankTbl_TeDS_CART$ks = abs(rankTbl_TeDS_CART$cum_rel_resp -
rankTbl_TeDS_CART$cum_rel_non_resp) #ks
rankTbl_TeDS_CART

## deciles cnt cnt_tar1 cnt_tar0 resp_rate cum_resp cum_non_resp


## 1: (0.134,1] 137 129 8 94.16 129 8
## 2: (0.00494,0.134] 75 5 70 6.67 134 78
## 3: [0,0.00494] 1282 9 1273 0.70 143 1351
## cum_rel_resp cum_rel_non_resp ks
## 1: 90.21 0.59 89.62
## 2: 93.71 5.77 87.94
## 3: 100.00 100.00 0.00

# Calculating auc,ks & gini computing methods on Test data

predObj_TeDS = prediction(TestDS$Probability, TestDS$Personal.Loan)


perf_TeDS = performance(predObj_TeDS, "tpr" , "fpr")
plot(perf_TeDS)
ks_TeDS = max([email protected][[1]][email protected][[1]])
auc_TeDS = performance(predObj_TeDS, "auc")
auc_TeDS = as.numeric([email protected])
gini_TeDS = ineq(TestDS$Probability,type = "Gini")
cat("Ks=", ks_TeDS,
"auc=" , auc_TeDS,
"gini=" , gini_TeDS)

## Ks= 0.8961764 auc= 0.9686039 gini= 0.8680705

# Concordance and Discordance Ratio Computation

Concordance_TeDS = Concordance(actuals = TestDS$Personal.Loan,


predictedScores = TestDS$Probability)
Concordance_TeDS

## $Concordance
## [1] 0.9419803
##
## $Discordance
## [1] 0.0580197
##
## $Tied
## [1] -4.163336e-17
##
## $Pairs
## [1] 193193
# Root Mean Square Error (RMSE)
# computed considering the personal loan as a continous variable or number

RMSE_TeDS = rmse(TestDS$Personal.Loan,TestDS$Prediction)
RMSE_TeDS

## [1] 0.121349

# Mean Absolute Error considering Personal Loan as a number

MAE_TeDS = mae(TestDS$Personal.Loan, TestDS$Prediction)


MAE_TeDS

## [1] 0.01472557

# CART model performance Table

Performance_KPI = c("Classification Error Rate",


"Accuracy",
"TPR",
"TNR",
"ks",
"auc",
"gini",
"Concordance",
"RMSE*",
"MAE*")

Training_CART = c(CER_TrDS,
Acc_TrDS,
TPR_TrDS,
TNR_TrDS,
ks_TrDS,
auc_TrDS,
gini_TrDS,
Concordance_TrDS$Concordance,
RMSE_TrDS,
MAE_TrDS)
Test_CART =c(CeR_TeDS,
Accuracy_TeDS,
TPR_TeDS,
TNR_TeDS,
ks_TeDS,
auc_TeDS,
gini_TeDS,
Concordance_TeDS$Concordance,
RMSE_TeDS,
MAE_TeDS)

x=cbind(Performance_KPI,Training_CART,Test_CART)
x=data.table(x)
x$Training_CART=as.numeric(x$Training_CART)
x$Test_CART=as.numeric(x$Test_CART)
print(x)

## Performance_KPI Training_CART Test_CART


## 1: Classification Error Rate 0.01347477 0.01472557
## 2: Accuracy 0.98652523 0.98527443
## 3: TPR 0.90746269 0.90209790
## 4: TNR 0.99492547 0.99407846
## 5: ks 0.91085865 0.89617636
## 6: auc 0.97992436 0.96860393
## 7: gini 0.86766141 0.86807045
## 8: Concordance 0.96291615 0.94198030
## 9: RMSE* 0.11608088 0.12134896
## 10: MAE* 0.01347477 0.01472557

• considering personla Loan as number applicable


RANDOM FOREST
## Random Forest

# Building Random Forest Model

TrainDS =TrainDS_RF
TestDS=TestDS_RF

rndForest=randomForest(Personal.Loan ~ ., data = TrainDS, ntree=501, mtry=5,


nodesize=10, importance=TRUE)
print(rndForest)

##
## Call:
## randomForest(formula = Personal.Loan ~ ., data = TrainDS, ntree = 501,
mtry = 5, nodesize = 10, importance = TRUE)
## Type of random forest: classification
## Number of trees: 501
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 1.32%
## Confusion matrix:
## 0 1 class.error
## 0 3147 6 0.00190295
## 1 40 295 0.11940299

# Tree Calculation based on error rate

min(rndForest$err.rate)

## [1] 0.001585791
# Plotting Error Rates for Random Forest

plot(rndForest, main = "")


legend("topright", c("OOB", "0", "1"), text.col = 1:6, lty = 1:3, col = 1:3)
title(main = "Error Rates Random Forest TrainDT")

After 100 the curve seems to be constant


We will start with 101 tree. Taking odd number so that we get a result in probability
# finding importance parameter

print(rndForest$importance)

## 0 1 MeanDecreaseAccuracy
## Age..in.years. 3.673304e-03 6.194839e-04 3.378982e-03
## Experience..in.years. 3.189203e-03 2.180676e-03 3.091802e-03
## Income..in.K.month. 1.285083e-01 4.589875e-01 1.600288e-01
## Family.members 5.306916e-02 7.721191e-02 5.536076e-02
## CCAvg 3.162860e-02 7.401166e-02 3.565070e-02
## Education 7.129464e-02 1.317559e-01 7.705097e-02
## Mortgage 1.186234e-03 -2.870626e-03 7.936276e-04
## Securities.Account 4.405486e-05 -5.149235e-05 3.663810e-05
## CD.Account 3.255770e-03 1.031084e-02 3.929585e-03
## Online 5.194141e-05 3.917111e-04 8.482881e-05
## CreditCard 6.970194e-04 5.825216e-04 6.873496e-04
## MeanDecreaseGini
## Age..in.years. 9.7745913
## Experience..in.years. 9.9801487
## Income..in.K.month. 182.4077177
## Family.members 83.4483778
## CCAvg 80.2558097
## Education 155.3222629
## Mortgage 8.3953196
## Securities.Account 0.7925112
## CD.Account 29.3202665
## Online 1.0615870
## CreditCard 2.0055241

# Tuning Random Forest

set.seed(1000)
set.seed(seed)
tRndForest=tuneRF(x=TrainDS[,-
which(colnames(TrainDS)=="Personal.Loan")],y=TrainDS$Personal.Loan,
mtryStart = 9,
ntreeTry = 101,
stepFactor = 1.2,
improve = 0.001,
trace = FALSE,
plot = TRUE,
doBest = TRUE,
nodesize = 10,
importance = TRUE )

## 0.1632653 0.001
## -0.1463415 0.001
## -0.1707317 0.001
# Finding important variables

importance(tRndForest)

## 0 1 MeanDecreaseAccuracy
## Age..in.years. 17.785873 -1.1339472 16.0236905
## Experience..in.years. 13.793447 -0.5375885 13.0188811
## Income..in.K.month. 235.480675 123.4019580 245.4670544
## Family.members 174.449089 68.6975275 178.4550860
## CCAvg 34.016640 52.1285949 41.2768729
## Education 226.777339 96.6939567 238.5661690
## Mortgage 4.066117 0.1846508 4.0537627
## Securities.Account -1.568248 1.5192923 -1.0841882
## CD.Account 13.230849 13.5666479 17.7969469
## Online 0.454509 0.6883306 0.7216765
## CreditCard 2.689135 -0.4054599 2.2414486
## MeanDecreaseGini
## Age..in.years. 7.9980541
## Experience..in.years. 6.7010570
## Income..in.K.month. 190.9673454
## Family.members 97.0932383
## CCAvg 58.7611727
## Education 189.1666498
## Mortgage 2.9831095
## Securities.Account 0.5013371
## CD.Account 18.3850897
## Online 0.8344523
## CreditCard 0.9173974

Income is considered most important variable


There are also other important variables Education, Family Member etc
# Random Forest Model Performance
# Performance on Training Data

# Prediction:

TrainDS$Prediction_RF=predict(tRndForest, TrainDS, type = "class")


TrainDS$Probability1_RF=predict(tRndForest, TrainDS, type = "prob")[,"1"]

# Confusion Matrix:

tbl_TrDS_RF=table(TrainDS$Prediction_RF, TrainDS$Personal.Loan)
tbl_TrDS_RF

##
## 0 1
## 0 3153 17
## 1 0 318

# Classification Error Rate:

CeR_TrDS_RF=(tbl_TrDS_RF[1,2]+tbl_TrDS_RF[2,1])/sum(tbl_TrDS_RF)

#classification Error Rate or error rate (FP+FN/TP+FP+TN+FN)

CeR_TrDS_RF

## [1] 0.004873853

# Accuracy:

Accuracy_TrDS_RF=1-(tbl_TrDS_RF[1,2]+tbl_TrDS_RF[2,1])/sum(tbl_TrDS_RF)

#accuracy (1-error rate)

Accuracy_TrDS_RF

## [1] 0.9951261

# True positive rate or sensitivity:

TPR_TrDS_RF=tbl_TrDS_RF[2,2]/(tbl_TrDS_RF[1,2]+tbl_TrDS_RF[2,2])

#True positive rate or sensitivity (TP/TP+FN)


TPR_TrDS_RF

## [1] 0.9492537

# True negative rate or specificity:

TNR_TrDS_RF=tbl_TrDS_RF[1,1]/(tbl_TrDS_RF[1,1]+tbl_TrDS_RF[2,1])

#Truenegative rate or specificity (TN/TN+FP)

TNR_TrDS_RF

## [1] 1

# Creating Decile and Chopping into unique buckets:

probs_TrDS_RF=seq(0,1,length=11)
qs_TrDS_RF=quantile(TrainDS$Probability1_RF, probs_TrDS_RF)
qs_TrDS_RF

## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.248 1.000

# Splitting the deciles

TrainDS$deciles_RF=cut(TrainDS$Probability1_RF, unique(qs_TrDS_RF),
include.lowest = TRUE, right=TRUE)
table(TrainDS$deciles_RF)

##
## [0,0.002] (0.002,0.248] (0.248,1]
## 2826 313 349

Three deciles has been split, first decile 0 -0.002, second decile 0.002 - 0.248, third decile
0.248 - 1 where majority of the data falls
# Rank ordering table computing

library(data.table)
trainDT = data.table(TrainDS)
rankTbl_TrDS_RF = trainDT[, list(
cnt = length(Personal.Loan),
cnt_tar1= sum(Personal.Loan == 1),
cnt_tar0 = sum(Personal.Loan == 0)),
by=deciles_RF][order(-deciles_RF)]
rankTbl_TrDS_RF$resp_rate = round(rankTbl_TrDS_RF$cnt_tar1 /
rankTbl_TrDS_RF$cnt,4)*100
rankTbl_TrDS_RF$cum_resp = cumsum(rankTbl_TrDS_RF$cnt_tar1)
rankTbl_TrDS_RF$cum_non_resp = cumsum(rankTbl_TrDS_RF$cnt_tar0)
rankTbl_TrDS_RF$cum_rel_resp = round(rankTbl_TrDS_RF$cum_resp /
sum(rankTbl_TrDS_RF$cnt_tar1),4)*100
rankTbl_TrDS_RF$cum_rel_non_resp = round(rankTbl_TrDS_RF$cum_non_resp /
sum(rankTbl_TrDS_RF$cnt_tar0),4)*100
rankTbl_TrDS_RF$ks = abs(rankTbl_TrDS_RF$cum_rel_resp -
rankTbl_TrDS_RF$cum_rel_non_resp) #ks
rankTbl_TrDS_RF

## deciles_RF cnt cnt_tar1 cnt_tar0 resp_rate cum_resp cum_non_resp


## 1: (0.248,1] 349 333 16 95.42 333 16
## 2: (0.002,0.248] 313 2 311 0.64 335 327
## 3: [0,0.002] 2826 0 2826 0.00 335 3153
## cum_rel_resp cum_rel_non_resp ks
## 1: 99.4 0.51 98.89
## 2: 100.0 10.37 89.63
## 3: 100.0 100.00 0.00

# auc,ks and gini Computing:

predObj_TrDS_RF = prediction(TrainDS$Probability1_RF, TrainDS$Personal.Loan)


perf_TrDS_RF = performance(predObj_TrDS_RF, "tpr", "fpr")
plot(perf_TrDS_RF) #ROC curve

# ks

ks_TrDS_RF = max([email protected][[1]][email protected][[1]]) #ks


auc_TrDS_RF = performance(predObj_TrDS_RF,"auc");
auc_TrDS_RF= as.numeric([email protected]) #auc
gini_TrDS_RF= ineq(TrainDS$Probability1_RF, type="Gini") #gini
cat("ks=", ks_TrDS_RF,
"auc=", auc_TrDS_RF,
"gini=", gini_TrDS_RF)

## ks= 0.9942912 auc= 0.9998755 gini= 0.9027421

# Concordance and Discordance ratios: computing

Concordance_TrDS_RF=Concordance(actuals=TrainDS$Personal.Loan,
predictedScores=TrainDS$Probability1_RF)
Concordance_TrDS_RF

## $Concordance
## [1] 0.9998731
##
## $Discordance
## [1] 0.0001268633
##
## $Tied
## [1] -2.981556e-17
##
## $Pairs
## [1] 1056255

# Root-Mean Square Error (RMSE*):

RMSE_TrDS_RF=rmse(TrainDS$Personal.Loan, TrainDS$Prediction_RF)
RMSE_TrDS_RF

## [1] 0.06981299

# Mean absolute error (MAE*):


MAE_TrDS_RF=mae(TrainDS$Personal.Loan, TrainDS$Prediction_RF)
MAE_TrDS_RF

## [1] 0.004873853

# Test data Performance on built Model

# Prediction:

TestDS$Prediction_RF=predict(tRndForest, TestDS, type = "class")


TestDS$Probability1_RF=predict(tRndForest, TestDS, type = "prob")[,"1"]

# Confusion Matrix:

tbl_TeDS_RF=table(TestDS$Prediction_RF, TestDS$Personal.Loan)
tbl_TeDS_RF
##
## 0 1
## 0 1347 14
## 1 4 129

# Classification Error Rate:

CeR_TeDS_RF=(tbl_TeDS_RF[1,2]+tbl_TeDS_RF[2,1])/sum(tbl_TeDS_RF)
CeR_TeDS_RF

## [1] 0.01204819

# Accuracy:

Accuracy_TeDS_RF=1-CeR_TeDS_RF
Accuracy_TeDS_RF

## [1] 0.9879518

# True positive rate or sensitivity:

TPR_TeDS_RF=tbl_TeDS_RF[2,2]/(tbl_TeDS_RF[1,2]+tbl_TeDS_RF[2,2])
TPR_TeDS_RF

## [1] 0.9020979

# True negative rate or specificity:

TNR_TeDS_RF=tbl_TeDS_RF[1,1]/(tbl_TeDS_RF[1,1]+tbl_TeDS_RF[2,1])
TNR_TeDS_RF

## [1] 0.9970392

# Creating Decile and Chopping into unique buckets:

probs_TeDS_RF=seq(0,1,length=11)
qs_TeDS_RF=quantile(TestDS$Probability1_RF, probs_TeDS_RF)
qs_TeDS_RF

## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.004 0.205 1.000

# Splitting Deciles

TestDS$deciles_RF=cut(TestDS$Probability1_RF, unique(qs_TeDS_RF),
include.lowest = TRUE, right=TRUE)
table(TestDS$deciles_RF)

##
## [0,0.004] (0.004,0.205] (0.205,1]
## 1210 134 150
# Rank ordering table on RF Test Data

testDT = data.table(TestDS)
rankTbl_TeDS_RF = testDT[, list(
cnt = length(Personal.Loan),
cnt_tar1 = sum(Personal.Loan == 1),
cnt_tar0 = sum(Personal.Loan == 0)),
by=deciles_RF][order(-deciles_RF)]
rankTbl_TeDS_RF$resp_rate = round(rankTbl_TeDS_RF$cnt_tar1 /
rankTbl_TeDS_RF$cnt,4)*100
rankTbl_TeDS_RF$cum_resp = cumsum(rankTbl_TeDS_RF$cnt_tar1)
rankTbl_TeDS_RF$cum_non_resp = cumsum(rankTbl_TeDS_RF$cnt_tar0)
rankTbl_TeDS_RF$cum_rel_resp = round(rankTbl_TeDS_RF$cum_resp /
sum(rankTbl_TeDS_RF$cnt_tar1),4)*100
rankTbl_TeDS_RF$cum_rel_non_resp = round(rankTbl_TeDS_RF$cum_non_resp /
sum(rankTbl_TeDS_RF$cnt_tar0),4)*100
rankTbl_TeDS_RF$ks = abs(rankTbl_TeDS_RF$cum_rel_resp -
rankTbl_TeDS_RF$cum_rel_non_resp) #ks
rankTbl_TeDS_RF

## deciles_RF cnt cnt_tar1 cnt_tar0 resp_rate cum_resp cum_non_resp


## 1: (0.205,1] 150 133 17 88.67 133 17
## 2: (0.004,0.205] 134 10 124 7.46 143 141
## 3: [0,0.004] 1210 0 1210 0.00 143 1351
## cum_rel_resp cum_rel_non_resp ks
## 1: 93.01 1.26 91.75
## 2: 100.00 10.44 89.56
## 3: 100.00 100.00 0.00

# auc, ks and gini Computing:

predObj_TeDS_RF = prediction(TestDS$Probability1_RF, TestDS$Personal.Loan)


perf_TeDS_RF = performance(predObj_TeDS_RF, "tpr", "fpr")
plot(perf_TeDS_RF) #ROC curve
ks_TeDS_RF = max([email protected][[1]][email protected][[1]])
#ksTestDS
auc_TeDS_RF = performance(predObj_TeDS_RF,"auc")
auc_TeDS_RF = as.numeric([email protected]) #auc
gini_TeDS_RF = ineq(TestDS$Probability1_RF, type="Gini") #gini
cat("ks_TeDS_RF=", ks_TeDS_RF,
"auc_TeDS_RF=", auc_TeDS_RF,
"gini_TeDS_RF=", gini_TeDS_RF)

## ks_TeDS_RF= 0.9328754 auc_TeDS_RF= 0.9968555 gini_TeDS_RF= 0.9035695

# Concordance and Discordance ratios:

Concordance_TeDS_RF=Concordance(actuals=TestDS$Personal.Loan,
predictedScores=TestDS$Probability1_RF)
Concordance_TeDS_RF

## $Concordance
## [1] 0.9968011
##
## $Discordance
## [1] 0.003198874
##
## $Tied
## [1] -1.864828e-17
##
## $Pairs
## [1] 193193

# Root-Mean Square Error(RMSE*):

RMSE_TeDS_RF=rmse(TestDS$Personal.Loan, TestDS$Prediction_RF)
RMSE_TeDS_RF

## [1] 0.1097643

# Mean absolute error (MAE*):

MAE_TeDS_RF=mae(TestDS$Personal.Loan, TestDS$Prediction_RF)
MAE_TeDS_RF

## [1] 0.01204819

# CART & Random Forest Model Summary

Performance_KPI = c("Classification Error Rate",


"Accuracy",
"TPR",
"TNR",
"ks",
"auc",
"gini",
"Concordance",
"RMSE*",
"MAE*")

Training_CART = c(CER_TrDS,
Acc_TrDS,
TPR_TrDS,
TNR_TrDS,
ks_TrDS,
auc_TrDS,
gini_TrDS,
Concordance_TrDS$Concordance,
RMSE_TrDS,
MAE_TrDS)

Test_CART = c(CeR_TeDS,
Accuracy_TeDS,
TPR_TeDS,
TNR_TeDS,
ks_TeDS,
auc_TeDS,
gini_TeDS,
Concordance_TeDS$Concordance,
RMSE_TeDS,
MAE_TeDS)

Training_RF = c(CeR_TrDS_RF,
Accuracy_TrDS_RF,
TPR_TrDS_RF,
TNR_TrDS_RF,
ks_TrDS_RF,
auc_TrDS_RF,
gini_TrDS_RF,
Concordance_TrDS_RF$Concordance,
RMSE_TrDS_RF,
MAE_TrDS_RF)

Test_RF = c(CeR_TeDS_RF,
Accuracy_TeDS_RF,
TPR_TeDS_RF,
TNR_TeDS_RF,
ks_TeDS_RF,
auc_TeDS_RF,
gini_TeDS_RF,
Concordance_TeDS_RF$Concordance,
RMSE_TeDS_RF,
MAE_TeDS_RF)

y=cbind(Performance_KPI, Training_CART, Test_CART, Training_RF, Test_RF)


library(data.table)
y=data.table(y)
y$Training_CART=as.numeric(y$Training_CART)
y$Test_CART=as.numeric(y$Test_CART)
y$Training_RF=as.numeric(y$Training_RF)
y$Test_RF=as.numeric(y$Test_RF)
print(y)

## Performance_KPI Training_CART Test_CART Training_RF


## 1: Classification Error Rate 0.01347477 0.01472557 0.004873853
## 2: Accuracy 0.98652523 0.98527443 0.995126147
## 3: TPR 0.90746269 0.90209790 0.949253731
## 4: TNR 0.99492547 0.99407846 1.000000000
## 5: ks 0.91085865 0.89617636 0.994291151
## 6: auc 0.97992436 0.96860393 0.999875504
## 7: gini 0.86766141 0.86807045 0.902742132
## 8: Concordance 0.96291615 0.94198030 0.999873137
## 9: RMSE* 0.11608088 0.12134896 0.069812987
## 10: MAE* 0.01347477 0.01472557 0.004873853
## Test_RF
## 1: 0.01204819
## 2: 0.98795181
## 3: 0.90209790
## 4: 0.99703923
## 5: 0.93287541
## 6: 0.99685548
## 7: 0.90356946
## 8: 0.99680113
## 9: 0.10976426
## 10: 0.01204819

Conclusion on the model performance


CART and Random Forest Model Summary
All key performance indicators, indicating built CART model is very good and showing very
good performance on Train and Test datasets Performance of Random Forest is even better
than CART (As the above table suggests) Both models are very good and as a choice of
preference, would select Random Forest for further business working due to it’s better
performance over CART model

You might also like