0% found this document useful (0 votes)
126 views25 pages

Mini Project-Data Mining

The document describes a predictive modeling mini project that aims to classify customers who are most likely to purchase a loan. It outlines steps for exploratory data analysis including importing data, identifying variables, univariate and bivariate analysis, handling missing values and outliers, feature creation, and applying models like linear regression, clustering, CART and random forest. It also discusses evaluating model performance and providing business insights.

Uploaded by

Stuti Prasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views25 pages

Mini Project-Data Mining

The document describes a predictive modeling mini project that aims to classify customers who are most likely to purchase a loan. It outlines steps for exploratory data analysis including importing data, identifying variables, univariate and bivariate analysis, handling missing values and outliers, feature creation, and applying models like linear regression, clustering, CART and random forest. It also discusses evaluating model performance and providing business insights.

Uploaded by

Stuti Prasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Mini Project – Predictive

Modeling
By - Stuti Prasad

1
Table of Contents

1 Project Objective ............................................................................................................................. 3

2 Assumptions .................................................................................................................................... 3

3 Exploratory Data Analysis – Step by step approach ....................................................................... 3

3.1 Environment Set up and Data Import ..................................................................................... 3

3.1.1 Install necessary Packages and Invoke Libraries ............................................................. 3

3.1.2 Set up working Directory ................................................................................................ 3

3.1.3 Import and Read the Dataset .......................................................................................... 3

3.2 Variable Identification ............................................................................................................. 3

3.2.1 Variable Identification – Inferences ................................................................................ 3

3.3 Univariate Analysis .................................................................................................................. 5

3.4 Bi-Variate Analysis.................................................................................................................... 7

3.5 Missing Value Identification. .................................................................................................... 8

3.6 Outlier Identification. ............................................................................................................... 8

3.7 Variable Transformation / Feature Creation ........................................................................... 8

4 Simple Linear Regression . ................................................................................................................8

5 Clustering ...........................................................................................................................................9

6 CART .................................................................................................................................................10

7 Random Forest. ................................................................................................................................12

8 Roc Curve …………………………………………………………………………………………………………………………………….12

9 Conclusion. .......................................................................................................................................13

10 Appendix A – Source Code .............................................................................................................14

2
1 Project Objective
The objective of the report is to build the best model which can classify the right customers who
have a higher probability of purchasing the loan. This exploration report will consist of the
following:

 Importing the dataset in R.


 EDA of the data available
 Understanding of attributes
 Visual inspection of data
 Appropriate Clustering of data
 Missing Value Treatment
 Outlier treatment
 Addition of New variables, if required
 Application of unsupervised learning methods, if required
 Creation of appropriate models and its interpretation
 Checking performance of all models
 Model validation
 Interpretation from the best model
 Business insights

2 Assumptions
 The data has one dependent variable and other response variables
 Variables that are highly correlated will converge to a common concept or factor

3 Exploratory Data Analysis – Step by Step approach


3.1 Environment Set Up and Data Import
3.1.1 Install Necessary Packages and Invoke Libraries

3.1.2 Set up working Directory

3.1.3 Import and read the dataset


3.2 Variable Identification
3.2.1 Variable identification – inferences
Number of rows and columns:
Number of Rows: 5000

3
Number of Columns: 14

Column Names & Types:

Data
Description:

ID Customer ID
Age Customer's age in years
Experience Years of professional experience
Income Annual income of the customer ($000)
ZIPCode Home Address ZIP code.
Family Family size of the customer
CCAvg Avg. spending on credit cards per month ($000)
Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage Value of house mortgage if any. ($000)
Personal Loan Did this customer accept the personal loan offered in the last campaign?
Securities
Does the customer have a securities account with the bank?
Account
CD Account Does the customer have a certificate of deposit (CD) account with the bank?
Online Does the customer use internet banking facilities?
CreditCard Does the customer use a credit card issued by the bank?
The first column is ID which can be ignored.

$ ID : num 1 2 3 4 5 6 7 8 9 10 ...
$ Age (in years) : num 25 45 39 35 35 37 53 50 35 34 ...
$ Experience (in years): num 1 19 15 9 8 13 27 24 10 9 ...
$ Income (in K/month) : num 49 34 11 100 45 29 72 22 81 180 ...
$ ZIP Code : num 91107 90089 94720 94112 91330 ...
$ Family members : num 4 3 1 1 4 4 2 1 3 1 ...
$ CCAvg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
$ Education : num 1 1 1 2 2 2 2 3 2 3 ...
$ Mortgage : num 0 0 0 0 0 155 0 0 104 0 ...
$ Personal Loan : num 0 0 0 0 0 0 0 0 0 1 ...
$ Securities Account : num 1 1 0 0 0 0 0 0 0 0 ...
$ CD Account : num 0 0 0 0 0 0 0 0 0 0 ...
$ Online : num 0 0 0 0 0 1 1 0 1 0 ...
$ CreditCard : num 0 0 0 0 1 0 0 1 0 0 ...
str indicates that all the columns in the data set are numeric. (ignore the ID column)

3.3 Descriptive Analysis

We can make a new data frame excluding ID column for further analysis

Standard Deviation of the data

4
Age (in years) Experience (in years) Income (in K/month) ZIP Code
11.4631656 11.4679537 46.0337293 2121.8521973
Family members CCAvg Education Mortgage
1.1471604 1.7476590 0.8398691 101.7138021
Personal Loan Securities Account CD Account Online
0.2946207 0.3058093 0.2382503 0.4905893
CreditCard
0.455638
Variance of the data

Age (in years) Experience (in years) Income (in K/month) ZIP Code
1.314042e+02 1.315140e+02 2.119104e+03 4.502257e+06
Family members CCAvg Education Mortgage
1.315977e+00 3.054312e+00 7.053801e-01 1.034570e+04
Personal Loan Securities Account CD Account Online
8.680136e-02 9.351934e-02 5.676319e-02 2.406779e-01
CreditCard
0.207606

Summary of data

Age (in years) Experience (in years) Income (in K/month) ZIP Code Family members
Min. :23.00 Min. :-3.0 Min. : 8.00 Min. : 9307 Min. :1.000
1st Qu.:35.00 1st Qu.:10.0 1st Qu.: 39.00 1st Qu.:91911 1st Qu.:1.000
Median :45.00 Median :20.0 Median : 64.00 Median :93437 Median :2.000
Mean :45.34 Mean :20.1 Mean : 73.77 Mean :93153 Mean :2.397
3rd Qu.:55.00 3rd Qu.:30.0 3rd Qu.: 98.00 3rd Qu.:94608 3rd Qu.:3.000
Max. :67.00 Max. :43.0 Max. :224.00 Max. :96651 Max. :4.000
NA's :18
CCAvg Education Mortgage Personal Loan Securities Account
Min. : 0.000 Min. :1.000 Min. : 0.0 Min. :0.000 Min. :0.0000
1st Qu.: 0.700 1st Qu.:1.000 1st Qu.: 0.0 1st Qu.:0.000 1st Qu.:0.0000
Median : 1.500 Median :2.000 Median : 0.0 Median :0.000 Median :0.0000
Mean : 1.938 Mean :1.881 Mean : 56.5 Mean :0.096 Mean :0.1044
3rd Qu.: 2.500 3rd Qu.:3.000 3rd Qu.:101.0 3rd Qu.:0.000 3rd Qu.:0.0000
Max. :10.000 Max. :3.000 Max. :635.0 Max. :1.000 Max. :1.0000

CD Account Online CreditCard


Min. :0.0000 Min. :0.0000 Min. :0.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
Median :0.0000 Median :1.0000 Median :0.000
Mean :0.0604 Mean :0.5968 Mean :0.294
3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.000
Max. :1.0000 Max. :1.0000 Max. :1.000

Visualisation using Histogram and Boxplot

Histogram of all Variables

5
Boxplot of Variables

3.4 Bi Variate Analysis

Correlation between variables:

6
3.4 Missing Value Identification:

There are 18 missing values in Family. This will be taken care by KNN imputation method.
Please refer to the R code for same. The new imputed file name is “Thera_Bank_imputed”

3.5 Outlier Identification

No Outliers

3.6 Variable transformation / feature creation

The categorical variables with values in the form of 0 and 1 were converted to factor
s from numeric. So two datasets were created and then merged using cbind.
The new dataset was named as “Thera_Bank_merged”

4 Simple Linear Regression


Call:
lm(formula = `Personal Loan` ~ Education + CCAvg + Mortgage +
`CD Account` + `Family members` + `Income (in K/month)` +
CreditCard + Online + `Securities Account` + `Experience (in years)` +
`Age (in years)` + `ZIP Code`)

Residuals:
Min 1Q Median 3Q Max
-0.79891 -0.13417 -0.02883 0.07250 1.04525
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.283e-01 1.604e-01 -2.046 0.0408 *
Education 7.943e-02 4.119e-03 19.284 < 2e-16 ***
CCAvg 1.197e-02 2.467e-03 4.853 1.26e-06 ***
Mortgage 6.668e-05 3.301e-05 2.020 0.0434 *
`CD Account` 3.290e-01 1.587e-02 20.737 < 2e-16 ***
`Family members` 3.341e-02 2.903e-03 11.508 < 2e-16 ***

7
`Income (in K/month)` 2.999e-03 9.716e-05 30.872 < 2e-16 ***
CreditCard -4.497e-02 7.562e-03 -5.947 2.92e-09 ***
Online -2.653e-02 6.804e-03 -3.900 9.77e-05 ***
`Securities Account` -5.970e-02 1.143e-02 -5.223 1.83e-07 ***
`Experience (in years)` 6.135e-03 2.768e-03 2.216 0.0267 *
`Age (in years)` -5.647e-03 2.771e-03 -2.038 0.0416 *
`ZIP Code` 1.017e-06 1.545e-06 0.658 0.5103
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2311 on 4969 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.3857, Adjusted R-squared: 0.3842
F-statistic: 260 on 12 and 4969 DF, p-value: < 2.2e-16
The percentage of Total Variation in Personal Loan as a dependent variable explained

by Simple Linear Regression is 38.57 %( The R-squared value)


We can see that there is not much correlation between all the variables

5 Clustering

K-means clustering with 3 clusters of sizes 1573, 1564, 1863

Cluster means:
Age (in years) Experience (in years) Income (in K/month) ZIP Code Family members CCAvg
1 45.42912 20.20280 73.11570 93590.01 2.391122 1.922479
2 44.92775 19.68926 72.77749 95083.30 2.440832 1.924041
3 45.60655 20.37037 75.16694 91162.19 2.363539 1.962657
Education Mortgage Securities.Account CD.Account Online CreditCard
1 1.907184 55.84870 0.1055308 0.06293706 0.5785124 0.3121424
2 1.860614 56.93159 0.1054987 0.06393862 0.6246803 0.2985934
3 1.876006 56.68438

Within cluster sum of squares by cluster:


[1] 674533593 371103612 8013120178
(between_SS / total_SS = 59.9 %)

The data was divided into 3 clusters of sizes 1573, 1564 and 1863. The above table gives the clu
ster means of variables.

6 Building CART Model

We are splitting the data such that we have 70% of the data is Train Data and 30% of the dat
a is my Test Data

The Table function shows that There are 4520 who didn’t respond and 480 who r esponded
from the total data of 5000 observation

table(Thera_Bank1$`Personal Loan`)

0 1

8
4520 480

The train dataset has 3500 observations out of which 3164 are
non respondents and 336 are respondents

table(train$`Personal.Loan`)

0 1
3164 336

The test dataset has 1500 observations out of which 1356 are non
respondents and 144 are respondents

table(test$`Personal.Loan`)

0 1
1356 144

Variables like Credit card, securities account, CD Account, Education and Online have been
considered to build the decision tree

m1= rpart(formula = `Personal.Loan` ~`CreditCard`+`Online`+`Securities.Account`+`CD.Acc


t`+ Education, data=train,method="class",control=r.ctrl)
> m1
n= 3500

node), split, n, loss, yval, (yprob)


* denotes terminal node

1) root 3500 336 0 (0.90400000 0.09600000)


2) CD.Account=0 3297 239 0 (0.92750986 0.07249014) *
3) CD.Account=1 203 97 0 (0.52216749 0.47783251)
6) CreditCard=1 160 62 0 (0.61250000 0.38750000)
12) Securities.Account=1 65 10 0 (0.84615385 0.15384615)
24) Online=1 58 6 0 (0.89655172 0.10344828) *
25) Online=0 7 3 1 (0.42857143 0.57142857)
50) Education>=2.5 4 1 0 (0.75000000 0.25000000) *
51) Education< 2.5 3 0 1 (0.00000000 1.00000000) *
13) Securities.Account=0 95 43 1 (0.45263158 0.54736842)
26) Education< 1.5 36 11 0 (0.69444444 0.30555556) *
27) Education>=1.5 59 18 1 (0.30508475 0.69491525) *
7) CreditCard=0 43 8 1 (0.18604651 0.81395349) *

9
The above data and Tree shows that CD Account is the Variable at root node

On Checking the Head of prediction of the model following data is provided:

Age (in years) Experience (in years) Income (in K/month) ZIP Code Family members
2 45 19 34 90089 3
5 35 8 45 91330 4
6 37 13 29 92121 4
8 50 24 22 93943 1
9 35 10 81 90089 3
10 34 9 180 93023 1
CCAvg Education Mortgage Personal.Loan Securities.Account CD.Account Online
2 1.5 1 0 0 1 0 0
5 1.0 2 0 0 0 0 0
6 0.4 2 155 0 0 0 1
8 0.3 3 0 0 0 0 0
9 0.6 2 104 0 0 0 1
10 8.9 3 0 1 0 0 0
CreditCard predict.score.0 predict.score.1 predict.class
2 0 0.92750986 0.07249014 0
5 1 0.92750986 0.07249014 0
6 0 0.92750986 0.07249014 0
8 1 0.92750986 0.07249014 0
9 0 0.92750986 0.07249014 0
10 0 0.92750986 0.07249014 0

confusion.matrix.train
predict.class
Personal.Loan 0 1
0 3138 26
1 257 79
> Accuracy.train
[1] 0.9191429

The confusion Matrix show that there is 91% accuracy in the train model

Following is the data for Test

10
Age (in years) Experience (in years) Income (in K/month) ZIP Code Family members
2 45 19 34 90089 3
5 35 8 45 91330 4
6 37 13 29 92121 4
8 50 24 22 93943 1
9 35 10 81 90089 3
10 34 9 180 93023 1
CCAvg Education Mortgage Personal.Loan Securities.Account CD.Account Online
2 1.5 1 0 0 1 0 0
5 1.0 2 0 0 0 0 0
6 0.4 2 155 0 0 0 1
8 0.3 3 0 0 0 0 0
9 0.6 2 104 0 0 0 1
10 8.9 3 0 1 0 0 0
CreditCard predict.score.0 predict.score.1 predict.class
2 0 0.92750986 0.07249014 0
5 1 0.92750986 0.07249014 0
6 0 0.92750986 0.07249014 0
8 1 0.92750986 0.07249014 0
9 0 0.92750986 0.07249014 0
10 0 0.92750986 0.07249014 0

confusion.matrix.test
predict.class
Personal Loan 0 1
0 1324 32
1 105 39
> Accuracy.test
[1] 0.9086667

In the test data the accuracy is around 90%.

7 Random Forest

Call:
randomForest(formula = Personal.Loan ~ CreditCard, data = trainR, ke
ep.forest = TRUE, ntree = 30)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 1

OOB estimate of error rate: 9.68%


Confusion matrix:
0 1 class.error
0 3613 0 0
1 387 0 1

11
The OOB error rate is 9.68% which is fairly decent.

8 ROC Curve

For the train data

12
For the test data

9 Conclusion

After checking model accuracy we can conclude that the CART model can be used as it has
highest accuracy to predict.

13
10 Appendix Source code

#set up working directory

setwd("C:/Users/YOGA 520 M8IN/Desktop/work/R program")

#run necessary packages

library(readr)

library(readxl)

library(caTools)

library(rpart)

library(rpart.plot)

library(rattle)

library(RColorBrewer)

library(data.table)

library(ROCR)

library(ineq)

library(gplots)

library(InformationValue)

install.packages("Hmisc")

library(Hmisc)

library(lattice)

library(ggplot2)

library(plyr)

library(psych)

library(dplyr)

library(tidyverse)

14
library(car)

install.packages("carData")

# Import data set

Thera_Bank = read_excel("Thera Bank_Personal_Loan_Modelling-dataset-1.xlsx")

names(Thera_Bank) # Display column names

dim(Thera_Bank) # Show dimensions of dataset

head(Thera_Bank,10) # Display first 10 rows of dataset

str(Thera_Bank1)

summary(Thera_Bank1)

sum(is.na(Thera_Bank))

describe(Thera_Bank1)

Thera_Bank1=Thera_Bank[,-1] # Removing ID column

sapply(Thera_Bank1, sd, na.rm=TRUE) # To check the Standard Deviation

sapply(Thera_Bank1, var,na.rm=TRUE) # To check the Variance

par(mfrow = c(2,2)) # Convert plotting space to show 4 graphs

multi.hist(Thera_Bank[,1:13],density = FALSE, labels=T, col = "pink", cex.main = 2)

# Histogram matrix for all variables

par(mfrow = c(2,1)) #Reset plotting space

attach(Thera_Bank1)

boxplot(Thera_Bank1,las = 2,col = "Cyan",cex.axis = 1.5) # Boxplot for all variables to


find outlier.

15
title = c("Age (in years)", "Experience (in years)","Income (in K/month)","ZIP
Code","Family members", "CCAvg","Education","Mortgage",

"Securities Account", "CD Account",

"Online", "CreditCard","Personal Loan")

#simple Linear Regression

slm=lm(`Personal Loan`~ Education+CCAvg+Mortgage+`CD Account`+`Family


members`+`Income (in K/month)`+CreditCard+Online+`Securities
Account`+`Experience (in years)`+`Age (in years)`+`ZIP Code`)

summary(slm)

vif(slm)

corbank=cor(Thera_Bank1[,1:13])

print(corbank)

corrplot::corrplot(cor(Thera_Bank1[,1:13]), method = "number", type = "lower")

lmmodel1=lm(Thera_Bank1$`Personal Loan`~Thera_Bank1$`Income (in K/month)`)

summary(lmmodel1)

#Clustering

16
dist1=dist(x=Thera_Bank_imputed,method="maximum")

dist1

cluster=hclust(dist1,method="complete")

cluster

plot(cluster,labels = as.character(Thera_Bank_imputed[,1]))

rect.hclust(cluster,k=3,border="red")

groups=cutree(cluster,k=3)

data1=cbind(Thera_Bank1,groups)

group1=subset(data1,groups==1)

group1

group2=subset(data1,groups==2)

group2

group3=subset(data1,groups==3)

group3

group4=subset(data1,groups==4)

group4

group5=subset(data1,groups==5)

group5

C=Thera_Bank_imputed[,-c(9)]

head(C)

kmeans.cluster=kmeans(C,3)

kmeans.cluster

17
# KNN Imputation

Thera_Bank_num=Thera_Bank1[,c(1:8)] # Numeric columns

Thera_Bank_cat=Thera_Bank1[,-c(1:8)] # categorical columns

str(Thera_Bank_num)

str(Thera_Bank_cat)

Thera_Bank_cat=data.frame(apply(Thera_Bank_cat,2,function(x){as.factor(x)}))

Thera_Bank_merged=cbind(Thera_Bank_num,Thera_Bank_cat)

sum(is.na(Thera_Bank_merged))

Thera_Bank_imputed=knnImputation(data=Thera_Bank_merged,k=5)

sum(is.na(Thera_Bank_imputed))

##CART

table(Thera_Bank1$`Personal Loan`)

set.seed(3000)

attach(Thera_Bank_imputed)

sample = sample.split(`Personal Loan`, SplitRatio = 0.7)

#we are splitting the data such that we have 70% of the data is Train Data and 30%
of the data is my Test Data

18
sample

train = subset(Thera_Bank_imputed, sample == TRUE)

test = subset( Thera_Bank_imputed, sample == FALSE)

table(train$`Personal.Loan`)

table(test$`Personal.Loan`)

#####

#Setting control parameters

r.ctrl = rpart.control(minsplit = 1000,minbucket = 100,cp=0,xval = 10)

r.ctrl=rpart.control(minsplit = 1,minbucket = 1,cp=0.001)

#model building

#Building the CART model

m1= rpart(formula = `Personal.Loan`


~`CreditCard`+`Online`+`Securities.Account`+`CD.Account`+ Education,
data=train,method="class",control=r.ctrl)

m1

#Displaying Decision Tree

fancyRpartPlot(m1)

19
train$predict.score=predict(m1,train)

train$predict.class=predict(m1,train,type="class")

head(train)

#checking model accuracy

confusion.matrix.train=with(train,table(`Personal.Loan`,predict.class))

confusion.matrix.train

Accuracy.train=sum(diag(confusion.matrix.train))/sum(confusion.matrix.train)

Accuracy.train

# Test

test$predict.score=predict(m1,test)

test$predict.class=predict(m1,test,type="class")

head(test)

#checking model accuracy

confusion.matrix.test=with(test,table(`Personal Loan`,predict.class))

confusion.matrix.test

Accuracy.test=sum(diag(confusion.matrix.test))/sum(confusion.matrix.test)

Accuracy.test

20
###

rows=seq(from=1,to=nrow(Thera_Bank_imputed),by=1)

set.seed(1)

trainrows=sample(x=rows,size=nrow(Thera_Bank_imputed)*0.8) # 80% rows

trainR=Thera_Bank_imputed[trainrows,]

testR=Thera_Bank_imputed[-trainrows,]

library(DMwR)

prop.table(table(Thera_Bank_imputed$Personal.Loan))

prop.table(table(trainR$Personal.Loan))

prop.table(table(testR$Personal.Loan))

## Random Forest

library(randomForest)

Therabank_rf=randomForest(Personal.Loan ~
Education+CreditCard+Online+Securities.Account+CD.Account

,data=trainR,keep.forest=TRUE,ntree=30)

21
Therabank_rf=randomForest(Personal.Loan ~
CreditCard,data=trainR,keep.forest=TRUE,ntree=30)

print(Therabank_rf)

print(Therabank_rf$err.rate)

plot(Therabank_rf)

Therabank_rf$predicted

Therabank_rf$importance

varImpPlot(Therabank_rf)

View(trainR)

pred_model_train=predict(Therabank_rf,trainR[,-c(9)],type="class")

trainR$Prediction=pred_model_train

names(trainR)

##

result_train=table("actual value"=trainR$Personal.Loan,trainR$Prediction)

result_train

##

pred_model_test=predict(Therabank_rf,testR[,-c(9)],type = "class")

result_test=table("actual values"=testR$Personal.Loan,pred_model_test);result_test

test_accuracy=sum(diag(result_test))/sum(result_test);test_accuracy

test_recall=(result_test[2,2])/(result_test[2,2]+result_test[2,1]);test_recall

test_precision=(result_test[2,2])/(result_test[2,2]+result_test[1,2]);test_precision

22
#Building the ROC curve and lift charts

library(ROCR)

pred = prediction(train$predict.score[,2],train$`Personal.Loan`)

perf = performance(pred, "tpr", "fpr")

plot(perf,main = "ROC curve")

#install.packages("ineq")

library(ineq)

gini = ineq(train$predict.score, type="Gini")

gini

#Checking the classification error

#Confusion Matrix

with(train, table(`Personal.Loan`, predict.class))

nrow(train)

#Accuracy and KS

auc

KS

#Concordance/Discordance

library(InformationValue)

23
Concordance(actuals=train$`Personal.Loan`, predictedScores=train$predict.score)

# Scoring test sample and validating the same

test$predict.class <- predict(m1, test, type="class")

test$predict.score <- predict(m1, test)

head(test)

#with(p_test, table(Sdelinquent, predict.class))

#nrow(p_test)

library(ROCR)

pred = prediction(test$predict.score[,2], test$`Personal.Loan`)

perf <- performance(pred, "tpr", "fpr")

plot(perf,main = "ROC curve")

#install.packages("ineq")

library(ineq)

gini = ineq(test$predict.score, type="Gini")

gini

#Checking the classification error

#Confusion Matrix

with(test, table(`Personal Loan`, predict.class))

24
nrow(test)

#Accuracy and KS

auc

KS

Concordance(actuals=test$`Personal Loan`, predictedScores=test$predict.score)

plot(pressure)

The End

25

You might also like