Mini Project-Data Mining
Mini Project-Data Mining
Modeling
By - Stuti Prasad
1
Table of Contents
2 Assumptions .................................................................................................................................... 3
5 Clustering ...........................................................................................................................................9
6 CART .................................................................................................................................................10
9 Conclusion. .......................................................................................................................................13
2
1 Project Objective
The objective of the report is to build the best model which can classify the right customers who
have a higher probability of purchasing the loan. This exploration report will consist of the
following:
2 Assumptions
The data has one dependent variable and other response variables
Variables that are highly correlated will converge to a common concept or factor
3
Number of Columns: 14
Data
Description:
ID Customer ID
Age Customer's age in years
Experience Years of professional experience
Income Annual income of the customer ($000)
ZIPCode Home Address ZIP code.
Family Family size of the customer
CCAvg Avg. spending on credit cards per month ($000)
Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage Value of house mortgage if any. ($000)
Personal Loan Did this customer accept the personal loan offered in the last campaign?
Securities
Does the customer have a securities account with the bank?
Account
CD Account Does the customer have a certificate of deposit (CD) account with the bank?
Online Does the customer use internet banking facilities?
CreditCard Does the customer use a credit card issued by the bank?
The first column is ID which can be ignored.
$ ID : num 1 2 3 4 5 6 7 8 9 10 ...
$ Age (in years) : num 25 45 39 35 35 37 53 50 35 34 ...
$ Experience (in years): num 1 19 15 9 8 13 27 24 10 9 ...
$ Income (in K/month) : num 49 34 11 100 45 29 72 22 81 180 ...
$ ZIP Code : num 91107 90089 94720 94112 91330 ...
$ Family members : num 4 3 1 1 4 4 2 1 3 1 ...
$ CCAvg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
$ Education : num 1 1 1 2 2 2 2 3 2 3 ...
$ Mortgage : num 0 0 0 0 0 155 0 0 104 0 ...
$ Personal Loan : num 0 0 0 0 0 0 0 0 0 1 ...
$ Securities Account : num 1 1 0 0 0 0 0 0 0 0 ...
$ CD Account : num 0 0 0 0 0 0 0 0 0 0 ...
$ Online : num 0 0 0 0 0 1 1 0 1 0 ...
$ CreditCard : num 0 0 0 0 1 0 0 1 0 0 ...
str indicates that all the columns in the data set are numeric. (ignore the ID column)
We can make a new data frame excluding ID column for further analysis
4
Age (in years) Experience (in years) Income (in K/month) ZIP Code
11.4631656 11.4679537 46.0337293 2121.8521973
Family members CCAvg Education Mortgage
1.1471604 1.7476590 0.8398691 101.7138021
Personal Loan Securities Account CD Account Online
0.2946207 0.3058093 0.2382503 0.4905893
CreditCard
0.455638
Variance of the data
Age (in years) Experience (in years) Income (in K/month) ZIP Code
1.314042e+02 1.315140e+02 2.119104e+03 4.502257e+06
Family members CCAvg Education Mortgage
1.315977e+00 3.054312e+00 7.053801e-01 1.034570e+04
Personal Loan Securities Account CD Account Online
8.680136e-02 9.351934e-02 5.676319e-02 2.406779e-01
CreditCard
0.207606
Summary of data
Age (in years) Experience (in years) Income (in K/month) ZIP Code Family members
Min. :23.00 Min. :-3.0 Min. : 8.00 Min. : 9307 Min. :1.000
1st Qu.:35.00 1st Qu.:10.0 1st Qu.: 39.00 1st Qu.:91911 1st Qu.:1.000
Median :45.00 Median :20.0 Median : 64.00 Median :93437 Median :2.000
Mean :45.34 Mean :20.1 Mean : 73.77 Mean :93153 Mean :2.397
3rd Qu.:55.00 3rd Qu.:30.0 3rd Qu.: 98.00 3rd Qu.:94608 3rd Qu.:3.000
Max. :67.00 Max. :43.0 Max. :224.00 Max. :96651 Max. :4.000
NA's :18
CCAvg Education Mortgage Personal Loan Securities Account
Min. : 0.000 Min. :1.000 Min. : 0.0 Min. :0.000 Min. :0.0000
1st Qu.: 0.700 1st Qu.:1.000 1st Qu.: 0.0 1st Qu.:0.000 1st Qu.:0.0000
Median : 1.500 Median :2.000 Median : 0.0 Median :0.000 Median :0.0000
Mean : 1.938 Mean :1.881 Mean : 56.5 Mean :0.096 Mean :0.1044
3rd Qu.: 2.500 3rd Qu.:3.000 3rd Qu.:101.0 3rd Qu.:0.000 3rd Qu.:0.0000
Max. :10.000 Max. :3.000 Max. :635.0 Max. :1.000 Max. :1.0000
5
Boxplot of Variables
6
3.4 Missing Value Identification:
There are 18 missing values in Family. This will be taken care by KNN imputation method.
Please refer to the R code for same. The new imputed file name is “Thera_Bank_imputed”
No Outliers
The categorical variables with values in the form of 0 and 1 were converted to factor
s from numeric. So two datasets were created and then merged using cbind.
The new dataset was named as “Thera_Bank_merged”
Residuals:
Min 1Q Median 3Q Max
-0.79891 -0.13417 -0.02883 0.07250 1.04525
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.283e-01 1.604e-01 -2.046 0.0408 *
Education 7.943e-02 4.119e-03 19.284 < 2e-16 ***
CCAvg 1.197e-02 2.467e-03 4.853 1.26e-06 ***
Mortgage 6.668e-05 3.301e-05 2.020 0.0434 *
`CD Account` 3.290e-01 1.587e-02 20.737 < 2e-16 ***
`Family members` 3.341e-02 2.903e-03 11.508 < 2e-16 ***
7
`Income (in K/month)` 2.999e-03 9.716e-05 30.872 < 2e-16 ***
CreditCard -4.497e-02 7.562e-03 -5.947 2.92e-09 ***
Online -2.653e-02 6.804e-03 -3.900 9.77e-05 ***
`Securities Account` -5.970e-02 1.143e-02 -5.223 1.83e-07 ***
`Experience (in years)` 6.135e-03 2.768e-03 2.216 0.0267 *
`Age (in years)` -5.647e-03 2.771e-03 -2.038 0.0416 *
`ZIP Code` 1.017e-06 1.545e-06 0.658 0.5103
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2311 on 4969 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.3857, Adjusted R-squared: 0.3842
F-statistic: 260 on 12 and 4969 DF, p-value: < 2.2e-16
The percentage of Total Variation in Personal Loan as a dependent variable explained
5 Clustering
Cluster means:
Age (in years) Experience (in years) Income (in K/month) ZIP Code Family members CCAvg
1 45.42912 20.20280 73.11570 93590.01 2.391122 1.922479
2 44.92775 19.68926 72.77749 95083.30 2.440832 1.924041
3 45.60655 20.37037 75.16694 91162.19 2.363539 1.962657
Education Mortgage Securities.Account CD.Account Online CreditCard
1 1.907184 55.84870 0.1055308 0.06293706 0.5785124 0.3121424
2 1.860614 56.93159 0.1054987 0.06393862 0.6246803 0.2985934
3 1.876006 56.68438
The data was divided into 3 clusters of sizes 1573, 1564 and 1863. The above table gives the clu
ster means of variables.
We are splitting the data such that we have 70% of the data is Train Data and 30% of the dat
a is my Test Data
The Table function shows that There are 4520 who didn’t respond and 480 who r esponded
from the total data of 5000 observation
table(Thera_Bank1$`Personal Loan`)
0 1
8
4520 480
The train dataset has 3500 observations out of which 3164 are
non respondents and 336 are respondents
table(train$`Personal.Loan`)
0 1
3164 336
The test dataset has 1500 observations out of which 1356 are non
respondents and 144 are respondents
table(test$`Personal.Loan`)
0 1
1356 144
Variables like Credit card, securities account, CD Account, Education and Online have been
considered to build the decision tree
9
The above data and Tree shows that CD Account is the Variable at root node
Age (in years) Experience (in years) Income (in K/month) ZIP Code Family members
2 45 19 34 90089 3
5 35 8 45 91330 4
6 37 13 29 92121 4
8 50 24 22 93943 1
9 35 10 81 90089 3
10 34 9 180 93023 1
CCAvg Education Mortgage Personal.Loan Securities.Account CD.Account Online
2 1.5 1 0 0 1 0 0
5 1.0 2 0 0 0 0 0
6 0.4 2 155 0 0 0 1
8 0.3 3 0 0 0 0 0
9 0.6 2 104 0 0 0 1
10 8.9 3 0 1 0 0 0
CreditCard predict.score.0 predict.score.1 predict.class
2 0 0.92750986 0.07249014 0
5 1 0.92750986 0.07249014 0
6 0 0.92750986 0.07249014 0
8 1 0.92750986 0.07249014 0
9 0 0.92750986 0.07249014 0
10 0 0.92750986 0.07249014 0
confusion.matrix.train
predict.class
Personal.Loan 0 1
0 3138 26
1 257 79
> Accuracy.train
[1] 0.9191429
The confusion Matrix show that there is 91% accuracy in the train model
10
Age (in years) Experience (in years) Income (in K/month) ZIP Code Family members
2 45 19 34 90089 3
5 35 8 45 91330 4
6 37 13 29 92121 4
8 50 24 22 93943 1
9 35 10 81 90089 3
10 34 9 180 93023 1
CCAvg Education Mortgage Personal.Loan Securities.Account CD.Account Online
2 1.5 1 0 0 1 0 0
5 1.0 2 0 0 0 0 0
6 0.4 2 155 0 0 0 1
8 0.3 3 0 0 0 0 0
9 0.6 2 104 0 0 0 1
10 8.9 3 0 1 0 0 0
CreditCard predict.score.0 predict.score.1 predict.class
2 0 0.92750986 0.07249014 0
5 1 0.92750986 0.07249014 0
6 0 0.92750986 0.07249014 0
8 1 0.92750986 0.07249014 0
9 0 0.92750986 0.07249014 0
10 0 0.92750986 0.07249014 0
confusion.matrix.test
predict.class
Personal Loan 0 1
0 1324 32
1 105 39
> Accuracy.test
[1] 0.9086667
7 Random Forest
Call:
randomForest(formula = Personal.Loan ~ CreditCard, data = trainR, ke
ep.forest = TRUE, ntree = 30)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 1
11
The OOB error rate is 9.68% which is fairly decent.
8 ROC Curve
12
For the test data
9 Conclusion
After checking model accuracy we can conclude that the CART model can be used as it has
highest accuracy to predict.
13
10 Appendix Source code
library(readr)
library(readxl)
library(caTools)
library(rpart)
library(rpart.plot)
library(rattle)
library(RColorBrewer)
library(data.table)
library(ROCR)
library(ineq)
library(gplots)
library(InformationValue)
install.packages("Hmisc")
library(Hmisc)
library(lattice)
library(ggplot2)
library(plyr)
library(psych)
library(dplyr)
library(tidyverse)
14
library(car)
install.packages("carData")
str(Thera_Bank1)
summary(Thera_Bank1)
sum(is.na(Thera_Bank))
describe(Thera_Bank1)
attach(Thera_Bank1)
15
title = c("Age (in years)", "Experience (in years)","Income (in K/month)","ZIP
Code","Family members", "CCAvg","Education","Mortgage",
summary(slm)
vif(slm)
corbank=cor(Thera_Bank1[,1:13])
print(corbank)
summary(lmmodel1)
#Clustering
16
dist1=dist(x=Thera_Bank_imputed,method="maximum")
dist1
cluster=hclust(dist1,method="complete")
cluster
plot(cluster,labels = as.character(Thera_Bank_imputed[,1]))
rect.hclust(cluster,k=3,border="red")
groups=cutree(cluster,k=3)
data1=cbind(Thera_Bank1,groups)
group1=subset(data1,groups==1)
group1
group2=subset(data1,groups==2)
group2
group3=subset(data1,groups==3)
group3
group4=subset(data1,groups==4)
group4
group5=subset(data1,groups==5)
group5
C=Thera_Bank_imputed[,-c(9)]
head(C)
kmeans.cluster=kmeans(C,3)
kmeans.cluster
17
# KNN Imputation
str(Thera_Bank_num)
str(Thera_Bank_cat)
Thera_Bank_cat=data.frame(apply(Thera_Bank_cat,2,function(x){as.factor(x)}))
Thera_Bank_merged=cbind(Thera_Bank_num,Thera_Bank_cat)
sum(is.na(Thera_Bank_merged))
Thera_Bank_imputed=knnImputation(data=Thera_Bank_merged,k=5)
sum(is.na(Thera_Bank_imputed))
##CART
table(Thera_Bank1$`Personal Loan`)
set.seed(3000)
attach(Thera_Bank_imputed)
#we are splitting the data such that we have 70% of the data is Train Data and 30%
of the data is my Test Data
18
sample
table(train$`Personal.Loan`)
table(test$`Personal.Loan`)
#####
#model building
m1
fancyRpartPlot(m1)
19
train$predict.score=predict(m1,train)
train$predict.class=predict(m1,train,type="class")
head(train)
confusion.matrix.train=with(train,table(`Personal.Loan`,predict.class))
confusion.matrix.train
Accuracy.train=sum(diag(confusion.matrix.train))/sum(confusion.matrix.train)
Accuracy.train
# Test
test$predict.score=predict(m1,test)
test$predict.class=predict(m1,test,type="class")
head(test)
confusion.matrix.test=with(test,table(`Personal Loan`,predict.class))
confusion.matrix.test
Accuracy.test=sum(diag(confusion.matrix.test))/sum(confusion.matrix.test)
Accuracy.test
20
###
rows=seq(from=1,to=nrow(Thera_Bank_imputed),by=1)
set.seed(1)
trainR=Thera_Bank_imputed[trainrows,]
testR=Thera_Bank_imputed[-trainrows,]
library(DMwR)
prop.table(table(Thera_Bank_imputed$Personal.Loan))
prop.table(table(trainR$Personal.Loan))
prop.table(table(testR$Personal.Loan))
## Random Forest
library(randomForest)
Therabank_rf=randomForest(Personal.Loan ~
Education+CreditCard+Online+Securities.Account+CD.Account
,data=trainR,keep.forest=TRUE,ntree=30)
21
Therabank_rf=randomForest(Personal.Loan ~
CreditCard,data=trainR,keep.forest=TRUE,ntree=30)
print(Therabank_rf)
print(Therabank_rf$err.rate)
plot(Therabank_rf)
Therabank_rf$predicted
Therabank_rf$importance
varImpPlot(Therabank_rf)
View(trainR)
pred_model_train=predict(Therabank_rf,trainR[,-c(9)],type="class")
trainR$Prediction=pred_model_train
names(trainR)
##
result_train=table("actual value"=trainR$Personal.Loan,trainR$Prediction)
result_train
##
pred_model_test=predict(Therabank_rf,testR[,-c(9)],type = "class")
result_test=table("actual values"=testR$Personal.Loan,pred_model_test);result_test
test_accuracy=sum(diag(result_test))/sum(result_test);test_accuracy
test_recall=(result_test[2,2])/(result_test[2,2]+result_test[2,1]);test_recall
test_precision=(result_test[2,2])/(result_test[2,2]+result_test[1,2]);test_precision
22
#Building the ROC curve and lift charts
library(ROCR)
pred = prediction(train$predict.score[,2],train$`Personal.Loan`)
#install.packages("ineq")
library(ineq)
gini
#Confusion Matrix
nrow(train)
#Accuracy and KS
auc
KS
#Concordance/Discordance
library(InformationValue)
23
Concordance(actuals=train$`Personal.Loan`, predictedScores=train$predict.score)
head(test)
#nrow(p_test)
library(ROCR)
#install.packages("ineq")
library(ineq)
gini
#Confusion Matrix
24
nrow(test)
#Accuracy and KS
auc
KS
plot(pressure)
The End
25