TITLE: Bank Marketing Classification: Submitted To: Dr. Supriya Kumar de Professor XLRI, Jamshedpur
TITLE: Bank Marketing Classification: Submitted To: Dr. Supriya Kumar de Professor XLRI, Jamshedpur
Submitted to:
Soumit Ghosh
PGCBA -2
(SID - EA19053, SMSID - 120755)
Course – Data Mining
XLRI, Jamshedpur
Contents
Targeting through telemarketing phone calls to sell long-term deposits of a Portuguese bank.
Within a campaign, the human agents execute phone calls to a list of clients to sell a deposit or if
meanwhile the client calls the contact-centers for any other reason, he is asked to subscribe the
deposit. Thus, the result is a binary unsuccessful or successful contact.
For this statistical analysis, we will analyze data from one table. Description of the tables and
their fields are as follows:
Fields Description
Age Client’s age.
Job Client’s type of job.
Client’s marital status, divorced means divorced or
Marital widowed.
Educatio
n Client’s education.
Default Client has previosly defaulted.
Housing Client has a housing loan.
Loan Client has a personal loan.
Fields Description
Contact Contact communication type (telephone or cellular).
Month Last contact month of year.
day_of_wee
k Last contact day of week.
Last contact duration in seconds. If duration is 0s,
then we never contacted a client to sign up for a term
duration deposit account.
4|Page
- Other Attributes
Fields Description
Campaig number of contacts performed during this campaign and for
n this client
number of days that passed by after the client was last
contacted from a previous campaign (numeric; 999 means
Pdays client was not previously contacted)
number of contacts performed before this campaign and for
Previous this client (numeric)
outcome of the previous marketing campaign (categorical,
Poutcome ‘failure’,‘nonexistent’,‘success’)
Fields Description
Emp.var.rate employment variation rate - quarterly indicator (numeric)
Cons.price.id
x consumer price index - monthly indicator (numeric)
Cons.conf.idx consumer confidence index - monthly indicator (numeric)
Euribor3m euribor 3 month rate - daily indicator (numeric)
Nr.employed number of employees - quarterly indicator (numeric)
- Output Variable
Fields Description
y has the client subscribed a term deposit? (binary, yes, no)
5|Page
4. > str(bank)
> summary(bank)
y
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.1127
3rd Qu.:0.0000
Max. :1.0000
We can see in the dataset, there are no missing variables. Fields contain values as ‘unknown’.
But these are non-influential data points. Hence can be ignored.
Here we can conclude that Banks only contact persons between the age of 20 to 60.
Here, we can conclude that Banks contact persons with Job titles admin and blue-collar than the
rest.
Distribution of Education
9|Page
| bank$y
bank$job | 0 | 1 | Row Total |
--------------|-----------|-----------|-----------|
admin. | 9070 | 1352 | 10422 |
| 3.423 | 26.961 | |
| 0.870 | 0.130 | 0.253 |
| 0.248 | 0.291 | |
| 0.220 | 0.033 | |
--------------|-----------|-----------|-----------|
blue-collar | 8616 | 638 | 9254 |
| 19.926 | 156.951 | |
| 0.931 | 0.069 | 0.225 |
| 0.236 | 0.138 | |
| 0.209 | 0.015 | |
--------------|-----------|-----------|-----------|
entrepreneur | 1332 | 124 | 1456 |
| 1.240 | 9.767 | |
| 0.915 | 0.085 | 0.035 |
| 0.036 | 0.027 | |
| 0.032 | 0.003 | |
--------------|-----------|-----------|-----------|
housemaid | 954 | 106 | 1060 |
| 0.191 | 1.507 | |
| 0.900 | 0.100 | 0.026 |
| 0.026 | 0.023 | |
| 0.023 | 0.003 | |
--------------|-----------|-----------|-----------|
management | 2596 | 328 | 2924 |
| 0.001 | 0.006 | |
| 0.888 | 0.112 | 0.071 |
| 0.071 | 0.071 | |
| 0.063 | 0.008 | |
--------------|-----------|-----------|-----------|
retired | 1286 | 434 | 1720 |
| 37.814 | 297.849 | |
| 0.748 | 0.252 | 0.042 |
| 0.035 | 0.094 | |
| 0.031 | 0.011 | |
--------------|-----------|-----------|-----------|
self-employed | 1272 | 149 | 1421 |
| 0.097 | 0.767 | |
10 | P a g e
| bank$y
bank$marital | 0 | 1 | Row Total |
-------------|-----------|-----------|-----------|
divorced | 4136 | 476 | 4612 |
| 0.464 | 3.652 | |
| 0.897 | 0.103 | 0.112 |
| 0.113 | 0.103 | |
| 0.100 | 0.012 | |
-------------|-----------|-----------|-----------|
married | 22396 | 2532 | 24928 |
| 3.450 | 27.174 | |
| 0.898 | 0.102 | 0.605 |
| 0.613 | 0.546 | |
11 | P a g e
| 0.544 | 0.061 | |
-------------|-----------|-----------|-----------|
single | 9948 | 1620 | 11568 |
| 9.778 | 77.021 | |
| 0.860 | 0.140 | 0.281 |
| 0.272 | 0.349 | |
| 0.242 | 0.039 | |
-------------|-----------|-----------|-----------|
unknown | 68 | 12 | 80 |
| 0.126 | 0.990 | |
| 0.850 | 0.150 | 0.002 |
| 0.002 | 0.003 | |
| 0.002 | 0.000 | |
-------------|-----------|-----------|-----------|
Column Total | 36548 | 4640 | 41188 |
| 0.887 | 0.113 | |
-------------|-----------|-----------|-----------|
| bank$y
bank$education | 0 | 1 | Row Total |
--------------------|-----------|-----------|-----------|
basic.4y | 3748 | 428 | 4176 |
| 0.486 | 3.829 | |
| 0.898 | 0.102 | 0.101 |
| 0.103 | 0.092 | |
| 0.091 | 0.010 | |
--------------------|-----------|-----------|-----------|
basic.6y | 2104 | 188 | 2292 |
| 2.423 | 19.088 | |
| 0.918 | 0.082 | 0.056 |
| 0.058 | 0.041 | |
| 0.051 | 0.005 | |
--------------------|-----------|-----------|-----------|
basic.9y | 5572 | 473 | 6045 |
| 8.065 | 63.527 | |
| 0.922 | 0.078 | 0.147 |
| 0.152 | 0.102 | |
| 0.135 | 0.011 | |
--------------------|-----------|-----------|-----------|
high.school | 8484 | 1031 | 9515 |
| 0.198 | 1.561 | |
| 0.892 | 0.108 | 0.231 |
| 0.232 | 0.222 | |
| 0.206 | 0.025 | |
--------------------|-----------|-----------|-----------|
illiterate | 14 | 4 | 18 |
| 0.244 | 1.918 | |
| 0.778 | 0.222 | 0.000 |
| 0.000 | 0.001 | |
12 | P a g e
| 0.000 | 0.000 | |
--------------------|-----------|-----------|-----------|
professional.course | 4648 | 595 | 5243 |
| 0.004 | 0.032 | |
| 0.887 | 0.113 | 0.127 |
| 0.127 | 0.128 | |
| 0.113 | 0.014 | |
--------------------|-----------|-----------|-----------|
university.degree | 10498 | 1670 | 12168 |
| 8.292 | 65.317 | |
| 0.863 | 0.137 | 0.295 |
| 0.287 | 0.360 | |
| 0.255 | 0.041 | |
--------------------|-----------|-----------|-----------|
unknown | 1480 | 251 | 1731 |
| 2.041 | 16.079 | |
| 0.855 | 0.145 | 0.042 |
| 0.040 | 0.054 | |
| 0.036 | 0.006 | |
--------------------|-----------|-----------|-----------|
Column Total | 36548 | 4640 | 41188 |
| 0.887 | 0.113 | |
--------------------|-----------|-----------|-----------|
| bank$y
bank$default | 0 | 1 | Row Total |
-------------|-----------|-----------|-----------|
no | 28391 | 4197 | 32588 |
| 9.562 | 75.315 | |
| 0.871 | 0.129 | 0.791 |
| 0.777 | 0.905 | |
| 0.689 | 0.102 | |
-------------|-----------|-----------|-----------|
unknown | 8154 | 443 | 8597 |
| 36.198 | 285.122 | |
| 0.948 | 0.052 | 0.209 |
| 0.223 | 0.095 | |
| 0.198 | 0.011 | |
-------------|-----------|-----------|-----------|
yes | 3 | 0 | 3 |
| 0.043 | 0.338 | |
| 1.000 | 0.000 | 0.000 |
| 0.000 | 0.000 | |
| 0.000 | 0.000 | |
-------------|-----------|-----------|-----------|
Column Total | 36548 | 4640 | 41188 |
| 0.887 | 0.113 | |
-------------|-----------|-----------|-----------|
13 | P a g e
In default, the value ‘yes’ only results in 1 outcome and hence should be ignored while
modelling.
The socio economic attributes looks the same and perhaps should have high correlation. To test
for Multicollinearity, we used the VIF method. We took the nr.employed variable and compared
with others
> vif(mod2)
euribor3m cons.conf.idx cons.price.idx emp.var.rate
25.992676 1.215625 3.123787 32.470054
>
The VIF of three variables is greater than 3. Hence, they were ignored in our Final Model.
table(bank$y)
0 1
36548 4640
Here, we can see the outcome variable which is imbalanced. This class imbalance can cause
wrong values of performance metrics. Even though if it shows high accuracy, this can be
incorrect. To deal with class imbalance we have used the technique of Over-sampling.
4. Modelling Technique
Data Splitting:
Algorithm Used:
- Logistic Regression
- Decision Tree
14 | P a g e
We tried out various parameter tuning settings to find out the best ones, based on a greedy
approach, ie. changing one parameter at a time, for each of the algorithm on each of the data set
seperately. We also used chi-squared test between the categorical variables and outcome
variables to find the significant variables.
1. Removed four variables: default (lack of variability), housing (lack of information), loan
(lack of information), and emp.var.rate (lack of significance),
2. Re-framed one variable: campaign because it had outliers.
3. Removed due to correlation issues: euribor3m, cons.price.idx, emp.var.rate.
Below are the best parameter settings for each of the algorithms on each dataset:
a) Logistic Regression
In Logistic Regression, we hyper tuned the parameters according to area under ROC
curve, the accuracy and the AIC score.
Following are the final parameters settings we used to maximize the accuracy
b) Decision Tree
In Decision Tree, we hyper tuned the parameters according to area under ROC curve, the
accuracy. We used two R packages to implement Decision tree.
c) Naïve Bayes
In Naïve Bayes, we hyper tuned the parameters according to area under ROC curve, the
accuracy.
5. ROC Curves:
The ROC Curves for Different Algorithm are:
Logistic Regression:
Decision Tree:
16 | P a g e
Naïve Bayes:
17 | P a g e
6. Confusion Matrices
Logistic Regression:
Decision Tree:
Actual No 56 1104
Naïve Bayes:
18 | P a g e
7. Conclusion:
Performance Parameters Summary:
All our models have high predictive power as Indicated by F1 scores. Although the prediction
power of 0 is almost same for all algorithm. The higher F1 score of 1 makes it a better model.
The accuracy rate is almost similar for all data models. But the AUC score differ a lot . Here also
Decision tree has the highest AUC score.
Therefore, we can conclude that Decision Tree algorithm is the best for this dataset.