0% found this document useful (0 votes)
93 views18 pages

TITLE: Bank Marketing Classification: Submitted To: Dr. Supriya Kumar de Professor XLRI, Jamshedpur

The document discusses a dataset containing information on bank clients and marketing campaigns. It includes demographic data on clients like age, job, education as well as data on past interactions and outcomes. The author performs exploratory data analysis on the data, including univariate analysis of key fields and bivariate analysis of job type versus outcome. Overall the document aims to analyze the dataset and client information to help classify and target clients for marketing campaigns.

Uploaded by

Soumit Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views18 pages

TITLE: Bank Marketing Classification: Submitted To: Dr. Supriya Kumar de Professor XLRI, Jamshedpur

The document discusses a dataset containing information on bank clients and marketing campaigns. It includes demographic data on clients like age, job, education as well as data on past interactions and outcomes. The author performs exploratory data analysis on the data, including univariate analysis of key fields and bivariate analysis of job type versus outcome. Overall the document aims to analyze the dataset and client information to help classify and target clients for marketing campaigns.

Uploaded by

Soumit Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

1|Page

TITLE: Bank Marketing Classification 

Submitted to:

Dr. Supriya Kumar De


Professor
XLRI, Jamshedpur

Report Prepared By:

Soumit Ghosh
PGCBA -2
(SID - EA19053, SMSID - 120755)
Course – Data Mining
XLRI, Jamshedpur

October 29, 2019


2|Page

Contents

1. Problem Context ...................................................................................................................................3


2. Description of fields of dataset.............................................................................................................3
3. Exploratory Data Analysis......................................................................................................................5
4. Modelling Technique...........................................................................................................................13
5. ROC Curves:..........................................................................................................................................15
6. Confusion Matrices..............................................................................................................................17
7. Conclusion:...........................................................................................................................................18
3|Page

1. Problem Context – Targeting Bank Clients

Targeting through telemarketing phone calls to sell long-term deposits of a Portuguese bank.
Within a campaign, the human agents execute phone calls to a list of clients to sell a deposit or if
meanwhile the client calls the contact-centers for any other reason, he is asked to subscribe the
deposit. Thus, the result is a binary unsuccessful or successful contact.

2. Description of fields of dataset

For this statistical analysis, we will analyze data from one table. Description of the tables and
their fields are as follows:

- Variables related to Bank Client Data

Fields Description
Age Client’s age.
Job Client’s type of job.
Client’s marital status, divorced means divorced or
Marital widowed.
Educatio
n Client’s education.
Default Client has previosly defaulted.
Housing Client has a housing loan.
Loan Client has a personal loan.

- Variables related with the last contact of the current campaign:

Fields Description
Contact Contact communication type (telephone or cellular).
Month Last contact month of year.
day_of_wee
k Last contact day of week.
Last contact duration in seconds. If duration is 0s,
then we never contacted a client to sign up for a term
duration deposit account.
4|Page

- Other Attributes

Fields Description
Campaig number of contacts performed during this campaign and for
n this client
number of days that passed by after the client was last
contacted from a previous campaign (numeric; 999 means
Pdays client was not previously contacted)
number of contacts performed before this campaign and for
Previous this client (numeric)
outcome of the previous marketing campaign (categorical,
Poutcome ‘failure’,‘nonexistent’,‘success’)

- social and economic context attributes

Fields Description
Emp.var.rate employment variation rate - quarterly indicator (numeric)
Cons.price.id
x consumer price index - monthly indicator (numeric)
Cons.conf.idx consumer confidence index - monthly indicator (numeric)
Euribor3m euribor 3 month rate - daily indicator (numeric)
Nr.employed number of employees - quarterly indicator (numeric)

- Output Variable

Fields Description
y has the client subscribed a term deposit? (binary, yes, no)
5|Page

3. Exploratory Data Analysis

3.1 Initial Exploration of Data

4. > str(bank)

5. 'data.frame': 41188 obs. of 21 variables:


6. $ age : int 56 57 37 40 56 45 59 41 24 25 ...
7. $ job : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1
8 8 1 2 10 8 ...
8. $ marital : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2
2 2 3 3 ...
9. $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4
3 6 8 6 4 ...
10. $ default : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1
2 1 1 ...
11. $ housing : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1
1 3 3 ...
12. $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1
1 1 1 ...
13. $ contact : Factor w/ 2 levels "cellular","telephone": 2 2 2 2
2 2 2 2 2 2 ...
14. $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7
7 7 7 7 7 ...
15. $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2
2 2 2 2 2 ...
16. $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
17. $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
18. $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
19. $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
20. $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2
2 2 2 2 2 2 2 2 ...
21. $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
22. $ cons.price.idx: num 94 94 94 94 94 ...
23. $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4
-36.4 -36.4 -36.4 ...
24. $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
25. $ nr.employed : num 5191 5191 5191 5191 5191 ...
26. $ y : num 0 0 0 0 0 0 0 0 0 0 ...

> summary(bank)

age job marital education


default housing
Min. :17.00 admin. :10422 divorced: 4612 university.degree :
12168 no :32588 no :18622
6|Page

1st Qu.:32.00 blue-collar: 9254 married :24928 high.school :


9515 unknown: 8597 unknown: 990
Median :38.00 technician : 6743 single :11568 basic.9y :
6045 yes : 3 yes :21576
Mean :40.02 services : 3969 unknown : 80 professional.course:
5243
3rd Qu.:47.00 management : 2924 basic.4y :
4176
Max. :98.00 retired : 1720 basic.6y :
2292
(Other) : 6156 (Other) :
1749
loan contact month day_of_week duration
campaign pdays
no :33950 cellular :26144 may :13769 fri:7827 Min. : 0.0
Min. : 1.000 Min. : 0.0
unknown: 990 telephone:15044 jul : 7174 mon:8514 1st Qu.: 102.0
1st Qu.: 1.000 1st Qu.:999.0
yes : 6248 aug : 6178 thu:8623 Median : 180.0
Median : 2.000 Median :999.0
jun : 5318 tue:8090 Mean : 258.3
Mean : 2.568 Mean :962.5
nov : 4101 wed:8134 3rd Qu.: 319.0
3rd Qu.: 3.000 3rd Qu.:999.0
apr : 2632 Max. :4918.0
Max. :56.000 Max. :999.0
(Other): 2016
previous poutcome emp.var.rate cons.price.idx
cons.conf.idx euribor3m nr.employed
Min. :0.000 failure : 4252 Min. :-3.40000 Min. :92.20 Min.
:-50.8 Min. :0.634 Min. :4964
1st Qu.:0.000 nonexistent:35563 1st Qu.:-1.80000 1st Qu.:93.08 1st
Qu.:-42.7 1st Qu.:1.344 1st Qu.:5099
Median :0.000 success : 1373 Median : 1.10000 Median :93.75
Median :-41.8 Median :4.857 Median :5191
Mean :0.173 Mean : 0.08189 Mean :93.58 Mean
:-40.5 Mean :3.621 Mean :5167
3rd Qu.:0.000 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd
Qu.:-36.4 3rd Qu.:4.961 3rd Qu.:5228
Max. :7.000 Max. : 1.40000 Max. :94.77 Max.
:-26.9 Max. :5.045 Max. :5228

y
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.1127
3rd Qu.:0.0000
Max. :1.0000

3.2 Getting the Idea of missing variables


7|Page

We can see in the dataset, there are no missing variables. Fields contain values as ‘unknown’.
But these are non-influential data points. Hence can be ignored.

3.3 Univariate Analysis

Plot Distribution of Age

Here we can conclude that Banks only contact persons between the age of 20 to 60.

Plot Distribution of Jobs


8|Page

Here, we can conclude that Banks contact persons with Job titles admin and blue-collar than the
rest.

Distribution of Marital Status

Married Men are more likely to be contacted by the Bank.

Distribution of Education
9|Page

3.4 Bivariate Analysis

Bivariate analysis of Jobs with respect to outcome yes.

Total Observations in Table: 41188

| bank$y
bank$job | 0 | 1 | Row Total |
--------------|-----------|-----------|-----------|
admin. | 9070 | 1352 | 10422 |
| 3.423 | 26.961 | |
| 0.870 | 0.130 | 0.253 |
| 0.248 | 0.291 | |
| 0.220 | 0.033 | |
--------------|-----------|-----------|-----------|
blue-collar | 8616 | 638 | 9254 |
| 19.926 | 156.951 | |
| 0.931 | 0.069 | 0.225 |
| 0.236 | 0.138 | |
| 0.209 | 0.015 | |
--------------|-----------|-----------|-----------|
entrepreneur | 1332 | 124 | 1456 |
| 1.240 | 9.767 | |
| 0.915 | 0.085 | 0.035 |
| 0.036 | 0.027 | |
| 0.032 | 0.003 | |
--------------|-----------|-----------|-----------|
housemaid | 954 | 106 | 1060 |
| 0.191 | 1.507 | |
| 0.900 | 0.100 | 0.026 |
| 0.026 | 0.023 | |
| 0.023 | 0.003 | |
--------------|-----------|-----------|-----------|
management | 2596 | 328 | 2924 |
| 0.001 | 0.006 | |
| 0.888 | 0.112 | 0.071 |
| 0.071 | 0.071 | |
| 0.063 | 0.008 | |
--------------|-----------|-----------|-----------|
retired | 1286 | 434 | 1720 |
| 37.814 | 297.849 | |
| 0.748 | 0.252 | 0.042 |
| 0.035 | 0.094 | |
| 0.031 | 0.011 | |
--------------|-----------|-----------|-----------|
self-employed | 1272 | 149 | 1421 |
| 0.097 | 0.767 | |
10 | P a g e

| 0.895 | 0.105 | 0.035 |


| 0.035 | 0.032 | |
| 0.031 | 0.004 | |
--------------|-----------|-----------|-----------|
services | 3646 | 323 | 3969 |
| 4.375 | 34.458 | |
| 0.919 | 0.081 | 0.096 |
| 0.100 | 0.070 | |
| 0.089 | 0.008 | |
--------------|-----------|-----------|-----------|
student | 600 | 275 | 875 |
| 40.090 | 315.775 | |
| 0.686 | 0.314 | 0.021 |
| 0.016 | 0.059 | |
| 0.015 | 0.007 | |
--------------|-----------|-----------|-----------|
technician | 6013 | 730 | 6743 |
| 0.147 | 1.156 | |
| 0.892 | 0.108 | 0.164 |
| 0.165 | 0.157 | |
| 0.146 | 0.018 | |
--------------|-----------|-----------|-----------|
unemployed | 870 | 144 | 1014 |
| 0.985 | 7.758 | |
| 0.858 | 0.142 | 0.025 |
| 0.024 | 0.031 | |
| 0.021 | 0.003 | |
--------------|-----------|-----------|-----------|
unknown | 293 | 37 | 330 |
| 0.000 | 0.001 | |
| 0.888 | 0.112 | 0.008 |
| 0.008 | 0.008 | |
| 0.007 | 0.001 | |
--------------|-----------|-----------|-----------|
Column Total | 36548 | 4640 | 41188 |
| 0.887 | 0.113 | |
--------------|-----------|-----------|-----------|

Bivariate analysis of Marital status and outcome variable yes:

| bank$y
bank$marital | 0 | 1 | Row Total |
-------------|-----------|-----------|-----------|
divorced | 4136 | 476 | 4612 |
| 0.464 | 3.652 | |
| 0.897 | 0.103 | 0.112 |
| 0.113 | 0.103 | |
| 0.100 | 0.012 | |
-------------|-----------|-----------|-----------|
married | 22396 | 2532 | 24928 |
| 3.450 | 27.174 | |
| 0.898 | 0.102 | 0.605 |
| 0.613 | 0.546 | |
11 | P a g e

| 0.544 | 0.061 | |
-------------|-----------|-----------|-----------|
single | 9948 | 1620 | 11568 |
| 9.778 | 77.021 | |
| 0.860 | 0.140 | 0.281 |
| 0.272 | 0.349 | |
| 0.242 | 0.039 | |
-------------|-----------|-----------|-----------|
unknown | 68 | 12 | 80 |
| 0.126 | 0.990 | |
| 0.850 | 0.150 | 0.002 |
| 0.002 | 0.003 | |
| 0.002 | 0.000 | |
-------------|-----------|-----------|-----------|
Column Total | 36548 | 4640 | 41188 |
| 0.887 | 0.113 | |
-------------|-----------|-----------|-----------|

Bivaraite analysis of Education with restpect to outcome variable

| bank$y
bank$education | 0 | 1 | Row Total |
--------------------|-----------|-----------|-----------|
basic.4y | 3748 | 428 | 4176 |
| 0.486 | 3.829 | |
| 0.898 | 0.102 | 0.101 |
| 0.103 | 0.092 | |
| 0.091 | 0.010 | |
--------------------|-----------|-----------|-----------|
basic.6y | 2104 | 188 | 2292 |
| 2.423 | 19.088 | |
| 0.918 | 0.082 | 0.056 |
| 0.058 | 0.041 | |
| 0.051 | 0.005 | |
--------------------|-----------|-----------|-----------|
basic.9y | 5572 | 473 | 6045 |
| 8.065 | 63.527 | |
| 0.922 | 0.078 | 0.147 |
| 0.152 | 0.102 | |
| 0.135 | 0.011 | |
--------------------|-----------|-----------|-----------|
high.school | 8484 | 1031 | 9515 |
| 0.198 | 1.561 | |
| 0.892 | 0.108 | 0.231 |
| 0.232 | 0.222 | |
| 0.206 | 0.025 | |
--------------------|-----------|-----------|-----------|
illiterate | 14 | 4 | 18 |
| 0.244 | 1.918 | |
| 0.778 | 0.222 | 0.000 |
| 0.000 | 0.001 | |
12 | P a g e

| 0.000 | 0.000 | |
--------------------|-----------|-----------|-----------|
professional.course | 4648 | 595 | 5243 |
| 0.004 | 0.032 | |
| 0.887 | 0.113 | 0.127 |
| 0.127 | 0.128 | |
| 0.113 | 0.014 | |
--------------------|-----------|-----------|-----------|
university.degree | 10498 | 1670 | 12168 |
| 8.292 | 65.317 | |
| 0.863 | 0.137 | 0.295 |
| 0.287 | 0.360 | |
| 0.255 | 0.041 | |
--------------------|-----------|-----------|-----------|
unknown | 1480 | 251 | 1731 |
| 2.041 | 16.079 | |
| 0.855 | 0.145 | 0.042 |
| 0.040 | 0.054 | |
| 0.036 | 0.006 | |
--------------------|-----------|-----------|-----------|
Column Total | 36548 | 4640 | 41188 |
| 0.887 | 0.113 | |
--------------------|-----------|-----------|-----------|

Default with respect to outcome variable

| bank$y
bank$default | 0 | 1 | Row Total |
-------------|-----------|-----------|-----------|
no | 28391 | 4197 | 32588 |
| 9.562 | 75.315 | |
| 0.871 | 0.129 | 0.791 |
| 0.777 | 0.905 | |
| 0.689 | 0.102 | |
-------------|-----------|-----------|-----------|
unknown | 8154 | 443 | 8597 |
| 36.198 | 285.122 | |
| 0.948 | 0.052 | 0.209 |
| 0.223 | 0.095 | |
| 0.198 | 0.011 | |
-------------|-----------|-----------|-----------|
yes | 3 | 0 | 3 |
| 0.043 | 0.338 | |
| 1.000 | 0.000 | 0.000 |
| 0.000 | 0.000 | |
| 0.000 | 0.000 | |
-------------|-----------|-----------|-----------|
Column Total | 36548 | 4640 | 41188 |
| 0.887 | 0.113 | |
-------------|-----------|-----------|-----------|
13 | P a g e

In default, the value ‘yes’ only results in 1 outcome and hence should be ignored while
modelling.

3.5 Multicollinearity between the socio economic attributes.

The socio economic attributes looks the same and perhaps should have high correlation. To test
for Multicollinearity, we used the VIF method. We took the nr.employed variable and compared
with others

> vif(mod2)
euribor3m cons.conf.idx cons.price.idx emp.var.rate
25.992676 1.215625 3.123787 32.470054

>

The VIF of three variables is greater than 3. Hence, they were ignored in our Final Model.

3.6 Class Imbalance

table(bank$y)

0 1
36548 4640

Here, we can see the outcome variable which is imbalanced. This class imbalance can cause
wrong values of performance metrics. Even though if it shows high accuracy, this can be
incorrect. To deal with class imbalance we have used the technique of Over-sampling.

4. Modelling Technique

Data Splitting:

We used 75-25% train-test split of data.

Algorithm Used:

- Logistic Regression
- Decision Tree
14 | P a g e

- Naïve Bayes Algorithm


Parameter Settings for R algorithms:

We tried out various parameter tuning settings to find out the best ones, based on a greedy
approach, ie. changing one parameter at a time, for each of the algorithm on each of the data set
seperately. We also used chi-squared test between the categorical variables and outcome
variables to find the significant variables.

The following are parameters which were deleted from Model:

1. Removed four variables: default (lack of variability), housing (lack of information), loan
(lack of information), and emp.var.rate (lack of significance),
2. Re-framed one variable: campaign because it had outliers.
3. Removed due to correlation issues: euribor3m, cons.price.idx, emp.var.rate.

Below are the best parameter settings for each of the algorithms on each dataset:

a) Logistic Regression

In Logistic Regression, we hyper tuned the parameters according to area under ROC
curve, the accuracy and the AIC score.

Following are the final parameters settings we used to maximize the accuracy

glm(formula= y ~ job + contact + month + day_of_week + poutcome + nr.employed +


cons.conf.idx, family = binomial, data = train)

b) Decision Tree

In Decision Tree, we hyper tuned the parameters according to area under ROC curve, the
accuracy. We used two R packages to implement Decision tree.

We decided to keep all the parameters we used to maximize the accuracy

dt_mod1 = rpart( y ~ ., data = ndov, method="class")


dt_mod3 = C5.0(as.factor(y) ~ .,data=ndov)

c) Naïve Bayes

In Naïve Bayes, we hyper tuned the parameters according to area under ROC curve, the
accuracy.

We decided to keep all the parameters we used to maximize the accuracy

dt_mod3 = C5.0(as.factor(y) ~ .,data=ndov)


15 | P a g e

5. ROC Curves:
The ROC Curves for Different Algorithm are:

Logistic Regression:

Decision Tree:
16 | P a g e

Naïve Bayes:
17 | P a g e

6. Confusion Matrices

Logistic Regression:

Predicted Yes Predicted No

Actual Yes 7791 1346

Actual No 413 747

Decision Tree:

Predicted Yes Predicted No

Actual Yes 7310 1827

Actual No 56 1104

Naïve Bayes:
18 | P a g e

Predicted Yes Predicted No

Actual Yes 7563 1572

Actual No 408 885

7. Conclusion:
Performance Parameters Summary:

Models Accuracy Precision(1/0) Recall(1/0) F1 score(1/0) AUC

Logistic 0.8291735 0: 0.94 0: 0.85 0: 0.89 0.79


Regression 1: 0.35 1: 0.64 1: 0.45
Decision 0.817 0: 0.99 0: 0.80 0: 0.88 0.91
Tree 1: 0.37 1: 0.95 1: 0.53
Naïve Bayes 0.82 0: 0.96 0: 0.82 0: 0.89 0.87
Algorithm 1: 0.36 1: 0.77 1: 0.49

Following are the Analytical insights we get from the model:


- Bank should contact customers who are highly educated.
- Bank should contact customers who are married
- Bank should contact customer during the month of April, September and December.
- Bank should always contact previous customers

All our models have high predictive power as Indicated by F1 scores. Although the prediction
power of 0 is almost same for all algorithm. The higher F1 score of 1 makes it a better model.
The accuracy rate is almost similar for all data models. But the AUC score differ a lot . Here also
Decision tree has the highest AUC score.
Therefore, we can conclude that Decision Tree algorithm is the best for this dataset.

You might also like