Data Mining in Banking
Data Mining in Banking
6,900
Open access books available
184,000
International authors and editors
200M Downloads
154
Countries delivered to
TOP 1%
most cited scientists
12.2%
Contributors from top 500 universities
Abstract
Classification, as one of the most popular data mining techniques, has been used
in the banking sector for different purposes, for example, for bank customer churn
prediction, credit approval, fraud detection, bank failure estimation, and bank
telemarketing prediction. However, traditional classification algorithms do not take
into account the class distribution, which results into undesirable performance on
imbalanced banking data. To solve this problem, this paper proposes an approach
which improves the decision jungle (DJ) method with a class-based weighting
mechanism. The experiments conducted on 17 real-world bank datasets show that
the proposed approach outperforms the decision jungle method when handling
imbalanced banking data.
1. Introduction
Data mining is the process of analyzing large data stored in data warehouses in
order to automatically extract hidden, previously unknown, valid, interesting, and
actionable knowledge such as patterns, anomalies, associations, and changes. It has
been commonly used in a wide range of different areas that include marketing,
health care, military, environment, and education. Data mining is becoming
increasingly important and essential for banking sector as well, since the amount of
data collected by banks has grown remarkably and the need to discover hidden and
useful patterns from banking data becomes widely recognized.
Banking systems collect huge amounts of data more rapidly as the number of
channels (i.e., Internet banking, telebanking, retail banking, mobile banking, ATM)
has increased. Banking data has been currently generated from various sources,
including but not limited to bank account transactions, credit card details, loan
applications, and telex messages. Hence, data mining can be used to extract mean-
ingful information from these collected banking data, to enable banking institutions
to make better decision-making process. For example, classification, which is one of
the most popular data mining techniques, can be used to predict bank failures [1–3],
to estimate bank customer churns [4], to detect frauds [5], and to evaluate loan
approvals [6].
1
Data Mining - Methods, Applications and Systems
2. Related work
As a data-intensive sector, banking has been a popular application area for data
mining researchers since the information technology revolution. The continuous
developments in banking systems and the rapidly increasing availability of big
banking data make data mining one of the most essential tasks for the banking
industry.
Banking industries have used data mining techniques in various applications,
especially on bank failure prediction [1–3], possible bank customer churns identifi-
cation [4], fraudulent transaction detection [5], customer segmentation [8–10],
predictions on bank telemarketing [11–14], and sentiment analysis for bank cus-
tomers [15]. Some of the classification studies in the banking sector have been
compared in Table 1. The objectives of the studies, years they were conducted,
algorithms and ensemble learning techniques they used, the country of the bank,
and obtained results are shown in this table.
The main data mining tasks are classification (or categorical prediction), regres-
sion (or numeric prediction), clustering, association rule mining, and anomaly
detection. Among these data mining tasks, classification is the most frequently used
one in the banking sector [16], which is followed by clustering. Some banking
applications [8, 10] have used more than one data mining techniques, among which
clustering before classification has shown sufficient evidence of both popularity and
applicability.
Apart from novel task-specific algorithms proposed by the authors, the most
commonly used classification algorithms in the banking sector are decision tree
(DT), neural network (NN), support vector machine (SVM), k-nearest neighbor
2
3
DOI: http://dx.doi.org/10.5772/intechopen.91836
Data Mining in Banking Sector Using Weighted Decision Jungle Method
Ref Year Algorithms Ensemble learning Description Country of the Result
bank
DT NN SVM KNN NB LR Bagging (i.e., RF) Boosting (AB, XGB)
Ilham et al. [11] 2019 √ √ √ √ √ √ √ Long-term deposit prediction Portugal ACC 97.07%
Krishna et al. [15] 2019 √ √ √ √ √ √ √ √ Sentiment analysis for bank customers India AUC 0.8268
Farooqi and Iqbal 2019 √ √ √ √ √ Prediction of bank telemarketing Portugal ACC 91.2%
[12] outcomes
Carmona et al. [2] 2019 √ √ √ Bank failure prediction USA ACC 94.74%
Jing and Fang [3] 2018 √ √ √ Bank failure prediction USA AUC 0.916
Marinakos and 2017 √ √ √ √ √ Customer classification for bank direct Portugal AUC
Daskalaki [8] marketing 0.9
Keramati et al. [4] 2016 √ Bank customer churn prediction — AUC 0.929
Wan et al. [6] 2016 √ √ √ √ √ Predicting nonperforming loans China AUC 0.965
Ogwueleka et al. 2015 √ √ Identifying bank customer behavior Intercontinental AUC 0.94
[10]
Moro et al. [14] 2014 √ √ √ √ Prediction of bank telemarketing Portugal AUC 0.8
outcomes
Table 1.
Classification applications in the banking sector.
Data Mining - Methods, Applications and Systems
(KNN), Naive Bayes (NB), and logistic regression (LR), as shown in Table 1. Some
data mining studies in the banking sector [1, 2, 6, 11, 15] have used ensemble
learning methods to increase the classification performance. Bagging and boosting
are the most popular ensemble learning methods due to their theoretical perfor-
mance advantages. Random forest (RF) [2, 6, 11, 15], AdaBoost (AB) [6], and
extreme gradient boosting (XGB) [2, 15] have also been used in the banking sector
as the most well-known bagging and boosting algorithms, respectively. As shown in
Table 1, accuracy (ACC) and area under ROC curve (AUC) are the commonly used
performance measures for classification.
Dealing with class imbalance problem, various solutions have been proposed in
the literature. Such methods can be mainly grouped under two different
approaches: (i) application of a data preprocessing step and (ii) modifying existing
methods. The first approach focuses on balancing the dataset, which may be done
either by increasing the number of minority class examples (over-sampling) or
reducing the number of majority class examples (under-sampling). In the literature,
synthetic minority over-sampling technique (SMOTE) [17] is commonly used as an
over-sampling technique. As an alternative approach, some studies (i.e., [18]) focus
on modifying the existing classification algorithms to make them more effective
when dealing with imbalanced data. Unlike these studies, this paper proposes a
novel approach (class-based weighting approach) to solve imbalanced data
problem.
3. Methods
4
Data Mining in Banking Sector Using Weighted Decision Jungle Method
DOI: http://dx.doi.org/10.5772/intechopen.91836
P y j jx
> threshold, ∀m 6¼ j (1)
P ym jx
where Wc is the weight assigned to the class c, N is the total number of instances
in the dataset, Nc is the number of instances present in the class c, and k is the
number of class labels. In the proposed approach, Eq. (1) is updated as follows:
W j ∗ P y j jx
> threshold, ∀m 6¼ j (3)
W m ∗ P ym jx
Figure 1 shows the general structure of the proposed approach. In the first step,
various types of raw banking data are obtained from different sources such as
account transactions, credit card details, loan applications, and social media texts.
Next, raw banking data is preprocessed by applying several different techniques to
provide data integration, data selection, and data transformation. The prepared data
is then passed to the training step, where weighted decision jungle algorithm is used
to build an effective model which accurately maps inputs to desired outputs. The
classification validation step provides feedback to the learning phase for adjustment
Figure 1.
General structure of proposed approach.
5
Data Mining - Methods, Applications and Systems
4. Experimental studies
6
Data Mining in Banking Sector Using Weighted Decision Jungle Method
DOI: http://dx.doi.org/10.5772/intechopen.91836
Table 2.
The main characteristics of the banking datasets.
It can be deduced from the average precision and recall values that higher
classification rates can be achieved with the weighted DJ method for minority
classes, while more misclassified points in majority classes may also be detectable in
the case of imbalanced data.
Figure 2 shows the comparison of the classification performances of two
methods in terms of F-measure: decision jungle and class-based weighted decision
jungle (weighted DJ). In principle, F-measure is defined as F = (2 Recall
Precision)/(Recall + Precision), which is a harmonic mean between recall and
precision. According to the results, for all banking datasets, the proposed method
showed some increase or the same performance in the F-measure value.
It can be possible to conclude from the experiments that the minority and
majority ratios are not the only issues in constructing a good prediction model. For
example, the minority and majority ratios of the first and last datasets are very
close, but the classification outcomes related to these datasets are not similar.
Although the minority and majority class ratios are almost the same for these two
datasets, there is a significant difference between the classification accuracy, preci-
sion, and recall values of the datasets, as can be seen in Table 3. There is also a need
7
Data Mining - Methods, Applications and Systems
1 Abstract dataset for credit card 99.09 0.9918 0.9715 99.19 0.9923 0.9749
fraud detection
6 Bank customer churn prediction 87.37 0.8514 0.7291 87.40 0.8394 0.7411
10 Credit card fraud detection 99.97 0.9915 0.9167 99.97 0.9861 0.9309
11 Default of credit card clients 83.05 0.7833 0.6695 83.16 0.7793 0.6785
15 Loan data for dummy bank 95.19 0.9753 0.6837 95.20 0.9753 0.6844
Table 3.
Comparison of unweighted and class-based weighted decision jungle methods in terms of accuracy,
macro-averaged precision, and macro-averaged recall.
Figure 2.
Comparison of unweighted and class-based weighted decision jungle methods in terms of F-measure.
8
Data Mining in Banking Sector Using Weighted Decision Jungle Method
DOI: http://dx.doi.org/10.5772/intechopen.91836
for appropriate training examples that have data characteristics consistent with the
class label assigned to them.
Author details
Derya Birant
Department of Computer Engineering, Dokuz Eylul University, Izmir, Turkey
© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms
of the Creative Commons Attribution License (http://creativecommons.org/licenses/
by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly cited.
9
Data Mining - Methods, Applications and Systems
References
10
Data Mining in Banking Sector Using Weighted Decision Jungle Method
DOI: http://dx.doi.org/10.5772/intechopen.91836
11