Improve Profiling Bank Customer Behavior Using ML
Improve Profiling Bank Customer Behavior Using ML
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2934644, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number
ABSTRACT In the banking industry, credit card evolution is a noticeable occurrence. Each banking system
includes a huge dataset for customer's transactions of their credit cards. Therefore, banks would be in need
of customer profiling. Profiling bank customer's cognize the issuer’s decisions about whom to give banking
facilities and what a credit limit to provide. It also helps the issuers get a better understanding of their potential
and current customers. In previous research, Customer profiling mainly depends on transaction data or
demographic data, but in this research, we merge both data in order to get a more accurate result and minimize
the risk. By finding the best technique, it leads to improvement in accuracy and helps banks to get higher
profitability by customer satisfaction through a focus on the valuable customer (companies) which consider
as the main engine in the bank's profitability. This study aims at using k-mean, improved k-mean, fuzzy c-
means and neural networks. The used dataset is labeled and creating a new label as a target for neural network
classification is the main aspect of this study, which helps to reduce the clustering execution time and get the
best accuracy results. Finally, by comparing the accuracy ratio it shows that the neural network is the best
clustering technique.
INDEX TERMS profiling, banking, machine learning, k-mean, fuzzy c-mean, neural network classifier.
I. INTRODUCTION how to find equations and functions that not work only in the
In the modern era of the banking sector, banks have large example that it has, but also in the future work for unknown
datasets contain customer's information and their history of ones. Machine learning not only helps in upgrade connection
transactions. So that banks need to divide these large datasets levels with current customers, but it also plays an important
into small clusters to be able to analyze these customer's role in predicting the behavior of customers based on a certain
behaviors for using it in the best way to suggest a suitable group of occurrences or patterns which identify their future
strategy to attain the highest benefits, customer satisfaction strategy, planning on offering targeted credit products to the
to increase profitability. To achieve this purpose, customer customers. It shifted the focus to the customer and modify the
profiling or customer segmentation is used. Profiling role played by banks in their current format. The four machine
produces customer profiles, which provide the banks with a learning techniques which are used in this research are (K-
full description of their customers based on a set of attributes. mean, improved k-mean, fuzzy c-mean, and artificial neural
Customer segmentation refers to characterize the groups of networks) and their applications are applied to a real dataset
customers based on either specific characteristics (e.g. from a bank in Taiwan, and then compare the accuracy ratio
region, age, income for demographic segmentation) or their between them. The used machine learning techniques are
behavior (for behavioral segmentation). However, ‘customer about profiling the customer behaviors into clusters.
segmentation’ and ‘profiling’ are considered as two sides of
the same coin. The rest of this paper is organized as follows:
Banks are confronting many challenges like default Section II: presents the related works, which focus on profiling
prediction, risk management, customer retention, and customers using machine-learning techniques. Section III:
customer profiling for different purposes to achieve higher explains the four machine learning techniques and the
profitability and reduce the risk. So it is necessary to identify accuracy measures, which are used in our research. Section
customers well, to solve such challenges. Machine learning IV: describes the dataset and its attributes. Section V: clarifies
is the science of enabling computers to act without being the proposed model and applying the techniques on the
programmed. Machine learning is so pervasive today that dataset. Section VI: shows the results of our experiment and
you probably use it dozens of times a day without knowing compares it with the results of earlier researches. Section VII:
it [1]. Machine learning teaches the computer how to learn, presents the conclusion and future work.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2934644, IEEE Access
ii. LITERATURE SURVEY used to get the average area under the curve (AUC) and the
Many researchers are working on the problem of profiling correct rate of the model. Light GBM (high-performance
bank customers using different techniques and different Gradient Boosting framework built by Microsoft Company)
datasets. was the highest accuracy rate. The model of Light GBM
The following papers focus on bank customers profiling and achieved an accuracy ratio by F1-measure equal 89.34%.
machine learning techniques used:
In 2018, NH Niloy [8] presented a classification model for the
In 2015, Majid Sharahi[2] presented a classification model for credit card default data set for a bank in Taiwan. Naïve
the dataset of Sepah Bank Branches Tehran using two steps Bayesian Classifier and Decision Trees were used as
and k-means clustering algorithms. Segmentation of 60 classification algorithms to classify if the client is the default
companies, which were customers of Sepah Bank, was a kind credit cardholder or not. The result of this paper showed that
of demographic and behavioral segmentation and it helped to Naïve Bayesian achieved the best accuracy.
identify the loyal customers.
In 2019, Ali Arshad [9] presented a multi-class classification
In 2016, M. Ayoubi [3] explained a customer segmentation model for eighteen datasets from the UCI repository. Semi-
model based on the two-step algorithm and Kohonen neural Supervised Deep Fuzzy C-Mean (DFCM-MC) was used in
network. Customer segmentation based on effective factors on this paper for clustering semi-supervised data. They
Customer Lifetime Value (CLV). The dataset about 56000 introduced a new label for the unlabeled data by fuzzy c-mean.
customers of the “Taavon bank” was used in this research. They used the labeled data (supervised data) and unlabeled
Firstly, by using the means of a Two-step approach, the data (unsupervised data) with the new label that extracted the
optimum number of clusters was determined. Then,” Kohonen discriminatory information that was used for classification.
neural network" was applied. Based on WRFM (the weight of The accuracy rate of DFCM-MC was 80.82% and the f-
Recency, Frequency, and Monetary) model, the value of each measure was 78.16%.
cluster was calculated.
The previous literature survey shows that various machine-
In 2017, Shamala Palaniappan [4] presented a profiling model learning algorithms were used for predicting and clustering
for the customers of a Portuguese retail bank within the different datasets by many authors. All of them clustered the
duration of five years (2008 to 2013). This paper focused on original datasets with the existing label, but in this work, we
helping banks to increase the accuracy of their customer create a new label by using the unsupervised technique and use
profiling through classification as well as identifying a group it as a target for the neural network algorithm. A profiling
of customers who had a high probability to subscribe to a long- model was built for the dataset of bank customers using a
term deposit. Three classification algorithms were used which supervised machine learning algorithm depends on the result
were Naïve Bayes, Random Forest, and Decision Tree. of the unsupervised techniques as input for the supervised
algorithm.
In 2017, Arpit Bansal [5] presented a modification in a
clustering model of the k-means algorithm. This modification 2.1. The impact of the dollar crisis on credit cards in Egyptian
based on normalization. The researcher to find the results used banks:
the Cancer Dataset. The original data were highly Some customer switches from one bank to another because of
dimensional, but only five attributes had been finally banks do not classify the customer as the best rating so there
considered based on requirements. This paper showed that the is no satisfaction for them. In recent days, Due to the high price
accuracy rate for the existing algorithm equal to 57.14% while of the dollar against the Egyptian pound (Dollar crisis),
the improved algorithm recorded 92.86%. customers tend to use credit cards, which need a good rating
so that the customer is satisfied to get the best profit and reduce
In 2017, P. S. Patil and N. V. Dharwadkar, [6], produce a the risk.
prediction and classification model for two datasets of bank Egypt's largest listed bank, Commercial International Bank
customer's data. They used the Artificial Neural Network (CIB), told customers on July 2016 it was reducing the number
(ANN) in this model then weighted the results. By applying of foreign currency customers can spend and withdraw when
the ANN algorithm and the proposed model, shows that the using their debit and credit cards abroad. Egypt's central bank
ANN algorithm works efficiently for the two datasets. This wrote to bank chiefs asking that they "ensure that debit cards,
algorithm gave an accuracy rate of 72% for dataset1 and 98% including pre-paid cards, issued in local currency by Egyptian
for dataset2. banks are only used within the country." CIB did not specify
which cards would be affected or give the new limits, but
In 2018, Shenghui Yang [7] presented a classification model several bank staff told Reuters that the move would affect both
for the credit card default data set in the bank from Taiwan credit and debit cards with limits cut by about 50 percent. CIB
using five clustering algorithms. 10-fold cross-validation was cut Classic Card owners' maximum purchases outside of
2
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2934644, IEEE Access
Egypt to $2,500 a month from $5,000 and $3,500 a month ∣∣xi−vj∣∣ is the Euclidean distance between a point, xi, and
from $7,500 a month for Gold Card owners [10]. HSBC Egypt a centroid, vj, iterated overall c points in the ith cluster, for
(The Hong Kong and shanghai banking corporation) says that all n clusters.
all credit and debit cards have a limit of $100 per month,
though it does not specify whether this is for cash withdrawals B. IMPROVED K-MEANS CLUSTERING
or purchases, according to the bank's website [11, 12]. Other ALGORITHM
Egyptian banks have put limits on debit and credit card Improvement in the k-means clustering algorithm was
purchases and ATM withdrawals abroad. According to used because it can define the number of clusters
Ahmed Aboul Dahab, head of retail at SAIB Bank (Arab automatically and assign the required cluster to un-
International Banking Company) [13], says that the bank clustered points. The proposed improvement leads to
registered a 70-percent drop in credit card usage in January achieve high accuracy and reduce the clustering time by the
and February compared to the same period a year earlier. member assigned to the cluster. An improved k-means
Because of this crisis, many customers turned from their banks clustering algorithm based on dissimilarity. It selects the
to another searching for the high limit. So that any bank may initial centroids using the Huffman tree, which uses the
lose a huge number of customers, so we suggest to reprofiling dissimilarity matrix. Many experiments confirm that the
the bank customers to put them in a suitable cluster to increase improved algorithm is efficient with better clustering
customer retention and get high profitability. accuracy on the same algorithm time complexity [16].
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2934644, IEEE Access
a computational model for neural networks based on • False Negative (FN): Observation is positive, but it is
mathematics and algorithms called threshold logic. This predicted negative.
model paved the way for neural network research to split • True Negative (TN): Observation is negative, and it is
into two approaches. One approach focused on biological predicted to be negative.
processes in the brain while the other focused on the • False Positive (FP): Observation is negative, but it is
predicted positive.
application of neural networks to artificial intelligence. A
common use of the phrase "ANN model" is really the
iv. DATA SET:
definition of a class of such functions (where members of
The data set (‘default of credit card clients) is obtained
the class are obtained by varying parameters, connection
from the archive of UCI (the University of California,
weights, or specifics of the architecture such as the number Irvine) Machine Learning Repository [19]. It is a recently
of neurons or their connectivity)[18]. This methodology published dataset (obtained in 2015). The attribute details
provides the opportunity of creating a large combination of in the dataset are given in Table 1. The data set contains
different structures based on 30000 observations and 23 variables and there are no
• Number of layers, missing data on it. All explanatory variables were
• Selection of activation function. normalized. Standardizing data is a data pre-processing
• The number of perceptrons. step applied to variables to scale these variables to a similar
• Normalization layers range. This research aimed at the case of customer's default
• Dropout adjustments payments in Taiwan and compares the accuracy rate of
profiling customers among four machine-learning
techniques. Therefore, among the four machine learning
techniques, the artificial neural network is the only one that
(4) can accurately profile the data set.
𝑎𝑙𝑖 = 𝜎 (∑ 𝜔𝑗𝑘
𝑙
𝑎𝑘𝑙−1 + 𝑏𝑗𝑙 )
𝑘 Table 1. Description of the attributes in the dataset
Where the activation 𝑎𝑙𝑖 of the jth neuron in the lth layer is
Attribute no. Attribute name Description
related to the activations in the (l−1)th layer. Weight matrix X1 Limit_ BAL Amount of the given
wl for each layer, l. the entry in the jth row and kth column credit (NT dollar)
𝑙
is 𝜔𝑗𝑘 . X2 Sex Gender (1 = male;
2 = female).
X3 Education Education (1 =
graduate school;
E. EVALUATION METRICS: 2 = university;
After building a machine learning profiling model, the 3 = high school;
performance of this model should be measured by different 4 = others).
accuracy measures to evaluate it. In this paper, there are X4 Marital status Marital status (1 =
married; 2 = single;
different techniques (supervised and unsupervised) so 3 = others).
evaluation of their performance of classification was X5 Age Age (year).
measured by using these measures shown in the following X6-X11 Pay_0 to Pay_6 April to September
X12-X17 Bill_AMT1 to Amount of bill
equations (5, 6, 7, 8, 9, 10, and 11). BILL_AMT6 statement
(NT dollar)
𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (5) X18-X23 Pay_AMT1 to
PAY_AMT6
Amount of previous
payment (NT dollar)
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
TP X24 Y Default payment (Yes
Sensitivity = (6) = 1,
TP+FN No = 0)
TN
Specificity = (7)
TN+FP
TP
Precision = (8) v. PROPOSED FRAMEWORK:
TP+FP
TP The main idea of our proposed model shown in
Recall = sensitivity = (9) figure 1 is to improve profiling bank customer's
TP+FN
(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙)
F-measure = 2 ∗ (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙) (10) behavior using different machine learning techniques.
This model starts with the data set, which obtained from
G-mean = √𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 ∗ 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑡𝑦 (11) the UCI machine learning repository. Then data goes
through the step of data preprocessing. After that, the
Where machine learning techniques are applied to build the
• True Positive (TP): Observation is positive, and it is customer profile. In machine learning, the profiling
predicted to be positive. phase recognizes the items in a group and places them
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2934644, IEEE Access
𝑋−𝑋𝑚𝑖𝑛
under target categories. In this paper, the accuracy rate 𝑋𝑛𝑒𝑤 = 𝑋 (12)
of techniques is evaluated through Gini co-efficient for 𝑚𝑎𝑥 −𝑋𝑚𝑖𝑛
the unsupervised techniques then used the results as
input for supervised technique (Artificial Neural 2. Classification using machine learning
Network) (ANN) then evaluates the results to compare algorithms:
them to get the best technique. The result of data preprocessing is the final training set.
Then, applying the four machine learning techniques on
the final training set. The first technique was applied is
the K-means algorithm. The number of clusters is
determined based on the researcher's pre-knowledge.
So, in this paper, the researcher determined the number
of clusters as three.
The second classifier, improved k-mean that determine
the number of clusters as five clusters by the next steps
[21]:
1. using the intra-cluster distance measure, which is
simply the distance between a point and its cluster
center and we take the average of all of these
distances, defined as
𝐾
1
𝑖𝑛𝑡𝑟𝑎 = ∑ ∑‖𝑥 − 𝑧𝑖 ‖2
𝑁
𝑖=1 𝑥𝜀𝐶𝑖
(13)
Where N is the number of pixels in the image, K is the
number of clusters, and zi is the cluster center of cluster
Ci. We obviously want to minimize this measure.
2. The next step is minimizing this measure.
Measuring the inter-cluster distance, or the
distance between clusters, which must be as big
as possible. Then calculate this as the distance
between cluster centers, and take the minimum of
this value, defined as
FIGURE 1. The proposed model for profiling bank customers 2
𝑖𝑛𝑡𝑒𝑟 = 𝑚𝑖𝑛 (‖𝑧𝑖 − 𝑧𝑗 ‖ ),𝑖 = 1,2, … . , 𝑘 − 1
𝑗 = 𝑖 + 1, . . , 𝑘
1. Data preprocessing
(14)
Data preprocessing is the first important step in the data
Where cluster centers are zi ’ and zj. K is the number of
mining process. If there is much not relevant and
clusters.
superfluous information present or noisy and untrusted 3. Only taking the minimum of this value, the
data, analyzing data that has not been carefully checked smallest of this distance to be maximized, and the
for such problems can produce not accurate results. other larger values will automatically be bigger
Thus, the quality and representation of data are first and than this value.
important before applying the analysis. Often, data 4. Finally, calculate the ratio of inter and intra which
preprocessing has been the most important phase in our defined as validity:
machine-learning project. Firstly, the normalization 𝐼𝑛𝑡𝑟𝑎
Validity = (15)
𝐼𝑛𝑡𝑒𝑟
process is confirmed in the database. In most problems,
5. Therefore, the clustering, which gives a minimum
to normalize the data, at first eliminate the units of value for the validity measure; tell us what the
measurement for data, to be able to easily compare data ideal value of K (number of clusters).
from different places. One of the most common ways to The third classifier is a fuzzy c-mean that applied to the
normalize data includes: data set using a number of clusters as five.
Re-scaling data to have values between 0 and 1.
This is usually called feature scaling. One possible The next step, calculation Gini co-efficient for each one of
formula to achieve this is [20]: the three unsupervised algorithms getting the best accuracy
for profiling the dataset.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2934644, IEEE Access
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2934644, IEEE Access
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2934644, IEEE Access
[8] N. H. Niloy and M. A. I. Navid. "Naïve Bayesian Classifier and Emad Abd Elaziz Dawood was born in
Classification Trees for the Predictive Accuracy of Probability of Sharkia, Egypt, in 1989. He received a
Default Credit Card Clients." American Journal of Data Mining and bachelor's degree in information systems
Knowledge Discovery 3.1 (2018): 1. from the science valley academy in 2010. He
[9]A. Arshad, S. Riaz, and L. Jiao. "Semi-Supervised Deep Fuzzy C- is a teaching assistant in the higher valley
Mean Clustering for Imbalanced Multi-class Classification." IEEE institute of information systems. He is the
Access (2019). Head of the Youth Welfare Authority in the
[10] Ahram Online, "Egypt's Banque Misr conditionally suspends card science valley academy. He is currently
usage abroad amid currency crisis," Egypt's Banque Misr conditionally pursuing a master's degree in information
suspends card usage abroad amid currency crisis - Economy - Business systems with the Arab Academy for
–Ahram online.[Online].Available: Technology and Maritime (AASTMT), Cairo, Egypt. He is a Research
http://english.ahram.org.eg/News/246079.aspx.[Accessed:10-Apr- Scholar with the Department of Computing and Information
2019] Technology, AASTMT. His main fields of research interests are data
[11] N. M. El Agroudy, F. A. Shafiq, and S. Mokhtar. "The effect of mining, machine learning.
the rise in the dollar rate on the Egyptian economy." Sciences 5.02
(2015): 509-514.
[12] H. Hassan and A. Jreisat. "Does bank efficiency matter? A case of Essamedean Elfakhrany received the
Egypt." International Journal of Economics and Financial Issues 6.2 B.S. and M.S. degrees from the Military
(2016): 473-478. Technical College (MTC), Cairo, Egypt, in
[13] T. Hafez, “IN DEPTH-The ups and downs of the Egyptian pound, 1986 and 1991, respectively, and the Ph.D.
"AmCham.[Online].Available: degree in System Engineering, The Ohio State
https://www.amcham.org.eg/publications/business- University, Dec 1999. He is an Assoc.
monthly/issues/256/April-2017/3568/the-ups-and-downs-of-the- Professor in the Computer Science department
egyptian-pound. [Accessed: 09-Apr-2019]. at Arab Academy for Sciences, Technology, &
[14]T. Perraju, "Artificial intelligence and decision support systems." Maritime Transport. His research interests
International Journal of Advanced Research in IT and Engineering 2.4 include data science, ontological knowledge
(2013): 17-26. representation, semantic web, and IoT streaming data analytics. He is
[15] M. Kaur, N.Kaur ''Adaptive K-Means Clustering Techniques For interested in teaching Artificial intelligence, knowledge management,
Data Clustering'' International Journal of Innovative Decision support systems and theory of computation.
Research in Science, Engineering, and Technology (2014).
[16] J. Wang and Su. Xiaolong "An improved K-Means clustering
algorithm." Communication Software and Networks (ICCSN), 2011 FAHIMA A. MAGHRABY received the B.S.
IEEE 3rd International Conference on. IEEE, 2011. degree in Computer Science from AinShams
[17] F. BASER, S. GOKTEN, and P. O. GOKTEN. "Using fuzzy c- University, Cairo, Egypt, in 2003 and the M.S.
means clustering algorithm in financial health scoring." Audit Financiar degree in Computer Science from AinShams
15.147 (2017): 385-394. University, Cairo, Egypt, in 2008. The Ph.D.
[18] S. Deb, "Application of Artificial Neural Networks (ANN)-In degree in Computer Science from AinShams
Designing SODEPUS (Study of Dynamic Earth Processes using University, Cairo, Egypt, in 2014. From 2004 to
Software)." 2014, she was a Lecturer Assistant in the
[19] Default of credit card clients Data Set, UCI machine learning Institute of Computer Science, Shorouk
repository. Academy, Cairo, Egypt. From 2014 till now,
[20] B. K. Singh, K. Verma, and A. S. Thoke. "Investigations on impact she is a lecturer in the Faculty of Computing and Information
of feature normalization techniques on classifier's performance in breast Technology, Arab Academy for Science, Technology and Maritime
tumor classification." International Journal of Computer Applications Transport (AASTMT), Cairo, Egypt. Her research interest includes
116.19 (2015). Bioinformatics, Imaging Processing, Artificial Intelligence, and
[21] S. Ray, and Rose H. Turi. "Determination of number of clusters in Blockchain.
k-means clustering and application in colour image
segmentation." Proceedings of the 4th international conference on
advances in pattern recognition and digital techniques. 1999.
[22] S.Yang, and H. Zhang. "Comparison of Several Data Mining
Methods in Credit Card Default Prediction." Intelligent Information
Management 10.05 (2018z): 115.
[23] S. Imtiaz and A. J. Brimicombe. "A Better Comparison Summary
of Credit Scoring Classification." International Journal of Advanced
Computer Science and Applications 8.7 (2017): 1-4.
[24] M. Pasha, et al. "Performance comparison of data mining
algorithms for the predictive accuracy of credit card defaulters." Int. J.
Comput. Sci. Netw. Secur 17.3 (2017): 178-183.
[25] V. Pyzhov and S. Pyzhov. "Comparison of methods of data mining
techniques for the predictive accuracy." (2017).
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.