Customer Churn Prediction On Credit Card Services Using Random Forest Method
Customer Churn Prediction On Credit Card Services Using Random Forest Method
Proceedings of the 2022 7th International Conference on Financial Innovation and Economic Development (ICFIED 2022)
ABSTRACT
With the continuous development of the Internet, more and more people are spending money using credit cards online,
therefore, retaining customers in order to maintain profit margin becomes very important for many banks. This paper
aims to make predictions on credit card customer churn through machine learning methods and to provide feasible
solutions to deal with customer churn issue based on the results. Three models including Random Forest, Linear
Regression and K-Nearest Neighbor (KNN) are applied to a dataset which contains more than 10000 pieces and 21
features. By tuning hyperparameters and evaluating models based on ROC & AUC and confusion matrix, it is concluded
that Random Forest has the best performance with its accuracy reaching 96.25%. Total transaction amount in the last
12 months, total transaction count in the last 12 months and total revolving balance are the top three important features
which have the significant impacts on the customer churn prediction. It shows that the more frequent customers use
their credit cards, the less likely they are to leave, and by using this model, bank managers can proactively take actions
to fight against customer churn.
learning approaches, it is possible to process and analyze convincing. Specifically, each tree learns independently
large amounts of data. Some other paper, although used from random sub-dataset and sub-features, and the final
models to predict the outcome, mainly focus on outcome is drawn with the help of deterministic
unsupervised learning, which, in general, are not reliable averaging process, or in other words, the average of
and have relatively poor interpretability [7]. predictions of individual trees [10]. A simple example of
random forest tree is shown below as Figure 1.
In this paper, we obtain credit card holders’
information data from Kaggle, which has more than
10000 pieces of data and 21 various features. We do the
exploratory data analysis to see its distribution and
visualize relationships between features. Then we split
the dataset into training and testing, followed by
standardization. Three models are used including
Random Forest, Logistic Regression and KNN to see
their performance. After that, a method called grid search
is used to tune hyperparameters. In order to select the
optimum model, we use confusion metrics and ROC &
AUC to evaluate models. It is concluded that Random
Forest performs the best among the three models, and it Figure 1. Description of Random Forest. Notes: X1 to
is approximately 5% higher in accuracy and much higher X4 are all features, with Y is the outcome that needs to
in other metrics such as prediction and recall. We rank be predicted. The original dataset undergoes splits to
the importance of features based on Random Forest several sub-sets with each one contains less features and
model and select the top three important features which less data. Then, each sub-set will be used to train a tree
exert meaningful impacts on banks’ decision making. to make prediction, and deterministic averaging process
Specifically, they are the total transaction amount in the is applied to determine the final result.
last 12 months, the total transaction count in the last 12
months and total revolving balance on the credit card. 2.1.1. Random Forest Algorithm
Therefore, we learn that the more frequent one customer
uses his/her credit card, the less likely he/she will get Input: Dataset (D) with N features and numbers of
churned, which is intuitive and makes sense in daily life, trees n.
and the bank managers can adjust the card service
Output: A random forest.
accordingly.
For i=1 to n:
In the rest of the paper, the methods used are
discussed first, followed by introduction of dataset, first step: Draw a bootstrap sample from original
preprocessing steps and results of applying models. After dataset D.
that, three conclusions are introduced in the end with
second step: Grow a random forest tree to the
some innovations and a touch of shortcomings and how
bootstrap data, and repeat the following steps until the
we can further improve.
minimum node size is reached.
2. METHOD (1) Select a subset of √𝑁 features (variables).
2.1. Random Forest (2) For j=1 to √𝑁 , pick the best variable from √𝑁
and split the node into left and right child nodes.
Random forest is a supervised learning model. It was
To make a binary classification prediction of a new
proposed by Breiman and Cutler in 2001, and is based on
point x, we can use this formula with denotes the state
decision tree and ensemble learning [8]. Decision tree can
prediction of the m-th random forest tree.
describe complicated relationships between x and y
n
rather than simple linear relationship, thus has a stronger Ĉrf (x) = majority vote{Ĉm (x)}1n (1)
modeling strength. However, a single tree model is very
sensitive to the training set data, therefore is very likely 2.2. Logistic Regression
to cause overfitting problem [9]. Ensemble learning,
however, can solve this problem by a method called Logistic Regression is a linear model, which connects
bagging, which is to train multiple learners, with each 𝑋1 ,…, 𝑋𝑝 to the conditional probability
one’s training data comes from a collection of through this formula:
bootstrapped samples selected randomly from original
dataset with replacement. It decreases variance by (2)
introducing randomness into model framework, making
the model more robust and the result more accurate and
650
Advances in Economics, Business and Management Research, volume 211
where Y stands for the binary outcome that we are 3.1. Basic Information about the dataset
interested, and 𝑋1 , _? 𝑋𝑝 are the features. , ,…,
We find relevant dataset about bank customer
are regression coefficients, which are derived from a
information from Kaggle, which consists of more than
method called maximum-likelihood using the dataset [11].
10000 pieces and includes 21 features such as age,
For a new instance, all the s are replaced by their
income, marital status, credit card limit and so on. 16 of
calculated counterparts and Xs by their realizations, and
them are numerical while 5 of them are categorical.
if the probability P is greater than a threshold value, the
new instance will then be assigned to Y=1 class
accordingly. Otherwise, it will be assigned to Y=0. 3.2. Exploratory Data Analysis
Usually, the threshold is set to be 0.5 and therefore is so- We conduct exploratory data analysis (EDA) to have
called Bayes Classifier. a better understanding of the data by checking for missing
and duplicated values, handling outliers, visualizing
2.3. K Nearest Neighbors distributions and plotting graphs to see the relationships
between features and our target, which is whether the
K Nearest Neighbors takes less time than the other
customer get churned. Here are some important features
and therefore is a relatively simple model. It conducts
that needed to illustrate.
predictions straightforward from training set data, by
calculating the closest k objects on distance of dataset to
the input data, where k is the hyperparameter and can be 3.2.1. Type of Card
adjusted to affect the classifier performance, and it then It can be seen from the below table that the type of
assigns classification based on maximum voted classes card held by majority of people is blue card with 93.2%.
out these adjacent classes [12]. In the Figure 3, we split the data into two parts, thus
There are many ways to calculate the distance, for clearly, we can visualize the relationships between card
example, Euclidean distance and distance Manhattan, type and both currently existing customer and left
with the former one the most popular. The distance d customers. These two follows the same pattern, with the
between two points a and b can be calculated through the amount of blue card holders extremely surpasses the
formula below: others.
Table 1. Proportion of different card categories
(3)
type percentage
The picture below shows the principle of K Nearest
Neighbors. Blue 93.2%
Silver 5.48%
Gold 1.15%
platinum 0.197%
3. DATA AND EXPLORATORY DATA Figure 3. Relationship between customers and card
ANALYSIS type.
651
Advances in Economics, Business and Management Research, volume 211
The credit card limit is analyzed to see whether there Features Notes
are some extremely large values, or outliers. For
example, if some people get a card limit of 1-2 million, Education level 70.65% of customers gained a
which are significantly larger than the others, then we high school or higher
need to delete these data. Fortunately, there are no
outliers, and all the limits are within a reasonable range. education.
4. PREPROCESSING, APPLYING
MODELS AND RESULTS
652
Advances in Economics, Business and Management Research, volume 211
Result of 5-
After tuning
Model fold cross
hyperparameters
validation
Random
0.9610 0.9568
Forest
model result of 5-fold cross The confusion matrix, also called error matrix, is a
standard format for the accuracy evaluation and is
validation
represented by a matrix of 2 rows and 2 columns as it
shown in figure 7 [13].
Random Forest 0.9610
653
Advances in Economics, Business and Management Research, volume 211
the highest AUC which means this model have the best
prediction ability.
Table 5. AUC of rocand recall ratio.
Random Logistic K-
Forest Regression Nearest
Neighbor
654
Advances in Economics, Business and Management Research, volume 211
Amount change 0.0633 Income level 0.0099 Totally three conclusions are obtained from the
research. To begin with, Random Forest model is the best
Card limit 0.0319 marital status 0.0086 among the other two, although it has relatively low
computational speed due to its complexity, its
age 0.0315 gender 0.0079 performance is approximately 5% higher in accuracy and
2% higher in recall. Secondly, using a better combination
of parameters can improve model’s performance. Finally,
we check the feature importance of the dataset and find
As for total transaction amount and count, they are
that the total transaction amount in the last 12 months,
very similar. Both can reflect usage situation of a
total transaction count in last 12 months and total
customer, because the bill could either be several big
revolving balance on the credit card have significant
expenses or frequent small-amount pay out. It is quite
impacts on model forecasting. It shows that the more
intuitive that the more one customer uses his credit card,
frequent customers use their credit cards, the less likely
the less likely he will leave bank’s services. Through
they are to leave, therefore, the bank managers can adjust
using process, customers may get more dependent on
credit card service based on it to fight against customer
credit card or be more satisfied with the services and
churn and increase retention rate. And the increase of
products, therefore, obviously they will keep using the
retention rate brings about a greater profit growth. By
card.
using this model, they have plenty of time prior to taking
Identifying the factors affecting customer churn has actions to retain customers, i.e., by making promotions,
always been popular research, because it can help banks offering coupons to encourage people to use their credit
better grasp existing customers and improve profits. By card and cultivate their using habits.
using logistic regression and decision tree, Abbas et al.
There are some deficiencies as well. Firstly, machine
found that customer relationship length, customer age,
learning has numerous algorithms in classification such
customer gender and the number of mobile banking
as neural network, but we merely use a few of them.
transactions have an impact on customer churn [16].
Next, only one dataset which is collected from a specific
Moreover, in the study of Mahdi et al., they found that
bank is used, thus it might bring limitations to our model
the loss of bank customers had something to do with their
because it just represents a part of the industry. Lastly,
careers through Neural Network model [17]. To
we use a single model to do predictions. In fact, ensemble
determine the causes of customer churn in banking and
learning can combine multiple models’ advantages and
e-banking services, Chiang et al. used association rules
give a better performance.
and analysis of customer transactions to find the most
important customer churn patterns and their result shows
CONFLICT OF INTEREST
that blind promotion is a major cause of customer loss
[18]. The authors declare no conflict of interest.
Those findings seem differ from what we get, and the
reason behind is because the databases, methods and AUTHOR CONTRIBUTIONS
models used in each study are different. In Abbas’ study
Xinyu Miao conducted the research, Liguo Tang
he used decision tree while Mahdi used Neural Network,
analyzed the data, and Haoran Wang wrote the paper. All
whereas in our study, we mainly use Random Forest
authors had approved the final version.
which is a combination of several decision trees.
Therefore, each study draws different conclusions.
REFERENCES
5. CONCLUSION [1] N. X. Hong, and L. Yi, “Standing at the crossroads-
credit card,” Reporters' Notes, vol. 5, pp. 41-43,
This paper aims at predicting the loss of bank credit 2020. (in Chinese)
card customers. We get 10,000 dataset containing age,
salary, marital status, credit card limit, etc., and do [2] R. Rajamohamed, and J. Manokaran, “Improved
analysis and research based on it. We firstly preprocess credit card churn prediction based on rough
the dataset, then apply three rational classification clustering and supervised learning techniques,”
models, specifically, Random Forest, Logistic Cluster Computing, vol. 21, pp. 65-77, June 2017.
regression, KNN by using 5-fold cross validation. We
[3] G. L. Nie, W. Rowe, L. L. Zhang, Y. J. Tian, and Y.
adjust the hyperparameters in each model to improve the
accuracy and use ROC & AUC and confusion matrix to Shi, “Credit card churn forecasting by logistic
evaluate the model performance. Both two agree that regression and decision tree,” Expert Systems with
Random Forest has the strongest predictive ability, and Applications, vol. 38, pp. 15273-15285, 2011.
by using this, we find out three features which have the [4] J. Liao, and Y. F. Ruan, “Research on APP
greatest impact on our prediction. Intelligence Promotion Decision Aiding System
655
Advances in Economics, Business and Management Research, volume 211
Based on Python Data Analysis and AARRR [15] P. Flach, J. H. Orallo, and C. Ferri, “A Coherent
Model,” Journal of Physics: Conference Series, vol. Interpretation of AUC as a Measure of Aggregated
1856, pp. 1-7, 2021. Classification Performance,” ICML, pp. 657-664,
Jane 2011.
[5] M. Kehoe, H. B. Taylor, and D. Broderick,
“Developing student social skills using restorative [16] A. Keramati, H. Ghaneei, and S. M. Mirmohammadi,
practices: a new framework called H.E.A.R.T,” “Investigating factors affecting customer churn in
Social Psychology of Education, vol. 21, pp. 189- electronic banking and developing solutions for
207, 2017. retention,” International Journal of Electronic
Banking, vol. 2, no. 3, pp. 185-204, November 2020.
[6] B. Roscher, B. Bohn, M. F. Duarte, and J. Garcke,
“Explainable Machine Learning for Scientific [17] S. H. Iranmanesh, M. Hamid, M. Bastan, G. H.
Insights and Discoveries,” IEEE Access, vol. 8, pp. Shakouri, and M. M. Nasiri, “Customer churn
42200-42216, 2020. prediction using artificial neural network: An
analytical CRM application,” In Proceedings of the
[7] Q. F. Bi, K. E. Goodman, J. Kaminsky, and J. Lessler,
International Conference on Industrial Engineering
“What is Machine Learning? A Primer for the
and Operations Management, Pilsen, Czech
Epidemiologist,” American Journal of
Republic, pp. 23-26, July 2019.
Epidemiology, vol. 188, pp. 2222-2239, October
2019. [18] D. Chiang, Y. Wang, S. Lee, and C. Lin, “Goal-
oriented sequential pattern for network banking
[8] Y. A. Amrani, M. Lazaar, and K. E. E. Kadiri,
churn analysis,” Expert Systems with Applications,
“Random Forest and Support Vector Machine based
vol. 25, no. 3, pp. 293-302, 2003.
Hybrid Approach to Sentiment Analysis,” Procedia
Computer Science, vol. 127, pp. 511-520, 2018.
[9] S. Y. Xuan, G. J. Liu, Z. C. Li, L. T. Zheng, S. Wang,
and C. J. Jiang, “Random Forest for Credit Card
Fraud Detection,” 2018 IEEE 15th International
Conference on Networking, Sensing and Control
(ICNSC), pp. 1-6, 2018.
[10] T. Hengl, M. Nussbaum, M. N. Wright, G. B. M.
Heuvelink, and B. Gräler, “Random forest as a
generic framework for predictive modeling of
spatial and spatio-temporal variables,” PeerJ, vol. 6,
pp. e5518, 2018.
[11] R. Couronné, P. Probst, and A. L. Boulesteix,
“Random forest versus logistic regression: a large-
scale benchmark experiment,” BMC Bioinformatics,
vol. 19, pp. 270-283, July 2018.
[12] A. Singh, M. N. Halgamuge, and R. Lakshmiganthan,
“Impact of Different Data Types on Classifier
Performance of Random Forest, Naïve Bayes, and
K-Nearest Neighbors Algorithms,” International
Journal of Advanced Computer Science and
Applications (IJACSA), vol. 8, pp. 1-10, 2017.
[13] N. Yang, Y. Qian, H. S. EL-Mesery, R. Zhang, A.
Wang, and J. Tang, “Rapid detection of rice disease
using microscopy image identification based on the
synergistic judgment of texture and shape features
and decision tree–confusion matrix method,”
Journal of the Science of Food and Agriculture, vol.
99, no. 14, pp. 6589-6600, 2019.
[14] J. H. Orallo, P. Flach, and C. Ferri, “ROC curves in
cost space,” Machine Learning, vol. 93, no. 1, pp.
71-91, 2013.
656