0% found this document useful (0 votes)
47 views10 pages

Researchpaper Bank

Uploaded by

125003138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views10 pages

Researchpaper Bank

Uploaded by

125003138
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Data Science and Management 7 (2024) 7–16

Contents lists available at ScienceDirect

Data Science and Management


journal homepage: www.keaipublishing.com/en/journals/data-science-and-management

Research article

Investigating customer churn in banking: a machine learning approach and


visualization app for data science and management
Pahul Preet Singh a, Fahim Islam Anik b, Rahul Senapati a, Arnav Sinha a, Nazmus Sakib c, *,
Eklas Hossain d
a
Institute for Artificial Intelligence and Data Science, University at Buffalo, The State University of New York, Amherst, 14068, United States
b
Department of Mechanical Engineering, Khulna University of Engineering & Technology, Khulna, 9203, Bangladesh
c
Department of Information Technology, Kennesaw State University, Marietta, 30067, United States
d
Department of Electrical and Computer Engineering, Boise State University, Boise, 83725, United States

A R T I C L E I N F O A B S T R A C T

Keywords: Customer attrition in the banking industry occurs when consumers quit using the goods and services offered by
Bank customer attrition the bank for some time and, after that, end their connection with the bank. Therefore, customer retention is
Churn prediction essential in today’s extremely competitive banking market. Additionally, having a solid customer base helps
Machine learning
attract new consumers by fostering confidence and a referral from a current clientele. These factors make reducing
XGboost
Random forest
client attrition a crucial step that banks must pursue. In our research, we aim to examine bank data and forecast
which users will most likely discontinue using the bank’s services and become paying customers. We use various
machine learning algorithms to analyze the data and show comparative analysis on different evaluation metrics.
In addition, we developed a Data Visualization RShiny app for data science and management regarding customer
churn analysis. Analyzing this data will help the bank indicate the trend and then try to retain customers on the
verge of attrition.

1. Introduction issues, such as poor customer experience, inefficient processes, or a lack


of competitive products and features. Therefore, understanding and
Customer attrition, also known as customer churn, is the phenome- managing customer attrition are crucial for banks to address these
non where customers terminate their relationship with a business or challenges and enhance their overall customer experience. Within the
organization. In the context of banking, customer attrition occurs when competitive banking industry, monitoring and managing customer
customers close their accounts or discontinue utilizing services of a attrition can provide banks valuable insights into customer preferences,
particular bank. Effectively understanding and managing customer needs, and pain points. This knowledge can help banks develop targeted
attrition are crucial for banks to maintain financial stability and safe- strategies to differentiate themselves from their competitors and enhance
guard their reputation. The financial impact of customer attrition on customer retention.
banks can be significant, resulting in potential revenue loss across various The Data Visualization RShiny app, a web application framework
banking services. Consequently, establishing and nurturing long-term developed using the R programming language, plays an important role in
customer relationships is highly valuable for banks. By gaining insight analyzing customer churn. It empowers users to interact with churn-
into attrition patterns, banks can identify customers at risk of leaving and related data through interactive visualizations and dashboards, fostering
implement strategies to retain them. This approach enhances overall a deeper understanding of the data and enabling the identification of
customer lifetime value and bolsters bank profitability. patterns and trends related to customer attrition. The app offers real-time
Moreover, customer attrition has repercussions on a bank’s reputa- monitoring capabilities by connecting live data sources, allowing banks to
tion and brand perception. High churn rates often indicate underlying track attrition rates, customer behavior, and other relevant indicators in

Peer review under responsibility of Xi'an Jiaotong University.


* Corresponding author.
E-mail address: [email protected] (N. Sakib).

https://doi.org/10.1016/j.dsm.2023.09.002
Received 16 January 2023; Received in revised form 12 September 2023; Accepted 18 September 2023
Available online 28 September 2023
2666-7649/© 2023 Xi'an Jiaotong University. Publishing services by Elsevier B.V. on behalf of KeAi Communications Co. Ltd. This is an open access article under the
CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
P.P. Singh et al. Data Science and Management 7 (2024) 7–16

real time. This feature enables prompt action and decision making. 2022; Machado and Karray, 2022). In a related study (Lemmens and
Additionally, the app supports comparative analysis, facilitating the Gupta, 2020), the authors discussed managing churn to maximize profits
comparison of different customer segments based on demographics, and investigated the profit-loss ratio concerning when customers stop
product usage, or behavior. This functionality provides valuable insights using products. Apart from practical applications for predicting bank
into which segments are more susceptible to churn and guides the customer attrition, this study helps establishing a starting point for
development of targeted retention strategies. Predictive modeling is conducting further research in this field (Al-Mashraie et al., 2020; Baghla
another crucial aspect of an application. It integrates machine-learning and Gupta, 2022; Schaeffer and Sanchez, 2020).
(ML) algorithms or statistical models to forecast customer churn. By Successful financial firms must provide useful customer assistance. By
generating churn predictions and visualizing the probability of churn for examining how consumers use goods, employing ML in the financial
individual customers, the app assists banks in identifying high-risk cus- sector enables businesses to provide customized offers and services that
tomers and taking proactive measures to prevent churn. Furthermore, the cater to customer demands. The primary role of ML in customer retention
app facilitates reporting and communication by allowing the creation of is to monitor and forecast customer turnover by tracking behavioral
customized reports and presentations. This feature enables stakeholders to changes. Research has revealed that acquiring new customers costs
easily share churn insights and recommendations with management, significantly more than retaining existing ones. ML enables firms to
marketing teams, or other relevant parties. The workflow of the Data recognize clients who are on the verge of leaving and take prompt action
Visualization RShiny app for data science and management involves the to retain them. Additionally, it can help boost client trust and maintain/
functions and processes listed in Table 1. extend customer engagement, whether it is the customer who has
The success of any business model relies on having a large customer forgotten about the service or the one who has had a bad experience.
base, which entails achieving two primary objectives: acquiring new Having a model that provides insights to banks about customers likely to
customers and retain existing ones. Winning new customers involves leave will assist them in taking the necessary steps and specifically tar-
designing products and advertising them to the appropriate de- geting these customers rather than expending resources tracking all
mographics. The second challenge, retaining customers, is essential for customers (De Lima Lemos et al., 2022; Guliyev and Tato glu, 2021; He
any business model to thrive, as lost customers are highly unlikely to et al., 2014; Karvana et al., 2019; Patil and Dharwadkar, 2017). Practi-
return. Our problem statement primarily addresses the concern about cally, employing such a model to predict the likelihood of customer
maintaining customers and predicting their patterns, which eventually attrition allows banks to focus on this group of customers (Dias et al.,
contributes to solving the customer attrition problem (De Caigny et al., 2020; Vo et al., 2021). However, most studies do not adequately compare
2020; Lee and Shin, 2020; Shirazi and Mohammadi, 2019). To address various ML techniques to help banks make informed decisions based on a
customer attrition, previous studies have discussed customer relationship comparison of the results and domain knowledge. To bring this idea to
management (CRM) systems and three approaches for retention (De fruition, the suitable adaptation of a particular application for this pur-
Caigny et al., 2020; De Lima Lemos et al., 2022; Rahman and Kumar, pose and its representation have not been well demonstrated in the
2020). These articles include post-purchase evaluations, periodic satis- literature.
faction surveys, and continuous satisfaction tracking. They provide an This study holds potential implications for stakeholders in the
excellent foundation for exploring the reasons for customer banking industry. Stronger customer retention methods will lead to
dissatisfaction. personalized offers, improved customer service, and customized banking
This study aims to extend the scope of the aforementioned CRM solutions, all of which will enhance customer experiences. By deploying
systems, with a primary focus on identifying and predicting the likeli- resources and training programs targeted at enhancing customer service,
hood of customer attrition (Amuda and Adeyemo, 2019; Domingos et al., employees may benefit from better work environments and greater job
2021; Ho et al., 2019). The findings of this study can be applied in satisfaction. As customer churn decreases, and customer lifetime value
real-world scenarios to assist banks in determining customer defection and profitability increase, shareholders should anticipate improved
and taking preventive measures to retain such customers (Geiler et al., financial performance. Additionally, implementing research findings can
boost a bank’s image and brand impression, attract new clients, and
encourage long-term company growth. Furthermore, the study empha-
Table 1
sizes the significance of data-driven decision making, enabling stake-
Workflow of the data visualization RShiny app for data science and management.
holders to make informed decisions based on churn analysis insights and
Steps Functionalities encouraging an industry-wide culture of evidence-based decision mak-
1. Data input Allows users to input relevant data sources. ing. These outcomes will support the overall expansion and achievements
Supports various formats (CSV, Excel, and databases). of banking institutions.
2. Data preprocessing Cleans and prepares the data for analysis.
This study makes several important contributions to the body of
Handles missing values, normalization, and other features.
3. Interactive filtering Enables users to filter and select variables. knowledge regarding customer churn analysis and ML in the banking
Focuses on specific subsets of the data. sector. The first part of the article presents a comprehensive pre-
4. Visualizations Generates a variety of visualizations. processing method that guarantees data correctness and consistency,
Includes scatter plots, bar charts, and line graphs. addressing the critical issue of data preparation unique to customer churn
5. Comparative Presents insights on customer churn patterns.
analysis Allows comparison of metrics and customer segments.
analysis in banking. Second, the study thoroughly examines different ML
6. Predictive modeling Analyzes churn rates, customer behavior, and other methods and evaluates their effectiveness in anticipating customer
parameters. attrition. This comparative analysis offers valuable insights into the
Integrates ML for churn prediction. performance of various churn prediction algorithms in the banking in-
7. Real-time Visualizes probability of churn for individual customers.
dustry. Furthermore, by offering a user-friendly tool for displaying churn-
monitoring Connects to live data sources.
8. Reporting and Updates visualizations and metrics in real-time. related insights, the development of the Data Visualization RShiny app
exporting Generates customized reports. enhances the practical application of churn analysis. Finally, this study
9. Decision support Exports visualizations and analysis results. yields useful implications for banks, emphasizing the importance of un-
Provides interactive dashboards and visuals. derstanding customer attrition and providing practical recommendations
10. Outputs Supports data-driven decision making.
Provides interactive visualizations and analysis results.
to improve client retention.
Allows for comparative analysis outcomes. Our unique contributions of this study are summarized as follows:
Creates predictive churn models and provides real-time
monitoring updates. (1) It presents a comprehensive preprocessing approach that effec-
Outputs customized reports and exported data.
tively unifies diverse data in a consistent format.

8
P.P. Singh et al. Data Science and Management 7 (2024) 7–16

(2) It conducts a thorough investigation of different ML algorithms for Table 2


the specific purpose of predicting bank customer attrition. Notations and descriptions.
(3) An application is developed to provide stakeholders with exten- Notation Description Notation Description
sive visualizations, empowering them to make informed
CRM Customer relationship ROC Receiver operating
decisions. management characteristic curve
KYC Know your customer TP True positives
The remainder of the article includes four sections: materials and SVM Support vector machine FP False positives
exploratory data analysis; theory and approach; results and discussion; XGBoost eXtreme gradient boosting TN True negatives
GLM Generalized linear model FN False negatives
and conclusions. Fig. 1 summarizes the key contributions of the study. CV Cross-validation Accuracy (TP þ TN)/(TP þ TN þ FP
þ FN)
2. Materials and exploratory data analysis AUC Area under the ROC curve Sensitivity TP/(TP þ FN)
Precision True positive/(true Specificity TN/(TN þ FP)
positive þ false positive)
This section provides an overview of the dataset used, as well as the
Recall True positive/(true SMOTE Synthetic minority
different analyses and summaries, to better understand the different positive þ false negative) oversampling technique
parameters contributing to the prediction modeling. Table 2 lists the
notations and descriptions used in this study.
2.2. Dataset analysis
2.1. Description of dataset
We preprocessed the dataset to effectively unify and visualize the
The dataset (https://www.kaggle.com/code/kmalit/bank-custo diverse input data parameters in a consistent format.
mer-churn-prediction/data) used in this study comprises 10,000 rows
of customers. Typically, each bank has an elaborate process, with a know (a) Customer church distribution. The pie chart in Fig. 2 depicts the
your customer (KYC) assessment conducted for every new customer. distribution of our dependent variable (churned) in the dataset.
Several critical processes or steps are involved during the onboarding, 80% of the records are for “not churned” customers, and 20% are
which ensures that the bank data obtained can be considered complete “churned”. Thus, every 5th customer was churned, and our
and reliable. Consequently, all the necessary customer information ac- dataset is highly imbalanced.
quired is accessible and legitimate. Each customer is differentiated by a
unique customer ID and associated surname. The dataset includes
Table 3
customer details, such as credit scores, age, tenure, balance, number of
Variables, null count, and unique count of the dataset.
products, and estimated salary. The data include Boolean measurements,
such as 0 or 1, and other sections, with two or more classes. These can be Variables Null count Unique count

classified as follows: county, gender, has a credit card, being an active RowNumber 0 10,000
member, and churned. The final column, “exited” determines the current CustomerID 0 10,000
Surname 0 2,932
state of the customer, and 1 implies that customer attrition occurred. We
CreditScore 0 460
aimed to feed the bank data into a model and determine the outcome, the Geography 0 3
exit column, if it becomes 1. The captured data varied according to the Gender 0 2
customer’s location, economic status, and gender. The number of prod- Age 0 70
ucts a user uses is proportional to how loyal and profitable the customer Tenure 0 11
Balance 0 6,382
is to the bank. A mix of such wide-ranging data helps to draw factual and NumOfProducts 0 4
statistically accurate inferences. Categorical data were converted into a HasCrCard 0 2
numerical form to prevent information loss during modeling. “Row- IsActiveMember 0 2
Numbers”, “CustomerID”, and “Surname” were removed from our data- EstimatedSalary 0 9,999
Exited 0 2
set, as they are not pertinent to our analysis. Table 3 summarizes the data.

Fig. 1. Visual summary of the scientific contribution of the study.

9
P.P. Singh et al. Data Science and Management 7 (2024) 7–16

(b) Gender, active members, credit cards, and country-based analysis. 400) will undoubtedly leave the bank, which is visible in the
Fig. 3 presents valuable insights regarding gender, active mem- scatterplot below (Fig. 5). In this figure, there are only blue dots
bers, credit cards, and country-based analyses. We observed that (churned customers) below the credit score of 400, which could be
out of 4,543 female customers, 1,139 were churned (~25%), due to the weak economic status of the customer.
whereas for the male population, 898 were churned out of 5,457
in total (~16%). In addition, 1/5th of all customers were churned It is evident from the box plot presented in Fig. 6 that older customers
irrespective of whether they owned credit cards. The likelihood of (above the age of 40) are more likely to churn than younger customers.
attrition of inactive customers is almost double that of active The bee-swarm plot (Fig. 5) supports this assumption, as more pink dots
customers (27% inactive vs. 14% active). Germany has the highest (churned customers) are present in the age range of 40–70. This could be
customer churn rate (40%), followed by France (16%) and Spain due to better plans offered for seniors at other banks.
(16%).
(c) Balance, owned product quantity, credit score, and tenure-based (d) Correlation matrix analysis of the dataset. It appears from Fig. 7
analysis. Fig. 4 illustrates a density plot to observe the balance, that no pair of variables is strongly correlated, thus helping to satisfy
owned product quantity, credit score, and tenure-based analysis. the fundamental assumption (absence of multicollinearity) of
We found that customers who maintain a balance of more than modeling (Alin, 2010; Kim, 2019; Mansfield and Helms, 2012). The
85,000 are more likely to churn. Premium accounts and higher only notable correlation observed was between the number of prod-
savings interest at other banks could be the root causes of this. ucts and balance.
Customers owning these two products at the bank were signifi-
cantly less likely to leave. Numerical factors, such as credit scores Now that we have a comprehensive understanding of the different
and tenure, do not impact the customer attrition rate. However, significant parameters of the dataset, the next section describes the
customers with a poor credit history (that is, a credit score below theoretical methodological approach for predicting bank customer
attrition.

3. Theory, approach, and development

This section introduces different ML techniques and applies them to


the dataset. We focused on core ML approaches, including logistic
regression, support vector machine (SVM), random forest, and eXtreme
Gradient Boosting (XGBoost). In this section, we then introduce a visu-
alization tool for stakeholders tailored explicitly to summarize different
data in a single mode.

3.1. ML techniques

We briefly discuss the theoretical underpinnings of the ML techniques


Fig. 2. Customer churn distribution. used in this research. First, logistic regression is a fundamental classifi-

Fig. 3. Histograms of the dataset. (a) Gender vs. churn; (b) customers having credit card vs. churn; (c) active member vs. churn; (d) country vs. churn.

10
P.P. Singh et al. Data Science and Management 7 (2024) 7–16

Fig. 4. Density plots of (a) observed balance, (b) owned product quantity, (c) credit score, and (d) tenure.

An SVM provides the distance to the border, and several steps must be
undertaken to convert it to probability (Cervantes et al., 2020). When
applied to specific issues, one technique may outperform another (Anton
et al., 2019; Sothe et al., 2020).
The gradient-boosted tree algorithm, which is a supervised learning
approach, is based on function approximation by maximizing certain loss
functions and employing a number of regularization approaches.
XGBoost is one of the best-known and most practical implementations of
this algorithm. To obtain a predictive analysis, the following objective
functions (loss function and regularization) must be minimized at itera-
tion t, as shown in Eq. (2) (Chen and Guestrin, 2016):

X
n  
ðt1Þ
ζ ðtÞ ¼ l yi ; ^yi þ ft ðXi Þ þ Ωðft Þ (2)
i¼1

Here, ζðtÞ represents the t-th iteration of the ensemble model. A summa-
tion over all the data points in the dataset, where n is the number of data
Fig. 5. Distribution of customers based on credit score and age. points, is on the right hand side of the equation.
^ðt1Þ
ðyi ; yi þft ðXi ÞÞ is the loss function, which measures the difference
cation technique that is crucial for predicting customer churn. The lo- between the true target value and the predicted target value. Ωðft Þ,
gistic curve equation is as follows: representing the regularization or complexity penalty, is applied to the t-
th model prevent overfitting.
eaþbX
P¼ (1)
1 þ eaþbX
3.2. Approach toward churn prediction
The rolling mean of the DV, P(Ȳ), and independent variable, X, are
connected by the logistic curve. In Eq. (1), e is the base of the natural Four pipelines were constructed for each model: logistic regression,
logarithm (approximately 2.718), a and b are the model parameters, and random forest, SVM, and XGBoost. The models were then fitted to a
P is the probability. When X is zero, the value of a produces P, and the training dataset. We orchestrated model training by constructing a two-
value of b regulates how rapidly the probability changes when X is step workflow pipeline.
changed by a single unit (we can have standardized and unstandardized b
weights in logistic regression, as in ordinary linear regression). B is not as  Normalize the features to bring them on the same scale;
easily interpreted in this model as in a typical linear regression, because  Instantiate the model to fit on.
the relationship between X and P is not linear.
In the case of random forest, a subset of the original data must first be We then configured the grid search parameters for each model. We set
made with row sampling and feature sampling to create a training up four grid-search CV functions that used the pipeline and parameters as
dataset. Subsequently, an individual decision tree is created for each inputs. Finally, all grids were fitted to the training dataset. For the clas-
subset. Finally, considering the output of each decision tree, the majority sification, we initially used a logistic regression model (Pearce and Fer-
vote is counted, as shown in Fig. 8. Random forest allows individuals to rier, 2000). In order to compute the significant features, we used the
belong to a class for classification. The advantage of using random forest “glm” function (Manning, 2007). We achieved an accuracy of 85%, a
for classification is its high accuracy. However, ensuring the robustness sensitivity of 38%, a specificity of 97%, and 0.83 as AUC when we
of the model (or its generalizability) in predicting unknown data remains applied the model to our validation data (Statinfer, 2017), as shown in
challenging. Fig. 9. The large discrepancy between sensitivity and specificity was most

11
P.P. Singh et al. Data Science and Management 7 (2024) 7–16

Fig. 7. Correlation matrix of the dataset.

deployed an application using Plotly (Van Der Donckt et al., 2022) and
Python to demonstrate our model calculations and the possible outcomes
of customer analysis. The application interface functions interactively
comprise three primary tabs: data analysis, model analysis, and predic-
tion. The data analysis tab shows information on both categorical and
numerical attributes. The user can then select attributes. The categorical
representations, as shown in Fig. 10, include a donut chart indicating the
percentage of presence/lack of presence of a particular attribute and a
bar chart indicating how the presence/absence of a particular distribu-
tion varies with churn. The numerical representation in Fig. 11 comprises
a density plot, a scatter plot with age, and a box plot of the selected at-
Fig. 6. Distribution of age of customers. tributes. An image of the data analysis tab for “Number of Products”
likely due to a large imbalance in the data, which caused bias in the (categorical) and “CreditScore” (numerical) is given below.
model (He and Garcia, 2009; Mazumder, 2021). As the data were not The model analysis tab in Fig. 12 shows all evaluation metrics,
balanced (Kaur et al., 2019; Sun et al., 2009), we used the synthetic including accuracy, sensitivity, specificity, AUC, F1 score, test-train data
minority oversampling technique (SMOTE) (Torgo et al., 2013). split percentage, and feature importance for all models. An image of the
This process was performed prior to and after addressing the data model analysis tab for prediction is provided below. Finally, the pre-
imbalance. The imbalance was treated using SMOTE (Chawla et al., diction tab shown in Fig. 13 lets the user set the independent variables to
2002; Fern andez et al., 2018; Han et al., 2005) which is an improved the values of their choice to obtain the dependent variable (which, in this
method for managing imbalanced data in classification problems that case, is churn). Using this feature, we can determine the churns for all
performs data augmentation by creating synthetic data points based on models mentioned above. An image of this tab for arbitrarily chosen data
original data points. This choice was made because methods such as is shown below as a demonstration. These features help the application
undersampling could cause a potential loss of information. Customer user draw inferences and make knowledgeable decisions.
churn was predicted by fitting and evaluating multiple models after
addressing the imbalance using SMOTE. This answered two of the 4. Results and discussion
research questions posed, namely, the prediction of customer churn for
an imbalanced dataset and the examination of multiple models to obtain In this section, we discuss the findings of the aforementioned ML
the most reliable one. Some of the rows, such as “Row Number” and models. The evaluation metrics used include the accuracy, sensitivity,
“Surname” were irrelevant; as a result, we removed them. Along with specificity, AUC, and F1 score (Sokolova et al., 2006). Table 4 presents
these changes, three new variables were created, namely “TenureByAge”, the model performance of the evaluation metrics. Because the main
“BalanceSalaryRatio,” and “CreditScoreGivenAge.” This was done to objective is to predict churn, the combination of sensitivity and accuracy
enhance the performance of the ML models. Removing irrelevant data is a far more sensible choice than assigning equal weightage to all the
and feature engineering also addressed a research question, namely, other metrics. It is evident from Table 4 that the sensitivity of all models
selecting pertinent attributes for evaluating the model and removing was not sufficient. The maximum and minimum values are 44% and
outliers. After a thorough examination, the best model was chosen based 17.3%, respectively. Since accuracy and sensitivity are the most relevant
on a particular combination of metrics that contributed to the imple- metrics, data treatment was required in order to draw meaningful in-
mentation of an adaptable model capable of addressing the dynamic ferences. After SMOTE was performed, as shown in Table 3, we observed
nature of data patterns. In addition, because the data do not contain any that XGBoost performed the best in terms of accuracy at 83%, followed
information that could be publicly traced back to a person, privacy has closely by random forest at 78.3%. A similar trend was observed for the
been maintained. F1 score (XGBoost leads at 0.613, followed by random forest at 0.577),
specificity (XGBoost leads at 90.3%, followed by random forest at
80.7%), and AUC (XGBoost leads at 0.847, followed by random forest at
3.3. Application development for visualization and decision making 0.831). The highest sensitivity was observed for logistic regression
(71.4%), followed by random forest (69.3%). Since sensitivity and ac-
Since our research aimed to provide a practical implementation, we curacy are the most relevant metrics, it is best to choose random forest

12
P.P. Singh et al. Data Science and Management 7 (2024) 7–16

Fig. 8. Random forest classifier with dataset.

Fig. 9. Logistic regression model statistics.

Fig. 10. Categorical attributes analysis.

13
P.P. Singh et al. Data Science and Management 7 (2024) 7–16

Fig. 11. Numerical attributes analysis.

Fig. 12. Machine learning (ML) model statistics.

Fig. 13. Prediction for a given set of inputs.

14
P.P. Singh et al. Data Science and Management 7 (2024) 7–16

Table 4 account balances, banks can proactively engage in offering premium


Evaluation metrics (without SMOTE) comparison for different approaches. account benefits or personalized financial solutions to mitigate the risk of
Metrics Logistic Support vector Random XGBoost attrition. The other alternative ways banks might utilize the findings of
regression machine forest the present study include the following:
Accuracy 0.793 0.802 0.844 0.852
Specificity 0.961 0.987 0.977 0.963  Proactive customer retention strategies. By using ML algorithms and
Sensitivity 0.173 0.119 0.353 0.440 the insights derived from this research, banks can identify customers
AUC 0.763 0.750 0.831 0.842 who are most likely to churn. This will enable them to implement
F1 score 0.263 0.205 0.491 0.559
proactive retention strategies such as personalized offers, targeted
Note: synthetic minority oversampling technique (SMOTE); area under the communication, or enhanced customer support for those at risk of
receiver operating characteristic (ROC) curve (AUC). attrition.
 Enhanced customer experience. Understanding the key factors
Table 5 contributing to customer attrition can help banks address pain points
Evaluation metrics (with SMOTE) comparison for different approaches. and improve the overall customer experience. By focusing on areas
Metrics Logistic Support vector Random XGBoost that drive dissatisfaction or disengagement, banks can make neces-
regression machine forest sary improvements and increase customer satisfaction, thereby
Accuracy 0.691 0.719 0.783 0.839 reducing churn.
Specificity 0.685 0.731 0.807 0.903  Tailored marketing and product offerings. The findings can guide
Sensitivity 0.714 0.672 0.693 0.601 banks in tailoring their marketing campaigns and product offerings.
AUC 0.767 0.765 0.831 0.847
By identifying the patterns or characteristics associated with
F1 score 0.497 0.505 0.577 0.613
customer attrition, banks can develop targeted marketing messages
Note: synthetic minority oversampling technique (SMOTE); area under the and introduce new products or features that cater to specific customer
receiver operating characteristic (ROC) curve (AUC). needs, thereby increasing their value propositions and reducing the
likelihood of churn.
 Effective decision making. The Data Visualization RShiny app
because it exhibits reasonably good performance in both metrics. developed in this study provides stakeholders with a comprehensive
Another noteworthy advantage is that the random forest technique can visualization, enabling them to make informed decisions. By utilizing
handle large datasets owing to its ability to work with many variables. In the app and insights derived from this research, banks can gain a
addition, linear regression is very sensitive to outliers, as opposed to clearer understanding of churn trends, customer behavior, and the
random forest, which helps us justify the latter’s choice. We can draw effectiveness of retention strategies. This will empower them to make
valuable insights from the exploratory data analysis. The chance of data-driven decisions and effectively allocate resources to improve
churning increases by 0.93 when the customer is German as compared to customer retention.
French and Spanish customers (Table 5).
Additionally, the odds of churning are reduced by 0.5 when a male Furthermore, the findings of this study affect numerous other in-
customer joins the bank compared to a female customer. However, the dustries in addition to the banking sector. The examination of customer
likelihood of a customer leaving the bank decreases by 1.5 if the new attrition and comprehension of the underlying variables are applicable to
customer owns two products, whereas the probability of leaving in- a variety of businesses, including insurance, telecommunications,
creases by 2.5 if the new customer owns three bank products. Addi- subscription-based services, and e-commerce. Adopting a thorough
tionally, customers who maintain a balance of more than 85,000 are preparation strategy and combining various pieces of data guarantee the
more likely to churn. Premium accounts and higher savings interest at accuracy and consistency of data analyses across sectors. Similarly, using
other banks could be root causes for this. Similarly, customers with two ML algorithms to forecast customer churn or other important business
bank products are more likely to stay. Crucial inferences can be drawn outcomes enables the optimization of proactive client retention and
from these results. XGBoost achieved the best accuracy, 83.9%. The marketing strategies. Decisions in areas such as CRM, marketing initia-
highest observed sensitivity in logistic regression was 71.4%. Random tives, resource allocation, and product development are supported by the
forest exhibited the best overall performance, with an accuracy of 78.3% creation of the Data Visualization RShiny app. The research conclusions
and a sensitivity of 69.3%. Random forest also has the advantage of thus have useful ramifications that may be applied to a variety of sectors,
handling large datasets and accommodating many features. directing data-driven decision making and improving client retention
The key findings of this research indicate that the random forest methods.
model performed best in terms of accuracy (78.3%) and sensitivity
(69.3%), followed by XGBoost, with an accuracy of 83.9%. The analysis 5. Conclusion
revealed critical factors influencing customer churn, such as nationality,
gender, number of products owned, and account balance. For example, This study helps to predict churn among bank customers with relative
German customers were found to have a higher likelihood of churning success. However, there is scope for improvement in the future. Due to
than French or Spanish customers. Male customers have a lower proba- the sensitive nature of banking data, access to large datasets is restricted.
bility of churning than female customers. Additionally, customers with Access to more data points would enhance the generalizability of pre-
two bank products are more likely to stay, whereas those with three dictions. Additionally, having access to more granular data would
products are more likely to churn. Customers maintaining a balance contribute to improved forecasts. The current attributes are more specific
above 85,000 are also more likely to churn, potentially because of to a customer’s profile than their recency (the metrics that record
attractive offerings from other banks. behavior immediately before churning). Prospective researchers can
Banks can leverage this information to improve customer retention derive these metrics, which will help track a shift in customer behavior
through the implementation of targeted strategies. For instance, they can just before they churn, and thus better identify churn patterns. There is
focus on retaining German customers through personalized or tailored also an opportunity for prospective researchers to improve our app by
services. They can also offer incentives or rewards to customers who own automating the model-training process, incorporating new features and
multiple bank products to increase loyalty. Moreover, banks can enhance data points, and generate updated models. This would help build a
their efforts to understand and cater to female customers’ specific needs feedback loop in the models, ensuring greater veracity of model predic-
and preferences to reduce churn. By identifying customers with high tion by changing patterns and increasing the dataset. In addition, they

15
P.P. Singh et al. Data Science and Management 7 (2024) 7–16

can go further by incorporating additional prediction algorithms that can Fernandez, A., Garcia, S., Herrera, et al., 2018. SMOTE for learning from imbalanced data:
progress and challenges, marking the 15-year anniversary. J. Artif. Intell. 61 (Apr.),
be integrated into the visualization app for comparative analysis and
863–905.
better churn management. This would enable the deployment of this app Geiler, L., Affeldt, S., Nadif, M., 2022. An effective strategy for churn prediction and
in multiple businesses, where it can be used as a centralized churn customer profiling. Data Knowl. Eng. 142 (Nov.), 102100.
management system. Another potential research topic involves devel- Guliyev, H., Tato glu, F.Y., 2021. Customer churn analysis in banking sector: evidence
from explainable machine learning models. J. Appl. Mic. Econ. 1 (2), 85–99.
oping localized prediction models that predict churn for only a subset of Han, H., Wang, W.Y., Mao, B.H., 2005. Borderline-SMOTE: a new over-sampling method
customers. For example, the given datasets could have different models in imbalanced data sets learning. In: International Conference on Intelligent
for customers from different countries. The hypothesis is that individual Computing. Springer, Berlin, pp. 878–887.
He, B., Shi, Y., Wan, Q., et al., 2014. Prediction of customer attrition of commercial banks
models would drive higher accuracy and cumulatively greater general- based on SVM model. Procedia Comput. Sci. 31 (Jan.), 423–430.
izability than a model that is supposed to fit all situations. Additionally, He, H., Garcia, E.A., 2009. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng.
the research indicates that different prediction algorithms are more 21 (9), 1263–1284.
Ho, S.C., Wong, K.C., Yau, Y.K., et al., 2019. A machine learning approach for predicting
suitable for different customer buckets, leading to more impact bank customer behavior in the banking industry. In: Machine Learning and Cognitive
predictions. Science Applications in Cyber Security. IGI Global, pp. 57–83.
Karvana, K.G.M., Yazid, S., Syalim, A., et al., 2019. Customer churn analysis and
prediction using data mining models in banking industry. In: 2019 International
CRediT author statement Workshop on Big Data and Information Security. IEEE, pp. 33–38.
Kaur, H., Pannu, H.S., Malhi, A.K., 2019. A systematic review on imbalanced data
Pahul Preet Singh: Conceptualization, Methodology, Software. challenges in machine learning: applications and solutions. ACM Comput. Surv. 52
(4), 1–36.
Fahim Islam Anik: Writing-Original draft preparation. Rahul Senapati: Kim, J.H., 2019. Multicollinearity and misleading statistical results. Korean J.
Visualization, Investigation. Arnav Sinha: Software, Validation. Naz- Anesthesiol. 72 (6), 558–569.
mus Sakib: Supervision, Correspondence, Writing-Reviewing and Edit- Lee, I., Shin, Y.J., 2020. Machine learning for enterprises: applications, algorithm
selection, and challenges. Bus. Horiz. 63 (2), 157–170.
ing. Eklas Hossain: Writing-Reviewing and Editing. Lemmens, A., Gupta, S., 2020. Managing churn to maximize profits. Market. Sci. 39 (5),
956–973.
Machado, M.R., Karray, S., 2022. Applying hybrid machine learning algorithms to assess
Declaration of competing interest customer risk-adjusted revenue in the financial industry. Electron. Commer. Res.
Appl. 56 (Nov.), 101202.
Manning, C., 2007. Generalized linear mixed models. Available at: https://nlp.stanford
The authors declare that there are no conflicts of interest.
.edu/~manning/courses/ling289/GLMM.pdf.
Mansfield, E.R., Helms, B.P., 2012. Detecting multicollinearity. The American Statistician
References 36 (3a), 158–160.
Mazumder, S., 2021. 5 Techniques to handle imbalanced data for a classification problem.
Available at: https://www.analyticsvidhya.com/blog/2021/06/5-techniques-
Al-Mashraie, M., Chung, S.H., Jeon, H.W., 2020. Customer switching behavior analysis in
to-handle-imbalanced-data-for-a-classification-problem/.
the telecommunication industry via push-pull-mooring framework: a machine
Patil, P.S., Dharwadkar, N.V., 2017. Analysis of banking data using machine learning. In:
learning approach. Comput. Ind. Eng. 144 (Jun.), 106476.
2017 International Conference on I-SMAC (IoT in Social, Mobile, Analytics and
Alin, A., 2010. Multicollinearity. Wiley Interdiscip. Rev. Comput. Stat. 2 (3), 370–374.
Cloud). IEEE, pp. 876–881.
Amuda, K.A., Adeyemo, A.B., 2019. Customers churn prediction in financial institution
Pearce, J., Ferrier, S., 2000. Evaluating the predictive performance of habitat models
using artificial neural network. Available at: https://arxiv.org/abs/1912.11346.
developed using logistic regression. Ecol. Model. 133 (3), 225–245.
Anton, S.D.D., Sinha, S., Schotten, H.D., 2019. Anomaly-based intrusion detection in
Rahman, M., Kumar, V., 2020. Machine learning based customer churn prediction in
industrial data with SVM and random forests. In: 2019 International Conference on
banking. In: 2020 4th International Conference on Electronics, Communication and
Software, Telecommunications and Computer Networks. IEEE, pp. 1–6.
Aerospace Technology. IEEE, pp. 1196–1201.
Baghla, S., Gupta, G., 2022. Performance evaluation of various classification techniques
Schaeffer, S.E., Sanchez, S.V.R., 2020. Forecasting client retention—a machine-learning
for customer churn prediction in E-commerce. Microprocess. Microsyst. 94 (Oct.),
approach. J. Retailing Consum. Serv. 52 (Jan.), 101918.
104680.
Shirazi, F., Mohammadi, M., 2019. A big data analytics model for customer churn
Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, et al., 2020. A comprehensive
prediction in the retiree segment. Int. J. Inf. Manag. 48 (Oct.), 238–253.
survey on support vector machine classification: applications, challenges and trends.
Sokolova, M., Japkowicz, N., Szpakowicz, S., 2006. Beyond accuracy, F-score and ROC: a
Neurocomputing 408 (Sep.), 189–215.
family of discriminant measures for performance evaluation. In: Australasian Joint
Chawla, N.V., Bowyer, K.W., Hall, L.O., et al., 2002. SMOTE: synthetic minority over-
Conference on Artificial Intelligence. Springer, Berlin, pp. 1015–1021.
sampling technique. J. Artif. Intell. 16 (Jun.), 321–357.
Sothe, C., De Almeida, C.M., Schimalski, et al., 2020. Comparative performance of
Chen, T., Guestrin, C., 2016. XGboost: a scalable tree boosting system. In: Proceedings of
convolutional neural network, weighted and conventional support vector machine
the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data
and random forest for classifying tree species using hyperspectral and
Mining. ACM, pp. 785–794.
photogrammetric data. Gisci. Remote Sens. 57 (3), 369–394.
De Caigny, A., Coussement, K., De Bock, K.W., et al., 2020. Incorporating textual
Statinfer, 2017. Calculating sensitivity and specificity in R. Available at: https://statinfe
information in customer churn prediction models based on a convolutional neural
r.com/203-4-2-calculating-sensitivity-and-specificity-in-r/.
network. Int. J. Forecast. 36 (4), 1563–1578.
Sun, Y., Wong, A.K., Kamel, M.S., 2009. Classification of imbalanced data: a review. Int. J.
De Lima Lemos, R.A., Silva, T.C., Tabak, B.M., 2022. Propension to customer churn in a
Pattern Recognit. Artif. 23 (4), 687–719.
financial institution: a machine learning approach. Neural Comput. Appl. 34 (14),
Torgo, L., Ribeiro, R.P., Pfahringer, B., et al., 2013. SMOTE for regression. In: Portuguese
11751–11768.
Conference on Artificial Intelligence. Springer, Berlin Heidelberg, pp. 378–389.
Dias, J., Godinho, P., Torres, P., 2020. Machine learning for customer churn prediction in
Van Der Donckt, J., Van der Donckt, J., Deprost, E., et al., 2022. Plotly-resampler:
retail banking. In: International Conference on Computational Science and its
effective visual analytics for large time series. In: 2022 IEEE Visualization and Visual
Applications. Springer, Berlin, pp. 576–589.
Analytics (VIS). IEEE, pp. 21–25.
Domingos, E., Ojeme, B., Daramola, O., 2021. Experimental analysis of hyperparameters
Vo, N.N., Liu, S., Li, X., et al., 2021. Leveraging unstructured call log data for customer
for deep learning-based churn prediction in the banking sector. Comput. Times 9 (3),
churn prediction. Knowl. Base Syst. 212 (Jan.), 106586.
34.

16

You might also like