A Review On Credit Card Default Modelling Using Data Science

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

International Journal of Trend in Scientific Research and Development (IJTSRD)

Special Issue: International Conference on Advances in Engineering, Science and Technology – 2021
Organized by: Uttaranchal Institute of Technology, Uttaranchal University, Dehradun
Available Online: www.ijtsrd.com e-ISSN: 2456 – 6470

A Review on Credit Card Default Modelling using Data Science


Harsh Nautiyal, Ayush Jyala, Dishank Bhandari
UIT, Uttaranchal University, Dehradun, Uttarakhand, India

How to cite this paper: Harsh Nautiyal | Ayush Copyright © 2021 by author(s) and
Jyala | Dishank Bhandari "A Review on Credit Card International Journal of Trend in Scientific
Default Modelling using Data Science" Published in Research and
International Journal of Trend in Scientific Research Development Journal.
and Development (ijtsrd), ISSN: 2456-6470, Special This is an Open Access
Issue | International Conference on Advances in article distributed under the terms of the
Engineering, Science and Technology – 2021, May IJTSRD42461 Creative Commons Attribution License
2021, pp.22-28, URL: (CC BY 4.0)
www.ijtsrd.com/papers/ijtsrd42461.pdf (http://creativecommons.org/licenses/by/4.0)

1. INTRODUCTION
In the last few years, credit card issuers have become one of to help the bank to improve its credit card services for the
the major consumer lending products in the U.S. as well as mutual benefit of customers and the business itself. Creating
several other developed nations of the world, representing a human-interpretable solution is emphasized in each stage
roughly 30% of total consumer lending (USD 3.6 tn in 2016). of the project.
Credit cards issued by banks hold the majority of the market
Even though plenty of solutions to the default prediction
share with approximately 70% of the total outstanding
using the full data set have been previously done, but there
balance. Bank’s credit card charge offs have stabilized after
lies a problem in the interpretability ,even in published
the financial crisis to around 3% of the outstanding total
papers, the scope of our project extends beyond that, as our
balance. However, there are still differences in the credit
ultimate goal is to provide an easy-to-interpret default
card charge off levels between different competitors.
mitigation program to the client bank.Which is done fairly
Credit card is a flexible tool by which you can use bank’s easy by using gradient boosting LightGBM algorithm for
money for a short period of time. If you accept a credit card, prediction.
you agree to pay your bills by the due date listed on your
In addition to default prevention, the case study includes a
credit card statement. Otherwise, the credit card will be
set of learning goals. The team must understand key
defaulted. When a customer is not able to pay back the loan
considerations in selecting analytics and machine learning
by the due date and the bank is totally certain that they are
methods and how these methodologies can be used
not able to collect the payment, it will usually try to sell the
efficiently to create direct business value. McKinsey also sets
loan. After that, if the bank recognizes that they are not able
the objective of learning how to communicate complex
to sell it, they will write it off. This is called a charge-off. This
topics to people with different backgrounds.
results in significant financial losses to the bank on top of the
damaged credit rating of the customer and thus it is an The project should include a recommended set of actions to
important problem to be tackled in todays world where mitigate the default and a clear explanation of the business
financial risks are happening vigorously. implications. The interpretability and adaptability of our
solution needs to be emphasized when constructing the
Predicting accurately which customers are most probable to
solution. The bank needs a solution that can be understood
default represents significant business opportunity and
and applied by people with varying expertise, so that no
strategy for all banks. Bank cards are the most common
further outside consultation is required in understanding the
credit card type in the U.S., which emphasizes the impact of
business implications of the decisions.
risk prediction to both the consumers and banks. In a well-
developed financial system, risk prediction is essential for 2. RELATED WORK
predicting business performance or individual customers’ There is much research on credit card lending, it is a widely
credit risk and to reduce the damage and uncertainty. researched subject. Many statistical methods have been
applied to developing credit risk prediction, such as
Our client ITBCO Bank has approached us to help them to
discriminant analysis, logistic regression, Knearest neighbor
predict and prevent credit card defaulters to improve their
classifiers, and probabilistic classifiers such as Bayes
bottom line. The client has a screening process, for instance,
classifiers. Advanced machine learning methods including
it has collected a rich data set of their customers, but they
decision trees and artificial neural networks have also been
are unable to use it properly due to shortage of analytics
applied. A short introduction to these techniques is provided
capabilities.
here.
The fundamental objective of the project is implementing a
K-nearest Neighbor Classifiers K-nearest neighbor (KNN)
proactive default prevention guideline to help the bank
classifier is one of the simplest unsupervised learning
identify and take action on customers with high probability
algorithms which is based on learning by analogy. The main
of defaulting to improve their bottom line. The challenge is

@ IJTSRD | Unique Paper ID – IJTSRD42461 | ICAEST-21 | May 2021 Page 22


Special Issue: International Conference on Advances in Engineering, Science and Technology – 2021 (ICAEST-21)
Available online @ www.ijtsrd.com eISSN: 2456-6470
idea is to define k centroids, one for each cluster. These classification trees and K-nearest neighbor classifiers have
centroids should be placed in appropriately because of the lowest error rate for the training set. However, for the
different location causes different result. Therefore, the validation data, artificial neural networks has the best
better choice is to place them as much as possible far away performance with the highest area ratio and the relatively
from each other. When given an unknown data, the KNN low error rate. As the validation data is the effective
classifier searches the pattern space for the KNN which are measurement of the classification accuracy of models, so, we
the closest to this unknown data. This closeness is defined by can conclude that artificial neural networks is the best model
distance. The unknown data sample is assigned to the most among the six methods. However, the error rates are not the
common class among its KNN. appropriate criteria for measuring the performance of the
models. As, for example, the KNN classifier has the lowest
Discriminant Analysis (DA) The objective of discriminant
error rate, while it does not perform better than artificial
analysis is to maximize the distance between different
neural networks and classification trees based on the area
groups and to minimize the distance within each group. DA
ratio. While considering the area ratio in validation data, the
assumes that, for each given class, the explanatory variables
results show that the performance of the six techniques is
are distributed as a multivariate normal distribution with a
ranked as: artificial neural networks, classification trees,
common variance–covariance matrix.
Naïve Bayesian classifier, kNN classifier, logistic regression,
Logistic Regression (LR) Logistic regression is often used and Discriminant Analysis, respectively.
in credit risk modeling and prediction in the finance and
3. PROBLEM FORMULATION
economics literature. Logistic regression analysis studies the
With the growth of e-commerce websites, people and
association between a categorical dependent variable and a
financial companies rely on online services to carry out their
set of independent variables. A logistic regression model
transactions that have led to an exponential and vigorous
produces a probabilistic formula of classification. LR has
increase in the credit card frauds. Fraudulent credit card
problems to deal with non-linear effects of explanatory
transactions lead to a loss of huge amount of money to banks
variables.
as well as various other sectors.
Classification Trees (CTs) The classification tree structure
The design of an effective fraud detection system is
is composed of nodes and leafs. Each internal node defines a
necessary in order to reduce the losses incurred by the
test on certain attribute whereas each branch represents an
customers and financial companies. Research has been done
outcome of the test, and the leaf nodes represent classes. The
on many models and methods to prevent and detect credit
root node is he top-most node in the tree. The segmentation
card frauds. Some credit card fraud transaction datasets
process is generally carried out using only one explanatory
contain the problem of imbalance in datasets. A good fraud
variable at a time. Classification trees can result in simple
detection system should be able to identify the fraud
classification rules and can also handle the nonlinear and
transaction accurately and should make the detection
interactive effects of explanatory variables. But they may
possible in real-time transactions. Fraud detection can be
depend on the observed data so a small change can affect the
divided into two groups: anomaly detection and misuse
structure of the tree.
detection. Anomaly detection systems bring normal
Artificial Neural Networks (ANNs) Artificial neural transaction to be trained and use techniques to determine
networks are used to develop relationships between the novel frauds. Conversely, a misuse fraud detection system
input and output variables through a learning process. This uses the labeled transaction as normal or fraud transaction
is done by formulating non-linear mathematical equations to to be trained in the database history. So, this misuse
describe these relationships. It can perform a number of detection system entails a system of supervised learning and
classification tasks at once, although commonly each anomaly detection system a system of unsupervised
network performs only one. The best solution is usually to learning. Fraudsters masquerade the normal behavior of
train separate networks for each output, then to combine customers and the fraud patterns are changing rapidly so the
them into an ensemble so that they can be run as a unit. Back fraud detection system needs to constantly learn and update.
propagation algorithm is the best known example of neural
Background Timely information on fraudulent activities is
networks algorithm. This algorithm is applied to classify
strategic to the banking industry as banks have huge
data. In back propagation neural network, the gradient
databases with variety. Valuable business information can be
vector of the error surface is computed. This vector points
extracted from these data stores. Credit card frauds can be
along the line of steepest descent from the current point, so
broadly classified into three categories, that is, traditional
we know that if we move along it a "short" distance, we will
card related frauds (application, stolen, account takeover,
decrease the error. A sequence of such moves will eventually
fake and counterfeit), merchant related frauds (merchant
find a minimum of some sort. The difficult part is to decide
collusion and triangulation) and Internet frauds (site
how large the steps should be. Large steps may converge
cloning, credit card generators and false merchant sites)
more quickly, but may also overstep the solution or go off in
the wrong direction. Methodology Basically, there are five basic steps for the
data mining process which defines the problem. 1) preparing
Naïve Bayesian classifier (NB) The Bayesian classifier is a
data 2) exploring the data 3) development of the model 4)
probabilistic classifier based on Bayes theory. This classifier
exploration and validation of the models 5) deployment and
is based on the conditional independence which assumes
updation in the models. In this project, LightGBM is used as
that the effect of an attribute value on a given class is
the data mining technique and it utilized above mentioned
independent of the values of the other attributes.
steps for accurate and reliable result. Moreover, Neural
Computations are simplified by using this assumption. In
network was used as it has the capability of adaption and
practice, however, dependences can exist between variables.
generalization. Moreover, python [3] is also a good option for
Comparing the results of the six data mining techniques,

@ IJTSRD | Unique Paper ID – IJTSRD42461 | ICAEST-21 | May 2021 Page 23


Special Issue: International Conference on Advances in Engineering, Science and Technology – 2021 (ICAEST-21)
Available online @ www.ijtsrd.com eISSN: 2456-6470
the experiment purpose. Jupyter is a notebook style open individual tree impacts the end result. These parameters are
source interface for pyhon. It is an interactive web-based usually optimized with a grid search that iterates through all
environment that allows persons to combine text, plot, the possible parameter combinations. This is usually
mathematics, executable code in a single document. computationally expensive since a large number of models
have to fitted since the number of parameters needing to be
4. OBJECTIVES
tested increases rapidly as more parameters are introduced
1. Higher accuracy of fraud detection. Compared to rule-
based solutions, machine learning tools have higher Self-organizing map (SOM), also known as Kohonen
precision and return more relevant results as they network, is a type of artificial neural network that is used to
consider multiple additional factors. This is because ML produce low dimensional discretized mappings of an input
technologies can consider many more data points, space [9]. Self-organizing maps produce a grid that consists
including the tiniest details of behavior patterns of nodes, which are arranged in a regular hexagonal or
associated with a particular account. rectangular pattern. The training of a SOM works by
assigning a model for each of the nodes in the output grid.
2. Less manual work needed for additional
The models are calculated by the SOM algorithm, and objects
verification. Enhanced accuracy leads reduces the
are mapped into the output nodes based on which node’s
burden on analysts. “People are unable to check all
model is most similar to the object, or in other words, which
transactions manually, even if we are talking about a
node has the smallest distance to the object on a chosen
small bank,” Alexander Konduforov, data science
metric. For real-valued objects, the most commonly used
competence leader at AltexSoft, explains. “ML-driven
distance metric is the euclidean distance,
systems filter out, roughly speaking, 99.9 percent of
normal patterns leaving only 0.1 percent of events to be although in this study, the sum of squares was used. For
verified by experts.” categorical variables, the distance metric used in this study is
the Tanimoto distance.
3. Fewer false declines. False declines or false positives
happen when a system identifies a legitimate The grid nodes’ models are more similar to nearby nodes
transaction as suspicious and wrongly cancels it. than those located farther away. Since it is the nodes that are
being calculated to fit the data, the mapping aims to preserve
4. Ability to identify new patterns and adapt to
the topology of the original space. The models are also
changes. Unlike rule-based systems, ML algorithms are
known as codebook vectors, which is the term used in the R
aligned with a constantly changing environment and
package ‘kohonen’ used to implement the algorithm [10].
financial conditions. They enable analysts to identify
Also, the Tanimoto distance metric is defined under the
new suspicious patterns
function supersom details in the package documentation.
5. METHODOLOGY
In this project, multiple unsupervised self-organizing maps
DBSCAN
were trained using the demographic variables to produce a
(Density Based Spatial Clustering of Applications with Noise)
two-dimensional mapping serving as a customer
algorithm is a well-known data clustering algorithm, which
segmentation. Different parameters and map sizes were
is used for discovering clusters for a spatial data set. The
tested to find the optimal mapping that would maximize
algorithm requires the knowledge of two parameters. First
quality of representation and distance to neighbouring
parameter is eps which is defined as the minimum distance
clusters within the map. The maps were also compared on
between two points. It simply means that if the distance
their ability to produce clusters with varying financial
between two points is smaller or equal to eps, these points
impact and default risk measured by the financial model and
are considered to be neighbors. The second is minPoints: the
the default prediction algorithm. The two primary measures
minimum number of points to form a dense region. For
used to compare different mappings in this study was the
instance, if we define the minPoints parameter as 5, then at
quality (mean distance of objects from the center of node)
least 5 points are required to form a dense region. Based on
and the U-matrix distances (mean distance of nodes to their
the parameters Eps and MinPts of each cluster and at least
neighbouring nodes). The name quality is used due to how it
one point from the respective cluster, the algorithm groups
appears in the kohonen R package.
together the points that are close to each other [6]. Gradient
boosting is a popular machine learning algorithm that Preliminary data analysis Describing the data The data
combines multiple weak learners, like trees, into a one consists of 30,000 customers and 26 columns of variables.
strong ensemble model. This is done by first fitting a model Each sample corresponds to a single customer. The columns
into the data. However, the first model is not likely to fit the consist of the following variables:
model perfectly to the data points so we are left with Default (Yes or no) as a binary response variable
residuals. We can then fit another tree to the residuals to Balance limit (Amount of credit in U.S. $)
minimize a loss function that can be the second norm but Sex (Male, Female)
gradient boosting allows the use of any loss function. This Education (Graduate school, University, High school,
can be iterated for multiple steps which leads to a stronger Others)
model and with proper regularization overfitting can be Marital status (Married, Single, Others)
avoided [7]. The gradient boosting has many parameters Age (Years)
that need to be optimized to find the best performing model Employer (Company name)
for a certain problem. These parameters include both tree Location (Latitude, Longitude)
specific parameters like size limitations for leaf nodes as well Payment status (last 6 months)
as tree depth. There are also parameters considering the Indicates payment delay in months or whether payment
boosting itself, for example how many models are fitted in was made duly
order to receive the final model and how much each Bill amount (last 6 months)

@ IJTSRD | Unique Paper ID – IJTSRD42461 | ICAEST-21 | May 2021 Page 24


Special Issue: International Conference on Advances in Engineering, Science and Technology – 2021 (ICAEST-21)
Available online @ www.ijtsrd.com eISSN: 2456-6470
States amount of bill statement in U.S. $ interchangeably to intend the same thing. It is unknown
Payment amount (last 6 months) whether the difference between the two definitions were
Amount paid by customer in U.S. $ 5 taken into account when the data was collected.
The variables Balance limit, Age, Sex, Education, Marital Education
status, Employer, and Location are defined as demographic The education level of a customer is represented as one of
variables, since they describe a demography of customers four values: 1 = Graduate school, 2 = University, 3 = High
and are available for new customers, unlike the historical school, 4 = Other. For the purpose of analysing customer
payment data which is only available for existing customers. groups, this is assumed to indicate the highest level of
education completed.
The total proportion of defaults in the data is 22.12% which
is 6,636 out of the total data set comprising of 30,000 Marital status
samples. This could be due to a large bias and therefore not a Referred to as “married” in the analysis, this variable can
realistic representation of the bank’s customer base. obtain three values: 1 = Married, 2 = Single, 3 = Other such as
However, the data was collected during a debt crisis which divorced or widowed.
provides an argument for the assumption that the data
Age
represents a non-biased sample of the customer base. In any
of the customer is stated in years.
case, the high amount of defaults in should be taken into
consideration when making generalizations about the Location
results or methodology of this case study. The high number This variable is composed of two different values for each
of defaults will especially have an effect on estimates of the customer. One is for the latitude, and the second one is for
bank’s financials. the longitude. In order to gain benefits from this data in
predictions using only the demographic variables, we
Default
applied the DBSCAN algorithm.
This variable indicates whether or not the customer
defaulted in their credit card debt payment. For the purpose Payment status
of this project, predicting default is the main focus of the is represented as 6 different columns, one for each month.
data analysis. A value of 1 indicates default, and a value of 0 The value of payment status for a month indicates whether
indicates no default. It is unclear how long after the repayment of credit is was delayed or paid duly. A value of -1
collection of the data this variable is measured. This means indicates pay duly. 6 Values from 1 to 8 indicate payment
that default could have happened the following month or a delay in months, with a value of 9 defined as a delay of 9
longer time thereafter. Since this is unknown, no months or more. Data collected from 6 months, April to
assumptions are based on the time of default. It is also not September.
clear whether a value of 1 indicating default means the client
Bill amount
missed only a single payment or multiple and whether or not
Amount of bill statement in U.S. $ is recorded in this variable.
the time of delay in payment was taken into account.
It is represented in the data as 6 columns, one for each
Balance limit month. Data collected from 6 months, April to September.
states the amount of given credit in US $. This is the
Payment amount
maximum amount a customer can spend with their credit
Amount of previous payment in U.S. $, stored in 6 different
card in a single month. The amount of balance limit is
columns for each month, similarly to payment status and bill
dependent on the bank’s own screening processes and other
amount. The payment amounts correspond to the same
unknown factors.
months as payment status and bill amount. For example, the
Sex payment amount for April indicates amount paid in April.
This variable can obtain a value of 1 for male and 2 for
female. In this study, sex and gender are used
Checking data unbalance:

@ IJTSRD | Unique Paper ID – IJTSRD42461 | ICAEST-21 | May 2021 Page 25


Special Issue: International Conference on Advances in Engineering, Science and Technology – 2021 (ICAEST-21)
Available online @ www.ijtsrd.com eISSN: 2456-6470
Preliminary data analysis

Features correlation

Using mainstream (LightGBM) algorithm:


Training the dataset:
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006994 seconds.
You can set `force_col_wise=true` to remove the overhead. Training until validation scores don't improve for 50 rounds
23. train's auc: 0.778238 valid's auc: 0.771173
[100] train's auc: 0.789346 valid's auc: 0.782605
[150] train's auc: 0.794861 valid's auc: 0.784753 Early stopping, best iteration is:
[135] train's auc: 0.793452 valid's auc: 0.785154
Out[62]:
33
Best validation score was obtained for round 135, for which AUC ~= 0.78.

@ IJTSRD | Unique Paper ID – IJTSRD42461 | ICAEST-21 | May 2021 Page 26


Special Issue: International Conference on Advances in Engineering, Science and Technology – 2021 (ICAEST-21)
Available online @ www.ijtsrd.com eISSN: 2456-6470
Plotting the variable importance

6. CONCLUSION Single customers should not be discriminated against


The results of analysis and predictive modelling show that especially based on the customer segmentation which relies
neither directly measuring or using predicted proportion of on calculating averages over a group. A single customer
defaults of a customer group to predict default is accurate. defaulting with high debt can result in much higher losses
This is most likely due to multiple reasons. One of them than might be anticipated simply based on averages.
being the limitations in accuracy of any machine learning
Similarly, the analysis does not go in-depth enough to justify
algorithm caused by the small number of variables or due to
assuming that the variables used in this study could explain
missing values. Another reason is most likely the lack of
or predict how reliable the customers are on the long run,
specificity in customer segments, mixing up actual high risk
especially considering that the data was collected during a
customers with those of low risk. Comparing paying
debt crisis.
amounts, gender and maternal status in the training set and
test set also showed large variation. This is most likely due to 7. REFERENCES
the high losses that a single customer can produce by [1] Wikipideahttps://www.8051projects.net/files/public
defaulting with high amounts of debt. Much of the variation /1259220442_20766_FT0_7380969-line-follower-
in the data could not be represented, since customer using-at89c51.pdf
segmentation was only done using the demographic
[2] Default Credit Card Clients Dataset,
variables. Further analysis should be done in order to fully
https://www.kaggle.com/uciml/default-of-credit-
justify and support business decisions based on the
card-clients-dataset/
customer segmentation in this study.
[3] RandomForrestClassifier, http://scikit-
When it comes to default prediction, we have a model that is
learn.org/stable/modules/generated/sklearn.ensemb
able to predict the defaults of customers with high enough
le.RandomForestClassifier.html
certainty that the bank can utilize it in their functions.
Assuming that the banks continues to receive customers that [4] ROC-AUC characteristic,
are represented in our dataset we could implement our https://en.wikipedia.org/wiki/Receiver_operating_ch
model in the banks preliminary screening process and it aracteristic#Area_under_the_curve
would bring financial gain to the bank.
[5] AdaBoostClassifier, http://scikit-
However, our solution is not viable to be used as a learn.org/stable/modules/generated/sklearn.ensemb
standalone system in its current form since it only considers le.AdaBoostClassifier.html
part of the banks actions. Many factors that were not covered
[6] CatBoostClassifier,
in this case study should be taken into consideration when
https://tech.yandex.com/catboost/doc/dg/concepts/
taking any business action. For example young people could
python-reference_catboostclassifier-docpage/
be preferable for the bank since they stay longer as a
customer so it could be in banks interest to favor having [7] XGBoost PythonAPI Reference,
them as a customer even if our model would suggest 26 http://xgboost.readthedocs.io/en/latest/python/pyt
otherwise. hon_api.html

@ IJTSRD | Unique Paper ID – IJTSRD42461 | ICAEST-21 | May 2021 Page 27


Special Issue: International Conference on Advances in Engineering, Science and Technology – 2021 (ICAEST-21)
Available online @ www.ijtsrd.com eISSN: 2456-6470
[8] LightGBM Python implementation, NOC ROUTER ARCHITECTURE. International Journal
https://github.com/Microsoft/LightGBM/tree/maste of Research Fellow for Engineering (IJRFE)–Volume, 4.
r/python-package
[12] Joshi, K., Chaudhary, S., & Chauhan, N. HYBRID
[9] LightGBM algorithm, https://www.microsoft.com/en- CLUSTERING ALGORITHM USING K-MEANS
us/research/wp- CLUSTERING ALGORITHM. International Journal of
content/uploads/2017/11/lightgbm.pdf Research Fellow for Engineering (IJRFE)–Volume, 4.
[10] Chauhan, N., Dhaundiyal, R., & Joshi, K. K-MEANS ON [13] Longkumer, M., Joshi, K. A Comprehensive Study on
SEARCH ENGINE DATASET THROUGH WEKA. Recent Botnet. International Journal of Science and
International Journal of Research Fellow for Research (IJSR)- Volume, 7.
Engineering (IJRFE)–Volume, 4.
[14] Joshi, K., Gupta, H., & Lamba, S. An Overview on Image
[11] Joshi, K., Rawat, S., & Chaudhary, S. ANALYSIS OF Fusion Concept. Journal of Emerging Technologies
DIFFERENT OPTICAL SWITCHING TECHNIQUES IN and Innovative Research (JETIR)–Volume, 5, 873-879.

@ IJTSRD | Unique Paper ID – IJTSRD42461 | ICAEST-21 | May 2021 Page 28

You might also like