Development of a Machine Learning-Based Financial Risk Control Sy
Development of a Machine Learning-Based Financial Risk Control Sy
DigitalCommons@USU
5-2022
Recommended Citation
Hu, Zhigang, "Development of a Machine Learning-Based Financial Risk Control System" (2022). All
Graduate Theses and Dissertations. 8479.
https://digitalcommons.usu.edu/etd/8479
CONTROL SYSTEM
by
Zhigang Hu
of
MASTER OF SCIENCE
in
Computer Science
Approved:
2021
ii
ABSTRACT
by
With the gradual end of the COVID-19 outbreak and the gradual recovery of the
economy, more and more individuals and businesses are in need of loans. This demand
brings business opportunities to various financial institutions, but also brings new risks. The
traditional loan application review is mostly manual and relies on the business experience
of the auditor, which has the disadvantages of not being able to process large quantities
and being inefficient. Since the traditional audit processing method is no longer suitable
some other method of reducing the rate of non-performing loans and detecting fraud in
This thesis develops a risk control system using real, desensitized loan applicant data
provided by financial institutions. The system uses a decision tree model to create a loan
user bad debt rate prediction system and a logistic regression model to create a loan user
score card system. The Borderline-SMOTE algorithm is used to solve the sample imbalance
problem in the dataset. A gradient boost decision tree method is employed to improve the
overall performance of the system. The final risk control system improves the approval
speed, reduces the risk, and lowers the rate of non-performing loans.
and feature derivation formula are created to meet the project requirements and are
applied to the decision tree model with significant results, in that the classification
features finally adopted were all derived from the results of feature derivation.
• Based on the idea of logistic regression algorithm, a new method of classifying feature
datasets using the reverse of classification result set data is proposed and applied to
• Based on the data equal frequency binning algorithm, a new data binning algorithm is
proposed by using the chi-square test combined with the method of calculating infor-
mation value. It makes up for the defect that the equal-frequency binning algorithm
(69 pages)
v
PUBLIC ABSTRACT
Zhigang Hu
With the gradual end of the COVID-19 outbreak and the gradual recovery of the economy,
more and more individuals and businesses are in need of loans. This demand brings business
opportunities to various financial institutions, but also brings new risks. The traditional
loan application review is mostly manual and relies on the business experience of the au-
ditor, which has the disadvantages of not being able to process large quantities and being
inefficient. Since the traditional audit processing method is no longer suitable some other
method of reducing the rate of non-performing loans and detecting fraud in applications is
In this project, a financial risk control model is built by using various machine learning
algorithms. The model is used to replace the traditional manual approach to review loan
applications. It improves the speed of review as well as the accuracy and approval rate
of the review. Machine learning algorithms were also used in this project to create a loan
user scorecard system that better reflects changes in user information compared to the
credit card systems used by financial institutions today. In this project, the data imbalance
ACKNOWLEDGMENTS
which lasted for almost 8 months.During this 8-month project, with the assistance of other
using daily business data from financial institutions and some data provided by third parties,
I give special thanks to my supervisor Dr.Curtis, family, friends, and colleagues for
their encouragement, moral support, and patience as I worked my way from the initial
proposal writing to this final document. I could not have done it without all of you.
Zhigang Hu
vii
CONTENTS
Page
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
PUBLIC ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 THE PROCESS OF DEVELOPING A MACHINE LEARNING-BASED FINAN-
CIAL RISK CONTROL SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 A DECISION TREE-BASED LOAN REVIEW SYSTEM . . . .... . .... .... .. 9
3.1 The Data Set and Feature Engineering . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Handle missing values . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Feature Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Going from a Data Table to a Decision Tree . . . . . . . . . . . . . . . . . . 13
3.3 Decision tree result output and related decisions . . . . . . . . . . . . . . . 16
3.3.1 Analysis and comparison of results . . . . . . . . . . . . . . . . . . . 17
4 LOGISTIC REGRESSION-BASED CREDIT SCORING MODEL . . .... . . . . . . 21
4.1 Key approaches to credit score modeling . . . . . . . . . . . . . . . . . . . . 21
4.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.1 Loss function of binary logistic regression . . . . . . . . . . . . . . . 26
4.2.2 Regularization in logistic regression . . . . . . . . . . . . . . . . . . . 26
4.3 Datasets and Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.1 Descriptive statistics to handle outliers . . . . . . . . . . . . . . . . . 29
4.3.2 Sample imbalance problem . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Creating bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Modeling and Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.6 Making scorecards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 DEALING WITH DATA IMBALANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1 Handling Imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Treatment of imbalanced data sets from an algorithmic perspective . . . . . 41
5.3 Treatment of imbalanced data sets from a data perspective . . . . . . . . . 42
5.4 Comparison of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 PERFORMANCE IMPROVEMENT . . . . . . . . . . .... .... ..... .... .... . . 46
6.1 Ensemble Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Gradient Boost Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2.1 Comparison and analysis of results . . . . . . . . . . . . . . . . . . . 49
viii
LIST OF FIGURES
Figure Page
INTRODUCTION
The Internet has transformed traditional financial practices. Internet finance brings
convenience, speed and high returns to financial enterprises, but it also brings considerable
risks such as confusion in credit management, bad lending and malicious deception. The
scale of social financing is growing and private economic activities are becoming more and
more active, meanwhile, credit cards, housing loans, auto loans, consumer loans and other
personal credit businesses are also growing rapidly. As the total amount of loans is increas-
ing, the number of defaults by users is also rising, and some banks are experiencing an
increasing rate of non-performing loans. Therefore, it is very important for finance enter-
prises to reduce the risk of Internet finance by studying the risk of user default, assessing
The rapid development of Internet finance has brought serious challenges to the tra-
ditional financial model, especially in the risk control system. In the financial sector, the
• Market risk refers to the uncertainties in the value of the company’s underlying assets,
liabilities, or income due to exposure to a highly dynamic financial market [2, 3].
Market risk includes risks such as interest rate risk, exchange rate risk, and commodity
price risk.
• Credit risk refers to the uncertainty involving its creditors’ ability to perform their
contractual obligation (loan defaults or bankruptcy) [4]. This is applicable for both
retail lenders (lenders who provide loans to individuals or retail customers) and cor-
• Insurance and Demographic risk, which is more specific to the insurance industry,
refers to the variance in insurance claim experience due to unpredictable events (e.g.,
2
catastrophes, car accidents) as well as uncertain ties involved with the demographic
profile of its policyholders (e.g., mortality). This risk can be further broken down to
• Operational risk refers to the risk of loss due to the unpredictability of business opera-
risk can be further broken down into business risk and event risk. Business risk in-
demand volatility, customer churn, faulty business operations) and event risk includes
uncertainty in events that have an adverse effect in business operations (e.g., fraudu-
each financial institution’s own credit risk control team in the form of a manual audit. The
biggest problem of this audit mode is the slow audit speed, which usually takes more than
a week from the audit to the release of funds, which cannot adapt to today’s fast-changing
financial model. Machine Learning is a computational method that uses past information
a model using machine learning, we can quickly assess the user’s credit rating, loan amount
• Loan Risk Assessment Modeling - This area is by far the most used area of machine
approaches to assign risk level associated with a specific application [7]. Recently,
machine learning algorithm have been widely used in credit card default prediction.
By constructing several models, it can predict clients’ payment precisely and effec-
can address the problem of large-scale data and operate promptly [8]. Several un-
supervised learning models used for default or bankruptcy prediction in credit risk
• Credit Card User Score System - In the age of Internet finance, credit is everywhere.
In order to judge the quality of customers, the use of credit scoring and the imple-
mentation of customer scoring has become a difficult point in the current field of risk
control. So it is very important to establish a credit card user score system, through
which the lenders can determine risk involved in lending. Shukla built this kind of
system by using a parallel social spider algorithm [10]. Sudha used SVM and random
This thesis develops a machine learning-based financial risk control system. The system
1. The first part is an effective bad debt prediction system. Bad debt rate control is a
problem that financial institutions face. Current bad debt rate analysis and control
systems are based on statistics combined with finance to find out the main risk points
in user data, common risk points are monthly income and asset/liability ratio. In
this thesis, the main focus is on predicting the bad debt rate of new users by using a
classification model. One of the problems in building a classifier is that the financial
2. The second part is a user credit scoring system. In FRM, user credit scoring system
is a core system. The system must fully reflect the user’s credit history, and it is also
an important basis for deciding whether to grant a loan to the user, the loan amount,
and the interest rate of the loan. The current credit history system that is widely
used by major financial institutions is the FICO system, but this system is only a
credit record system, not the scoring system that financial institutions need. It can
4
only score a user’s credit based on their age, income status, and whether they have
a history of defaults. The system reflects the history of the user, for example, there
are a large number of users in the system simply because they have defaulted on their
loans for various reasons, but have paid them back in a timely manner. This situation
will result in a low FICO score, which will lead to a failed audit and potential loss of a
quality customer. The advantage of machine learning lies in the use of historical data
to predict the future behavior of users, so machine learning algorithms can be applied
to the creation of this system. But the challenge is how to make the final scorecard
system interpretable and accurate. It has to be able to accurately rate users based
on their history and also explain the basis of the rating to business people. We build
a regression based user credit scoring system. Common regression models are binary
or multi-classify users based on feature data sets. But we need a system to perform
multiple classifications on the feature dataset based on the user classification results,
and the classification results must be optimized to reflect the user information as
much as possible. Therefore, in this system, we simply borrow the idea of a regression
algorithm and create a new algorithm to classify the feature set data based on the
result set data, and ensure the interpretability and accuracy of the algorithm.
3. The third part is solving the problem of data imbalance in financial risk control sys-
tem. A natural feature of financial data is data imbalance (fewer users actually have
defaulted than paid loans in full), so this reality should be fully considered when
dealing with such data. In this thesis several algorithms to deal with the data imbal-
ance problem are tried and the results are given for comparison. We experimentally
show that a small improvement in accuracy can be achieved through addressing data
imbalance.
illustrated in many papers and kaggle competition practices that integrated algorithms
show that while an integrated algorithm can indeed improve the overall performance
and feature derivation formula are created to meet the project requirements and are
applied to the decision tree model with significant results, in that the classification
features finally adopted were all derived from the results of feature derivation.
• Based on the idea of logistic regression algorithm, a new method of classifying feature
datasets using the reverse of classification result set data is proposed and applied to
• Based on the data equal frequency binning algorithm, a new data binning algorithm is
proposed by using the chi-square test combined with the method of calculating infor-
mation value. It makes up for the defect that the equal-frequency binning algorithm
CHAPTER 2
in Figure 2.1. The application starts with getting data. The main sources of data in
this project are daily business data of financial institutions, data provided by qualified
third parties, and log data of websites. The data has been desensitized according to the
The next step in the process is to clean the data. Data cleaning and feature engineer-
ing is an important procedure in machine learning. The quality of the data will directly
affect the learning effect of the model. There is a famous idiom in the field of machine
learning that “garbage in, garbage out.” This step is often neglected in papers because the
datasets are idealized data obtained after many cleanings. Real-world datasets are often
less than ideal, which leads to many algorithmic models failing to achieve the desired effect
in practice. The reason for this is not the algorithm itself, but the quality of the data.
Data cleaning and feature engineering methods are many and complex, common methods
are data type conversion, missing value filling, one-hot coding, biased data correction, and
Model development comes after feature engineering. Model development is the most
central step in machine learning. We use four kinds of machine learning models in this
usually divided into univariate versus multivariate regression. The coefficients of the
dependent variable are calculated by fitting the data to the regression model during
the training process. Regression problems are widely used, examples include stock
7
price prediction, house price prediction, and temperature prediction. The credit user
of an unknown entity. Classification can be supervised (the classes are known a priori)
or unsupervised (the classifier learns the classes). A machine learning model can also
output the probability of the classification result while processing the classification.
In this thesis classification is used for loan risk predication (into good or bad risk)
and credit card user prediction (rating the card risk from A to F).
the number of clusters is not known in the initial condition. Clustering is groups
similar samples into a cluster. Commonly used methods include density clustering
and grid clustering, and the problem of measuring sample distance is also judged by
using Euclidean distance and Minkowski clustering. Cluster modeling usually uses
K-Means, a simple and easy method, is widely used in pattern recognition, image
association rule, which studies the correlation between different data in an event.
After choosing the modeling method and building the model, the next step is testing
and evaluation. In this step the training results of the model are tested with the test dataset
and the parameters of the model are adjusted based on the test results to achieve the best
machine learning results. In addition, an evaluation team involving business people is often
needed to assess the feasibility of the modeling results in real business scenarios. When the
results of the model do not reach the expected results, we need to re-examine the quality
of the input data set when the problems of the model algorithm itself are excluded, so this
stage is often an iterative process. The main methods used in this phase are cross-validation,
Once the model has passed evaluation and testing, it is ready for live application.
However, in practice, we find that after the model has been online for a period of time,
there is often a phenomenon of model degradation, that is, the model works better for a
period of time when it is just online, but when it runs for a period of time, the effect is much
worse than when it was just online. After analyzing the reasons, we found that the user
data is relatively concentrated in certain types over a period of time, for example, there is
a period of time when most of the loan applications are for home loans, and another period
of time when most of the users are for car loans. In this case, the original distribution of
the data set is affected thus leading to degradation of the model. Then, we need to add the
new data distribution to the model and retrain the model. Therefore, it is not the end of
the project after the model goes online, but we need to continuously monitor the effect of
the online model, and analyze the reasons for the degradation once the model is found, and
CHAPTER 3
An important part of a financial institution’s daily business is the review of loan appli-
cants. At present, with the increasing economic activity and lower interest rates on loans,
the number of applications for loans of all kinds has risen sharply. While these loans bring
lucrative profits to financial institutions, they also pose a relatively large risk. Traditional
loan applications are reviewed by auditors based on their work experience, and this method
1. Low speed - An auditor can review a limited number of applications per day. Generally,
an experienced auditor can only review about 30-50 applications a day. However, the
to be reviewed. This can result in a large number of quality customers being lost
to other financial institutions. The longer audit cycles also lead to a slowdown in
2. Low accuracy - Traditional auditors can only rely on limited information such as credit
score, income status, and marital status to analyze and make judgments with previous
work experience. For example, when most financial institutions review home loans,
they often set the monthly mortgage payments to be no more than 1/3 of the net
income. However, personal income is a dynamic data, and the income proof available
to financial institutions is only the applicant’s current income, which leads to a large
number of qualified applicants who cannot get the loan amount they need and miss out
on the desired home, while some applicants default on the loan due to lower income.
rules from a series of data with features and labels, and presents these rules in a tree-like
graph structure to solve classification and regression problems. Decision tree algorithms
are easy to understand, applicable to all kinds of data, and perform well in solving various
problems, especially various integrated algorithms with tree model as the core, which are
This chapter describes the decision tree-based loan review system starting with a de-
We use user oil consumption data provided by a third-party, desensitized and inte-
grated. The details of the data-set are as shown in Figure 3.1. The data set has a total of
50609 records and 18 features, where bad ind is an expert provided label of whether or not
Some of the features have missing values. The missing values are shown in Figure 3.2.
The missing values account for nearly 10% of the total data volume, so we cannot simply
delete records with missing values or fill them with 0. We need to adopt different filling
Based on the nature of the features, we divide the feature set into three categories.
category contains features such as ‘uid’, ‘create dt’, ‘oil actv dt’, ‘class new’, ‘bad ind’.
2. agg lst - Aggregate numeric values. This category contains features such as ‘oil amount’,
‘discount amount’, ‘sale amount’, ‘amount’, ‘pay amount’, ‘coupon amount’, and ‘pay-
3. dstc lst - Count text variables. This category contains features such as ‘channel code’,
Inevitably, missing values appear in the raw data for various reasons. However, in
machine learning algorithms, if there are missing values in the dataset, the program cannot
be executed. Therefore, one of the important tasks in feature engineering is to fill in the
missing values. When the number of missing values is small, the easiest way is to remove
the missing values. Other methods include median/mean/zero value filling, etc.
From the Figure 3.2,we can know that there are a total of 4944 data items with
missing values. The number of missing values accounts for 9.8% of the total data vol-
ume, and if these missing values are deleted directly, it will inevitably cause information
loss and thus affect the final prediction results. Looking further, we can reveal that the
missing values exist in the following feature sets:’create dt’ ’oil amount’ ’discount amount’
12
’sale amount’ ’amount’ ’pay amount’ ’coupon amount’ ’payment coupon amount’.In terms of
data types, the missing values are both date and numeric data, so we need to use different
filling algorithms.
Do a filling to ’creat dt’, fill it with ’oil actv dt’, and intercept 6 months of data. When
constructing the variables you cannot do a direct accumulation of all the historical data.
First filter out the values with maximum time interval >180
Then for numerical data, find the value of the maximum time interval less than 180
In the feature engineering approach of machine learning, we often compute the correla-
tion between features and labels to delete those feature sets that have little or no impact on
the final learning results but consume a lot of computing resources. However, the deletion
of these feature sets will inevitably lead to the loss of data information, which will affect
the final machine learning results. However, after some transformation or combination of
such features, they often have strong information value and can be helpful for data sensi-
tivity and machine learning experience, so we need to do some derivative work on the basic
features.
In this dataset, the set of features that can actually be used in the decision tree for
classification is agg lst, which has a total of seven features. The number of features is
on the low side to achieve a good classification result. Therefore, feature derivation can
be considered for this feature set. There are seven numerical features in this feature set,
and most of the information contained in this dataset should be contained in this data
set. In order to further explore the hidden data information in the data set to ensure the
classification effect of the decision tree model, feature derivation can be performed on this
feature set.
have rich knowledge or practical experience in related fields, to observe and analyze the raw
data based on actual business scenarios, to think about the potential form of the problem
and data structure, and to find out some physically meaningful features from the raw data.
Kanter proposed the concept and method of Deep Feature Synthesis [12] and verified
the effectiveness of the method. Based on the ideas provided in the paper and previous work
experience, the feature set was derived as shown in Figure 3.3. After feature engineering,
There are an exponential number of decision trees that can be constructed from a table.
Decision-tree algorithms prune decision trees by measuring some property of the tree. The
method we chose uses impurity. Generally speaking, the lower the impurity, the better
14
the decision tree fits the training set. Many decision tree algorithms are centered around
Impurity is calculated based on nodes, each node in the tree will have an impurity, and
the impurity of child nodes must be lower than that of the parent node. That is, the leaf
node must have the lowest impurity in the same decision tree.
In the decision tree algorithm, there are two ways to calculate impurity.
c−1
X
Entropy(t) = − p(i|t) log2 (p(i|t))
i=0
c−1
X
Gini (t) = 1 − p(i|t)2
i=0
15
In the equation t represents the given node and i represents any classification of the
Information entropy is more sensitive to impurity than the Gini coefficient and penal-
izes impurity the most. However, in practical use, the effects of information entropy and
Gini coefficient are similar. The calculation of information entropy is slower than the Gini
coefficient because the calculation of the Gini coefficient does not involve logarithms. In-
formation entropy is more sensitive to impurity, so the growth of the decision tree is more
refined when the information entropy is used as an indicator. For high-dimensional data
or data with a lot of noise, the information entropy can be easily over-fitted, and the Gini
coefficient tends to work better in this case. When the model is under-fitted, i.e., when
the model does not perform well on both the training and test sets, it is better to use the
information entropy coefficient. When the model does not fit well, i.e., when the model
does not perform well on both the training and test sets, use the Gini coefficient.
The basic process of decision trees can actually be briefly summarized as shown in
Figure 3.4:
Without restrictions, a decision tree grows until the measure of impurity is optimal,
or until no more features are available. Such a decision tree tends to over-fit, which means
that it will perform well on the training set and poorly on the test set. The sample data
we collect cannot be identical to the overall situation, so when a decision tree has an overly
good interpretation of the training data, the rules it finds must contain the noise in the
In order to have better generalization of the decision tree, we have to prune the decision
tree. The pruning strategy has a huge impact on the decision tree, and the right pruning
16
strategy is the core of the optimization the core of the decision tree algorithm.
There are two common pruning strategies: max depth and min samples leaf.
1. max depth - Limit the maximum depth of the tree, and cut off all branches that
exceed the set depth. This is the most widely used pruning parameter and is very
effective at high dimensionality and low sample size. Growing the decision tree by
one more layer doubles the demand on the sample size, so Therefore, limiting the tree
2. min samples leaf and min samples split - min samples leaf, each child of a node af-
ter branching must contain at least min samples leaf training samples, otherwise
branching not happen, or the branching will happen in the direction of satisfying
min samples leaf for each child node. It is usually used with max depth, which has
the magic effect of smoothing out the model in the regression tree. Setting the num-
ber of this parameter too small will induce over-fitting, and setting it too large will
prevent the model from learning the data. If the sample size contained in the leaf
centage of the sample size to be used. Also, this parameter ensures the minimum size
of each leaf and can be used in regression problems It can avoid the appearance of low
variance, over-fitting leaf nodes in the regression problem. min samples split, a node
must contain at least min samples split training samples for this node to be allowed
We can find the optimal parameters for the decision tree by plotting the learning curve
as shown in Figures 3.5, 3.6 and 3.7. From the learning curves, it appears that the optimal
combination of parameters is max depth=2, min samples leaf =500, min samples split=5000.
After training the decision tree model using the optimal parameters the resulting
tree is shown in Figure 3.8. We can see that the average bad debt rate is 4.7% when
pay amount tot ≤ 240387.5 is used as the first level screening condition, while the bad
17
debt rate is 2.1% when pay amout num ≤ 3.5 is used as the second level screening con-
dition, both of which meet the project target requirement of no more than 5% bad debt
rate.
Using the decision tree, users can be classified into A, B and C categories as illustrated
in Table 3.1.
[TODO: explain these three tables. The previous table had only three categories: A,
Through the results comparison table we can find: the number of applications approved
increased from 1,901 to 5,052, and the approval rate increased from 17.1% to 45.5%, while
This fully demonstrates that the system can significantly improve the pass rate while
also significantly reducing the bad debt rate, and the accuracy and efficiency of the audit
CHAPTER 4
and is one of the most central management techniques for banks, credit card companies,
panies, insurance companies and other business entities involved in consumer credit. It is
widely used in credit card life cycle management, auto loan management, housing loan man-
agement, personal loan management, and other consumer credit management. It plays a
very important role in marketing, credit approval, risk management, account management,
Credit scoring models use advanced data mining techniques and statistical analysis
characteristics, credit history records, behavior records, and transaction records, to mine the
behavior patterns and credit characteristics embedded in the data, capture the relationship
between historical information and future credit performance, and develop a predictive
model to comprehensively assess consumers’ future with a credit score of a certain credit
performance. For example, the probability of incurring a bad debt loss in the next year,
the potential to generate revenue for the bank. or the probability that the consumer will
accept credit card marketing as a basis for consumer credit management decisions. If
developed scientifically and applied correctly, credit scoring models can provide consumer
credit managers with a large amount of highly predictive information that can help them
develop effective management strategies to effectively develop markets, control risks, and
exploit revenues with a high degree of accuracy, thus achieving high efficiency in consumer
credit business.
The first is credit analysis with customer classification (profiling and segmentation)
as the core. This is to use some descriptive analysis (such as mean, variance, frequency,
probability distribution, etc.) and exploratory analysis (such as class clustering analysis,
factor analysis, correlation analysis, etc.) to conduct a relatively simple analysis of cus-
tomer credit information, to observe and refine customer behavior characteristics. We also
into categories on this basis. For example, customers are classified according to the income
to the amount of revolving credit usage and the activity of transactions, etc. The character-
istic is to get some insight as a basis for decision making through non-systematic analysis
(ad hoc analysis). This method uses a simple algorithm, small amount of data, and gives
only a very limited basis for decision making, so the method is gradually being replaced.
The second is the credit scoring model with predictive modeling as the core. This
derivative variables reflecting consumer behavior and creditworthiness through deep mining
bureaus) and internal information (such as credit card transaction information and behav-
ioral information) of consumer credit institutions, and uses advanced mathematical and
certain area and rank them by a score as a basis for decision making. This method has been
applied earlier and has been used for more than 20 years, and major companies and orga-
nizations have accumulated a lot of relevant experience in using it, so it is still a relatively
mainstream modeling method. Typical methods include data mining techniques using em-
pirical research [15], using network-based models to improve credit score accuracy [16], and
The third is a credit scoring model with decision modeling (DM) at its core. This is
23
a step beyond pure predictive models because it not only predicts the future credit perfor-
mance of a consumer from information about his or her behavioral characteristics, but also
predicts that the credit decision itself has an impact on future performance. The funda-
mental difference between decision models and predictive models is that predictive models
predict the future credit performance of consumers based on their historical behavioral
characteristics, and future credit performance is only a function of consumers’ own credit
characteristics; whereas decision models aim to quantify the impact of decisions with dif-
ferent future credit service profiles. Therefore, the sensitivity to the decision varies. This
is the so-called interaction effect. In recent years, decision modeling techniques have been
comparing the impact of different decisions on business management objectives and then
selecting the management decision that is most beneficial to achieve the objectives.
Machine learning techniques, which have become more and more widely used in recent
years, are also widely used in this field. Mhina [18] proposed collecting information related
to cell phone users, and using K-Means clustering method to score the credit of users who
apply for loans through cell phones. Zhang [19] uses deep learning to solve the problem that
the single classification method used in other methods is only good for classifying one type
of data, but poor for classifying other types of data. And the algorithm’s performance has
been significantly improved to enable real-time online analysis of data. More research and
applications use the classical algorithm of logistic regression and its improvements [20–23].
In these approaches, it is common to set the credit scoring model as a classification model,
Linear regression is the simplest regression algorithm in machine learning. Its equation
is expressed as follows.
z = θ0 + θ1 x1 + . . . + θn xn
24
θ are collectively referred to as the parameters of the model, where θ0 is known as the
matrix, x and θ can all be viewed as a column matrix, yielding the following.
x
0
x1
z = [θ0 , . . . , θn ] × = θT x(x0 = 1)
...
xn
The task of linear regression is to construct a prediction function that maps the linear
relationship between the input feature matrix x and the label value y. The core of con-
structing a prediction function is to find the parameters of the model θT and θ0 . Through
the function z, Linear regression uses the input feature matrix X to output a continuous set
of label values y pred, to accomplish various tasks of predicting continuous type variables.
Then, if our labels are discrete variables, in particular, if they are discrete variables that
Transform the linear regression equation z to g(z) and let the values of g(z) are dis-
tributed between (0,1) and the sample is labeled as category 0 when g(z) is close to 0, and
category 1 when g(z) is close to 1. This gives a classification category model. And this link
1
g(z) =
1 + e−2
25
In linear regression, we replace with z = θT x, then we can get the general form of the
1
g(x) = y(x) =
1 + e−θT x
Linear regression has strict requirements on data, such as labels must meet normal
Logistic regression is a good fit for linear relationships. Data with extremely strong
linear relationships between features and labels, such as those for credit card fraud, score
card modeling, marketing prediction in e-commerce, and other related data are all strong
points of logistic regression. Although there is now a gradient boosting GDBT, which is
more effective than logistic regression and is also used by many data consulting companies,
but the dominance of logistic regression in the financial field, especially in the banking
Computation of logistic regression is quickly. For linear data, logistic regression is very
fast to fit and compute, and the computational efficiency is better than SVM and Random
Forest.
The classification results returned by logistic regression are not fixed 0 and 1, but class
probability numbers presented as decimals. We can therefore use the results of logistic
regression as continuous data. For example, when creating a scorecard, we not only need to
determine whether a customer will default, but also need to give a definite “credit score”.
The calculation of this credit score requires logarithmic odds calculated using class prob-
abilities, while classifiers like decision trees and random forests can produce classification
results but cannot help us calculate the credit score. decision trees and random forests can
produce classification results, but they cannot help us calculate the score. Logistic regres-
sion is essentially a classifier that returns log odds and performs well on linear data, and it
is mainly used in the field of finance. Its mathematical purpose is to solve for the values
of the parameters θ that allow the model to fit the data the best, to construct a prediction
function, and then to input the feature matrix into the prediction function to calculate the
26
In machine learning modeling, we have to observe how the model performs on the
training set, and how it performs on the test set. We model for optimal model performance
on the test set, so the model evaluation metrics are often used to measure the performance
of the model on the test set. However, logistic regression has the need to solve parameters
θ based on training data and wants the trained model to fit the training data as closely as
possible, that is, the closer the model’s prediction accuracy on the training set is to 100%,
the better.
The loss function of the logistic regression is derived from the Great Likelihood Esti-
m
X
J(θ) = − (yi × log(yθ (xi ) + (1 − yi ) × log(yθ (xi )))
i=1
Where θ is a set of parameters solved for, m is the number of samples, yi is the true label
on sample i, yθ (xi ) is the value of the logistic regression on sample i based on the parameters
θ calculated the logistic regression return value from xi is the value of each feature of sample
i. Our goal is to solve for the value θ that minimizes J(θ). Since we pursue the minimum of
the loss function to make the model perform optimally on the training set, another problem
may be raised: if the model represents excellent on the training set but performs poorly on
the test set, the model will be over-fitting. Although logistic regression and linear regression
are inherently under-fitting models, we still need techniques to control over-fitting to help
us tune the models, and control of over-fitting in logistic regression is achieved through
regularization.
Regularization is the process used to prevent over-fitting of the model. Two options are
multiples of the L1 and L2 paradigms of the parameter vector θ after the loss function,
respectively. This increased paradigm is called the “regular term” and also the “penalty
term”. We use this to adjust the degree of model fit. The L1 paradigm is expressed as the
sum of the absolute values of each parameter in the parameter vector, and the L2 paradigm
is expressed as the squared sum of the squares of each parameter in the parameter vector.
n
X
J(θ)L1 = C × J(θ) + |θj |(j ≥ 1)
j=1
v
u n
uX
J(θ)L2 = C × J(θ) + t (θj )2 (j ≥ 1)
j=1
not have the same effect. As the regularization strength gradually increases (C gradually
becomes smaller), the parameter values will gradually become smaller, but L1 regularization
will compress the parameter to 0, and L2 regularization will only make the parameter as
small as possible and will not take it to 0. In the process of gradual strengthening of L1
regularization, the parameters of features that carry little information and contribute little
to the model will become zero faster than those of features that carry a huge amount of
is essentially a feature selection process, which governs the “sparsity” of parameters. The
stronger the L1 regularization, the more parameters in the parameter vector become 0,
the more sparse the parameters are, and the fewer features are selected to prevent over-
fitting. Thus, if the feature size is large and the data dimension is high, we tend to use
to make each feature contribute to the model in some small way during the strengthening
process, but features that carry little information and do not contribute much to the model
will have parameters very close to 0. Usually, if our main purpose is only to prevent over-
fitting, choosing L2 regularization is sufficient. However, if the model still over-fits after
28
can be considered.
The data set is obtained from the daily business data of financial institutions, after
data desensitization and integration, and the data dictionary is as shown in Table 4.1.
One problem that can exist with realistic data, especially in the banking industry, is
The second problem to face is the missing values. The features we need to fill in here
about 2.5% is missing, which can be considered to be removed directly or filled using the
29
mean value. We fill the missing dependents with the mean value. MonthlyIncome is missing
by almost 20%, and we know that MonthlyIncome is necessarily an important factor for
credit scores, so this feature has to be filled in. Here we can consider using Random Forest
to fill in the missing values of this feature. Random forest uses the idea that if A, B, and
C can predict Z, then A, C, and Z can be used to fill a missing value for B. For a data
with n features, one of the features T has a missing value, we use the feature T as the label,
and the other n − 1 features and the original label form a new feature matrix. Then, for
T , the part of it without missing values is our y train, which has both labels and features,
and the part with the missing values, only features without labels, is the part we need to
predict. The value of feature T that is not missing corresponds to the other n − 1 features
plus the original label: X train. The value of feature T that is not missing: y train. The
value of feature T that is missing corresponds to the other n − 1 features plus the original
label: X test. The value of feature T that is missing and we need to predict is y test. This
approach works very well for cases where a certain feature is missing but the other features
are complete.
Real data will always have some outliers. We do not want to exclude all outliers, rather
many times, the outliers are our focus, for example, the brands with super high purchase
volume in a Black Friday sale, or the topics that excite many students in the classroom,
30
these are the ones on which to focus our research and observation.
To deal with outliers on a daily basis, we use box plots or 3δ laws to find outliers. But
in the banking data, the outliers we want to exclude are not numbers that are too high
or too low, but numbers that do not make sense: for example, income cannot be negative,
but a very high level of income is reasonable and can exist. So in the banking industry, we
tend to just use ordinary descriptive statistics to see whether the data are abnormal. Note
that this approach can only be performed with a limited number of features. If there are
hundreds of features and it is not possible to successfully reduce the dimensionality or the
As can be seen from the descriptive statistics table above, the age characteristic has
a minimum value of 0, which is clearly not in line with banking requirements, even for
children’s accounts, which must be at least 8 years old. If a person’s age is found to be 0,
it can be judged to be definitely caused by an entry error, and can be treated as a missing
“the number of times in the past two years that were 35-59 days past due but did not
develop worse”, “the number of times in the past two years that were 60-89 days past due
but did not develop worse”, and “the number of times in the past two years that were 90
days past due”, respectively. These three indicators are 2 in 99% of the distribution, but
the maximum value is 98, which seems very strange. For instance, how could a person be
overdue 35-59 days 98 times in the past two years? There are 225 samples where this is the
case, and we can basically determine that these samples are some kind of anomalies and
In the descriptive statistics, we can observe that the data are not uniform in magnitude,
and there is an extremely skewed distribution. Although logistic regression has no distri-
bution requirement for the data, we know that gradient descent can converge faster if the
data obey normal distribution. However, here, we do not standardize the data and do not
31
perform But here, we do not standardize the data, nor do we standardize the magnitude,
why?
Whatever the algorithm has to say, whatever the requirements in statistics, our ultimate
goal is to serve the business. Now we have to create scorecards. A scorecard is a card that
is used by business people to rate new customers based on various information they fill
in. We need to make a “grade” for our data, for example, age 20-30 is a grade, age 30-50
each grade is different. Once we standardize the data with a uniform magnitude, or after
standardization, the data size and range will change, and it may be difficult for business
people to understand what it means to have a standardized age between .00328 and .00467
magnitude. We can indeed enter all the information, standardize it, and then import it into
the algorithm to calculate it. But ultimately, when it comes to the hands of the business
people to judge. Due to business requirements, when making the scorecard, we should try to
keep the data in its original form, age is the number from 8 to 110, income is a number range
from 0 to some very large number, even if the scale is not uniform, we do not standardize
the data.
Some samples are imbalanced. For instance, while everyone is trying to prevent credit
risk, not many people are actually defaulting on their loans. And, banks don’t really refuse
all the people who will default. Many people will pay back the money, they just forget the
due day. Many people are reluctant to owe money, but they are really struggling at the
time and can’t turn around their money, so overdue payments happen. But once he has
the money, he will pay it back. For a bank, it is willing to lend money to you as long as
the borrower can pay it back in the end because it gets income (interest) for lending the
money out. So, for banks, what they really want to be identified is the number of people
who default in bad faith. But the number of those people is tiny, hence the sample is
unbalanced. This has always been a pain point in banking modeling; we always want to
33
The most used balancing method in logistic regression is upsampling, which we use to
To create a score card is to grade the individual features so that the salesperson can
score the customer based on the information filled out by the new customer score. Therefore,
an important step in the process of creating a scorecard is to divide the bins. It can be
said that dividing bins is the most difficult and the core idea of the score card. The essence
of dividing bins is to discrete continuous variables, so that people with different attributes
can be divided into different categories (with different scores), in fact, the essence is more
similar to clustering. We have to answer two questions in dividing bins: First, how many
bins are appropriate to divide? Second, what kind of effect should be achieved by splitting
the bin?
The number of bins is not known a priori, but since we are discretizing continuous
variables, the number of bins must not be too large, and it is better to keep it under
ten. For making score cards, it is better to have 4-5 bins. Making a continuous variable
discrete must be accompanied by loss of information, and the fewer the bins, the greater
the information loss. In order to measure the amount of information on the features and
the contribution of the features to the prediction function, the banking industry has defined
N
X
IV = (good % − bad %) × WOE i
i=1
In the above equation N is the number of bins on this feature, i represents each bin, good%
is the proportion of good customers (those with label 0) in this bin to all good customers in
the whole feature, bad% is the proportion of bad customers (that is, those who will default,
with label 1) in this bin to all bad customers in the feature, and WOE i is given by the
34
Table 4.2: Relationship between IV values and the contribution of features to the model
following equation.
good %
WOE i = ln( )
bad %
WOE is an indicator used in the banking industry to measure the probability of default.
The essence of this is the logarithm of the ratio of good customers to bad customers; WOE
is for a bin, the larger the WOE, the more good customers are in the bin. While IV is for
the whole feature, IV represents the meaning of the amount of information on our feature
and the contribution of this feature to the model, controlled by Table 4.2.
As can be seen, the larger the IV is not the better. We should find a balance between
the value of the IV and the number of bins. The more bins there are, the smaller the IV
is bound to be, because the information loss will be very high. Therefore, we will sub-bin
the features, and then calculate the WOE value of each feature at every bin, and use the
We want people with different attributes to have different scores, so we want the at-
tributes of people in the same bin to be as similar as possible, and the attributes of people
in different bins to be as different as possible, that is, the industry often says “large differ-
ences between groups, small differences within groups”.For the scorecard, it means that we
want the probability of default to be similar for people within a bin, and the probability of
default to vary widely across bins, which means that the WOE gap should be large and the
proportion of bad customers () in each bin should be different. Then we can use the chi-
square test to compare the similarity between the two bins, if the p-value of the chi-square
35
test between the two bins is very large, then they are very similar, then we can combine
Based on this idea, we summarize the steps we take to divide a feature into different
bins.
1. We first divide the continuous variables into a larger number of sub-type variables,
for example, dividing tens of thousands of samples into 100 groups, or 50 groups.
2. Make sure that samples from both categories are included in each group, otherwise
3. We perform a chi-squar e test on adjacent groups, and groups with large p-values from
the chi-square test are combined until the number of groups in the data is less than
4. We let a feature be divided into twenty bins, observe how the IV value changes under
each number of bins, and find the most suitable number of bins.
5. After dividing the boxes, we calculate the WOE and the bad% value of each bin, and
After all these steps are completed, we can pick features by binning each feature and
To create equal frequency bins we use a quantile-based binning function. The quantile-
of handling one-dimensional data only and returns the upper and lower bounds of the box.
In order to be able to calculate the IV values, we have to make sure that there are good
Not all features can use this binning function, for example, some features, like the
can be binned into separate groups, and the features that cannot be automatically binned
After binning, the next thing we want to do is to calculate the WOE for each bin
and replace the WOE with our original data since we will be using the WOE overlay data
to model the boxes, we want to get the classification results for each bin, which is the
In the case where there are already bins, the processing of the test set is very simple,
we only need to map the already computed WOE to the test set.
After the modeling is completed, we use the ROC [24] curve to validate the model.
After modeling, we verified the predictive power of the model using accuracy and ROC
curves. The next step is to convert the logistic regression into a standard scorecard. The
Score = A − B × log(odds)
where A and B are constants, A is called “compensation”, B is called “scale”, and log(odds)
represents the probability of a person defaulting. In fact, the logistic regression results in
the form of log odds will get θT x, that is, our parameter characteristic matrix, so in fact
log(odds) is our parameters. The two constants can be found by bringing the scores of two
1
Assuming that the specific score set at log odds of 60 is 600 and PDO = 20, then the
1
score at log odds of 30 is 620. Bringing in the above linear expression, we get.
1
600 = A − B × log( )
60
1
620 = A − B × log( )
30
Once the program is run, we can get the scoring grid shown in Table 4.3. Once this
form is generated, it can be handed over to a salesperson or entered into an online scoring
CHAPTER 5
In many real-world scenarios, data has unbalanced category distribution. That is,
one category in the data contains much more data than the other categories [25]. In the
loan scenario, we focus on the category imbalance problem in binary classification since a
creditworthy customer often has far more data than a fraudulent customer.
Consider a simple example, 100,000 positive samples (normal customers labeled 0) with
1,000 negative samples (fraudulent customers labeled 1), the ratio of positive to negative
samples is 100:1, if brought directly into the model to learn, each gradient descent if using
the full amount of samples, the weight of negative samples is less than 1/100, even if the
information of negative samples is not learned at all, the accuracy rate is more than 99%.
So obviously we must not measure the effectiveness of the model in terms of accuracy. But
in practice, we actually know that even if we use KS or AUC to measure the performance
of the model, there is still no guarantee that the model can learn the negative samples well.
Instead, we actually need to get a classifier that has a high accuracy for the positive cases
In financial risk control business scenarios there are three primary ways to handle
1. Use human experience - Force data balancing approach based on business experience.
2. Use an algorithm - From the perspective of the algorithm, the algorithm is optimized
considering the variability of the cost of different misclassification cases, mainly based
Forcing data balancing mainly involves artificially modifying the sample distributions
in the data based on business experience. The advantage of this method is that it is simple
and easy to be understood by business staff. The disadvantage is also obvious, that is, it
lacks a theoretical basis for support and relies heavily on the business experience of data
processors. It tends to work well on some data sets, but performs poorly on others, and
Downward probing is the most direct solution to the imbalance of samples in the risk
control scenario. The so-called downward probing is to lend to people with lower scores
who are rejected, sacrificing part of the revenue to accumulate bad samples for subsequent
model learning. This is also the most direct and effective of all methods. But not every
company is willing to take on this part of bad debt. In addition, as we mentioned before,
the samples used in subsequent model iterations are biased as the business develops, and
semi-supervised methods to generate labels and then brought into the model for training.
percentage of low scoring customers based on experience. For example, fifty percent
of the lowest scoring customers are considered bad guys, followed by forty percent,
etc. The effect is not as good as down-probing. But does not have any additional
overhead.
way the sample is labeled and bring it into the model to see if it helps the model.
• Model Screening - The unlabeled samples are labeled with other models that have been
trained, and then the model is trained. Many companies will use the current model
to make predictions on top, and then bring in the model to continue training. This
41
is highly discouraged and the results are generally very poor. Consider unsupervised
algorithms or using very old samples for training and then making predictions.
[TODO - describe which method is used in this thesis of the three methods given above.]
cost-sensitive-learning can be divided into three categories according to the solution to the
problem.
1. The first category is direct construction. This class of approaches focuses on how
• Decision trees: Knoll [26] and Bradford [27] proposed a cost-sensitive pruning
method for decision trees. Bradford studied how to prune decision trees un-
der cost-sensitive conditions to minimize the loss, and showed that the pruning
method based on Laplace’s method could achieve the best results. Holte inves-
tigated node splitting for decision trees with cost-sensitive-learning method [28]
• Boosting: Fan and others [29] have proposed the cost-sensitive Boosting algo-
rithm Ada-Cost.
example, Kukar and Kononenko [31] propose a new backward propagation al-
gorithm for neural networks for neural networks to enable them to meet the
tion results. This class of methods involves learning a classification model according
42
to traditional learning methods and then adjusting the results of its classification ac-
cording to Bayesian risk theory in order to minimize losses. Compared to the first
class of cost-sensitive learning methods, this approach has the advantage that it does
not depend on the specific classifier used. Domingos [32] proposed a process called
MetaCost, which treats the underlying classifier as a black box without making any
assumptions or changes to the classifier, and MetaCost can be applied to any number
3. The third category is based on traditional learning models. Such methods are trained
data. Chan and Stolfo [33] proposed a hierarchical model (Stratification) to adjust
the unevenly distributed training data into uniformly distributed data with positive
and negative examples. Zadrozny [34] propose a cost-proportionate idea to adjust the
weights to the training data, which in practice is similar to the Boosting algorithm,
and can be implemented by adjusting the weights to the classification model, and by
subsampling.
This processing method focuses on the imbalance of the data set by using different
sampling methods in the data set through statistical methods. This method does not require
much modification to the existing algorithm, is simple to implement, and can achieve more
desirable application results. Therefore, in this project, this type of method is mainly used
For imbalanced data, one of the easiest ways is to use random sampling, that is, to
generate a sample of a small number of classes. There are two main approaches to generating
minority classes.
a small number of samples at random from the majority class samples and merging
the original minority class samples into a new training data set.There are two types
The biggest advantage of random sampling is its simplicity, but the disadvantage is also
obvious. Some samples will be repeated in the datasets after up-sampling, and the trained
model will have some over-fitting; and the obvious disadvantage of down- sampling is that
the final training set loses data and the model learns only a part of the overall pattern.
analyze the minority class samples and add new samples to the datasets manually based on
1. For each sample in the minority class xi ∈ Smin where Smin ∈ S, calculate the distance
between that point and the other sample points in the minority class to get the nearest
k nearest neighbors (that is, the KNN algorithm for the minority class points)
2. A sampling ratio is set according to the sample imbalance ratio to determine the
sampling multiplier, and for each minority class sample xi , a number of samples are
randomly selected from its k nearest neighbors, assuming that the selected nearest
neighbor is x′ .
3. For each randomly selected nearest neighbor x′ , a new sample xnew is constructed
xnew = xi + (x′ − x) × δ
is a random number.
The SMOTE method is an oversampling method that overcomes some of the draw-
backs of oversampling and enhances the original data. However, the SMOTE algorithm has
44
train ks evl ks
Before 0.4482453222991063 0.4198642457760936
After 0.4859866821876423 0.44085108654818894
Table 5.1: Comparison of results before and after processing of imbalanced data sets
obvious drawbacks: on the one hand, it increases the possibility of overlap between classes
(since a new sample is generated for each minority class sample, it is prone to generate
sample overlap), and on the other hand, it generates some samples that do not provide
useful information.
Borderline-SMOTE and SMOTE is that the original SMOTE generates new samples for
all minority class samples. The improved method, on the other hand, first determines the
borderline samples of a few classes according to the rules, and then generates new samples
A simple rule for determining the borderline is that the minority class with more than
half majority class samples in the K -nearest neighbors is the border sample. Intuitively,
new samples are generated only for those minority class samples that are surrounded by
Suppose xi ∈ Smin is a sample in the minority class, at this time the samples in the
1. NOISE - All nearest neighbor samples of that minority class are from other classes
2. DANGER - At least half of the nearest neighbor samples are from the same category
3. SAFE - All nearest neighbor samples are from the same class.
Table 5.1 shows the results. As can be seen, there is a 3.5 percentage point improvement
on the final cross-time validation set. The about 4 percentage point improvement on the
45
training set is more in line with expectations. However, as can be seen from the ROC curves
in Figures 5.1 and 5.2, after oversampling with the SMOTE algorithm, the performance of
the model increases on both the training and test sets while the performance of the model
improves. That is, the performance improvement on the training set is better than that on
the test set, which may bring the risk of overfitting, but this risk is within the manageable
range.
46
CHAPTER 6
PERFORMANCE IMPROVEMENT
In this chapter we present how the models were improved using two methods: building
tions are gathered to work as a whole [37]. There are several ways to create an ensemble. In
this section we consider several methods: simple Voting/Averaging (for classification and
Voting or averaging the results predicted by each different model directly, without
changing the model, is a simple but effective way of ensemble. For example, for the classi-
fication problem, suppose there are three mutually independent models, each with a 70%
correct rate, and a minority-majority voting is used. Then the final correct rate will be
That is, the results were simply voted on, resulting in an 8% improvement in the correctness
rate. This is a simple probabilistic problem - if more models are voted on, then obviously
the results will be better. But the prerequisite is that the models are independent of each
other and the results are not correlated with each other. The more similar the models are
that concatenates various weak classifiers in series, where the training of each classifier
depends on the results of the previous classifier. Boosting is shown in Figure 6.1. As
with all fusion approaches, it does not consider what the structure of the individual weak
47
classifier models themselves are, but rather manipulates the training data (sample set) and
the connection method to obtain smaller errors. However, in order to equalize the errors
of the final strong classifiers, the previously selected classifiers are generally relatively weak
ones, because once a classifier is stronger it will make the subsequent results too affected.
Bagging (also known as Bootstrap Aggregation) [41] also does not operate on the model
itself, but acts on the sample set. It uses a randomized and put-back selection of train-
ing data then constructs the classifiers and finally performs the combination. Unlike the
Boosting method in which the classifiers are interdependent and run serially, the Bagging
method in which the base learners do not have strong dependencies on each other and are
a Decision Tree and SVM using Voting, AdaBoost, and Random Forest. Table 6.1 shows the
result. Each ensemble method improved the accuracy over that of the individual classifiers.
48
Gradient Boost Decision Tree (GDBT) is a commonly used nonlinear model [42, 43].
It is based on the idea of boosting in ensemble learning. Each iteration creates a new
decision tree in the direction of reducing the gradient of the residuals, and as many iterations
as it takes to generate a decision tree. A GBDT has the advantage of finding multiple
distinguishing features and feature combinations, and the paths of the decision tree can be
used directly as Logistic Regression (LR) input features, eliminating the need to manually
find features and feature combinations. This way of generating LR features by GBDT
In GBDT+LR GBDT is used to extract features from the training set as new training
input data, and LR is used as a classifier for the new training input data in the following
three steps.
1. GBDT first does training on the original training data to get a binary classifier, and
it also needs to use grid search to find the best combination of parameters.
2. In contrast to the usual practice, when GBDT is trained to make predictions, the
output is not the final binary probability value, but the leaf node position belonging
to the predicted probability value calculated for each tree in the model is to be recorded
data with GBDT, it is the One-hot method that is used. In a GBDT with n weak
classifiers and m leaf nodes, each training data is transformed into a 1∗m-dimensional
sparse vector with n elements being 1 and the remaining m − n elements being 0.
3. After the new training data is constructed, the next step is to input the label (output)
data from the original training data into the Logistic Regression classifier for training
The procedure is diagrammed in Figure 6.2. Tree1(The light blue nodes in the image)
and Tree2(The red nodes in the image) are two trees learned by GBDT model, x(The black
node in the image) is an input sample, after traversing the two trees, x samples fall on
49
train ks evl ks
Before 0.4482453222991063 0.4198642457760936
LGB+LR 0.4687230745337274 0.44510149222090417
the leaf nodes(The light blue nodes and red nodes ) of the two trees respectively, each
leaf node corresponds to LR one-dimensional features, then by traversing the tree, all LR
features corresponding to the sample are obtained. Since each path of the tree, which
is a distinguished path finally partitioned by minimizing the mean squared deviation and
other methods, the features and feature combinations obtained according to the path are
relatively distinguished, and the effect is theoretically not inferior to the manual empirical
processing method.
The Figure 6.3 and Figure 6.4 and comparison Table 6.2 show that with model en-
semble, the performance of the model improves by about 3% on both the training set and
the test set data. Also the difference in performance between the training and test sets is
shrinking. This shows that the model fusion technique also plays a role in improving the
Combined with the results achieved by using the Borderline-SMOTE algorithm on the
unbalanced dataset in the previous section, the use of model fusion techniques is also a way
to be considered in order to improve the training effect of the financial risk control training
And since the financial risk control dataset is inherently characterized by data imbalance,
it can be suggested that processing of imbalanced data can be preferred, and if the effect is
still poor after processing, then the use of model fusion can be considered for performance
improvement.
At the same time, because GDBT is faster and suitable for online analysis scenar-
ios, if there is a demand for online analysis, you can first consider using GDBT+LR for
performance improvement.
51
CHAPTER 7
This thesis presents a financial risk control system build using machine learning meth-
ods. The system uses daily business data from financial institutions and some data provided
by third parties. The system was deployed in the financial industry and achieved good re-
sults in internal testing. After a period of testing and optimization, the project will be
officially launched in early 2022, which will improve the efficiency and accuracy of the daily
loan business of the company. The system has the following features.
• Developed a decision tree based loan review system, which performs the review of loan
applications by predicting the bad debt rate of users. For the data set selection of this
model, we did not use traditional personal credit data, but used oil consumption data,
which is closely related to the applicants’ daily life, as the basis of modeling. Various
data pre-processing strategies were also used to clean the dataset. In response to the
actual situation that the original dataset had few effective features, feature derivation
was carried out, and the original 18 features were derived to 70 effective features,
which greatly improved the classification effect of the model. While the bad debt rate
• Developed a credit score card system for individual users. The traditional credit
scoring system is mainly based on credit data provided by third-party credit bureaus
(FICO). However, this data can only be used for credit scoring based on the user’s
historical consumption data and income status, which has the disadvantage of slow
information update and cannot fully reflect the user’s credit status. This scoring
system combines personal information data and consumption data on top of credit
data, and uses logistic regression algorithm to perform credit scoring. In the data set
processing, for the characteristics of logistic regression algorithm, the continuous type
data is converted into discrete type data by using split-box technology. The binning
53
method creatively uses equal frequency binning + calculation of WOE value, which
achieves good results. The score card is also generated for the use of operational staff.
The model has the advantages of strong interpretability, timeliness and high accuracy.
• For data imbalance, Borderline SMOTE based on SMOTE algorithm is used to bal-
ance.
of the model.
In future, the following improvements will be made around enhancing performance and
improving functionality
The current infrastructure of this system is built based on AWS public cloud system,
in which AWS’s EC2, S3, Aurona, DynamoDB, RDS and other services are mainly used to
obtain the required computing and storage capabilities. However, as the project progressed,
the amount of data acquired and required for daily processing became larger and larger.
Therefore, we plan to build our own OLAP eco-system due to the cost, performance, data
sensitivity and data security considerations. The design is shown in Figure 7.1
In this system, Hadoop distributed file system HDFS and Spark distributed computing
framework are used to solve the problem of large data sets and improve the security of the
data.
• Using the distributed storage HDFS to avoid data loss and and provide computing
resource.
• Using Zookeeper to achieve high availability and stability of the distributive computing
• Building a hybrid warehouse system using Hive and Hbase to facilitate faster data
query (using Hive) and distributive storage of T-byte data (using Hbase) .
• Building MongoDB clusters and implement data migration from the relational database
to MongoDB
• Building an Apache Flume-based system for multi-source data collection with high
Almost all financial institutions have a need for anti-fraud. And fraud is becoming more
and more sophisticated, with each different type of fraud treated as a separate category. In
addition to the diverse and continuously changing fraud schemes, fraud detection generally
• In most cases, the data is not labeled, and there is no use for various mature supervised
learning.
55
• It is very difficult to distinguish between noise and anomaly, and even requires a little
• When multiple scam data are mixed together, it is more difficult to distinguish between
different scam types. The reason for this is that we don’t know the definition of each
type of fraud.
• Even if we do have historical data on frauds, that is, using supervised learning with
labels, there is a significant risk. A model learned with such historical data can only
detect scams that have appeared similar to historical scams, while our model will be
powerless for variants of scams and scams that have never been seen before.
We plan to use unsupervised learning to build and model and have domain experts to
validate our predictions and provide feedback so that the model can be adjusted in a timely
manner.
56
Bibliography
financial risk management: A survey,” IEEE Access, vol. 8, pp. 203 203–203 223, 2020.
system for the actuarial profession,” British Actuarial Journal, vol. 18, no. 1, pp. 91–
121, 2013.
[3] M. Beyhaghi and J. P. Hawley, “Modern portfolio theory and risk management: as-
[4] W.-Y. Lin, Y.-H. Hu, and C.-F. Tsai, “Machine learning in financial crisis prediction: a
press, 2018.
based risk assessment analysis for smes loan grant,” in 2021 International Conference
pp. 1–5.
[8] Y. Yu, “The application of machine learning algorithms in credit card default pre-
[9] G. Kou, Y. Peng, and G. Wang, “Evaluation of clustering algorithms for financial risk
analysis using mcdm methods,” Information Sciences, vol. 275, pp. 1–12, 2014.
[10] U. P. Shukla and S. J. Nanda, “Designing of a risk assessment model for issuing credit
card using parallel social spider algorithm,” Applied Artificial Intelligence, vol. 33,
[11] C. Sudha and D. Akila, “Credit card fraud detection system based on operational &
transaction features using svm and random forest classifiers,” in 2021 2nd International
data science endeavors,” in 2015 IEEE international conference on data science and
[13] J. R. Quinlan, “Induction of decision trees,” Machine learning, vol. 1, no. 1, pp. 81–106,
1986.
[14] G. De’ath and K. E. Fabricius, “Classification and regression trees: a powerful yet
simple technique for ecological data analysis,” Ecology, vol. 81, no. 11, pp. 3178–3192,
2000.
[15] W. Li and J. Liao, “An empirical study on credit scoring model for credit card by using
scoring accuracy,” in 2018 IEEE 5th International Conference on Data Science and
fuzzy decision support system for credit scoring,” Neural Computing and Applications,
[18] M. C. Mhina and F. Labeau, “Using machine learning algorithms to create a credit
scoring model for mobile money users,” in 2021 IEEE 15th International Symposium
[19] Z. Zhang, K. Niu, and Y. Liu, “A deep learning based online credit scoring model for
p2p lending,” IEEE Access, vol. 8, pp. 177 307–177 317, 2020.
[20] F. Shen, R. Wang, and Y. Shen, “A cost-sensitive logistic regression credit scoring
[21] W. H. Eko, M. Heti, and S. I. Teguh, “Improving credit scoring model of mortgage
financing with smote methods in sharia banking,” Russian Journal of Agricultural and
[22] H. Sutrisno and S. Halim, “Credit scoring refinement using optimized logistic regres-
[23] R. Y. Goh, L. S. Lee, H.-V. Seow, and K. Gopal, “Hybrid harmony search–artificial
intelligence models in credit scoring,” Entropy, vol. 22, no. 9, p. 989, 2020.
[24] W. J. Krzanowski and D. J. Hand, ROC curves for continuous data. Crc Press, 2009.
knowledge and data engineering, vol. 21, no. 9, pp. 1263–1284, 2009.
induction,” Journal of Parallel and Distributed Computing, vol. 61, no. 3, pp. 322–349,
2001.
59
[28] R. C. Holte and C. Drummond, “Cost-sensitive classifier evaluation using cost curves,”
pp. 26–29.
[29] Z. Fan and M. Ni, “Individualized boosting learning for classification,” Optik, vol. 126,
[30] P. Geibel, U. Brefeld, and F. Wysotzki, “Perceptron and svm learning with generalized
cost models,” Intelligent Data Analysis, vol. 8, no. 5, pp. 439–455, 2004.
[31] M. Kukar and I. Kononenko, “Reliable classifications with machine learning,” in Eu-
[33] P. K. Chan and S. J. Stolfo, “On the accuracy of meta-learning for scalable data
mining,” Journal of Intelligent Information Systems, vol. 8, no. 1, pp. 5–28, 1997.
[34] B. Zadrozny and C. Elkan, “Learning and making decisions when costs and probabil-
ities are both unknown,” in Proceedings of the seventh ACM SIGKDD international
[36] H. Han, W.-Y. Wang, and B.-H. Mao, “Borderline-smote: a new over-sampling method
the decisions of machine learning models: A case study,” IEEE Access, vol. 7, pp.
2013.
[40] D. Sarkar and V. Natarajan, Ensemble Machine Learning Cookbook: Over 35 prac-
tical recipes to explore ensemble machine learning techniques using Python. Packt
[41] H. Kim and Y. Lim, “Bootstrap aggregated classification for sparse functional data,”
[42] J. H. Friedman, “Stochastic gradient boosting,” Computational statistics & data anal-
[43] ——, “Greedy function approximation: a gradient boosting machine,” Annals of statis-
[44] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers
the Eighth International Workshop on Data Mining for Online Advertising, 2014, pp.
1–9.