FULLTEXT01

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Credit risk modelling and prediction:

Logistic regression versus machine


learning boosting algorithms
Linnéa Machado and David Holmer

Bachelor’s thesis in Statistics

Advisor
Tatjana Pavlenko

2022
Abstract

The use of machine learning methods in credit risk modelling has been proven
to yield good results in terms of increasing the accuracy of the risk score as-
signed to customers. In this thesis, the aim is to examine the performance of
the machine learning boosting algorithms XGBoost and CatBoost, with logis-
tic regression as a benchmark model, in terms of assessing credit risk. These
methods were applied to two different data sets where grid search was used
for hyperparameter optimization of XGBoost and CatBoost. The evaluation
metrics used to examine the classification accuracy of the methods were model
accuracy, ROC curves, AUC and cross validation. According to our results, the
machine learning boosting methods outperformed logistic regression on the test
data for both data sets and CatBoost yield the highest results in terms of both
accuracy and AUC.

i
Acknowledgements

We would like to express our gratitude towards our advisor Tatjana Pavlenko
for the guidance and support that you have provided for this thesis.

ii
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Theory 4
2.1 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 CatBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7 Model Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 11
2.7.1 Model Accuracy . . . . . . . . . . . . . . . . . . . . . . 11
2.7.2 ROC curves and AUC . . . . . . . . . . . . . . . . . . 11
2.7.3 Cross-validation . . . . . . . . . . . . . . . . . . . . . . 12

3 Data and method 13


3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Handling of categorical features . . . . . . . . . . . . . . . . 13
3.3.1 Logistic regression . . . . . . . . . . . . . . . . . . . . . 13
3.3.2 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.3 CatBoost . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.1 Hyperparameter optimization . . . . . . . . . . . . . 14
3.4.2 Model training . . . . . . . . . . . . . . . . . . . . . . . 15

4 Results 16
4.1 Model accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 ROC curves and AUC . . . . . . . . . . . . . . . . . . . . . . 17

5 Discussion 20

6 Conclusion and further studies 22

7 Bibliography 23

8 Appendix 25

iii
Abbreviations

AUC Area Under the Curve

CatBoost Categorical Boosting

CV Cross-validation

FN False Negatives

FP False Positives

FPR False Positive Rate

LR Logistic Regression

ROC Receiver Operator Characteristic

TN True Negatives

TP True Positives

TPR True Positive Rate

TS Target Statistics

XGBoost eXtreme Gradient Boosting

iv
1 Introduction
This chapter provides a brief overview of the purpose of the thesis. It also con-
tains topics such as an introduction, previous studies and the scope of the study.

1.1 Background
In the process of a bank or a financial institution lending out money, they face
a risk with some customers not fulfilling the obligation to re-pay the money.
To minimize the risk of this problem occurring, customers are assigned a class
when applying for the loan. Based on some characteristics they are classified
into a certain category and given a credit risk score.

The financial crisis of 2007-2008 boosted the debate on how to properly asses and
manage credit risk in commercial banks (Mačerinskienė, Ivaškevičiūtė Railienė,
2014). With the purpose of decreasing the loss as much as possible several ways
of determining the accuracy of the credit risk has been developed and evaluated.
Algorithms and machine learning techniques such as logistic regression, neural
networks and boosting are widely used in this field to increase the accuracy of
the risk score given to customers.

Gradient boosting algorithms have proven themselves to be very useful in the


task of assessing credit risk. In recent years, several such algorithms have been
developed. They have a lot of similarities and differences. How they handle dif-
ferent types of data and how much computational power that is required is only
two examples of many. Considering this, which machine learning boosting al-
gorithm is the most effective and precise when it comes to predicting credit risk?

The implementations of machine learning algorithms are becoming more and


more widespread across the financial sectors (OECD, 2021). Modelling of credit
risk is no exception. Several methods and techniques have been developed and
deployed to create more robust and accurate assessments and predictions, to
assist decision makers at banks and other financial institutions. As a result,
several articles have been written on the topic of how to create the most accu-
rate model assessing credit risk.

One article benchmarks a boosted decision tree model, against two alternative
data mining techniques: support vector machines and multilayer perceptron
(João A. Bastos, 2008). He created several credit scoring models used to help
lenders decide whether to grant credit to applicants or not. He used to publicly
available datasets on credit card applications. In this study, the boosted deci-
sion trees outperformed the support vector machine and multilayer perceptron
models, in regards of accuracy.

1
A similar study was conducted by Zhenya Tian et al. (2019). They also com-
pared gradient boosting decision trees to support vector machines and multi-
layer perceptron models, but also included models based on methods such as
logistic regression, decision trees, adaptive boosting and random forest. They
used historical personal consumption loan data of a lending company. After the
data cleaning they used grid search for the adjustment of hyperparameters. To
evaluate the performance results of each model, they looked at accuracy and
Area Under the Curve (AUC). The gradient boosting decision tree model out-
performed all other models, with an AUC of 97%.

Anastasios Petropoulos et al. (2018) compared XGBoost to MXNET. XGBoost


is short for extreme gradient boosting, and is an implementation of such decision
trees. MXNET is a deep learning technique used to train, and deploy neural net-
works. The models were used to classify obligor credit quality. They performed
the comparison on a three-years period out-of-time sample of corporate loans.
Their experimental results were also benchmarked against traditional methods
such as linear discriminant analysis (LDA) and logistic regression. They used
AUC as their evaluation metric, where XGBoost received the highest score of
78%, compared to MXNETs 72%, both significantly outperforming both LDA
and the logistic regression model.

As gradient boosting methods have proven to be useful in classification tasks,


Essam Al Daoud (2019), conducted a study, with the aim of comparing three
such methods, XGBoost, LightGBM and CatBoost. He used a home credit
dataset without any categorical variables. It’s also worth to mention that the
study was conducted before CatBoosts stable release. The comparison was made
in terms of CPU runtime and AUC, where LightGBM was significantly faster
than the other methods and had the highest AUC of 78,9%, compared to 78,8%
(XGBoost) and 78,8% (Catboost).

To summarize the previous studies, all have used model accuracy and AUC as
the main performance metrics. According to these metrics, gradient boosting
decision tree models have repeatedly proven to outperform other commonly used
machine learning techniques in the classification task of assessing credit risk.

As mentioned previously, gradient boosting decision trees have been proved to


perform well on classification tasks. XGBoost have been formally tested on
numerous occasions, more times than in the mentioned studies. CatBoost, on
the other hand, have only been formally tested once, to the best of our knowl-
edge. Considering this, and the rise of machine learning within the financial
sector, this thesis will examine how well the relatively new boosting algorithm,
CatBoost, compares to a more recognized algorithm, XGBoost, and Logistic
regression, in the classification task of assessing credit risk. More specifically,
this thesis aims at investigating which boosting algorithm that performs best in
the assessment of credit risk, with regards to specific evaluation metrics.

2
This thesis will be restricted to investigating two boosting algorithms, XGBoost
and CatBoost, and their ability to predict credit risk. Logistic regression will be
used as a benchmark model. The model evaluation metrics used are restricted
to precision, Receiver Operating Characteristic (ROC) and AUC-score. This
metrics will be further discussed in Chapter 2. The datasets are from south
Germany, and the USA respectively. More details about these datasets will be
covered in Chapter 3.

The second Chapter of this thesis will cover all the relevant theories needed to
understand how logistic regression, the algorithms, and the evaluations metrics
work and/or are calculated. In the third Chapter, the datasets will be more
thoroughly discussed, as well as the methods used to reach our conclusion. The
fourth Chapter contains all the results from the model, such as model accuracy,
ROC-curves, and AUC. In the fifth and final Chapter, the results will be dis-
cussed, and a conclusion will be drawn.

3
2 Theory
To provide a better understanding for how the models work, this Chapter will
explain the theory behind the three models used in this thesis. This Chapter
also presents how the models can be evaluated and compared by using Model
Accuracy, ROC/AUC and Cross-validation.

2.1 Binary classification


Binary classification is a technique that refers to predicting a class label out of
two possibilities for an input based on model trained data. This is a type of
supervised learning which means that the training data consists of both a vector
of feature variables, also known as features or predictors, and a response, which
in classification is the class variable. Both the features and the class variable
are used to train the model so it performs well on new, untrained, data. Unlike
supervised learning, unsupervised learning refers to the case where the features
are given but the response is unknown (Hastie, Tibshirani & Friedman, 2001).

In this thesis, the two class labels are defined as default and non-default. The
customers can either be classified to the default class or the non-default class.
The set of labels can be referred to as the label space Y ∈ {0, 1} where 0 refers
to the non-default class and 1 to the default class. The labeled target is rep-
resented by the random variable Yi = yi where i represents the i-th instance
i = 1, 2, ..., n in a given data set.

The set of features, containing the remaining information about the customers,
can be referred to as the feature space X ∈ Rm where m denotes the total num-
ber of features. These features can either be numerical or categorical.The feature
vector is given by Xi and takes on a list of observed features xi = [xi1 , ..., xi,m ]T .
Given n samples, the data set containing the labeled target and the vector of
features can be defined by D = {(x1 , y1 ), ..., (xn , yn )}.

Further on, the three methods used for binary classification in this thesis will
be explained. These are Logistic Regression, XGBoost and CatBoost.

2.2 Logistic regression


Logistic regression is a commonly used classification method for modelling in-
dependent binary response variables, Y ∈ {0, 1}. Given m-dimensional set of
features, X, where xi is the feature vector belonging to the i-th observation.
It models the conditional probability of the observation belonging to a specific
class. This is modelled using the logistic function (Hastie, Tibshirani & Fried-
man, 2001).

4
T
eβ0 +β xi
P (Yi = 1|Xi = xi ) = (1)
1 + eβ0 +β T xi

where β0 denotes the intercept and β is the vector of coefficients associated with
the predictors, β = [β1 , β2 , ..., βm ]T . A transformation of xi can be derived from
Eq(1) and displayed as following

P (Yi = 1|Xi = xi )
ln = β0 + β1 xi1 + β2 xi2 + ... + βm xim = β0 + β T xi (2)
1 − P (Yi = 1|Xi = xi )

representing the natural logarithm of the odds, also known as the log-odds, on
the left side and the transformation of our predictors, xi , on the right side. This
can also be written as

logit(pi ) = β0 + β T xi (3)
Where pi is the probability of observation i belonging to class 1. Since the
logistic model returns a value between 0 and 1, a threshold value is chosen in
order to classify the observations. The selected boundaries are

P (Yi = 1|Xi = xi ) ≥ 0.5 : class = 1


P (Yi = 1|Xi = xi ) ≤ 0.5 : class = 0

In order to fit the logistic model, we would like to maximize the likelihood func-
tion. By letting θ = (β0 , β) and re-writing Eq(1) we are given

1
p(xi , θ) ≡ P (Yi = 1|Xi = xi ; θ) = (4)
1 + e−θT xi

In the case of binary classification, the Bernoulli distribution is widely used as


the likelihood function (Wasserman, 2004). This is because P (Yi = 1|Xi =
xi ; θ) completely specifies the conditional distribution, where Yi |Xi = xi ∼
Bernoulli(p(xi , θ)). The likelihood function is then expressed as
n
Y
L(θ) = p(xi , θ)yi (1 − p(xi , θ))1−yi (5)
i=1

5
And the log-likelihood:
n
X
l(θ) = {yi ln p(xi ; θ) + (1 − yi ) ln(1 − p(xi ; θ))} (6)
i=1

With some modification of Eq(6), we can obtain the normalized binary cross-
entropy (Loiseau, 2020)
n
1X
J(θ) = − yi ln p(1|xi ; θ) + (1 − yi ) ln(1 − p(0|xi ; θ)) (7)
n i=1
We also also express a vectorized version of the normalized binary cross entropy.
In this case, y T denotes the vector of response variables.
1
J(θ) = − (y T ln p(1|x; θ) + (1 − y)T ln p(0|x; θ) (8)
n
Cross-entropy will be further discussed in Chapter 2.3. As for now, this is the
function we want to minimize by finding the optimal coefficients, θ. To do so, an
iterative approach called the Newton-Raphson method is commonly used. The
coefficients are updated for every iteration, and the update can be described by

θt+1 = θt − H−1 (θt )∇θ J(θt ) (9)


Here, ∇θ J(θt ) is a vector of partial derivatives of the vectorized cross entropy
function, with respect to the components of θ, of the t-th iteration. H−1 is a
Hessian matrix, that contains the second derivatives of the vectorized binary
cross-entropy function, also derived with respect to the components of θ. This
process repeats itself as θ̂ converges towards its true values, until a stopping
criterion has been met.

2.3 Decision trees


Decision trees are a supervised learning algorithm used to handle both regres-
sion and classification tasks. They are grown by using decision rules beginning
from the top of the tree, known as the root node, to binary split the data into
subsets. Figure 2.3 shows a decision tree for regression containing two inter-
nal nodes and three terminal nodes. The terminal nodes, also known as leaves,
presents the predictive response. In this example, the root node represents the
feature Years and has the threshold 4.5. An instance with a value less than
the given threshold for this feature ends up in the left leave representing the
mean value of yi for all the instances in the leave. If an instance instead has a
greater value than 4.5, it is assigned to the other internal node represented by
the feature Hits. This splitting process continues for all of the instances until
they end up in one of the leaves (James et al. 2021).

6
Figure 2.3 Regression tree

The previous example presents a regression tree were the response is quantita-
tive. Since this thesis handles a classification task, were the response instead
is qualitative, classification trees will be introduced. In classification trees, the
leaves represents the predicted class label. For example, the customers in this
thesis can either end up in a leaf represented by the default or the non-default
class. Besides the response being different, classification trees are grown in a
similar way as regression trees. However, they require a different splitting cri-
terion.

Since the entropy is one of the measures that is considered as preferable as


splitting criterion for classification trees, it will be used in this thesis. Given
that have K classes, the entropy is defined by
K
X
D=− p̂bk logp̂bk (10)
k=1

where 0 ≤ −p̂bk logp̂bk due to 0 ≤ p̂bk ≤ 1 and p̂bk can be explained by using
1 X
p̂bk = I(yi = k) (11)
nb
xi ∈Rb

which is the ratio of class k instances in the b:th node, with Rb representing the
region with nb observations (Hastie, Tibshirani & Friedman, 2001).

2.4 Boosting
Boosting is an ensemble technique that combines or converts weak learners to
create a stronger one with the purpose of improving the performance of the
model (Sharpie & Freund, 2012). The reason why we have chosen to use boost-
ing methods is that these type of models have many advantages against the
logistic regression. For example, the boosting models often handles missing
data well and requires less data preparation. Since there are many different

7
type of boosting models, they usually have different type of advantages against
each other.

Further on the two boosting models we have chosen to include in this thesis will
be explained by theory.

2.5 XGBoost
XGBoost (eXtreme gradient boosting) is an implementation of the gradient
boosting algorithm and has been recognized for its successful performance and
speed due to its algorithmic optimizations (Chen & Guestrin, 2016). To provide
a better understanding for how XGBoost works, a mathematical derivation of
XGBoost according to Chen & Guestrin (2016) follows.

Given a data set containing n observations and m features, D = {xi , yi }(|D| =


n, xi ∈Rm , yi ∈{0, 1}, H additive functions are being used for output prediction
for a tree ensemble model.
H
X
pi = fh (xi ), fh ∈ S (12)
h=1

where S can be defined as the classification tree space S = {f (x) = wq(x) }(q :
Rm → T, w ∈ RT ). In the classification tree space, q can be defined as the tree
structure with T number of leaves. An independent tree structure q and leaf
weights w is represented by fh .

The XGBoost objective function consists of a training loss term l and a reg-
ularization term Ω. The training loss term, also known as the loss function,
measures the model fit on the training data and the regularization term mea-
sures the complexity of the trees. For learning, we aim to minimize this function

n
X H
X
L(pi ) = l(yi , pi ) + Ω(fh ) (13)
i=1 h=1

where the regularization term, also can be defined as


T
1 X 2
Ω(f ) = γT + λ w (14)
2 j=1 j

As seen above, the regularization term consists of two parameters. It consists


of the number of leaves weighted by the hyperparameter γ and leaf weight pe-
nalized by the hyperparameter λ.

8
(t)
Given that pi is the prediction of the i-th instance at the t-th iteration, ft is
needed to be added in order to minimize the objective function. It can therefor
further be derived as
n
X
L(t) = l(yi , pt−1
i + ft (xi )) + Ω(ft ) (15)
i=1

Optimization of the objective function can be done by using gradient descent.


This is an iterative algorithm used to minimize the given function. In order to
use gradient descent for optimization, the first and the second order gradient
over the predictive target pi at the t-th iteration can be calculated. As we
don’t have the derivative for the objective function, the second order of Taylor
approximation of the objective function is calculated. This gives us
n
X 1
L (t)
' [l(yi , pt−1
i ) + gi ft (xi ) + hi ft2 (xi )] + Ω(ft ) (16)
i=1
2
were gi and hi represents the first and the second order gradients of the loss
∂l(y ,pt−1 ) ∂ 2 l(y ,pt−1 )
function. They can be defined as gi = ∂pii t−1
i
and hi = ∂pii t−1 i
.

When continuing by removing the fixed term l(yi , pt−1


i ) and by replacing Ω by
its expression given in Eq(14) the objective function can then be simplified to
n T
X 1 1 X 2
L̄(t) = [gi ft (xi ) + hi ft2 (xi )] + γT + λ w (17)
i=1
2 2 j=1 j

Further, we let the instance set of leaf j be defined as Ij = {i | q(xi ) = j}. The
objective function can be rewritten as
T
X X 1 X
L̄(t) = [( gi ])wj + ( hi + λ)wj2 ] + γT (18)
j=1
2
i∈Ij i∈Ij

From previous equations, the optimal weight wj∗ of leaf j for given a tree struc-
ture q(x) can be derived as
P
∗ i∈Ij gi
wj = − P (19)
i∈Ij hi + λ

Hence, the optimal value of the objective function corresponding to Eq(19) is


T 2
P
(t) 1 X ( i∈Ij gi )
L̄ (q) = − P + γT (20)
2 j=1 i∈Ij hi + λ

As explained in 2.3, trees are grown greedily by using binary splitting, were each
split results in two new nodes. To further explain how this works for XGBoost
we define IL and IR as the set on instances of the left and right node after a

9
split. By letting I be defined as the node before the split, I = IL ∪ IR , the loss
reduction is given by
P P P
1 i∈IL gi gi gi
Lsplit = [ P + P i∈IR − P i∈I ]−γ (21)
2 i∈IL hi + λ i∈IR hi + λ i∈I hi + λ

The chosen split is the split with maximum loss reduction.

In practical application using XGBoost, λ and γ can be tuned to improve model


performance. In 3.4.1 a description on how the hyperparameters were tuned in
this thesis will be provided.

2.6 CatBoost
CatBoost (Categorical boosting) is also an implementation of the gradient boost-
ing algorithm. The algorithm does, except handling numerical features, also
automatically handle categorical features which makes categorical data usable
with less manual pre-processing. For other boosting methods, the categorical
features are commonly converted into numbers before training. CatBoost is
also known for requiring less parameter tuning, its training speed and is built to
prevent overfitting better than many of the previous released boosting models
(Dorogush, Ershov & Gulin, 2018).

In this thesis, the data sets includes both numerical and categorical features.
Since both XGBoost and CatBoost can handle numerical features, the mathe-
matical explanation that follows will focus on what distinguishes CatBoost from
XGBoost - its ability to handle categorical features.

CatBoost handles categorical features by using ordered target statistics (Han-


cock & Khoshgoftaar, 2020). This refers to, given a data of observations, the
output value being computed by the expected target value in each category.
This can be mathematically explained by estimating

E(Yi |xi = xi,k ) (22)


as the mean of the output value yi were xi,k is the value of the i-th categorical
variable on the k-th training observation of a given data set D. The estimator
for the target value is given by
Pn
j=1 [xi,j = xi,k ]·yj + aP
x̂i,k = Pn (23)
j=1 [xi,j = xi,k ] + a

were the dot [·] corresponds to the Iverson brackets, that is, [xi,j = xi,k ] is equal
to 1 in the case where xi,j = xi,k and if not, it is equal to 0. In classification,
P is some priori probability of encountering a default class using a > 0, which

10
can be defined as the weight of P (Dorogush, Ershov & Gulin, 2018).

Another difference between CatBoost and XGBoost is that CatBoost uses obliv-
ious decision trees. This means that all nodes on the same level of the tree,
except the terminal nodes, has the same splitting criterion. Since these trees
are full binary, a tree with n levels will have 2n nodes. This means that the
length from the root node to each leaf will be equal to the depth of the tree
(Hancock & Khoshgoftaar, 2020).

2.7 Model Evaluation Metrics


2.7.1 Model Accuracy
Model accuracy is an evaluation method used for measuring the ratio of number
of correctly classified observations to the total number of classified observations
(Mishra, 2018). The model accuracy is given by
TP + TN
Accuracy = (24)
TP + TN + FP + FN
were T P and T N represents the observations that were correctly classified to
the default and the non-default class. Further, F P and F N represents the
number of observations that were incorrectly classified to the default and the
non-default class.

2.7.2 ROC curves and AUC


Another way to evaluate the performance of a classifier is by observing a ROC
(Receiver Operating Characteristics) curve. Here, the y-axis shows the true
positive rate, also known as the recall, given by
TP
TPR = (25)
TP + FN

The x-axis shows the false positive rate given by


FP
FPR = (26)
FP + TN

To evaluate the performance of a classifier the AUC, Area under the curve, is
observed. The AUC represents the model’s capability to distinguish between
classes and takes on values between 0 and 1. An AUC-value close to 1 indicates
that the model is better at distinguishing between classes than model receiving
an AUC close to 0 (Fawcett, 2006).

11
2.7.3 Cross-validation
To develop models that performs well, in terms of predictions, on new data as
well as training data, the widespread ‘holdout method’ is often employed. To
avoid overfitting, we do not want to use the same information for training and
validating the models. Instead, some of data is held out and the remaining data
is used for training. This is called the holdout method and the data that has
been held out is often referred to as a validation data set. In the case where data
is limited, there is a risk that important information that is found only in the
validation set can be missed when training the model exclusively on data from
the training set. This might cause a bias. To solve this problem and improve
the holdout method, a technique called k-fold cross-validation is used.

With k-fold CV the training data set is split into k approximately equaled sized
sets. The model is then trained using k-1 folds where the remaining k-th fold
serves as a validation data set. This process is repeated k times so every k-fold
has served as a validation set. For every iteration, the test error is computed
and the performance of the model is based on the average test error among all
k trials (James et al. 2021).

This can be further explained by letting κ : {1,...,n} 7→ {1,...,K} be a mapping


function that indicates to which k th part observation i by randomness is allo-
cated to (Hastie, 2008). fˆ−κ (x) is the fitted function calculated without the k th
part of the training set. The loss function for each respective model is denoted
as L(·). The CV estimated prediction error is then
n
1X
CV (fˆ) = L(yi , fˆ−κ(i) (xi )). (27)
n i=1
In the case of fitting a model containing one or several hyperparameters, let
f (x, α) be a set of models, indexed by the hyperparameters α. fˆ−κ (x, α) is then
the α-th fitted model where the k -th part of the training data is not included.
Then the CV estimated prediction error is defined as
n
1X
CV (fˆ, α) = L(yi , fˆ−κ(i) (xi , α)). (28)
n i=1
An exhaustive search for the optimal combination of hyperparameters can be
performed. The optimal α minimizes the CV estimated prediction error, and
is denoted as α̂. The process of finding α̂ will be further discussed in Chapter
3.4.1.

12
3 Data and method
This chapter aims to describe the two different data sets that are used in the the-
sis and how missing values are handled. The chapter also consists of a method
part which explains how the data is modeled by using Logistic Regression, XG-
Boost and CatBoost in R.

3.1 Data
As previously mentioned, we will apply the different models to two different sets
of data. The first dataset contains 1000 obsvervations, with 20 predictor vari-
ables. This data was collected from Germany between the years of 1973-1975,
and is a stratified sample with oversampled bad credits. Of the 1000 observa-
tions, 200 defaulted on their loan. The data was collected from UCI machine
learning repository. Further on in the thesis this data will be referred to as data
1.

The second dataset contains characeristics on home equity loan applications.


This is a type of loan that means that the obligor uses the equity of his or hers
home as the collateral. The dataset is from the USA, and dates back to 2016.
It contains 5960 observations, where 1189 defaulted on their loan or seriously
delinquent. This data is collected from the book Credit Risk Analytics written
by B. Baesens, D. Roesch and H. Scheule. Further on in the thesis this data
will be referred to as data 2.

3.2 Missing values


The first data set does not contain any missing values but the second one does.
In the second data set, missing values occur in several of the variables. Both
XGBoost and CatBoost can handle missing values. However, logistic regression
can not. The variable DEBTINC, has the highest amount of missing values of
approximately 21%. Since this variable is the only of the variables included in
the data that in some way explains the income and dept of the customer, which
can be seen as important, we do not want to delete the variable. Instead, we
have chosen to delete the individuals with missing values in our data, leaving
us with 3364 observations from the second data set.

3.3 Handling of categorical features


3.3.1 Logistic regression
Logistic regression handles categorical features as dummy variables. One value,
the reference level, is left out. The beta-coefficients for the new variable will in-
dicate the difference in relation to the reference level. This occurs automatically
after converting categorical variables using the as.factor()-function.

13
3.3.2 XGBoost
In the first data set, all of the variables are integers, implying that we do not
need to transform the data. Unlike the first data set, the second one has different
type of variables including categorical features. To use the XGBoost algorithm
on this data, one-hot-encoding is necessary.

3.3.3 CatBoost
In R, CatBoost can handle factors. All categorical features is therefor converted
using the as.factor()-function. Categorical features needs to be specified in order
for CatBoost to handle them as categorical features.

3.4 Method
3.4.1 Hyperparameter optimization
The values of hyperparameters are used to control the learning process. In
order to find the set of optimal hyperparameters for the machine learning algo-
rithms, grid search have been performed. Grid search is an exhaustive search
through all possible combinations of hyperparameters, given a set of values for
each hyperparameter (Hsu, Chang Lin, 2003). The best combination of hy-
perparameters, that will be used in the final models, is the combination that
minimizes the loss function, measured by cross validation on the training set.
We will be using 10-fold cross validation with 10-repeats for each combination
of hyperparameters. Below are tables describing the hyperparameters that are
to be tuned for both machine learning algorithm used in this thesis.

Table 3.1 Hyperparameters (XGBoost)

Name Description
nrounds Number of trees to be built
max depth Maximum depth of each tree
eta Learning rate
gamma regularization parameter (See Chapter 2.5)
lambda regularization parameter (See Chapter 2.5)
min child weight Stop further partitioning if sum of observation weights
is less than the value of the hyperparameter

14
Table 3.2 Hyperparameters (CatBoost)

Name Description
iterations Number of trees to be built
depth Depth of each tree
learning rate Learning rate
l2 leaf reg regularization parameter. Analog to λ (See Chapter 2.5)
rsm % of features to use at split selections
For further explanation, learning rate is the size of each step for every iteration,
moving towards the minimum of the loss function. A high learning rate would
require fewer iterations, but also increases the risk of missing the minimum of
the loss function. Intuitively, a low min child weight implies that subsequent
learners focuses more on missclassified observations. This can help to improve
the model accuracy, but may also cause overfitting. RSM is short for Random
Subspace Method. The value of this hyperparameter is the percentage of ex-
planatory variables available to for the model, to use at each split section, when
explanatory variables are selected repeatedly at random. A lower RSM would
increase the training speed, but having a RSM of below 1 would add another
stochastic element to the training process.

Unlike the machine learning algorithms, Logistic regression does not contain
any hyperparameters that are to be tuned. The logistic regression models will
be fitted using different amount of iterations, until no further improvements in
the evaluation metrics can be seen.

As discussed in Chapter 2.7.3, we will search for the optimal set of hyperparam-
eters of α, which is denoted as α̂. This is the combination of hyperparameters
that minimizes the CV estimated prediction error when conducting the grid
search with 10-fold cross validation with 10 repeats of each combination. This
set of hyperparameters, α̂, will be used in our final models.

3.4.2 Model training


Given the optimal set of hyperparameters, the final models are trained using
10-fold cross validation. All models are trained using the ’caret’ package in
R. We then use these models to predict the credit risk for the observations in
the test set, to calculate the model accuracy and AUC, as well as plotting the
ROC-curves.

15
Table 4.1 Accuracy (Data 1)

Model LR XGBoost CatBoost


Training 0.829 0.817 0.836
Test 0.790 0.813 0.817

Table 4.2 Accuracy (Data 2)

Model LR XGBoost CatBoost


Training 0.931 0.991 0.999
Test 0.935 0.949 0.946

4 Results
This chapter presents the results for using the train and test data of the two
different data sets with Logistic Regression, XGBoost and CatBoost.

4.1 Model accuracy


Table 4.1 shows the accuracy of the models for data 1. On the test data, Cat-
Boost outperformed the other models, in terms of accuracy. CatBoost correctly
predicted 81.7% of the observations in the test data set. XGBoost had an accu-
racy of 81.3%, and the logistic regression model had an accuracy of 79%. This
means that booth machine learning algorithms outperformed our benchmark
model, logistic regression. Logistic regression also had the largest difference in
accuracy between the training- and test dataset. It had an accuracy of 82.9%
on the training dataset, second to CatBoosts 83.6%. XGBoost had the lowest
accuracy on the training set, 81.7%

Table 4.2 shows the accuracy of the models for data 2. All models had a higher
accuracy compared to the models for data 1. Logistic regression had an accuracy
of 93.5%, and was outperformed by both machine learning algorithms. Unlike
the models for data 1, XGBoost had the highest accuracy when using data 2.
XGBoost had an accuracy of 94.9% compared to CatBoost’s 94.6%. Looking at
the training data, Logistic regression scored lower than on the test data, with
an accuracy of 93.1%. Both machine learning algorithms almost achieved an
accuracy of 100% on the test data, with 99.1% and 99.9% for XGBoost and
CatBoost respectively.

16
Table 4.3 AUC (Data 1)

Model LR XGBoost CatBoost


Training 0.829 0.817 0.863
Test 0.676 0.720 0.731

Table 4.4 AUC (Data 2)

Model LR XGBoost CatBoost


Training 0.806 1 1
Test 0.790 0.905 0.925

4.2 ROC curves and AUC


Table 4.3 shows the models AUC-scores for data 1. We can see that the logis-
tic regression model received an AUC-score of 0.676, which means that it was
outperformed by the machine learning algorithms in terms of AUC as well. Cat-
Boost received a higher AUC than XGBoost, 0.731 compared to 0.720. We can
also see that the logistic regression model once again had the greatest difference
between scores on training and test datasets, having an AUC of 0.829 on the
training data. The machine learning algorithms also had significant different
AUC scores between the datasets, where CatBoost had an AUC of 0.863 on the
training data, compared to XGBoosts AUC of 0.817. ‘

17
Figure 4.3 ROC curves comparing the models for data 1

Figure 4.3 shows a ROC-curve of the models with the AUC for the test data
given in table 4.3.

Table 4.4 shows the models AUC-scores for data 2. The machine learning al-
gorithms outperformed the benchmark model, logistic regression, in terms of
AUC for this set of data as well. The logistic regression model received an AUC
of 0.790. CatBoost had the highest AUc score of 0.925, while XGBoost had an
AUC of 0.905. On the training data, both machine learning algorithms had an
AUC of 1 or close to 1. The logistic regression model had an AUC of 0.806 on
the training data.

18
Figure 4.3 ROC curves comparing the models for data 2

19
5 Discussion
This chapter provides an analysis of the results from the previous chapter.

This thesis aimed to compare the machine learning boosting algorithms with
logistic regression as a benchmark, using two different sets of credit risk data.
As in the study of Essam Al Daoud (2019), mentioned in section 1.2, XGBoost
and CatBoost yield results very similar to each other considering the evaluation
metrics. The results of the machine learning boosting algorithms outperformed
the benchmark model in both accuracy and AUC for the test data.

Looking at the results, the models fitted using data 2 achieved much higher
accuracy and AUC comparing to those fitted using data 1. As previously men-
tioned in Chapter 3.1, data 1 was collected between the years of 1973-1975.
Since then it’s been widely used in different studies regarding credit scores,
but it doesn’t come with much background information. There is also severe
errors in the coding information. We can safely say that the data was not col-
lected with the intentions of being used to construct machine learning models
with the recently developed algorithms such as XGBoost and CatBoost. We
believe this is the primary reason for the fact that the models fitted using this
set of data performed worse in terms of accuracy and AUC than those of data 2.

Another thing noticeable about the results, is that the difference in accuracy
between training and test data is larger for the machine learning models fitted
on data 2, compared to the models fitted on data 1. This implies some overfit-
ting tendencies are present in the models fitted on data 2, compared to models
fitted on data 1. The reason for this can be derived from the results of our
hyperparameter optimization. Despite models fitted on data 2 had trees with
greater depth, and were therefore more complex and contained more weights,
the regularization terms (described in Chapter 2.5) suggested by the performed
grid search were more or less the same across all machine learning models. In
other words, since the regularization terms were more or less the same across
all machine learning algorithms, despite the models fitted on data 2 were more
complex, these were also more prone to overfit. The reason for why data 2
yielded more complex models, is because it contained more relevant features.

In Chapter 2.6, when discussing the differences between CatBoost and XG-
Boost, it was mentioned that CatBoost is known for requiring less parameter
tuning, better training speed and is built to prevent over-fitting than many of
the previous released model. Fitting two models with CatBoost is not sufficient
evidence to claim that this is not true, but in these two cases, none of the above
mentioned benefits were realized. In regards of data 2, the grid search suggested
roughly for times the number of iterations, with a slightly lower learning rate,
for CatBoost compared to XGBoost. Therefore, it took longer time to tune the
parameters and train the model. Looking at the results, the differences between
the training and test data in terms of accuracy, were greater for the CatBoost

20
models, compared to XGBoost, implying that the CatBoost models were more
prone to overfitting. However, this could also be the result of not finding the
optimal set of hyperparameters for the models. Despite that both the authors
computers performed grid searches for weeks during the writing of this thesis,
we cannot be completely certain that the optimal set of hyperparameters were
found, but it was not for the lack of trying.

When observing the results, it can also be seen that the accuracy for both data
sets were higher than the AUC for them. When using the accuracy as an evalu-
ation metric it can be important to consider that imbalanced data tend to yield
higher accuracy results. Despite that the data sets used in this thesis are not
highly imbalanced, they still are somewhat imbalanced which needs to be taken
into account when analyzing the results. Perhaps, the ROC-curves and AUC
can be seen as more reliable evaluation metrics in that case since they are not
biased by the data being imbalanced. For future work, it could also be interest-
ing to add other evaluation metrics such as a confusion matrix to evaluate the
algorithms.

In terms of AUC, both machine learning algorithms outperformed the bench-


mark model, logistic regression. This is aligned with the results from all previous
studies. CatBoost also outperformed XGBoost on both sets of data, which is
the opposite of the findings from Essam Al Daouds study (2019). As mentioned
when previously discussing this study, they used a data set that contained no
categorical variables, which could be a reason for why our results differ. How-
ever, looking at the results we can also see that CatBoost outperformed XG-
Boost in terms of AUC with a smaller margin when comparing the models fitted
on data 1, the data set that contained the most categorical variables. This is
contradictory to our expectations, but could be explained by the reasons pre-
viously mentioned here in the discussion, when we discussed the limitations of
data 1. In other words, that the lack of relevant features made it difficult for
any type of model to significantly outperform other models.

An aspect that can be considered regarding the results for the second data set
is that the observations with missing values were deleted. This was done for the
benchmark model to be comparable with the machine learning boosting algo-
rithms, despite their ability to handle missing values. Since the deletion of data
resulted in fewer observations, the outcome of using the whole data set for the
machine learning boosting algorithms could be expected to be different from the
one given in the results. Even though the second data set still contains many
observations, a large data set tend to prevent the risk for overfitting. Possibly,
the difference between the training and the test set for accuracy could have been
smaller if the whole data set where used.

21
6 Conclusion and further studies
To summarize, due the model accuracy being biased by somewhat imbalanced
data, we expect the AUC to be a more reliable evaluation metric than model
accuracy. According to the AUC scores, CatBoost outperformed both XGBoost
and the benchmark model logistic regression on both sets of data. Much due
to the age of the first data set, it is not believed to be resembling to any sets
of data used in the financial industry today. Due to logistic regression not be-
ing able to handle missing values we also decided to remove all observations
containing missing values. This is a actor that could have an effect on our re-
sults. Considering that missing values is common in the real world, it would
be interesting to see how the machine learning algorithms performed on such
data. We, the authors, also believe that we lacked the time and computational
power required to be able to with certainty claim that we found the optimal
set of hyperparameters. However, according to our results and evaluation met-
rics, CatBoost outperformed XGBoost in the assignment of assessing credit risk.

For further studies, it would be interesting to include other commonly used ma-
chine learning algorithms such as LigthGBM, and try these on various sets of
data, containing both numerical and categorical variables, varying in size, bal-
ance in target variable, and amount of missing values. Retention and exclusion
of certain features would also be an interesting aspect. This would also require
more computational power to guarantee that the optimal set of hyperparame-
ters were found.

22
7 Bibliography
Aditya Mishra. 2018. Metrics to Evaluate your Machine Learning Algorithm,
towards data sciene [Blog], 24 February. Available at:
https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-
f10ba6e38234.

Al Daoud, Essam. 2019. Comparison between XGBoost, LightGBM and Cat-


Boost using a home credit dataset.International Journal of Computer and In-
formation Engineering, 13 (1), pp.6-10.

Bastos, Joao. 2008. Credit scoring with boosted decision trees. MPRA paper,
(8156). Available at: https://mpra.ub.uni-muenchen.de/8156/

Chen, Tianqi & Guestrin, Carlos. 2016. XGBoost: A Scalable Tree Boosting
System. Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. KDD x27;16. New York, NY, USA:
ACM, pp. 785–794. Available at: http://doi.acm.org/10.1145/2939672.2939785.

Dorogush, Anna Veronika; Ershov, Vasily & Gulin, Andrey. 2018. CatBoost:
gradient boosting with categorical features. NIPS’18: Proceedings of the 32nd
International Conference on Neural Information Processing Systems, pp. 6639-
6649. Available at: arXiv:1810.11363.

Fawcett, Tom. 2006. An introduction to ROC analysis. Pattern Recognized


Letters, 27 (8), pp: 861-874. Available at:
https://doi.org/10.1016/j.patrec.2005.10.010.

Gao, Ge; Wang, Hongxin and Gao, Pengbin. 2021. Establishing a Credit Risk
Evaluation System for SMEs Using the Soft Voting Fusion Model. Risks, 9 (11),
p.202.

Gareth, James; Witten, Daniela; Hastie, Trevor & Tibshirani, Robert. 2021. An
Introduction to Statistical Learning: with applications in R. New york: Springer.

Hancock, John T. & Khoshgoftaar, Taghi M. 2020. CatBoost for big data: an
interdisciplinary review. Journal of Big Data, 7, 94. Available at:
https://doi.org/10.1186/s40537-020-00369-8.

Hastie, Trevor; Tibshirani, Robert & Friedman, Jerome. 2001. Elements of Sta-
tistical Learning: data mining, inference and prediction. New york: Springer.

Hsu, Chih-Wei; Chang, Chih-Chung and Lin, Chih-Jen. 2003. A practical guide
to support vector classification.
Loiseau, Jean-Christophe B. 2020. Binary cross-entropy and logistic regression,
towards data science [Blog], 1 June. Available at: https://towardsdatascience.com/binary-

23
cross-entropy-and-logistic-regression-bf7098e75559 [Retrieved 2022-01-07].

Mačerinskienė, Irena; Ivaškevičiūtė, Laura; and Railienė, Ginta. 2014. The


financial crisis impact on credit risk management in commercial banks. KSI
transactions on knowledge society, 7 (1), pp.5-16.

OECD. (2021). Artificial Intelligence, Machine Learning and Big Data in Fi-
nance: Opportunities, Challenges, and Implications for Policy Makers, Avail-
able at: https://www.oecd.org/finance/artificial-intelligence-machine-learningbig-
data-in-finance.htm.

Petropoulos, Anastasios; Siakoulis, Vasilis; Stavroulakis Evaggelos and Klamar-


gias, Aristotelis. 2019. A robust machine learning approach for credit risk
analysis of large loan level datasets using deep learning and extreme gradient
boosting. IFC Bulletins chapters, 49.

Sharpie, Robert E. & Freund, Yoav. 2012. Boosting: Foundations and Algo-
rithms. Cambridge, Massachusetts: The MIT Press.

Tian, Zhenya; Xiao, Jialiang; Feng, Haonan and Wei, Yutian. 2020. Credit
risk assessment based on gradient boosting decision tree. Procedia Computer
Science, 174, pp.150-160.

Wasserman, Larry. 2004. All of statistics: a concise course in statistical infer-


ence (Vol. 26). Edited by Casella, George; Fienberg, Stephen Olkin, Ingram.
New York: Springer.

24
8 Appendix
List A1: Explanation of variables in data 1.

 laufkont: status of the debtor’s checking account with the bank (categor-
ical).

 laufzeit: Credit duration in months.

 moral: History of compliance with previous or concurrernt credit con-


tracts (categorical).

 verw: Purpose for which the credit is needed (categorical).


 hoehe: Credit amount in Deutsche Mark.

 sparkont: Debtor’s savings (categorical).

 beszeit: duration of debtor’s employment with current employer (ordinal;


discretized quantitative).

 rate: Credit installments as a perncentage of debtor’s disposable income


(ordinal, discretized quantitative).

 famges: combined information on sex and martial status; categroical; sex


cannot be recovered from the variable, because male singles and female non sin-
gles are coded with the same code (2); female widows cannot be easily classified,
because code table does not list them in any of the female categories.

 buerge: Is there another debtor or a guarantor for the credit? (categorical)

 wohnzeit: length of time (in years) the debtor lives in the present resi-
dence (ordinal; discretized quantitative).

 verm: the debtor’s most valuable property, i.e. the highest possible code
is used. Code 2 is used, if codes 3 or 4 are not applicable and there is a car or
any other relevant property that does not fall under variable sparkont (ordinal).

 alter: Age in years (quantitative).

 weitkred: Installment plans from providers other than the credit-giving


bank (categorical).

 wohn: tyoe of housing the debtor lives in (categorical).

 bishkred: Number of credits including the current on the debtor has (or
had) at this bank (ordinal, discretized quantitative).

25
 beruf: quality of debtor’s job (ordinal).

 pers: Number of persons who financially depend on the debtor (i.e., are
entitled to maintenaance) (binary, discretized quantitative).

 telef: Is there a telephone landline registred on the debtor’s name? (bi-


nary)

 gastarb: Is the debtor a foreign worker? (binary)

 kredit: Has the credit contract been complied with? (binary)

26
List A2: Explanation of variables in data 2.

 BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = appli-


cant paid loan

 LOAN: Amount of the loan request

 MORTDUE: Amount due on existing mortgage

 VALUE: Value of current property

 REASON: DebtCon = debt consolidation; HomeImp = home improve-


ment

 JOB: Occupational categories

 YOJ: Years at present job

 DEROG: Number of major derogatory reports

 DELINQ: Number of delinquent credit lines

 CLAGE: Age of oldest credit line in months

 NINQ: Number of recent credit inquiries

 CLNO: Number of credit lines

 DEBTINC: Debt-to-income ratio

27

You might also like