FULLTEXT01
FULLTEXT01
FULLTEXT01
Advisor
Tatjana Pavlenko
2022
Abstract
The use of machine learning methods in credit risk modelling has been proven
to yield good results in terms of increasing the accuracy of the risk score as-
signed to customers. In this thesis, the aim is to examine the performance of
the machine learning boosting algorithms XGBoost and CatBoost, with logis-
tic regression as a benchmark model, in terms of assessing credit risk. These
methods were applied to two different data sets where grid search was used
for hyperparameter optimization of XGBoost and CatBoost. The evaluation
metrics used to examine the classification accuracy of the methods were model
accuracy, ROC curves, AUC and cross validation. According to our results, the
machine learning boosting methods outperformed logistic regression on the test
data for both data sets and CatBoost yield the highest results in terms of both
accuracy and AUC.
i
Acknowledgements
We would like to express our gratitude towards our advisor Tatjana Pavlenko
for the guidance and support that you have provided for this thesis.
ii
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Theory 4
2.1 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 CatBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7 Model Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 11
2.7.1 Model Accuracy . . . . . . . . . . . . . . . . . . . . . . 11
2.7.2 ROC curves and AUC . . . . . . . . . . . . . . . . . . 11
2.7.3 Cross-validation . . . . . . . . . . . . . . . . . . . . . . 12
4 Results 16
4.1 Model accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 ROC curves and AUC . . . . . . . . . . . . . . . . . . . . . . 17
5 Discussion 20
7 Bibliography 23
8 Appendix 25
iii
Abbreviations
CV Cross-validation
FN False Negatives
FP False Positives
LR Logistic Regression
TN True Negatives
TP True Positives
TS Target Statistics
iv
1 Introduction
This chapter provides a brief overview of the purpose of the thesis. It also con-
tains topics such as an introduction, previous studies and the scope of the study.
1.1 Background
In the process of a bank or a financial institution lending out money, they face
a risk with some customers not fulfilling the obligation to re-pay the money.
To minimize the risk of this problem occurring, customers are assigned a class
when applying for the loan. Based on some characteristics they are classified
into a certain category and given a credit risk score.
The financial crisis of 2007-2008 boosted the debate on how to properly asses and
manage credit risk in commercial banks (Mačerinskienė, Ivaškevičiūtė Railienė,
2014). With the purpose of decreasing the loss as much as possible several ways
of determining the accuracy of the credit risk has been developed and evaluated.
Algorithms and machine learning techniques such as logistic regression, neural
networks and boosting are widely used in this field to increase the accuracy of
the risk score given to customers.
One article benchmarks a boosted decision tree model, against two alternative
data mining techniques: support vector machines and multilayer perceptron
(João A. Bastos, 2008). He created several credit scoring models used to help
lenders decide whether to grant credit to applicants or not. He used to publicly
available datasets on credit card applications. In this study, the boosted deci-
sion trees outperformed the support vector machine and multilayer perceptron
models, in regards of accuracy.
1
A similar study was conducted by Zhenya Tian et al. (2019). They also com-
pared gradient boosting decision trees to support vector machines and multi-
layer perceptron models, but also included models based on methods such as
logistic regression, decision trees, adaptive boosting and random forest. They
used historical personal consumption loan data of a lending company. After the
data cleaning they used grid search for the adjustment of hyperparameters. To
evaluate the performance results of each model, they looked at accuracy and
Area Under the Curve (AUC). The gradient boosting decision tree model out-
performed all other models, with an AUC of 97%.
To summarize the previous studies, all have used model accuracy and AUC as
the main performance metrics. According to these metrics, gradient boosting
decision tree models have repeatedly proven to outperform other commonly used
machine learning techniques in the classification task of assessing credit risk.
2
This thesis will be restricted to investigating two boosting algorithms, XGBoost
and CatBoost, and their ability to predict credit risk. Logistic regression will be
used as a benchmark model. The model evaluation metrics used are restricted
to precision, Receiver Operating Characteristic (ROC) and AUC-score. This
metrics will be further discussed in Chapter 2. The datasets are from south
Germany, and the USA respectively. More details about these datasets will be
covered in Chapter 3.
The second Chapter of this thesis will cover all the relevant theories needed to
understand how logistic regression, the algorithms, and the evaluations metrics
work and/or are calculated. In the third Chapter, the datasets will be more
thoroughly discussed, as well as the methods used to reach our conclusion. The
fourth Chapter contains all the results from the model, such as model accuracy,
ROC-curves, and AUC. In the fifth and final Chapter, the results will be dis-
cussed, and a conclusion will be drawn.
3
2 Theory
To provide a better understanding for how the models work, this Chapter will
explain the theory behind the three models used in this thesis. This Chapter
also presents how the models can be evaluated and compared by using Model
Accuracy, ROC/AUC and Cross-validation.
In this thesis, the two class labels are defined as default and non-default. The
customers can either be classified to the default class or the non-default class.
The set of labels can be referred to as the label space Y ∈ {0, 1} where 0 refers
to the non-default class and 1 to the default class. The labeled target is rep-
resented by the random variable Yi = yi where i represents the i-th instance
i = 1, 2, ..., n in a given data set.
The set of features, containing the remaining information about the customers,
can be referred to as the feature space X ∈ Rm where m denotes the total num-
ber of features. These features can either be numerical or categorical.The feature
vector is given by Xi and takes on a list of observed features xi = [xi1 , ..., xi,m ]T .
Given n samples, the data set containing the labeled target and the vector of
features can be defined by D = {(x1 , y1 ), ..., (xn , yn )}.
Further on, the three methods used for binary classification in this thesis will
be explained. These are Logistic Regression, XGBoost and CatBoost.
4
T
eβ0 +β xi
P (Yi = 1|Xi = xi ) = (1)
1 + eβ0 +β T xi
where β0 denotes the intercept and β is the vector of coefficients associated with
the predictors, β = [β1 , β2 , ..., βm ]T . A transformation of xi can be derived from
Eq(1) and displayed as following
P (Yi = 1|Xi = xi )
ln = β0 + β1 xi1 + β2 xi2 + ... + βm xim = β0 + β T xi (2)
1 − P (Yi = 1|Xi = xi )
representing the natural logarithm of the odds, also known as the log-odds, on
the left side and the transformation of our predictors, xi , on the right side. This
can also be written as
logit(pi ) = β0 + β T xi (3)
Where pi is the probability of observation i belonging to class 1. Since the
logistic model returns a value between 0 and 1, a threshold value is chosen in
order to classify the observations. The selected boundaries are
In order to fit the logistic model, we would like to maximize the likelihood func-
tion. By letting θ = (β0 , β) and re-writing Eq(1) we are given
1
p(xi , θ) ≡ P (Yi = 1|Xi = xi ; θ) = (4)
1 + e−θT xi
5
And the log-likelihood:
n
X
l(θ) = {yi ln p(xi ; θ) + (1 − yi ) ln(1 − p(xi ; θ))} (6)
i=1
With some modification of Eq(6), we can obtain the normalized binary cross-
entropy (Loiseau, 2020)
n
1X
J(θ) = − yi ln p(1|xi ; θ) + (1 − yi ) ln(1 − p(0|xi ; θ)) (7)
n i=1
We also also express a vectorized version of the normalized binary cross entropy.
In this case, y T denotes the vector of response variables.
1
J(θ) = − (y T ln p(1|x; θ) + (1 − y)T ln p(0|x; θ) (8)
n
Cross-entropy will be further discussed in Chapter 2.3. As for now, this is the
function we want to minimize by finding the optimal coefficients, θ. To do so, an
iterative approach called the Newton-Raphson method is commonly used. The
coefficients are updated for every iteration, and the update can be described by
6
Figure 2.3 Regression tree
The previous example presents a regression tree were the response is quantita-
tive. Since this thesis handles a classification task, were the response instead
is qualitative, classification trees will be introduced. In classification trees, the
leaves represents the predicted class label. For example, the customers in this
thesis can either end up in a leaf represented by the default or the non-default
class. Besides the response being different, classification trees are grown in a
similar way as regression trees. However, they require a different splitting cri-
terion.
where 0 ≤ −p̂bk logp̂bk due to 0 ≤ p̂bk ≤ 1 and p̂bk can be explained by using
1 X
p̂bk = I(yi = k) (11)
nb
xi ∈Rb
which is the ratio of class k instances in the b:th node, with Rb representing the
region with nb observations (Hastie, Tibshirani & Friedman, 2001).
2.4 Boosting
Boosting is an ensemble technique that combines or converts weak learners to
create a stronger one with the purpose of improving the performance of the
model (Sharpie & Freund, 2012). The reason why we have chosen to use boost-
ing methods is that these type of models have many advantages against the
logistic regression. For example, the boosting models often handles missing
data well and requires less data preparation. Since there are many different
7
type of boosting models, they usually have different type of advantages against
each other.
Further on the two boosting models we have chosen to include in this thesis will
be explained by theory.
2.5 XGBoost
XGBoost (eXtreme gradient boosting) is an implementation of the gradient
boosting algorithm and has been recognized for its successful performance and
speed due to its algorithmic optimizations (Chen & Guestrin, 2016). To provide
a better understanding for how XGBoost works, a mathematical derivation of
XGBoost according to Chen & Guestrin (2016) follows.
where S can be defined as the classification tree space S = {f (x) = wq(x) }(q :
Rm → T, w ∈ RT ). In the classification tree space, q can be defined as the tree
structure with T number of leaves. An independent tree structure q and leaf
weights w is represented by fh .
The XGBoost objective function consists of a training loss term l and a reg-
ularization term Ω. The training loss term, also known as the loss function,
measures the model fit on the training data and the regularization term mea-
sures the complexity of the trees. For learning, we aim to minimize this function
n
X H
X
L(pi ) = l(yi , pi ) + Ω(fh ) (13)
i=1 h=1
8
(t)
Given that pi is the prediction of the i-th instance at the t-th iteration, ft is
needed to be added in order to minimize the objective function. It can therefor
further be derived as
n
X
L(t) = l(yi , pt−1
i + ft (xi )) + Ω(ft ) (15)
i=1
Further, we let the instance set of leaf j be defined as Ij = {i | q(xi ) = j}. The
objective function can be rewritten as
T
X X 1 X
L̄(t) = [( gi ])wj + ( hi + λ)wj2 ] + γT (18)
j=1
2
i∈Ij i∈Ij
From previous equations, the optimal weight wj∗ of leaf j for given a tree struc-
ture q(x) can be derived as
P
∗ i∈Ij gi
wj = − P (19)
i∈Ij hi + λ
As explained in 2.3, trees are grown greedily by using binary splitting, were each
split results in two new nodes. To further explain how this works for XGBoost
we define IL and IR as the set on instances of the left and right node after a
9
split. By letting I be defined as the node before the split, I = IL ∪ IR , the loss
reduction is given by
P P P
1 i∈IL gi gi gi
Lsplit = [ P + P i∈IR − P i∈I ]−γ (21)
2 i∈IL hi + λ i∈IR hi + λ i∈I hi + λ
2.6 CatBoost
CatBoost (Categorical boosting) is also an implementation of the gradient boost-
ing algorithm. The algorithm does, except handling numerical features, also
automatically handle categorical features which makes categorical data usable
with less manual pre-processing. For other boosting methods, the categorical
features are commonly converted into numbers before training. CatBoost is
also known for requiring less parameter tuning, its training speed and is built to
prevent overfitting better than many of the previous released boosting models
(Dorogush, Ershov & Gulin, 2018).
In this thesis, the data sets includes both numerical and categorical features.
Since both XGBoost and CatBoost can handle numerical features, the mathe-
matical explanation that follows will focus on what distinguishes CatBoost from
XGBoost - its ability to handle categorical features.
were the dot [·] corresponds to the Iverson brackets, that is, [xi,j = xi,k ] is equal
to 1 in the case where xi,j = xi,k and if not, it is equal to 0. In classification,
P is some priori probability of encountering a default class using a > 0, which
10
can be defined as the weight of P (Dorogush, Ershov & Gulin, 2018).
Another difference between CatBoost and XGBoost is that CatBoost uses obliv-
ious decision trees. This means that all nodes on the same level of the tree,
except the terminal nodes, has the same splitting criterion. Since these trees
are full binary, a tree with n levels will have 2n nodes. This means that the
length from the root node to each leaf will be equal to the depth of the tree
(Hancock & Khoshgoftaar, 2020).
To evaluate the performance of a classifier the AUC, Area under the curve, is
observed. The AUC represents the model’s capability to distinguish between
classes and takes on values between 0 and 1. An AUC-value close to 1 indicates
that the model is better at distinguishing between classes than model receiving
an AUC close to 0 (Fawcett, 2006).
11
2.7.3 Cross-validation
To develop models that performs well, in terms of predictions, on new data as
well as training data, the widespread ‘holdout method’ is often employed. To
avoid overfitting, we do not want to use the same information for training and
validating the models. Instead, some of data is held out and the remaining data
is used for training. This is called the holdout method and the data that has
been held out is often referred to as a validation data set. In the case where data
is limited, there is a risk that important information that is found only in the
validation set can be missed when training the model exclusively on data from
the training set. This might cause a bias. To solve this problem and improve
the holdout method, a technique called k-fold cross-validation is used.
With k-fold CV the training data set is split into k approximately equaled sized
sets. The model is then trained using k-1 folds where the remaining k-th fold
serves as a validation data set. This process is repeated k times so every k-fold
has served as a validation set. For every iteration, the test error is computed
and the performance of the model is based on the average test error among all
k trials (James et al. 2021).
12
3 Data and method
This chapter aims to describe the two different data sets that are used in the the-
sis and how missing values are handled. The chapter also consists of a method
part which explains how the data is modeled by using Logistic Regression, XG-
Boost and CatBoost in R.
3.1 Data
As previously mentioned, we will apply the different models to two different sets
of data. The first dataset contains 1000 obsvervations, with 20 predictor vari-
ables. This data was collected from Germany between the years of 1973-1975,
and is a stratified sample with oversampled bad credits. Of the 1000 observa-
tions, 200 defaulted on their loan. The data was collected from UCI machine
learning repository. Further on in the thesis this data will be referred to as data
1.
13
3.3.2 XGBoost
In the first data set, all of the variables are integers, implying that we do not
need to transform the data. Unlike the first data set, the second one has different
type of variables including categorical features. To use the XGBoost algorithm
on this data, one-hot-encoding is necessary.
3.3.3 CatBoost
In R, CatBoost can handle factors. All categorical features is therefor converted
using the as.factor()-function. Categorical features needs to be specified in order
for CatBoost to handle them as categorical features.
3.4 Method
3.4.1 Hyperparameter optimization
The values of hyperparameters are used to control the learning process. In
order to find the set of optimal hyperparameters for the machine learning algo-
rithms, grid search have been performed. Grid search is an exhaustive search
through all possible combinations of hyperparameters, given a set of values for
each hyperparameter (Hsu, Chang Lin, 2003). The best combination of hy-
perparameters, that will be used in the final models, is the combination that
minimizes the loss function, measured by cross validation on the training set.
We will be using 10-fold cross validation with 10-repeats for each combination
of hyperparameters. Below are tables describing the hyperparameters that are
to be tuned for both machine learning algorithm used in this thesis.
Name Description
nrounds Number of trees to be built
max depth Maximum depth of each tree
eta Learning rate
gamma regularization parameter (See Chapter 2.5)
lambda regularization parameter (See Chapter 2.5)
min child weight Stop further partitioning if sum of observation weights
is less than the value of the hyperparameter
14
Table 3.2 Hyperparameters (CatBoost)
Name Description
iterations Number of trees to be built
depth Depth of each tree
learning rate Learning rate
l2 leaf reg regularization parameter. Analog to λ (See Chapter 2.5)
rsm % of features to use at split selections
For further explanation, learning rate is the size of each step for every iteration,
moving towards the minimum of the loss function. A high learning rate would
require fewer iterations, but also increases the risk of missing the minimum of
the loss function. Intuitively, a low min child weight implies that subsequent
learners focuses more on missclassified observations. This can help to improve
the model accuracy, but may also cause overfitting. RSM is short for Random
Subspace Method. The value of this hyperparameter is the percentage of ex-
planatory variables available to for the model, to use at each split section, when
explanatory variables are selected repeatedly at random. A lower RSM would
increase the training speed, but having a RSM of below 1 would add another
stochastic element to the training process.
Unlike the machine learning algorithms, Logistic regression does not contain
any hyperparameters that are to be tuned. The logistic regression models will
be fitted using different amount of iterations, until no further improvements in
the evaluation metrics can be seen.
As discussed in Chapter 2.7.3, we will search for the optimal set of hyperparam-
eters of α, which is denoted as α̂. This is the combination of hyperparameters
that minimizes the CV estimated prediction error when conducting the grid
search with 10-fold cross validation with 10 repeats of each combination. This
set of hyperparameters, α̂, will be used in our final models.
15
Table 4.1 Accuracy (Data 1)
4 Results
This chapter presents the results for using the train and test data of the two
different data sets with Logistic Regression, XGBoost and CatBoost.
Table 4.2 shows the accuracy of the models for data 2. All models had a higher
accuracy compared to the models for data 1. Logistic regression had an accuracy
of 93.5%, and was outperformed by both machine learning algorithms. Unlike
the models for data 1, XGBoost had the highest accuracy when using data 2.
XGBoost had an accuracy of 94.9% compared to CatBoost’s 94.6%. Looking at
the training data, Logistic regression scored lower than on the test data, with
an accuracy of 93.1%. Both machine learning algorithms almost achieved an
accuracy of 100% on the test data, with 99.1% and 99.9% for XGBoost and
CatBoost respectively.
16
Table 4.3 AUC (Data 1)
17
Figure 4.3 ROC curves comparing the models for data 1
Figure 4.3 shows a ROC-curve of the models with the AUC for the test data
given in table 4.3.
Table 4.4 shows the models AUC-scores for data 2. The machine learning al-
gorithms outperformed the benchmark model, logistic regression, in terms of
AUC for this set of data as well. The logistic regression model received an AUC
of 0.790. CatBoost had the highest AUc score of 0.925, while XGBoost had an
AUC of 0.905. On the training data, both machine learning algorithms had an
AUC of 1 or close to 1. The logistic regression model had an AUC of 0.806 on
the training data.
18
Figure 4.3 ROC curves comparing the models for data 2
19
5 Discussion
This chapter provides an analysis of the results from the previous chapter.
This thesis aimed to compare the machine learning boosting algorithms with
logistic regression as a benchmark, using two different sets of credit risk data.
As in the study of Essam Al Daoud (2019), mentioned in section 1.2, XGBoost
and CatBoost yield results very similar to each other considering the evaluation
metrics. The results of the machine learning boosting algorithms outperformed
the benchmark model in both accuracy and AUC for the test data.
Looking at the results, the models fitted using data 2 achieved much higher
accuracy and AUC comparing to those fitted using data 1. As previously men-
tioned in Chapter 3.1, data 1 was collected between the years of 1973-1975.
Since then it’s been widely used in different studies regarding credit scores,
but it doesn’t come with much background information. There is also severe
errors in the coding information. We can safely say that the data was not col-
lected with the intentions of being used to construct machine learning models
with the recently developed algorithms such as XGBoost and CatBoost. We
believe this is the primary reason for the fact that the models fitted using this
set of data performed worse in terms of accuracy and AUC than those of data 2.
Another thing noticeable about the results, is that the difference in accuracy
between training and test data is larger for the machine learning models fitted
on data 2, compared to the models fitted on data 1. This implies some overfit-
ting tendencies are present in the models fitted on data 2, compared to models
fitted on data 1. The reason for this can be derived from the results of our
hyperparameter optimization. Despite models fitted on data 2 had trees with
greater depth, and were therefore more complex and contained more weights,
the regularization terms (described in Chapter 2.5) suggested by the performed
grid search were more or less the same across all machine learning models. In
other words, since the regularization terms were more or less the same across
all machine learning algorithms, despite the models fitted on data 2 were more
complex, these were also more prone to overfit. The reason for why data 2
yielded more complex models, is because it contained more relevant features.
In Chapter 2.6, when discussing the differences between CatBoost and XG-
Boost, it was mentioned that CatBoost is known for requiring less parameter
tuning, better training speed and is built to prevent over-fitting than many of
the previous released model. Fitting two models with CatBoost is not sufficient
evidence to claim that this is not true, but in these two cases, none of the above
mentioned benefits were realized. In regards of data 2, the grid search suggested
roughly for times the number of iterations, with a slightly lower learning rate,
for CatBoost compared to XGBoost. Therefore, it took longer time to tune the
parameters and train the model. Looking at the results, the differences between
the training and test data in terms of accuracy, were greater for the CatBoost
20
models, compared to XGBoost, implying that the CatBoost models were more
prone to overfitting. However, this could also be the result of not finding the
optimal set of hyperparameters for the models. Despite that both the authors
computers performed grid searches for weeks during the writing of this thesis,
we cannot be completely certain that the optimal set of hyperparameters were
found, but it was not for the lack of trying.
When observing the results, it can also be seen that the accuracy for both data
sets were higher than the AUC for them. When using the accuracy as an evalu-
ation metric it can be important to consider that imbalanced data tend to yield
higher accuracy results. Despite that the data sets used in this thesis are not
highly imbalanced, they still are somewhat imbalanced which needs to be taken
into account when analyzing the results. Perhaps, the ROC-curves and AUC
can be seen as more reliable evaluation metrics in that case since they are not
biased by the data being imbalanced. For future work, it could also be interest-
ing to add other evaluation metrics such as a confusion matrix to evaluate the
algorithms.
An aspect that can be considered regarding the results for the second data set
is that the observations with missing values were deleted. This was done for the
benchmark model to be comparable with the machine learning boosting algo-
rithms, despite their ability to handle missing values. Since the deletion of data
resulted in fewer observations, the outcome of using the whole data set for the
machine learning boosting algorithms could be expected to be different from the
one given in the results. Even though the second data set still contains many
observations, a large data set tend to prevent the risk for overfitting. Possibly,
the difference between the training and the test set for accuracy could have been
smaller if the whole data set where used.
21
6 Conclusion and further studies
To summarize, due the model accuracy being biased by somewhat imbalanced
data, we expect the AUC to be a more reliable evaluation metric than model
accuracy. According to the AUC scores, CatBoost outperformed both XGBoost
and the benchmark model logistic regression on both sets of data. Much due
to the age of the first data set, it is not believed to be resembling to any sets
of data used in the financial industry today. Due to logistic regression not be-
ing able to handle missing values we also decided to remove all observations
containing missing values. This is a actor that could have an effect on our re-
sults. Considering that missing values is common in the real world, it would
be interesting to see how the machine learning algorithms performed on such
data. We, the authors, also believe that we lacked the time and computational
power required to be able to with certainty claim that we found the optimal
set of hyperparameters. However, according to our results and evaluation met-
rics, CatBoost outperformed XGBoost in the assignment of assessing credit risk.
For further studies, it would be interesting to include other commonly used ma-
chine learning algorithms such as LigthGBM, and try these on various sets of
data, containing both numerical and categorical variables, varying in size, bal-
ance in target variable, and amount of missing values. Retention and exclusion
of certain features would also be an interesting aspect. This would also require
more computational power to guarantee that the optimal set of hyperparame-
ters were found.
22
7 Bibliography
Aditya Mishra. 2018. Metrics to Evaluate your Machine Learning Algorithm,
towards data sciene [Blog], 24 February. Available at:
https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-
f10ba6e38234.
Bastos, Joao. 2008. Credit scoring with boosted decision trees. MPRA paper,
(8156). Available at: https://mpra.ub.uni-muenchen.de/8156/
Chen, Tianqi & Guestrin, Carlos. 2016. XGBoost: A Scalable Tree Boosting
System. Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. KDD x27;16. New York, NY, USA:
ACM, pp. 785–794. Available at: http://doi.acm.org/10.1145/2939672.2939785.
Dorogush, Anna Veronika; Ershov, Vasily & Gulin, Andrey. 2018. CatBoost:
gradient boosting with categorical features. NIPS’18: Proceedings of the 32nd
International Conference on Neural Information Processing Systems, pp. 6639-
6649. Available at: arXiv:1810.11363.
Gao, Ge; Wang, Hongxin and Gao, Pengbin. 2021. Establishing a Credit Risk
Evaluation System for SMEs Using the Soft Voting Fusion Model. Risks, 9 (11),
p.202.
Gareth, James; Witten, Daniela; Hastie, Trevor & Tibshirani, Robert. 2021. An
Introduction to Statistical Learning: with applications in R. New york: Springer.
Hancock, John T. & Khoshgoftaar, Taghi M. 2020. CatBoost for big data: an
interdisciplinary review. Journal of Big Data, 7, 94. Available at:
https://doi.org/10.1186/s40537-020-00369-8.
Hastie, Trevor; Tibshirani, Robert & Friedman, Jerome. 2001. Elements of Sta-
tistical Learning: data mining, inference and prediction. New york: Springer.
Hsu, Chih-Wei; Chang, Chih-Chung and Lin, Chih-Jen. 2003. A practical guide
to support vector classification.
Loiseau, Jean-Christophe B. 2020. Binary cross-entropy and logistic regression,
towards data science [Blog], 1 June. Available at: https://towardsdatascience.com/binary-
23
cross-entropy-and-logistic-regression-bf7098e75559 [Retrieved 2022-01-07].
OECD. (2021). Artificial Intelligence, Machine Learning and Big Data in Fi-
nance: Opportunities, Challenges, and Implications for Policy Makers, Avail-
able at: https://www.oecd.org/finance/artificial-intelligence-machine-learningbig-
data-in-finance.htm.
Sharpie, Robert E. & Freund, Yoav. 2012. Boosting: Foundations and Algo-
rithms. Cambridge, Massachusetts: The MIT Press.
Tian, Zhenya; Xiao, Jialiang; Feng, Haonan and Wei, Yutian. 2020. Credit
risk assessment based on gradient boosting decision tree. Procedia Computer
Science, 174, pp.150-160.
24
8 Appendix
List A1: Explanation of variables in data 1.
laufkont: status of the debtor’s checking account with the bank (categor-
ical).
wohnzeit: length of time (in years) the debtor lives in the present resi-
dence (ordinal; discretized quantitative).
verm: the debtor’s most valuable property, i.e. the highest possible code
is used. Code 2 is used, if codes 3 or 4 are not applicable and there is a car or
any other relevant property that does not fall under variable sparkont (ordinal).
bishkred: Number of credits including the current on the debtor has (or
had) at this bank (ordinal, discretized quantitative).
25
beruf: quality of debtor’s job (ordinal).
pers: Number of persons who financially depend on the debtor (i.e., are
entitled to maintenaance) (binary, discretized quantitative).
26
List A2: Explanation of variables in data 2.
27