Loan Default Prediction Using Supervised Machine Learning Algorithms
Loan Default Prediction Using Supervised Machine Learning Algorithms
DARIA GRANSTRÖM
JOHAN ABRAHAMSSON
DARIA GRANSTRÖM
JOHAN ABRAHAMSSON
We would like to devote an extra gesture of gratitude towards: our supervisors Tat-
jana Pavlenko for professional academic guidance, Aron Moberg and Lee MacKenzie
Fischer for enabling the research and helpful mentoring at Nordea.
Contents
Acknowledgements
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Theory 3
2.1 Formulation of a Binary Classification Problem . . . . . . . . . . . . 3
2.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.1 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.2 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 11
2.7 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Feature Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8.1 Correlation Analysis with Kendall’s Tau Coefficient . . . . . 16
2.8.2 Recursive Feature Elimination . . . . . . . . . . . . . . . . . 17
2.9 Treatment of Imbalanced Data with SMOTE Algorithm . . . . . . . 18
2.10 Model Evaluation Techniques . . . . . . . . . . . . . . . . . . . . . . 19
2.10.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.10.2 Area Under the Receiver Operator Characteristic Curve . . . 21
2.11 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.11.1 Implementation of Cross-validation with SMOTE . . . . . . . 24
4 Results 33
4.1 Performance of the Methods . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Results of the Best Performing Method . . . . . . . . . . . . . . . . 36
5 Discussion 39
5.1 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.1 Impact of SMOTE . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.2 Impact of Variable Selection Methods . . . . . . . . . . . . . 40
5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Bibliography 43
B Adam Algorithm 49
List of Figures
2.1 (a) Two dimensional feature space split into three subsets. (b) Corre-
sponding tree to the split of the feature space. Source of figure: [16]. . . 6
2.2 Structure of a single hidden layer, feed-forward neural network. Source
of figure: [21]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 An example of a result using the SMOTE algorithm. Source of figure: [24]. 19
2.4 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 K-fold cross validation performed on a data set . . . . . . . . . . . . . . 23
2.6 Upper: Correct way of oversampling and implementing CV. Lower: In-
correct way of oversampling and implementing CV. Source of figure: [4] 24
3.1 Correlation of feature variables against the response variable with Kendall’s
Tau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
CV Cross-validation
DC Dummy Classifier
DT Decision Tree
EL Expected Loss
FN False Negatives
FP False Positives
PD Probability of Default
RF Random Forest
TN True Negatives
TP True Positives
TPR True Positive Rate
Introduction
In this chapter an overview of what aim of the thesis is provided. The topics discussed
within this chapter are the thesis’ background, purpose and scope.
1.1 Background
A recent development of machine learning techniques and data mining has led to an
interest of implementing these techniques in various fields [33]. The banking sector
is no exclusion and the increasing requirements towards financial institutions to have
robust risk management has led to an interest of developing current methods of risk
estimation. Potentially, the implementation of machine learning techniques could
lead to better quantification of the financial risks that banks are exposed to.
Within the credit risk area, there has been a continuous development of the Basel
accords, which provides frameworks for supervisory standards and risk management
techniques as a guideline for banks to manage and quantify their risks. From Basel
II, two approaches are presented for quantifying the minimum capital requirement
such as the standardized approach and the internal ratings based approach (IRB) [3].
There are different risk measures banks consider in order to estimate the potential
loss they may carry in future. One of these measures is the expected loss (EL) a
bank would carry in case of a defaulted customer. One of the components involved in
EL-estimation is the probability if a certain customer will default or not. Customers
in default means that they did not meet their contractual obligations and potentially
might not be able to repay their loans [43]. Thus, there is an interest of acquiring
a model that can predict defaulted customers. A technique that is widely used for
estimating the probability of client default is Logistic Regression [44]. In this thesis,
a set of machine learning methods will be investigated and studied in order to test
if they can challenge the traditionally applied techniques.
1
CHAPTER 1. INTRODUCTION
1.2 Purpose
The objective of this thesis is to investigate which method from a chosen set of
machine learning techniques performs the best default prediction. The research
question is the following
• For a chosen set of machine learning techniques, which technique exhibits the
best performance in default prediction with regards to a specific model evalua-
tion metric?
1.3 Scope
The scope of this paper is to implement and investigate how different supervised
binary classification methods impact default prediction. The model evaluation tech-
niques used in this project are limited to precision, sensitivity, F -score and AUC
score. The reasons for choosing these metrics will be explained in more detail in
section 2.10. The classifiers that will be implemented and studied are:
• Logistic Regression
• Decision Tree
• Random Forest
• XGBoost
• AdaBoost
The project will be performed at Nordea and thus the internal data of Nordea’s
customers will be used in order to conduct the research. With the regards to this
fact, the results presented in this thesis will be biased towards the profile of Nordea’s
clients, specifically their location and other behavioral factors. The majority of
Nordea’s clients are situated in Nordic countries, and thus it will be impacted mainly
by the behaviour of clients from these countries.
2
Chapter 2
Theory
In the theory section, relevant theory behind the chosen classification methods is
explained. Background needed for understanding the implemented variable selection
techniques and chosen model validation methods are also provided in this chapter.
Binary classification refers to the case when the input to a model is classified to
belong to one of two chosen categories. In this project, customers belong either to
the non-default category or to the default category. The categories can therefore be
modeled as a binary random variable Y ∈ {0, 1}, where 0 is defined as non-default,
while 1 corresponds to default. The random variable Yi is the target variable and
will take the value of yi , where i corresponds to the ith observation in the data set.
For some methods, the variable ȳi = 2yi −1 will be used, since these methods require
the response variable to take the values ȳi ∈ {−1, 1}.
The rest of the information about the customers, such as the products the customers
posses, account balances and payments in arrears can be modeled as the input vari-
ables. These variables are both real numbers and categories and are often referred
to as f eatures or predictors. Let Xi ∈ Rp denote a real valued random input vector
and an observed feature vector be represented by xi = [xi1 , xi2 , ..., xip ]> , where p is
the total number of features. Then the observation data set with N samples can be
expressed as D = {(x1 , y1 ), (x2 , y2 ), ..., (xN , yN )}.
With this setup, it makes it feasible to fit a supervised machine learning model that
relates the response to the features, with the objective of accurately predicting the
response for future observations [14]. The main characteristic of supervised machine
learning is that the target variable is known and therefore an inference between the
target variable and the predictors can be made. In contrast, unsupervised machine
learning deals with the challenge where the predictors are measured but the target
variable is unknown.
3
CHAPTER 2. THEORY
The chosen classification methods in this project are Logistic Regression, Artificial
Neural Network, Decision Tree, Random Forest, XGBoost, AdaBoost and Support
Vector Machine. The theory for these classifiers will be explained in more detail in
the sections below.
>
eβ0 +ββ xi
P (Yi = 1|Xi = xi ) = , (2.1)
1 + eβ0 +ββ > xi
where the parameters β0 and β are parameters of a linear model with β0 denoting an
intercept and β denoting a vector of coefficients, β = [β1 , β2 , ..., βp]> . The logistic
function from Equation (2.1) is derived from the relation between the log-odds of
P (Yi = 1|Xi = xi ) and a linear transformation of xi , that is
P (Yi = 1|Xi = xi )
log = β0 + β > xi . (2.2)
1 − P (Yi = 1|Xi = xi )
1, if P (Yi = 1|Xi = xi ) ≥ c
(
ŷi = , (2.3)
0, if P (Yi = 1|Xi = xi ) < c
1
p(xi ; β0 , β ) = P (Yi = 1|Xi = xi ; β0 , β ) = . (2.4)
1+ e−(β0 +ββ > xi )
4
2.3. DECISION TREES
N
l(β0 , β ) = [yn log p(xn ; β0 , β ) + (1 − yn ) log(1 − p(xn ; β0 , β ))]
X
n=1
N
>x
= [yn (β0 + β > xn ) − log(1 + e(β0 +ββ )]. (2.5)
X
n)
n=1
Let θ = {β0 , β } and assume that xn includes the constant term 1 to accommodate
β0 . Then, in order to maximize the log-likelihood, take the derivative of l and set
to zero
N
∂l(θ)
= xn (yn − p(xn ; θ)) = 0. (2.6)
X
∂θ n=1
N
∂ 2 l(θ)
= xn xn> p(xn ; θ)(1 − p(xn ; θ)). (2.7)
X
−
∂θ∂θ> n=1
!−1
∂ 2 l(θ) ∂l(θ)
θ new
=θ old
− . (2.8)
∂θ∂θ> ∂θ
A decision tree algorithm binary splits the feature space into subsets in order to
divide the samples into more homogeneous groups. This can be implemented as a
tree structure, hence the name decision trees. An example of a two-dimensional split
feature space and its corresponding tree can be seen in Figure 2.1.
The terminal nodes in the tree in Figure 2.1 are called leaves and are the predictive
outcomes. In this particular example, a regression tree which predicts quantita-
tive outcomes has been used. However, classification trees that predict qualitative
outcomes rather than quantitative will be used in this project.
5
CHAPTER 2. THEORY
Figure 2.1: (a) Two dimensional feature space split into three subsets. (b) Corre-
sponding tree to the split of the feature space. Source of figure: [16].
1 X
p̂mk = I(yi = k) (2.9)
Nm x ∈R
i m
K
G= p̂mk (1 − p̂mk ), (2.10)
X
k=1
is amenable for numerical optimization [20], it will be chosen as the criterion for
binary splitting.
Before describing the random forest classifier, the notions of bootstrapping and
bagging will be introduced. Bootstrapping is a resampling with replacement method
and is used for creating new synthetic samples in order to make an inference of an
analyzed population. An example could be to investigate the variability of the
mean of a population. This is done by resampling the original sample B times with
6
2.5. BOOSTING
replacement, then compute a sample mean for each of the B new samples, and lastly
compute the variance for the sample means.
B
1 X
fˆbag (z) = fˆ∗b (z), (2.11)
B b=1
where B is the total amount of bootstrapped data sets and fˆ∗b (z) is a model used
for the bth bootstrapped data set. In a classification setting, instead of taking the
average of the models, a majority vote is implemented. When applying bagging to
decision trees, the following should be considered. If there is one strong predictor
in the data set along with moderately strong predictors, most of the top splits will
be done based on the strong predictor. This leads to fairly similar looking trees
that are highly correlated. Averaging highly correlated trees does not lead to a large
reduction of variance.
The random forest classifier has the same setup as bagging when building trees
on bootstrapped data sets but overcomes the problem of highly correlated trees.
It decorrelates the trees by taking a random sample of m predictors from the full
set of p predictors at each split and uses randomly one among the m predictors to
split [17]. An example of a classification for a random forest classifier is shown in
Algorithm 1.
2.5 Boosting
Boosting works in a similar way as bagging regarding combining models and cre-
ating a single predictive model, but it does not build trees independently, it builds
trees sequentially [17]. Building trees sequentially means that information from the
previous fitted tree is used for fitting the current tree. Rather than fitting separate
trees on separate bootstrapped data sets, each tree is fit on a modified version of
the original data set [17].
2.5.1 AdaBoost
AdaBoost stands for Adaptive Boosting and combines weak classifiers into a strong
classifier. A weak classifier refers to a model that is slightly better than classifying
7
CHAPTER 2. THEORY
3 For each split a submodel fˆb (zb∗ ) does, m predictors are chosen randomly out
of the p predictors and one of the m predictors is used for splitting.
4 For each submodel, an unseen observation from the test set, xtest , is used for
prediction, such that fˆ1 (xtest ) ∈ {0, 1}, ..., fˆB (xtest ) ∈ {0, 1}.
5 Create the final model fˆf inal (xtest ) = B ˆ
b=1 fb (xtest ).
1 PB
by flipping a coin, i.e., the accuracy is a bit higher than 0.5. Combining the weak
classifiers can be defined as the following linear combination
where xi is associated with the class ȳi ∈ {−1, 1} and kj (xi ) ∈ {−1, 1} is a weak
classifier with its weight αj [39]. At the mth iteration there is an added weak
classifier, km with weight αm , that enhances the boosted classifier
In order to determine the best km and its weight αm , a loss function that defines
the total error E of Cm is used and takes the following form
N
E= e−ȳi Cm (xi ) , (2.14)
X
i=1
(1) (m)
where N is the total sample size. Letting wi = 1 and wi = e−ȳi Cm−1 (xi ) for
m > 1, the following is obtained
N
(m)
E= w e−ȳi αm km (xi ) . (2.15)
X
i
i=1
8
2.5. BOOSTING
Correctly classified data points takes the form of ȳi km (xi ) = 1 and misclassified
ȳi km (xi ) = −1. This can be used to split the summation into
(m)
The only part that depends on km in Equation (2.16) is , it can also
P
ȳi 6=km (xi ) wi
(m)
be realized that the km that minimizes also minimizes E. To deter-
P
ȳi 6=km (xi ) wi
mine the desired weight αm that minimizes E with the km that was just determined,
take the derivative of E with respect to the weight and set to zero
With the mth calculated weight αm and the weak classifier km , the mth combined
classifier can be obtained as in Equation (2.13). The output of the algorithm is then
1, if Cm (xi ) ≥ 0
(
ŷi = . (2.19)
0, if Cm (xi ) < 0
2.5.2 XGBoost
Let N be a number of samples in the data set with p features, D = {(xi , yi )}N
i=1
(|D| = N , xi ∈ Rp and yi ∈ {0, 1}). To predict the output, M additive functions
are being used
M
φ(xi ) = fk (xi ), S = {f (x) = wq(x) )}, (2.20)
X
fk ∈ S,
k=1
9
CHAPTER 2. THEORY
N M
L(φ) = l(yi , φ(xi )) + Ω(fk ), (2.21)
X X
i=1 k=1
The function Ω(f ) penalizes the complexity of the model by the parameter γ, which
penalizes the number of leaves, and λ which penalizes the leaf weights. The loss
function l measures the difference between the prediction φ(xi ) and the target yi [7].
Further, let φ(xi )(t) be the prediction of the ith observation at the t-th iteration,
then ft is needed to add in order to minimize the following objective
N
L(t) = l(yi , φ(xi )(t−1) + ft (xi )) + Ω(ft ), (2.23)
X
i=1
where ft is chosen greedily so that it improves the model the most. Second-order
approximation can be used to quickly optimize the objective in the general setting
[7]
N
1
[l(yi , φ(xi )(t−1) ) + gi ft (xi ) + hi ft2 (xi )] + Ω(ft ), (2.24)
X
(t)
L '
i=1
2
N T
1 1 X
= [gi ft (xi ) + hi ft (xi )] + γT + λ (2.25)
X
(t) 2
L̃ wj2 .
i=1
2 2 j
10
2.6. ARTIFICIAL NEURAL NETWORKS
simplified to
T
1 X
L̃(t) = [( gi )wj + ( hi + λ)wj2 )] + γT. (2.26)
X X
j i∈Ij
2 i∈I
j
Now, the expression for the optimal weight wj∗ can be derived from Equation (2.26)
P
i∈Ij gi
wj∗ = −P . (2.27)
i∈Ij hi + λ
1XT ( i∈Ij gi )2
P
L̃ (q) = −
(t)
+ γT. (2.28)
2 j i∈Ij hi + λ
P
1, if φ(xi ) ≥ c
(
ŷi = , (2.29)
0, if φ(xi ) < c
where c is a chosen decision boundary and φ(xi ) ∈ (0, 1). Further, in order to find
the split the Exact Greedy Algorithm is used [7]. There are also other algorithms
that can be used as alternatives for split finding such as the Approximate Algorithm
and the Sparsity Aware Algorithm [7].
Artificial neural networks (ANN) is originally inspired by how a human brain works
and is intended to replicate its learning process [23]. A neural network consists of
an input layer, an output layer and a number of hidden layers (see Figure 2.2).
11
CHAPTER 2. THEORY
Figure 2.2: Structure of a single hidden layer, feed-forward neural network. Source
of figure: [21].
Zm = f (α0m + α >
m xi ), m = 1, ..., M, (2.30)
where α m = [αm1 , ..., αmp ]> and M is the total number of hidden units. The acti-
vation function f (·) in Equation (2.30) is a Rectified Linear Unit (ReLU) function
f (v) = max(0, v +) with ∼ N (0, σl (v))[34], where σl (v) is the variance of . Then,
a linear transformation of Z is performed
Hk = β0k + β >
k Z, k = 1, ..., K, (2.31)
where Z = [Z1 , Z2 , ..., ZM ]> and β k = [βk1 , ..., βkm ]> . Further, the final transfor-
mation of the vector H = [H1 , H2 , ..., HK ]> is being done by a sigmoid function
σk (H)
1
gk (xi ) = σk (H) = . (2.32)
1 + e−Hk
12
2.6. ARTIFICIAL NEURAL NETWORKS
values for them, so that the the model fits the training data well. For classification
problems, the cross-entropy error function is defined as follows
N X
K
R(θ) = − yik log gk (xi ), (2.33)
X
i=1 k=1
with the respective classifier G(xi ) = arg maxk gk (xi ). In this project, K = 2, where
k = 1 is defined as non-default and k = 2 corresponds to default. In order to find
the most optimal solution, the back-propagation algorithm is used, which in turn
makes use of gradient descent for finding the global minimum.
The back-propagation algorithm works in the following way. The derivatives of the
error function with respect to the weights should be found. Let zmi = f (α0m +α α>m xi )
derived from Equation (2.30) and zi = [z1i , zi , ..., zM i ]. Then the cross-entropy error
can be expressed as
N
X
R(θ) ≡ Ri
i=1
N X K
=− yik log gk (xi ), (2.34)
X
i=1 k=1
with derivatives
K
∂Ri yik 0 >
=− β k zi )βkm f 0 (α
σk (β α>m xi )xil . (2.36)
X
∂αml k=1
g k (xi )
The gradient descent update consequently takes the following form at the (r + 1)th
iteration
N
(r+1) (r) ∂Ri
= βkm − γr (2.37)
X
βkm (r)
,
i=1 ∂βkm
N
(r+1) (r) ∂Ri
= (2.38)
X
αml αml − γr (r)
,
i=1 ∂αml
13
CHAPTER 2. THEORY
where γr is the learning rate. In order to avoid overfitting, the stopping criterion
should be introduced before the global minimum is being reached [21]. As mentioned
above, at the start of training, the model is highly linear due to the starting point
of weights and, thus, it might lead at the end might lead to the shrinkage of the
model. In this case, the penalty is being introduced to the the error function
J(θ) = + (2.40)
X X
2 2
βkm αml ,
km ml
2
βkm X α2
J(θ) = + (2.41)
X
ml
1 + βkm
2 1 + 2 ,
αml
km ml
if the weight elimination penalty is being used instead. The Adam algorithm will
be implemented to perform a stochastic optimization in order to find the optimal
weights [28]. The more detailed description of the algorithm can be found in Algo-
rithm 4 in Appendix B.
One of the critiques ANN algorithm has received is that it is relatively slow to train.
Further, relying on the back-propagation algorithm, sometimes it is quite unstable
due to the tuning parameter should be adjusted in a way that the algorithm reaches
the final solution and not oversteps it [25].
The Support Vector Machine (SVM) is an algorithm that involves creating a hy-
perplane for classification. In order to classify an object, a set of features is used.
Thus, if there are p features, the hyperplane will lie in p-dimensional space [41].
The hyperplane is created by the optimization performed by an SVM, which in turn
maximizes the distance from the closest points, also called support vectors. Let
xi = [xi1 , ..., xip ]> be an arbitrary observation feature vector in the training set, ȳi
the corresponding label to xi , w a weight vector w = [w1 , ..., wp ]> with ||w||2 = 1,
14
2.7. SUPPORT VECTOR MACHINE
and b a threshold. Then following constraints are defined for the classification prob-
lem [8]:
Let f (xi ) = w> xi + b, then the output of the model ŷi is defined follows
1 for f (xi ) ≥ 0
(
ŷi = . (2.44)
0 for f (xi ) < 0
For margin maximization, instead of using ||w||2 = 1, the lower bound on the
margin and the optimization problem can be defined for minimization of ||w||2 . The
constraints for the optimization problem are derived from the inequalities (2.42) and
(2.43) can be presented as follows
In some cases, it is relevant to implement a soft margin, which allows some points
to lie on the wrong side of the hyperplane or between the support vector and the
origin in order to provide a more robust model. A cost parameter C may also be
introduced, which plays a role of assigning penalty to errors, where C > 0. Then
the objective function to minimize takes the following form
||w||2 + C (2.46)
X
ξi ,
i
where ξi is a slack variable. The constraints to the optimization problem now are
as follows [8]
M +C (2.48)
X
2
RSV ξi ,
i
15
CHAPTER 2. THEORY
Lagrangian dual problem is used for solving this optimization problem, where the
maximization of following objective with respect to parameter αi is done
L= αi (x>
i xi ) − αi αj (x>
i xi ), (2.50)
X X
i i,j
a= αi Φ(xi ), (2.51)
X
In the data given by Nordea, features are presented as both continuous and cat-
egorical variables. Thus, in order to understand how features correlate with each
other and the response variable, implemented methods for feature selection should
process both continuous and categorical variables simultaneously. That is why it is
been decided to use following methods for feature selection: Feature selection with
Kendall’s Tau Coefficient Analysis and Recursive Feature Elimination.
The Kendall’s Tau rank correlation coefficient measures the ordinal association be-
tween two measured variables, thus, it is a univariate method of correlation analysis
[13]. The ordinal association between two variables is done by calculating the pro-
portion of concordant pairs minus the proportion of discordant pairs in a sample
[11]. Any pair of observations (xki , yi ) and (xkj , yj ), for i < j, are concordant if the
product (xki − xkj )(yi − yj ) is positive, and discordant if this product is negative.
The Kendall’s Tau coefficient can then be defined as follows [35]
2
τ= sgn(xki − xkj )sgn(yi − yj ). (2.52)
X
n(n − 1) i<j
16
2.8. FEATURE SELECTION METHODS
The Kendall’s Tau rank correlation is a non-parametric test which means that it
does not rely on any assumptions of distributions between the analyzed variables
[42]. In a statistical hypothesis test, the null hypothesis implies an independence of
X and Y and for large data sets, the distribution of the test can be approximated
by a normal distribution with mean zero and variance [1]
2(2n + 5)
στ2 = . (2.53)
9n(n − 1)
Further, for samples where N > 10, the transformation of τ into a Z value can be
done for the null hypothesis test, such that Z value has a normal distribution with
zero mean and standard deviation of 1, such that
τ τ
Zτ = =r . (2.54)
στ 2(2n+5)
9n(n−1)
When Z value is being computed, it should be compared with the chosen significance
α-level, such that the null hypothesis can be rejected or not. In this project, α-level
has chosen to be 0.1.
Further, it is also relevant to highlight that improvement for some models may be
seen when applying RFE, while for others no remarkable difference in performance
could exhibit. For example, random forest is one of the models that might benefit
when RFE is applied [30]. One of the reasons is caused by the nature of model
ensembles. Random forest tends not to exclude irrelevant predictors when a split is
made, which requires a prior irrelevant feature elimination.
Thus, the aim is to test how RFE will impact on the implemented models. In
contrast, when applying logistic regression for RFE, it is relevant to consider that
17
CHAPTER 2. THEORY
logistic regression is sensitive to class imbalances [2], which in fact exists in the
given data set. Further, the given data is not particularly linear and that is why
linear regression has not been considered either as a method for RFE. Therefore,
the intention is to build RFE based on a random forest classifier. In order to choose
the optimal number of variables, the F -score will be used as the evaluation score for
RFE.
In the data provided, a heavy class imbalance exhibits. An imbalanced data set
contains of observations where the classes of the response variable are not approxi-
mately equally represented. In fraud detection for example, it is prevalent that the
minority class is in the order of 1 to 100 [36]. In many studies there have been cases
with orders of 1 to 100,000. The imbalance causes a problem when training machine
learning algorithms since one of the categories is almost absent, hence poor predic-
tions of new observations of the minority class are expected. In order to increase the
performance of the algorithms there are different sampling techniques that can be
used. One of them is called SMOTE and will be explained in the next paragraph.
18
2.10. MODEL EVALUATION TECHNIQUES
Figure 2.3: An example of a result using the SMOTE algorithm. Source of figure:
[24].
One common way to evaluate the performance of a model with binary responses is
to use a confusion matrix. The observed cases of default are defined as positives
and non-default as negatives [10]. The possible outcomes are then true positives
(TP) if defaulted customers have been predicted to be defaulted by the model. True
negatives (TN ) if non-default customers have been predicted to be non-default.
19
CHAPTER 2. THEORY
From a confusion matrix there are certain metrics that can be taken into consider-
ation. The most common metric is accuracy which is defined as the fraction of the
total number of correct classifications and the total number of observations. It is
mathematically defined as
TP + TN
Accuracy = . (2.55)
TP + TN + FN + FP
The issue with using accuracy as a metric is when applying it for imbalanced data.
If the data set contains 99% of one class it is possible to get an accuracy of 99%, if
all of the predictions are made for the majority class. A metric that is more relevant
in the context of this project is specificity. It is defined as [6]
TN
Specif icity = , (2.56)
FP + TN
and will be used for explaining the theory behind receiver operator characteristic
curve and its area under the curve in section 2.10.2.
20
2.10. MODEL EVALUATION TECHNIQUES
In terms of business sense, the aim is to achieve a trade-off between loosing money on
non-performing customers and opportunity cost caused by declining of a potentially
performing customer. Thus, there is a high relevance to analyze how sensitivity and
precision are affected by various methods applied, as sensitivity is a measure of how
many defaulted customers are captured by the model, while precision relates to the
potential opportunity cost. Sensitivity and precision are defined in Equation (2.58)
and (2.57) [12].
TP
Sensitivity = , (2.57)
TP + FN
TP
P recision = . (2.58)
TP + FP
Since sensitivity and precision are of equal importance in this project, a trade-off
between these metrics is considered. The F -score is the weighted harmonic average
of precision and sensitivity [12]. The definition of F -score can be expressed as
P recision · Sensitivity
F = (1 + β 2 )
Sensitivity + β 2 · P recision
(1 + β 2 )T P
= , (2.59)
(1 + β 2 )T P + β 2 F N + F P
Another way to evaluate results from the models is to analyze the Receiver Operator
Characteristic (ROC) curve and its Area Under the Curve (AUC). In this section,
the definition of ROC will be provided, followed by the explanation of AUC.
Let V0 and V1 denote two independent random variables with cumulative distribution
functions F0 and F1 respectively. The random variables V0 and V1 describe the
outcomes predicted by a model if a customer has defaulted or not. Let c be a
threshold value for the default classification such that if the value from the model is
greater or equal to c, a customer is classified as default and non-default otherwise.
21
CHAPTER 2. THEORY
Further, in this setting, sensitivity and specificity are defined then in the following
way [37]
The ROC curve uses the false positive fraction in order to describe the trade-offs
between sensitivity and (1-specificity). Let m express 1 − F0 (c), then the following
definition for the ROC curve is obtained
Z 1
AU C = ROC(m)dm. (2.63)
0
2.11 Cross-validation
In order to prevent using the same information in the training phase and the evalu-
ation phase of models, which makes the results less reliable, the data is divided into
training set, validation set and test set [40]. The training set and validation set are
used for finding the best model and the test set is only used for calculating the pre-
diction performance of the best model. The test data will therefore be held out until
the best model is obtained, this is called the holdout method [40]. Choosing the best
model from a set of models can be done by a method called K-fold cross-validation
(CV).
K-fold CV involves the procedure where the data set is divided in K roughly equal-
sized sets or folds. One of them is set to be a validation set and the rest are the
training set that a model is being fitted on [22]. The procedure is repeated K times
and the validation error is being estimated for each time. For example, in Figure
2.5, K has been assigned the numerical value of 5, which means that there have
been 5 iterations of the procedure.
Let κ : {1, ..., N } 7→ {1, ..., K} be a mapping function that shows an index of the
partition to which observation i is assigned by the randomization and k = 1, 2, ..., K.
22
2.11. CROSS-VALIDATION
For the kth fold, the model is fitted on the K − 1 parts and the prediction error
is calculated for the kth part. The fitted model is denoted as fˆ−k (x) with the kth
part of the data removed, then the CV error is defined as follows
N
1 X
CV = L(yi , fˆ−κ(i) (xi )), (2.64)
N i=1
where L(·) is the loss function for the respective model. Given a set of competing
models f (x, α), with α denoting the index of the model, an exhaustive search for
the best αth model can be performed. Let the αth fitted model be fˆ−k (xi , α) with
the kth part of the data removed, then the CV error becomes
N
1 X
CV (α) = L(yi , fˆ−κ(i) (xi , α)). (2.65)
N i=1
The objective is to find the α that minimizes the validation error, denoted as α̂.
This is also known as a hyperparameter search and will be implemented by looping
through every combination of a chosen set of hyperparameter values for each machine
learning method. When the final model f (x, α̂) is obtained, where α̂ represents the
best combination of hyperparameters, the performance of f (x, α̂) will be calculated
when predicting on the test set .
In this project, due to the class imbalance problem, stratified CV will be used in
order to preserve proportional representation of the two classes in each fold. In
binary classification problem, stratification in cross-validation is a technique which
rearrange data that each fold contains roughly the same representation of classes
between folds [27].
23
CHAPTER 2. THEORY
When using CV and applying an oversampling technique such as SMOTE, the follow-
ing has to be considered. A simple oversampling technique duplicates observations
of the minority class. If the oversampling is performed before CV, then the prob-
ability of getting duplicate observations in the validation set and the training set
is fairly high [4]. The point of the validation set is to calculate the performance
of the method on unseen observations. If oversampling is performed before CV
there is a possibility that model has already ”seen” observations and therefore cause
biased results. SMOTE is an oversampling technique that does not duplicate the
observations but rather synthetically creates new ones. Even if the synthetic new
data points generated by SMOTE are not duplicates of an observation, they are
still based on an original observation and will therefore cause biased results when
predicting on them. An example of a correct way of oversampling and use of CV
following with an example of an incorrect way is visualized in Figure 2.6.
Figure 2.6: Upper: Correct way of oversampling and implementing CV. Lower:
Incorrect way of oversampling and implementing CV. Source of figure: [4]
24
Chapter 3
The data provided by Nordea contains their small and medium enterprise (SME)
clients. Because of the clients’ geographical locations and different currencies, it was
decided to segment them into their respective countries, see Figure 3.1.
The parameters in the data were the same regardless of geographical location and
were in total over 400 variables. Information derived from these variables could be,
for example, what types of accounts a customer possessed and how these particular
accounts behaved over a certain time span. From now, these variables will be denoted
as behavioral variables. The time span of data was merged with a default flag
indicating if the customer was in default or not in the following year. For example,
if the observation period of a customer was between December 31, 2015 to December
31, 2016, then the performance period of the customer was of year 2017. This means
that the customers who defaulted any time during 2017 were given a default flag.
25
CHAPTER 3. DATA PROCESSING AND VARIABLE SELECTION
As mentioned in the section 3.1, the data contained different types of accounts
that the customers possessed. These accounts can be related to different financial
products such as mortgages, loans or credit. Thus, in order to decrease the number
of variables, all different types of financial products have been aggregated into one
category, see Figure 3.2.
Some of the created variables describe the same feature, but for different time spans.
This means that the variables, for example bf ag 1 1, bf ag 1 3, bf ag 1 6 and
bf ag 1 12, describe the same feature but represent different time periods.
The input data to all of the models were standardized according to section 3.2.1.
Missing values were treated by doing a complete case analysis.
In order to obtain invariant measures and familiarly similar scale between variables,
standardization was applied. The standard score was calculated for each x as follows
[26]
xki − xk
zki = , (3.1)
sk
where xk is the mean and sk is the standard deviation of the sample for a specific
feature k [29]. The mean of a sample xk1 , xk2 , ..., xkn is defined as
n
1 X
xk = (3.2)
xki ,
n i=1
26
3.3. VARIABLE SELECTION BY CORRELATION ANALYSIS WITH KENDALL’S
TAU
v
N
1 X
u
sk = t (xki − xk )2 . (3.3)
u
N − 1 i=1
The standardization parameters were fit on the training data and applied on the
test data.
The hypothesis was made that the feature variable groups that describe the same
feature but represent different time periods have high correlation within the group.
This is evident in Figure 3.3, where the majority of the groups indeed have a high
correlation internally.
Because of the high internal correlation in each group, the decision was made to
chose one feature variable from a respective group. The aim is to pick the variable,
which correlates the most with the response variable. Thus, from each group the
variable with the highest Kendall Tau correlation with the response variable was
chosen. Parts of the results are shown in Table 3.1. The whole table can be seen in
Appendix A.2.
Table 3.1: Correlation of feature variables against the response variable with
Kendall’s Tau
Feature name Correlation with response
bf ag 41 0.382857
bf ag 71 0.381114
bf ag 11 0.360268
bf ag 10 3 0.356646
bf ag 43 0.351934
bf ag 73 0.346601
bf ag 63 0.340912
bf ag 13 0.340761
bf ag 23 0.340727
.. ..
. .
bf ag 3 3 -0.324351
27
CHAPTER 3. DATA PROCESSING AND VARIABLE SELECTION
From Table 3.1 it can be seen that two behavioral variables bf ag 7 1 and bf ag 7 3
are from the same feature group. Since bf ag 7 1 has higher correlation with the
response variable, it will be chosen as the only remaining variable from this feature
group. After the selection of variables from each group was made, correlation was
analyzed again in order to investigate how the chosen variables correlates with each
other. Results can be found in Figure 3.4.
As can be seen in Figure 3.4, some variables have strong correlation, which should
be treated. The next step in this case would be to choose a threshold for correlation
allowed in the model. The threshold was chosen to 0.7, because pairwise correlation
higher than 0.7 leads to unstable estimates and multicollinearity [5]. After the
threshold was decided, the variable, which correlates the most with the response
variable was kept, see Figure 3.5. For example, the behavioral variable bf ag 6 3 is
perfectly correlated with the behavioral variable bf ag 2 3. According to Table 3.1,
behavioral variable bf ag 6 3 correlates more with the response variable than the
variable bf ag 2 3 and this is why the variable bf ag 6 3 should be retained from
this pair.
28
3.3. VARIABLE SELECTION BY CORRELATION ANALYSIS WITH KENDALL’S
TAU
After this selection, a significance test described in section 2.8.1 was executed in
order to see if the null hypothesis can be rejected. For the final selection of variables
with α-level of 0.1, all variables from Figure 3.5 were tested with regards to their
independence with the response variable. The result showed that all tests proved
that there is a dependence between the feature variables and the response variable,
and therefore the null hypothesis can be rejected. Thus, the variables shown in
Figure 3.5 are included in the final model.
29
CHAPTER 3. DATA PROCESSING AND VARIABLE SELECTION
For the sake of getting a multivariate based variable selection, RFE was imple-
mented. The variable selection by RFE was performed only on the behavioral vari-
ables with a Random Forest classifier with F -score as a scoring evaluation. Further,
the shaded area in Figure 3.6 represents the variability of cross-validation with one
standard deviation above and below the mean accuracy score drawn by the graph
[9].
As seen in Figure 3.6, the highest score was obtained when the total number of
30
3.4. VARIABLE SELECTION BY RFE
variables is 29. It can be also noticed that after around 19 variables the score varies
with small fluctuations and even after 7 variables it can be seen that the score does
not vary much either. Thus, experiments were made with 7, 19 and 29 variables
selected from the RFE. From now, RFE with 7, 19 and 29 respectively will be
denoted as RFE7 , RFE19 and RFE29 . The aim is although to obtain a model with
low complexity such that it will be easy to interpret.
Figure 3.6: Graph over number of features selected with the corresponding F -score.
31
Chapter 4
Results
In this chapter, results obtained from different models will be presented and discussed
The results were obtained with the data set provided by Nordea. Four different
data sets were studied, where one of them was obtained by performing variable
selection with correlation analysis with Kendall’s Tau and in total contained 13
variables. The other three data sets were generated by RFE and included 7, 19 and
29 features respectively. Half of the models were implemented with oversampling of
the minority class with SMOTE. In these cases, the minority class was oversampled
until its magnitude corresponded to 60% of the majority class.
In Tables 4.1 and 4.2, it can be seen that the highest obtained F -score was for
XGBoost without SMOTE and variable selection method of RFE7 and RFE29 . With
regards to precision and sensitivity, Artificial Neural Networks without SMOTE and
RFE29 generated the best precision, while AdaBoost with SMOTE and RFE7 had
the best sensitivity.
33
CHAPTER 4. RESULTS
Further, in Table 4.3, it is clearly indicated that in terms of variable selection meth-
ods, RFE performed better than correlation analysis with Kendall’s Tau. Potential
causes for this will be discussed in chapter 5.
In Table 4.4, the relationship between the number of variables and mean value of
F -score has been studied. The results indicate that on average there is a direct
relation between increase in the number of variables and increase in F -score when
RFE was used as a variable selection method. It is also relevant to highlight that
the increase was marginal, which is one of the arguments to consider when trying
to achieve a trade-off between the complexity of the model and a higher F -score.
Another thing to consider is that according to Figure 3.6 in section 3.4, with the
34
4.1. PERFORMANCE OF THE METHODS
increasing number of variables the F -score will plateau. However, in order to make
the conclusion about the relation between the number of variables used in the model
and F -score, statistically significant more tests should be executed and analyzed.
Further, the conclusion can be made that tree-based methods showed on average
better performance than Artificial Neural Networks. The average F -score has been
computed for tree-based methods (AdaBoost, XGBoost, Decision Tree and Random
Forest) and Artificial Neural Networks and is shown in Table 4.5.
35
CHAPTER 4. RESULTS
In this section results of the best performing method, XGBoost without SMOTE
and RFE7 will be presented. All model evaluation parameters can be studied in
Figure 4.1.
In Figure 4.2, ROC curves of XGBoost without SMOTE with different feature se-
lection methods are analyzed. From the figure it can be concluded that there is no
remarkable difference between different applied feature selection methods in terms
of ROC curves and AUC scores for XGBoost without SMOTE.
In Figure 4.3, ROC curves of different classification methods, when RFE7 was ap-
plied as a variable selection method, are presented. The yellow line in Figure 4.3
is a dummy classifier (DC) and represents a model that classifies every observation
with a 50% probability to be a default. It can be clearly seen that even though the
same variable selection method was applied, ROC curves differed for the studied
classifiers.
36
4.2. RESULTS OF THE BEST PERFORMING METHOD
Figure 4.1: Performance metrics for XGBoost without SMOTE and RFE7
Figure 4.2: ROC curve for XGBoost without SMOTE and all of the different variable
selection approaches
37
CHAPTER 4. RESULTS
Figure 4.3: Comparison of the method’s ROC curves when RFE7 was applied as a
feature selection method
38
Chapter 5
Discussion
In this chapter, the results from chapter 4 will be analyzed and discussed. From
chapter 4, it can be seen that the overall best performance was obtained with the
machine learning technique XGBoost. The best precision was shown with ANN
and the best sensitivity was obtained with AdaBoost. In this project, F -score was
chosen to be the main indicator of how well the model performed. This decision was
made, because sensitivity and precision were of equal importance and that is why
the F -score was chosen as main metric in order to capture both of the measures.
Thus, if the main performance indicator is to be changed (for example sensitivity,
accuracy or precision), another conclusions can be made and further analysis should
be performed.
5.1 Findings
Generally, results indicated that tree-based models were more stable and showed
better performance on average than Artificial Neural Networks. This is aligned
with results from another study, which had the same conclusion when comparing
performance between ANN and tree-based methods [32]. Other insights from chapter
4 is the impact of SMOTE on the model performance as well as how the number
of variables included in the models will be discussed in section 5.1.1 and 5.1.2.
However, in order to test if this relation exhibits generally, more tests should be
done and different sets of classifiers should be studied.
On average, SMOTE did not have a significant positive impact on the F -score, but
it had an impact on sensitivity and precision. It can be noted that implementation
of SMOTE led to an increase in sensitivity and a decrease in precision. The models
with SMOTE showed the following trend. The number of T P and F P increased,
while F N and T N decreased. Thus, sensitivity was higher when using SMOTE.
39
CHAPTER 5. DISCUSSION
After analyzing the results, the conclusion can be made that RFE performed better
than correlation analysis with Kendall’s Tau. That can be partially explained by
the nature of variable selection methods. RFE is a multivariate selection method,
which allows the analysis of a set of features simultaneously, while Kendall’s Tau
provides a pair-wise analysis of variables. In this case, it can be concluded that for
this type of project, the multivariate feature selection method is more suitable than
the univariate one.
Results indicated also that the number of features included in the model had an
impact on the model performance. When using RFE as a variable selection method,
there is a direct relation between an increase in F -score and an increase of number
of variables used in the model. This is shown in Table 4.1. On the other hand,
the increase is marginal. Considering the trade-off between the complexity of the
model and obtaining the highest F -score, solution in this case could be to define
the highest number of variables allowed in the model or a threshold of the lowest
F -score.
5.2 Conclusion
• For a chosen set of machine learning techniques, which technique exhibits the
best peroformance in default prediction with regards to a specific model evalu-
ation metric?
The overall results showed that XGBoost without SMOTE executed with RFE7 and
RFE29 showed the best performance with regards to F -score. However, because of
model complexity, the data set, which contains 7 variables is more preferable and
therefore recommended in the context.
Results also showed that using RFE as a feature selection method led to the better
performance than the correlation analysis with Kendall’s Tau, which could partially
be explained by the multivariate nature of RFE.
It can be concluded that SMOTE did not enhance the performance of the models
remarkably in terms of F -score. When SMOTE was applied, sensitivity increased
40
5.3. FUTURE WORK
and precision decreased. On average, it was also concluded that the increased num-
ber of variables used in the models chosen by RFE had a direct relation with an
increase in F -score. However, this increase was marginal and that should be taken
into consideration as it is preferably to have a trade-off between the complexity of
the model and the F -score. Further, it could be also concluded that tree-based
methods showed on average better performance than ANN.
The potential future work for this project will be a further development of the
model by deepening analysis on variables used in the models as well as creating new
variables in order to make better predictions. Data available for the scope of this
thesis has constraints in terms of many years are covered by the data presented as
well as geographical breadth of the Nordea’s clients. The majority of customers at
Nordea’s clients are from the Nordic countries, thus, it should be considered that the
behaviour of Nordic customers influence the results of this research. It means that
the behaviour of clients outside Nordics may or may not follow the same pattern and
therefore one should make additional analysis and obtain a geographically-broader
data set if the objective is to have a model unbiased of the geographical location.
An assumption can also be made that if there is data available for longer time span
as well as broader geography of clients, there is an interest to implement marco-
economical variables, which in turn might open some new insights about factors
impacting default of a customer as well as what machine learning methods are more
suitable for this type of a problem. Further, a large part of this project was to make
a grounded feature selection such that variables included in the models were valuable
for prediction. Variable selection was made by RFE and correlation analysis with
Kendall’s Tau, but it would be interesting to apply other variable selection methods.
An alternative for dimensionality reduction could be Principal Component Analysis
(PCA).
It would also be interesting to make a study concerning what metrics are the most
relevant for this type of the problem. As mentioned previously, in this project the
main metric all evaluations were analyzed by was F -score, because the aim was to
achieve a trade-off between the sensitivity and the precision. If a deeper analysis
could be performed regarding the most relevant metric for this type of problem, then
potentially a weight function could be implemented if one of the metrics explored
turned out to be of more importance. The example of a weight function can be to
use a weighted F -score, where β is not set to 1, but the value of interest.
41
Bibliography
43
BIBLIOGRAPHY
[13] Xavier Benoit Gust Lucile D’journo. The use of correlation functions in tho-
racic surgery research. https://www.ncbi.nlm.nih.gov/pmc/articles/
PMC4387406/. Accessed: 2019-05-17.
[14] Trevor Hastie, Jerome Friedman, and Robert Tibshirani. An Introduction to
Statistical Learning. eng. Springer Series in Statistics. New York, NY: Springer
New York, 2013, p. 26. isbn: 978-1-4614-7138-7.
[15] Trevor Hastie, Jerome Friedman, and Robert Tibshirani. An Introduction to
Statistical Learning. eng. Springer Series in Statistics, New York, NY: Springer
New York, 2013, p. 132. isbn: 978-1-4614-7138-7.
[16] Trevor Hastie, Jerome Friedman, and Robert Tibshirani. An Introduction to
Statistical Learning. eng. Springer Series in Statistics. New York, NY: Springer
New York, 2013, pp. 304–305. isbn: 978-1-4614-7138-7.
[17] Trevor Hastie, Jerome Friedman, and Robert Tibshirani. An Introduction to
Statistical Learning. eng. Springer Series in Statistics, New York, NY: Springer
New York, 2013, pp. 319–320. isbn: 978-1-4614-7138-7.
[18] Trevor Hastie, Jerome Friedman, and Robert Tibshirani. The Elements of
Statistical Learning: Data Mining, Inference and Prediction. eng. Springer
Series in Statistics. New York, NY: Springer New York, 2001, p. 98. isbn:
9780387216065.
[19] Trevor Hastie, Jerome Friedman, and Robert Tibshirani. The Elements of
Statistical Learning: Data Mining, Inference and Prediction. eng. Springer
Series in Statistics. New York, NY: Springer New York, 2001, p. 270. isbn:
9780387216065.
[20] Trevor Hastie, Jerome Friedman, and Robert Tibshirani. The Elements of
Statistical Learning: Data Mining, Inference and Prediction. eng. Springer
Series in Statistics. New York, NY: Springer New York, 2001, p. 271. isbn:
9780387216065.
[21] Trevor Hastie, Jerome Friedman, and Robert Tibshirani. The Elements of Sta-
tistical Learning: Data Mining, Inference and Prediction. eng. Springer Series
in Statistics. New York, NY: Springer New York, 2001, pp. 350–355. isbn:
9780387216065.
[22] Trevor Hastie, Jerome Friedman, and Robert Tibshirani. The Elements of Sta-
tistical Learning: Data Mining, Inference and Prediction. eng. Springer Series
in Statistics. New York, NY: Springer New York, 2001, pp. 214–217. isbn:
9780387216065.
[23] Simon S Haykin et al. Neural networks and learning machines. Vol. 3. Pearson
Upper Saddle River, 2009.
[24] Feng Hu and Hang Li. “A Novel Boundary Oversampling Algorithm Based on
Neighborhood Rough Set Model: NRSBoundary-SMOTE”. In: Mathematical
Problems in Engineering (2013). url: https : / / www . researchgate . net /
publication/287601878_A_Novel_Boundary_Oversampling_Algorithm_
Based_on_Neighborhood_Rough_Set_Model_NRSBoundary-SMOTE.
44
BIBLIOGRAPHY
[25] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. “Extreme Learn-
ing Machine: A New Learning Scheme of Feedforward Neural Networks”. In:
(2004).
[26] Simon James. An Introduction to Data Analysis using Aggregation Functions
in R. eng. 2016. isbn: 3-319-46762-X.
[27] K fold and other cross-validation techniques. https://medium.com/datadriveninvestor/
k- fold- and- other- cross- validation- techniques- 6c03a2563f1e. Ac-
cessed: 2019-05-15.
[28] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimiza-
tion”. In: arXiv preprint arXiv:1412.6980 (2014).
[29] Erwin Kreyszig. Advanced Engineering Mathematics. eng. 2011. isbn: 9780470646137.
[30] Kjell Kuhn Max Johnson. Feature Engineering and Selection: A Practical Ap-
proach for Predictive Models. eng. Feature Engineering and Selection: A Prac-
tical Approach for Predictive Models. 2019. isbn: 9781138079229.
[31] A.E. Lazzaretti and D.M.J. Tax. “An adaptive radial basis function kernel for
support vector data description”. In: vol. 9370. Springer Verlag, 2015, pp. 103–
116. isbn: 9783319242606.
[32] Peter Martey Addo, Dominique Guegan, and Bertrand Hassani. “Credit Risk
Analysis Using Machine and Deep Learning Models”. In: Risks 6 (2018).
[33] Tom M Mitchell. “Machine learning and data mining”. In: Communications
of the ACM 42.11 (1999).
[34] Vinod Nair and Geoffrey E Hinton. “Rectified linear units improve restricted
boltzmann machines”. In: Proceedings of the 27th international conference on
machine learning (ICML-10). 2010, pp. 807–814.
[35] R.B. Nelsen. Kendall tau metric. eng. Encyclopedia of Mathematics. 2001.
isbn: 978-1-55608-010-4.
[36] Chawla Nitesh V. et al. “SMOTE: Synthetic Minority Over-sampling Tech-
nique”. In: Journal of Artificial Intelligence Research 16 (2002), pp. 321–357.
[37] Nonparametric Bayesian Inference in Biostatistics. eng. 1st ed. 2015.. Frontiers
in Probability and the Statistical Sciences. 2015. isbn: 3-319-19518-2.
[38] Recursive Feature Elimination (RFE). https : / / www . brainvoyager . com /
bv/doc/UsersGuide/MVPA/RecursiveFeatureElimination.html. Accessed:
2019-05-17.
[39] Raúl Rojas. “AdaBoost and the super bowl of classifiers a tutorial introduction
to adaptive boosting”. In: (2009).
[40] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach.
eng. Upper Saddle River, New Jersey 07458: Pearson Education, Inc., 2010.
[41] Bernhard Schölkopf, Chris Burges, and Vladimir Vapnik. “Incorporating in-
variances in support vector learning machines”. In: International Conference
on Artificial Neural Networks. Springer. 1996, pp. 47–52.
[42] Peter Sprent and Nigel C Smeeton. Applied nonparametric statistical methods.
Chapman and Hall/CRC, 2000.
45
BIBLIOGRAPHY
46
Appendix A
Table A.1: Positive correlation of Feature Variables against the Response Variable
with Kendall’s tau
Feature name Correlation with response
bf ag 41 0.382857
bf ag 71 0.381114
bf ag 11 0.360268
bf ag 10 3 0.356646
bf ag 43 0.351934
bf ag 73 0.346601
bf ag 63 0.340912
bf ag 13 0.340761
bf ag 23 0.340727
bf ag 10 6 0.332280
bf ag 46 0.330372
bf ag 66 0.319739
bf ag 76 0.319038
bf ag 16 0.318593
bf ag 26 0.318559
bf ag 8 0.315214
bf ag 10 12 0.313506
bf ag 4 12 0.311391
bf ag 96 0.306812
bf ag 7 12 0.300638
bf ag 6 12 0.300206
bf ag 93 0.298221
bf ag 2 12 0.296260
bf ag 1 12 0.296115
bf ag 9 12 0.289003
bf ag 56 0.193858
bf ag 5 12 0.193014
bf ag 53 0.192234
bf ag 20 12 0.053963
bf ag 20 6 0.048757
bf ag 20 3 0.045284
bf ag 20 1 0.037427
bf ag 11 0.003460
47
APPENDIX A. KENDALL’S TAU CORRELATION ANALYSIS
Table A.2: Negative correlation of Feature Variables against the Response Variable
with Kendall’s tau
Feature name Correlation with response
bf ag 12 -0.030496
bf ag 14 3 -0.044130
bf ag 14 12 -0.044615
bf ag 14 6 -0.050902
bf ag 13 3 -0.070456
bf ag 13 12 -0.073104
bf ag 13 6 -0.080682
bf ag 17 -0.092303
bf ag 3 12 -0.275970
bf ag 36 -0.298557
bf ag 33 -0.324351
48
Appendix B
Adam Algorithm
49
TRITA -SCI-GRU 2019:073
www.kth.se