0% found this document useful (0 votes)

32 views31 pages

Bench Marking Paper

The document benchmarks 24 regression techniques for modelling and predicting loss given default (LGD) using 5 real-life lending datasets. It finds that nonlinear techniques like artificial neural networks and support vector machines consistently outperform traditional linear techniques. Two-stage models combining linear and nonlinear techniques also perform well while retaining interpretability. However, models still only explain 4-43% of LGD variance on average, indicating much remains unexplained. Improving LGD estimates can significantly impact banks' capital requirements.

Uploaded by

Gert Loterman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views31 pages

Bench Marking Paper

Uploaded by

Gert Loterman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Benchmarking State-of-the-Art Regression Algorithms for Loss Given

Default Modelling

Gert Lotermana , Iain Brownb , David Martensa,c , Christophe Muesb , Bart Baesensb,c
a
Department of Business Administration and Public Management
University College Ghent, Ghent University
Voskenslaan 270, B-9000 Ghent, Belgium
{gert.loterman, david.martens} @hogent.be
b
School of Management, University of Southampton
University Road, Southampton, S017 1BJ, United Kingdom
{i.brown, c.mues, b.m.m.baesens} @soton.ac.uk
c
Department of Decision Sciences & Information Management
Catholic University of Leuven
Naamsestraat 69, B-3000 Leuven, Belgium
{david.martens, bart.baesens} @econ.kuleuven.be

Abstract

The recent introduction of the Basel II framework has had a huge impact on financial institu-
tions, allowing them to build credit risk models for three key risk parameters: PD (Probability of
Default), LGD (Loss Given Default) and EAD (Exposure at Default). Current credit risk research
is largely focused on the estimation and validation of the PD parameter. However, changes in LGD
directly affect the capital of a financial institution in a linear way, unlike PD, which therefore has
less of an effect on minimal capital requirements. The use of models that estimate LGD as accu-
rately as possible is thus of crucial importance as these can translate into significant future savings.

In this first large scale LGD benchmarking study, various state-of-the-art regression techniques
to model and predict LGD are studied. These include one-stage models, such as those built by
ordinary least squares, beta regression, robust regression, ridge regression, regression splines, arti-
ficial neural networks, support vector machines and regression trees, as well as two-stage models
which attempt to combine the benefits of multiple techniques. In total 24 regression techniques are
evaluated and compared using 5 real-life retail lending datasets from major international banking
institutions.

It is found that much of the variance of LGD remains unexplained as the average predictive
performance of the models in terms of R2 range from 4 % to 43 %. Nonetheless, a clear trend can
1
be observed that, non-linear techniques and in particular artiﬁcial neural networks and support
vector machines yield consistently higher predictive performances over all datasets than more tra-
ditional linear techniques. Also, two-stage models built by a combination of linear and non-linear
techniques are shown to have similarly good predictive power, while they oﬀer the added advantage
of having a comprehensible linear model component.

Key words: Basel II, Credit risk, LGD, Data mining, Prediction

1. Introduction

Given the recent turmoil on credit markets, the topic of credit risk modelling has now become
more important than ever before. Also, to comply with the recently introduced Basel II accord,
financial institutions are investing heavily in developing improved credit risk models. The Basel
II Capital Accord aims at quantifying the minimum amount of regulatory buffer capital so as to
provide a safety cushion against unexpected credit-, market- and/or operational losses. From a
credit risk perspective, the accord encourages financial institutions to build risk models hereby
using three key risk parameters: Probability of Default (PD), Loss Given Default (LGD), and
Exposure at Default (EAD).

Nowadays, credit risk research is largely focused on the estimation and validation of the PD
parameter. The LGD parameter measures the economic loss, expressed as percentage of the ex-
posure, in case of default. This parameter is a crucial input to the Basel II capital as it enters
the capital requirement formula in a linear way (unlike PD, which therefore has less of an effect
on minimal capital requirements). Hence, changes in LGD directly affect the capital of a financial
institution and as such also its long-term strategy. It is thus of crucial importance to have models
that estimate LGD as accurately as possible. This is however not straightforward, as industry
models typically show low R2 values. Such models are often built using ordinary least squares
regression or regression trees [3] [4] [10] [18]. This first large scale LGD benchmarking study in
terms of techniques and datasets, investigates whether other approaches can improve the predictive
performance which, given the impact of LGD on capital requirements, can yield large benefits.

Preprint submitted to Int. J. of Forecasting, Special Issue on Credit Risk Forecasting February 26, 2010
The remainder of this paper is organised as follows. Section 2 gives a short overview of the
examined regression techniques. Section 3 details several performance metrics for the evaluation
and comparison of the regression models discussed in the previous section. Section 4 details the
datasets used and the experimental set-up implemented in this study. The penultimate section 5
displays the experimental results from this study. Finally section 6 concludes this paper.

2. Regression techniques

This study comprises both one-stage and two-stage techniques. One stage techniques can be
divided into linear and non-linear techniques. Linear techniques model the dependent variable as
a linear function of the independent variables while non-linear techniques ﬁt a non-linear model
to a dataset. Two stage models are a strategic combination of the aforementioned one-stage models.

In this paper, the following mathematical notations are employed. A scalar x is denoted in
normal script. A vector x is represented in boldface and is assumed to be a column vector. The
corresponding row vector xT is obtained using the transpose T . Bold capital notation is used for
a matrix X. The number of independent variables is given by n and the number of observations is
given by l. The observation i is denoted as xi whereas variable j is indicated as xj . The value of
variable j for observation i is represented as xi (j) and the independent variable y for observation
i is represented as yi . P is used to denote a probability.

A regression technique ﬁts a dataset to a model y = f (x) + e where y is the dependent variable,
x are the independent variables and e is the residual.

2.1. Ordinary Least Squares (OLS)

Ordinary least squares regression [13] is the most common technique to ﬁnd optimal parameters
bT = [b0 b1 b2 ... bn ] to ﬁt a linear model to a dataset as

y = bT x

where xT = [1 x1 x2 ... xn ]. OLS approaches this problem by minimising the sum of squared
residuals:
∑
l ∑
l
2
(ei )2 = (yi − bT xi )
i=1 i=1
3
By taking the derivative of this expression and subsequently setting the derivative equal to zero

∑
l
(yi − bT xi )xTi = 0
i=1

the model parameters b can be retrieved as

b = (XT X)−1 XT y

with XT = [x1 x2 ... xl ] and y = [y1 y2 ... yl ]T .

2.2. Ridge Regression (RiR)

Ridge regression [21] is a linear regression variant that is less sensitive to correlated independent
variables than OLS. When independent variables are strongly correlated with each other, inverting
the XT X matrix leads to large and unreliable parameter estimates. Ridge regression reduces these
undesirable symptoms by minimising

∑
l ∑
l
λbT b + (ei )2 = λbT b + (yi − bT xi )2
i=1 i=1

where λ is deﬁned as the ridge parameter which controls a trade-oﬀ between bias and variance.
With small values of λ, the model parameters are slightly biased but can be estimated more reliably
as
b = (XT X + λI)−1 XT y

where I is the identity matrix.

2.3. Robust Regression (RoR)

Robust regression [22] is another linear regression variant that is less sensitive to outliers as
OLS. When the dataset contains outliers, the model parameters can become unreliable. Therefore,
the most common method for robust regression called M-estimation [24] minimises

∑
l ∑
l
ρ(ei ) = ρ(yi − bT xi )
i=1 i=1

where the objective function ρ(e) should be less sensitive for outliers than the function used by
OLS, i.e. ρ(e) = e2 . By taking the derivative of the objective function and subsequently setting
the derivative equal to zero
∑
l
wi (yi − bT xi )xTi = 0
i=1
4
∂ρ
where w(e) = ∂e is defined as the weight function and wi = w(ei ) are the resulting weights.
e
Because the weights depend upon the residuals, the residuals depend upon the estimated coefficients
and the estimated coefficients depend upon the weights, the solution requires an iterative procedure
(Iteratively Reweighted Least Squares or IRLS). To start, the initial model parameters b(0) are
estimated by setting wi = 1 as in OLS. At each iteration t, the model parameters b(t) are estimated
(t−1) (t−1)
using the residuals ei and associated weights wi from the previous iteration. The new
estimates are given by
b(t) = (XT W(t−1) X)−1 XT W(t−1) y
{ }
(t−1)
where W(t−1) = diag wi . This procedure stops when the estimated model parameters b
satisfy a convergence criterion [25].

2.4. Ordinary Least Squares with Beta transformation (B-OLS)

Whereas OLS regression tests generally assume normality of the dependent variable y, the
empirical distribution of LGD can often be approximated more accurately by a Beta distribution
[18]. Assuming that y is constrained to the open interval (0, 1), the cumulative distribution function
(CDF) of a Beta distribution is given by:
∫ y
Γ(a + b)
β(y; a, b) = v a−1 (1 − v)b−1 dv
Γ(a)Γ(b) 0

where Γ() denotes the well-known Gamma function, and a and b are two shape parameters, which
can be estimated from the sample mean µ and variance σ 2 using the method of the moments, i.e.:

µ2 (1 − µ) 1
a= 2
− µ ; b = a( − 1)
σ µ

A potential solution to improve model ﬁt therefore is to estimate an OLS model for a transformed
dependent variable yi∗ = N −1 (β(yi ; a, b)) (i = 1, ..., l), in which N −1 () denotes the inverse of the
standard normal CDF. The predictions by the OLS model are then transformed back through the
standard normal CDF and the inverse of the ﬁtted Beta CDF to get the actual LGD estimates.

2.5. Beta Regression (BR)

Instead of performing a Beta transformation prior to fitting an OLS model, an alternative Beta
regression approach is outlined in [30]. Their preferred model for estimating a dependent variable
bounded between 0 and 1 is closely related to the class of generalised linear models and allows for
5
a dependent variable that is Beta-distributed conditional on the covariates. Instead of the usual
parametrisation though of the Beta distribution, with shape parameters a and b, they propose
an alternative parametrisation involving a location parameter µ and a precision parameter φ, by
letting:
a
µ= ; φ=a+b
a+b
It can be easily shown that the first parameter is indeed the mean of a β(a, b)-distributed variable,
µ(1−µ)
whereas σ 2 = (φ+1) , so for fixed µ, the variance (dispersion) increases with smaller φ.

Two link functions mapping the unbounded input space of the linear predictor into the required
value range for both parameters are then chosen, viz. the logit link function for the location
parameter (as its value must be squeezed into the open unit interval) and a log function for the
precision parameter (which must be strictly positive), resulting in the following sub models:
T
eb xi
µi = E(yi |xi ) = T
1 + eb xi
T
φi = e−d xi

This particular parametrisation offers the advantage of producing more intuitive variable coef-
ficients (as the two rows of coefficients, bT and dT , provide an indication of the effect on the
estimate itself and its precision, respectively). By further selecting which variables to include in
(or exclude from) the second submodel, one can explicitly model heteroskedasticity. The result-
ing log-likelihood function is then used to compute maximum-likelihood estimators for all model
parameters.

2.6. Ordinary Least Squares with Box-Cox transformation (BC-OLS)

The aim of the family of Box-Cox transformations [8] is to make the residuals of the regression
model more homoskedastic and closer to a normal distribution. The Box-Cox transformation on
the dependent variable yi takes the form

 ((yi + c) − 1)
λ

if λ 6= 0
λ

 log(y + c)
i if λ = 0
with power parameter λ and parameter c. If needed, the value of c can be set to a non-zero value to
rescale y so that it becomes strictly positive. After a model is built on the transformed dependent
variable using OLS, the predicted values can be transformed back to their original value range.
6
2.7. Regression trees (RT)

Classiﬁcation and regression trees are decision tree models, for a categorical or continuous
dependent variable, respectively, that recursively partition the original learning sample into smaller
subsamples, so that some impurity criterion i() for the resulting node segments is reduced [9]. To
grow the tree, one typically uses a greedy algorithm that, at each node t, evaluates a large set of
candidate variable splits so as to ﬁnd the ’best’ split, i.e. the split s that maximises the weighted
decrease in impurity:
∆i(s, t) = i(t) − pL i(tL ) − pR i(tR )

where pL and pR denote the proportions of observations associated with node t that are sent to the
left child node tL or right child node tR , respectively. A commonly applied impurity measure i(t)
for regression trees is the mean squared error or variance for the subset of observations falling into
node t. Alternatively, a split may be chosen based on the p-value of an ANOVA F-test comparing
between-sample variances against within-sample variances for the subsamples associated with its
respective child nodes (ProbF criterion).

2.8. Multivariate Adaptive Regression Splines (MARS)

MARS [16] is a technique that uses piecewise linear functions to capture non-linearities and
interactions between variables. The method is based on a ‘divide and conquer’ strategy where the
input space is divided in partitions and each partition holds its own regression equation. MARS
ﬁts a dataset to a model of the form
∑
K
y= bk Bk (x) + e
k=1

where B(x) is a basis function and K refers to the number of basis functions. A basis function
can either take the value 1 or a single hinge function h(xj ) that takes the form of max(0, xj − a)
or max(0, a − xj ) with a a so-called knot, or a product of 2 or more hinge functions to model
interactions. MARS builds a model in 2 phases: a forward and a backward pass. The forward pass
builds an over ﬁtted model by adding a number of Hinge functions, typically twice the number of
Hinge functions with the lowest mean squared error. Both variables and knots are selected via a
partition scheme and a subsequent exhaustive search. The backward procedure prunes the model
by removing those Hinge functions that are associated with the smallest increase in the so-called

7
GCV (Generalised Cross Validation) error, deﬁned as

∑
l
(yi − f (xi ))2
i=1
GCV =
C 2
(1 − )
l
where C = 1 + c · d, c is a penalty for adding a Hinge function and d is the number of independent
Hinge functions.

2.9. Least Squares Support Vector Machines (LSSVM)

In this study an SVM [36] variant, called LSSVM [31], is used because of its higher eﬃciency
for solving large scale problems [37]. The basic idea behind regression with LSSVM is to map the
independent variables to a high dimensional feature space with a non-linear function ϕ so the data
becomes more appropriate for linear regression:

y = bT ϕ(x) + e

with ϕT (x) = [1 ϕ(x1 ) ϕ(x2 ) ... ϕ(xn )]. However, the model is never evaluated in this form.
Instead, LSSVM regression ﬁts a model to a dataset by minimising

1 T 1 ∑l
1 1 ∑l
b b+ γ (ei )2 = bT b + γ (yi − bT ϕ(xi ))2
2 2 i=1 2 2 i=1

where γ is deﬁned as the regularisation parameter. The primal optimisation problem indicates
that each data point has to be mapped to a high dimensional (possibly inﬁnite) feature space.
This mapping however becomes quite fast computationally infeasible. To bypass this problem, the
kernel trick is used. In order to be able to do the kernel trick, the optimisation problem has to
be reformulated in its dual form by applying the method of Lagrange multipliers that leads to the
following equation:
∑
l
y= αi ϕ(x)T ϕ(xi ) + e
i=1
At this point the kernel trick can be performed. The kernel K is a function that calculates the dot
products of the input vectors in feature space without implicitly doing the mapping to the feature
space. The kernel trick is supported by Mercer’s theorem and replaces every dot product in high
dimensional feature space by a simple kernel function:

K(x, xi ) = ϕ(x)T ϕ(xi )

8
2.10. Artiﬁcial Neural Networks (ANN)

Artificial neural networks (ANNs) are mathematical representations inspired by the functioning
of the human brain [7]. The benefit of an ANN is its flexibility in modelling virtually any (non-
linear) dependency between independent variables and the dependent variable. Although various
architectures have been proposed, our study focuses on probably the most widely used type of ANN,
i.e. the Multilayer Perceptron (MLP). A MLP is typically composed of an input layer (consisting
of neurons for all input variables), a hidden layer (consisting of any number of hidden neurons),
and an output layer (in our case, one neuron). Each neuron processes its inputs and transmits
its output value to the neurons in the subsequent layer. Each such connection between neurons is
assigned a weight during training. The output of hidden neuron i is then computed by applying
(1)
an activation function f (1) to the weighted inputs and its bias term bi (having a similar role to
the intercept of a regression model) as follows:

(1) ∑
n
hi = f (1) (bi + Wij xj )
j=1

W is the weight matrix whereby Wij denotes the weight connecting input j to hidden neuron i.
Similarly, the output of the output layer is computed as follows:
∑
nh
y = f (2) (b(2) + vj hj )
j=1

with nh the number of hidden neurons and v the weight vector whereby vj represents the weight
connecting hidden neuron j to the output neuron. Examples of transfer functions that are com-
1 ex −e−x
monly used are the sigmoid function f (x) = 1+e−x , the hyperbolic tangent f (x) = ex +e−x and the
linear transfer function f (x) = x.

During model estimation, the weights of the network are ﬁrst randomly initialised and then
iteratively adjusted so as to minimise an objective function, typically the sum of squared errors
(possibly accompanied by a regularisation term to prevent over ﬁtting). This iterative procedure
can be based on simple gradient descent learning or more sophisticated optimisation methods such
as Levenberg-Marquardt or Quasi-Newton. The number of hidden neurons can be determined
through a grid search based on validation set performance.

9
2.11. Linear regression + non-linear regression
The purpose of this two-stage technique is to combine the good comprehensibility of OLS with
the predictive power of a non-linear regression technique [34]. In a ﬁrst stage, a linear model

y = bT x + e

is built with OLS. In a second stage, the residuals e of this linear model

e = f (x) + e∗

are estimated with a non-linear regression model f in order to further improve the predictive ability
of the model. Doing so, the model takes the following form:

y = bT x + f (x) + e∗

where e∗ are the new residuals of estimating e. A combination of OLS with RT, MARS, LSSVM
and ANN is assessed in this study.

2.12. Logistic regression + (non)linear regression

The LGD distribution is often characterised by a large peak around LGD = 0. This non-normal
distribution can lead to inaccurate regression models. This proposed two-stage technique attempts
to resolve this issue by modelling the peak separately from the rest. Therefore, the first stage of
this two-stage model consists of a logistic regression to estimate whether LGD ≤ 0 or LGD > 0.
In a second stage the mean of the observed values of the peak is used as prediction in the first case
and a one-stage (non)linear regression technique is used as prediction in the second case. More
specifically, a logistic regression [23] results in an estimate of the probability P of being in the peak
1
P = T
1 + e−(b x)
with (1 − P ) as the probability of not being in the peak. This two-stage model is built using the
following equation:
y = P · y peak + (1 − P ) · f (x) + e

where y peak is the mean of the values of y ≤ 0, which practically equals to 0, and f (x) is a one-stage
(non)linear regression model, build on those observations only that are not in the peak. Whereas
y peak is determined using only the values of y ≤ 0, the one-stage model is built using only the
values of y > 0. A combination of logistic regression with all aforementioned one-stage techniques
as described above, is assessed is this study.
10
Metric Worst Best Comparability Evaluation
RMSE +∞ 0 Relative Calibration
MAE +∞ 0 Relative Calibration
AUC 0.5 1 Absolute Discrimination
AOC +∞ 0 Relative Calibration
R2 −∞ 1 Absolute Calibration
r 0 1 Absolute Discrimination
ρ 0 1 Absolute Discrimination
τ 0 1 Absolute Discrimination
Table 1: Performance metrics

3. Performance metrics

Performance metrics evaluate evaluate to which degree the predictions f (xi ) differ from the
observations yi of the dependent variable. Each of the following metrics, listed in Table 1, has its
own method to express the predictive performance of a model into a quantitative value. The second
and third column of the table show the metric values for respectively the worst and best possible
prediction performance. The fourth column indicates whether the metric is relatively or absolutely
comparable: relative metric values depend on the distribution of the dependent variable, whereas
absolute metric values do not. This implies that relative metrics can only be used to compare
predictive performance amongst models built on the same dataset. Absolute metrics can also be
used to compare predictive performance between models built on different datasets. The final
column shows whether the metric measures calibration or discrimination. Calibration indicates
how close the predictive values are with the observed values whereas discrimination refers to the
ability to provide an ordinal ranking of the dependent variable considered. A good ranking does
not necessarily imply a good calibration.

3.1. Root Mean Squared Error (RMSE)

RMSE is deﬁned as the square root of the average of the squared diﬀerence between predictions
and observations: v
u
u1 ∑
l
RM SE = t (f (xi ) − yi )2
l i=1

RMSE has the same units as the independent variable being predicted. Since residuals are squared,
this metric heavily weights outliers. The metric is bound between the maximum squared error and
0 (perfect prediction).

11
3.2. Mean Absolute Error (MAE)
MAE is given by the averaged absolute diﬀerences of predicted and observed values:

1∑ l
M AE = |f (xi ) − yi |
l i=1
Just like RMSE, MAE has the same unit scale as the dependent variable being predicted. Unlike
RMSE, MAE is not that sensitive to outliers. The metric is bound between the maximum absolute
error and 0 (perfect prediction).

3.3. Area under the Receiver Operating Characteristic Curves (AUC)

ROC curves are normally used for the assessment of binary classification techniques [14]. It is
however used in this context to measure how good the regression technique is in distinguishing high
values from low values of the dependent variable. To build the ROC curve, the observed values are
first classified into high and low classes using the mean y of the training set as reference. The area
under the ROC curve (AUC) is an estimate for the discriminatory power of the technique. The
metric varies from 0.5 (random classification) to 1 (perfect classification).

3.4. Area over the Regression Error Characteristic curves (AOC)

REC curves [6] generalise ROC curves for regression. The AOC curve plots the error tolerance
on the x-axis versus the percentage of points predicted within the tolerance (or accuracy) on the
y-axis. The resulting curve estimates the cumulative distribution function of the squared error.
The area over the REC curve (AOC) is an estimate of the predictive power of the technique. The
metric is bound between 0 (perfect prediction) and the maximum squared error.

3.5. Coeﬃcient of Determination (R2 )

The Coefficient of Determination R2 [13] can be defined as 1 minus the fraction of the residual
sum of squares to the total sum of squares:
SSerr
R2 = 1 −
SStot
∑
l ∑
l
where SSerr = (yi − f (xi ))2 , SStot = (yi − y)2 and y is the mean of the observed values.
i=1 i=1
Since the second term in the formula can be seen as the fraction of unexplained variance, the R2
can be interpreted as the fraction of explained variance. Although R2 is usually expressed as a
number on a scale from 0 to 1, R2 can yield negative values when the model predictions are worse
than using the mean y from the training set as prediction.
12
3.6. Pearson’s Correlation Coefficient (r)

Pearson’s r [11] is deﬁned as the sum of the products of the standard scores of the observed
and predicted values divided by the degrees of freedom:
( )( )
1 ∑ l
yi − y f (xi ) − f )
r=
l − 1 i=1 sy sf

with y and f the mean and sy and sf the standard deviation of respectively the observations and
predictions. Pearson’s r can take values between -1 (perfect negative correlation) and +1 (perfect
positive correlation) with 0 meaning no correlation at all.

3.7. Spearman’s Correlation Coeﬃcient (ρ)

Spearman’s ρ [11] is deﬁned as Pearson’s r applied to the rankings of predicted and observed
values. If there are no or few tied ranks however, it is more usual to use the equivalent formula

∑
l
6 d2i
i=1
ρ=1−
l(l2 − 1)

where dk is the diﬀerence between the ranks of observed and predicted values. Spearman’s ρ can
take values between -1 (perfect negative correlation) and +1 (perfect positive correlation) with 0
meaning no correlation at all.

3.8. Kendall’s Correlation Coeﬃcient (τ )

Kendall’s τ [11] measures the degree of correspondence between observed and predicted values.
In other words, it measures the association of cross tabulations:

nc − nd
τ=
2 l(l − 1)
1

where nc is the number of concordant pairs and nd is the number of discordant pairs. A pair of
observations {i, k} is said to be concordant when there is no tie in either observed or predicted LGD
(i.e. yi 6= yk , f (xi ) 6= f (xk )), and if sgn(f (xk ) − f (xi )) = sgn(yk − yi ), where i, k = 1, ..., l (i 6= k).
Similarly, it is said to be discordant if there is no tie and if sgn(f (xk ) − f (xi )) = −sgn(yk − yi ).
Kendall’s τ can take values between -1 (perfect negative correlation) and +1 (perfect positive
correlation) with 0 meaning no correlation at all.

13
4. Datasets and experimental set-up

In this section the characteristics of the datasets are described as well as the experimental
benchmarking framework to assess the predictive performance of the regression techniques. Further,
a description of a technique’s parameter setting and tuning is given where required.

4.1. Dataset characteristics

Table 2 displays the characteristics of 5 real-life retail lending LGD datasets from major inter-
national banking institutions. These include personal loans, revolving credit and mortgage loans.
The corresponding histograms of the LGD are shown in Figure 1. Note that the LGD distribution
for retail lending often contain one or two spikes around LGD = 0 (in which case there was a full
recovery) and/or LGD = 1 (no recovery). The number of dataset entries are as low as 3351 and
as high as 119210. The number of input variables vary from 12 to 44. The datasets are employed
to evaluate the predictive performance of 24 diﬀerent regression techniques.

Dataset Type Inputs Total size Training size Test size

BANK1 Personal loans 44 47853 31905 15948
BANK2 Mortgage loans 18 119211 79479 39732
BANK3 Mortgage loans 14 3351 2232 1119
BANK4 Revolving credit 12 7889 5260 2629
BANK5 Mortgage loans 35 4097 2733 1364
Table 2: Dataset characteristics of real-life LGD datasets

4.2. Experimental set-up

First, each dataset is randomly shuﬄed and divided into two thirds training set and one third
test set. The training set is used to build the models while the test set is solely used to assess
the predictive performance of these models. Where required, continuous independent variables are
standardised with the sample mean and standard deviation of the training set, nominal indepen-
dent variables are encoded with dummy variables and ordinal independent variables are encoded
with thermo variables.

An input selection method is used to remove irrelevant and redundant variables from the
dataset as this might improve the performance of regression techniques. For this, a stepwise selec-
tion method is applied for building the linear models. For computational eﬃciency reasons, an R2
14
BANK1 x 10
4 BANK2
15000 10

8
10000
6

4
5000
2

0 0
0 0.5 1 0 0.5 1

BANK3 BANK4
4000 4000

3000 3000

2000 2000

1000 1000

0 0
0 0.5 1 0 0.5 1

BANK5
400

300

200

100

0
0 0.5 1

Figure 1: LGD distributions of real-life LGD datasets

15
based ﬁlter method [15] is applied prior to building the non-linear models.

After building the models, the predictive performance of each dataset is measured on the test
set by comparing the predictions and observations according to several performance metrics. Next,
an average ranking of techniques over all datasets is generated per performance metric as well as
a meta-ranking of techniques over all datasets and all performance metrics.

Finally, the regression techniques are statistically compared with each other [12]. A Friedman
test [17] is performed to test the null hypothesis that all regression techniques perform alike ac-
cording to a speciﬁc performance metric, i.e., performance diﬀerences would just be due to random
chance. Friedman’s test is based on the ranked performances rather than the actual performance
estimates and is therefore less susceptible to outliers. The test statistic of the Friedman test is
calculated as: [ ]
12D ∑E
E(E + 1)2
χ2F = ARe2 −
E(E + 1) e=1 4

1 ∑D
where D is the number of datasets, E is the number of techniques, ARe = re is the average
D d=1 d
rank of technique e over all datasets and rde is the rank of technique e for dataset d. If the value
of the test statistic χ2F is large enough to reject the null hypothesis, it may be concluded that
performance differences among regression techniques are nonrandom. In this case, a post-hoc
Nemenyi test [29] can be applied to test the null hypothesis that two techniques perform alike.
This test states that the performances of two techniques are significantly different if the average
ranks differ by at least the critical difference (CD):
√
E(E + 1)
CD = q(α, ∞, E)
12D
where q(α, ∞, E) is the studentised range statistic and α is the significance level.

4.3. Parameter settings and tuning

During model building, several techniques require parameters to be set or tuned. This section
describes how these are set or tuned where appropriate.

4.3.1. Ridge Regression (RiR)

The ridge parameter λ is tuned by minimising the mean squared error with 10-fold cross
validation on the training set. Values are varied from 0 to 1 in steps of 0.01.
16
4.3.2. Robust Regression (RoR)
Though several objective functions ρ(e) are appropriate for robust regression, the commonly
used bisquare function is employed here:
 { }

 k 2 e 2 3

 1 − [1 − ( ) ] f or |e| ≤ k
ρ(e) = 6 k

 k2

 f or |e| > k
6
where k is set to 4.685 · s [19] and s is the standard deviation of the residual e.

4.3.3. Ordinary Least Squares with Box-Cox transformation (BC-OLS)

The value of parameter c is set to zero. The value of the power parameter λ is varied over a
chosen range (e.g. from −3 to 3 in 0.25 increments) and an optimal value is chosen based on a
maximum likelihood criterion.

4.3.4. Regression Trees (RT)

For the regression tree model, the training set is further split into a training and validation
set. The validation set is used to select the criterion for evaluating candidate splitting rules (i.e.
variance reduction or ProbF), the depth of the tree, and the threshold p-value for the ProbF
criterion. The choice of tree depth, the threshold p-value for the ProbF criterion and criterion
method was selected based on the mean squared error on the validation set.

4.3.5. Multivariate Adaptive Regression Splines (MARS)

The penalty c for adding a Hinge function is set to 2.5 [20] and the number of interactions d are
varied from 0 to 5 in steps of 1 and are tuned by minimising the mean squared error with 10-fold
cross validation on the training set.

4.3.6. Least Squares Support Vector Machines (LSSVM)

Although several kernels can be used, the radial basis function (RBF) kernel

kx − xi k2
−
K(x, xi ) = e 2σ 2

with kernel parameter σ is used here because of its good overall performance for LSSVM classiﬁers
[2]. The hyperparameters γ and σ for LSSVM regression are tuned with 10-fold cross validation
on the training dataset. A gridsearch evaluates all possible combinations of parameters within
the search space in order to ﬁnd a possible optimal combination that minimises the mean squared
17
[ √ √]
error. The limits of the grid for the kernel parameter σ are set to 0.5 · l, 500 · l and the
[ ]
0.01 1000
limits of the grid for the regularisation parameter γ are set to , [35]. Estimating the
n n
LSSVM hyperparameters this way can be a computational burden. To tune the hyperparameters,
a sample from the complete training dataset is chosen as follows. First, 100 random subsets of 4000
observations are chosen. Next, the LGD distribution histogram of each subset is compared with the
LGD distribution histogram of the complete training set, and the subset that best approximates
the original set based on the mean squared error, is chosen.

4.3.7. Artiﬁcial Neural Networks (ANN)

For the ANN model, the training set is further split into a training and validation set. The
validation is used to evaluate the target layer activation functions (logistic, linear, exponential,
reciprocal, square, sine, cosine, tanh and arcTan) and number of hidden neurons (1-20) used in the
model. The weights of the network are ﬁrst randomly initialised and then iteratively adjusted so
as to minimise the mean squared error. The choice of activation function and number of hidden
neurons is selected based on the mean squared error on the validation set. The hidden layer
activation function is set as logistic.

5. Experimental results and discussion

Tables 3 to 7 display the values of several prediction performance metrics per dataset of the
24 regression models. Top performances for each metric are underlined. Figure 2 illustrates the
predictive performance of the regression techniques in and across all datasets in terms of the ab-
solute metrics AU C, R2 , r, ρ and τ . Similar trends can be observed across these metrics. Note
that diﬀerences in type of portfolio and available input variables cause a variation of the predictive
performance across the banks.

Although all absolute metrics can be used, it is adviced to use the R2 metric for the comparison
of regression models across datasets as it measures the calibration and is easily interpretable. As
indicated in Figure 2, the average predictive performance of these regression models in terms of
R2 varies from 4 % to 43 %. This means that the variance in the LGD that can be explained by
the independent variables is consistently below 50 %. Although most of the models do a better job
than a model that uses the mean of the training LGD, still most of the variance in the LGD can

18
Method MAE RMSE AUC AOC R2 r ρ τ
OLS 0.3257 0.3716 0.6570 0.1380 0.0972 0.3112 0.3084 0.2145
B-OLS 0.3474 0.4294 0.6580 0.1843 -0.2060 0.2954 0.2991 0.2071
BC-OLS 0.3835 0.4579 0.5180 0.2096 -0.3747 0.2403 0.2312 0.1602
BR 0.3356 0.3693 0.5690 0.1363 0.0546 0.2601 0.2641 0.1844
RiR 0.3267 0.3723 0.6561 0.1385 0.0933 0.3056 0.3033 0.2106
RoR 0.3262 0.3723 0.6565 0.1385 0.0935 0.3061 0.3034 0.2107
RT 0.3228 0.3732 0.5990 0.1392 0.0892 0.2997 0.2913 0.2095
ANN 0.3118 0.3648 0.6840 0.1331 0.1295 0.3603 0.3559 0.2524
LSSVM 0.3184 0.3669 0.6723 0.1346 0.1194 0.3466 0.3442 0.2444
MARS 0.3214 0.3704 0.6657 0.1372 0.1027 0.3205 0.3122 0.2187
LOG+OLS 0.3202 0.3700 0.6210 0.1366 0.1063 0.3262 0.3143 0.2214
LOG+B-OLS 0.3163 0.3750 0.6020 0.1406 0.1002 0.3166 0.3103 0.2185
LOG+BC-OLS 0.4308 0.5090 0.5040 0.2590 -0.6946 0.2125 0.2440 0.1731
LOG+BR 0.3560 0.4142 0.5270 0.1715 0.0782 0.2797 0.2591 0.1794
LOG+RiR 0.3193 0.3693 0.6655 0.1363 0.1081 0.3289 0.3167 0.2234
LOG+RoR 0.3171 0.3700 0.6554 0.1369 0.1045 0.3264 0.3205 0.2270
LOG+RT 0.3219 0.3693 0.6160 0.1363 0.1081 0.3301 0.3212 0.2263
LOG+ANN 0.3174 0.3664 0.6320 0.1342 0.1221 0.3502 0.3406 0.2395
LOG+LSSVM 0.3191 0.3679 0.6664 0.1353 0.1150 0.3401 0.3336 0.2371
LOG+MARS 0.3205 0.3689 0.6658 0.1360 0.1099 0.3320 0.3248 0.2286
OLS+MARS 0.3177 0.3679 0.6799 0.1353 0.1150 0.3394 0.3363 0.2352
OLS+LSSVM 0.3115 0.3631 0.6929 0.1317 0.1379 0.3714 0.3666 0.2596
OLS+RT 0.3170 0.3681 0.6730 0.1354 0.1137 0.3382 0.3342 0.2348
OLS+ANN 0.3079 0.3633 0.6960 0.1318 0.1367 0.3716 0.3638 0.2581
Table 3: BANK1 performances

19
Method MAE RMSE AUC AOC R2 r ρ τ
OLS 0.1187 0.1613 0.8100 0.0259 0.2353 0.4851 0.4890 0.3823
B-OLS 0.1058 0.1621 0.8000 0.0262 0.2273 0.4768 0.4967 0.3881
BC-OLS 0.1056 0.1623 0.7450 0.0262 0.2226 0.4718 0.4990 0.3900
BR 0.1020 0.1661 0.7300 0.0275 0.2120 0.4635 0.4857 0.3861
RiR 0.1187 0.1606 0.8074 0.0258 0.2415 0.4915 0.4855 0.3792
RoR 0.1075 0.1663 0.8063 0.0277 0.1866 0.4770 0.4824 0.3751
RT 0.0978 0.1499 0.7710 0.0224 0.3390 0.5823 0.5452 0.4357
ANN 0.0956 0.1472 0.8530 0.0216 0.3632 0.6029 0.5549 0.4366
LSSVM 0.1047 0.1518 0.8365 0.0230 0.3229 0.5690 0.5301 0.4160
MARS 0.1068 0.1531 0.8397 0.0234 0.3113 0.5579 0.5321 0.4168
LOG+OLS 0.1060 0.1622 0.7590 0.0255 0.2268 0.4838 0.5206 0.4084
LOG+B-OLS 0.1040 0.1567 0.8320 0.0245 0.2779 0.5286 0.5202 0.4083
LOG+BC-OLS 0.1034 0.1655 0.7320 0.0273 0.2124 0.4628 0.4870 0.3820
LOG+BR 0.1015 0.1688 0.7250 0.0285 0.2024 0.4529 0.4732 0.3876
LOG+RiR 0.1049 0.1554 0.8312 0.0240 0.2901 0.5386 0.5209 0.4091
LOG+RoR 0.1043 0.1558 0.8307 0.0242 0.2859 0.5350 0.5200 0.4084
LOG+RT 0.1041 0.1538 0.8360 0.0236 0.3049 0.5545 0.5254 0.4126
LOG+ANN 0.1011 0.1531 0.8430 0.0234 0.3109 0.5585 0.5380 0.4240
LOG+LSSVM 0.1031 0.1530 0.8334 0.0234 0.3121 0.5587 0.5243 0.4128
LOG+MARS 0.1031 0.1537 0.8355 0.0236 0.3059 0.5531 0.5268 0.4149
OLS+MARS 0.1081 0.1526 0.8379 0.0233 0.3150 0.5615 0.5300 0.4156
OLS+LSSVM 0.1029 0.1520 0.8428 0.0230 0.3208 0.5665 0.5398 0.4241
OLS+RT 0.1015 0.1506 0.8410 0.0227 0.3331 0.5786 0.5344 0.4188
OLS+ANN 0.0999 0.1474 0.8560 0.0217 0.3612 0.6010 0.5585 0.4398
Table 4: BANK2 performances

20
Method MAE RMSE AUC AOC R2 r ρ τ
OLS 0.0549 0.1411 0.6460 0.0178 0.0124 0.1168 0.0965 0.0718
B-OLS 0.0348 0.1449 0.6610 0.0188 -0.0419 0.0767 0.1754 0.1361
BC-OLS 0.0340 0.1456 0.6380 0.0190 -0.0529 0.1373 0.2312 0.1765
BR 0.0883 0.1315 0.6530 0.0169 -0.1128 0.1567 0.1719 0.1323
RiR 0.0550 0.1405 0.6499 0.0177 0.0210 0.1460 0.1270 0.0936
RoR 0.0347 0.1453 0.6438 0.0189 -0.0464 0.1733 0.1991 0.1501
RT 0.0482 0.1311 0.6990 0.0154 0.1477 0.3869 0.2007 0.1673
ANN 0.0458 0.1318 0.6000 0.0152 0.1386 0.3776 0.1482 0.1105
LSSVM 0.0473 0.1270 0.7441 0.0140 0.1998 0.4526 0.2085 0.1520
MARS 0.0478 0.1229 0.7345 0.0131 0.2506 0.5016 0.1344 0.0974
LOG+OLS 0.0553 0.1417 0.6010 0.0179 0.0043 0.0759 0.0701 0.0510
LOG+B-OLS 0.0392 0.1429 0.6330 0.0182 -0.0127 0.1214 0.1252 0.0923
LOG+BC-OLS 0.0349 0.1448 0.6330 0.0188 -0.0395 0.1665 0.1918 0.1426
LOG+BR 0.0569 0.1417 0.5790 0.0180 0.0043 0.0742 0.1710 0.1265
LOG+RiR 0.0545 0.1408 0.6404 0.0177 0.0169 0.1319 0.1511 0.1094
LOG+RoR 0.0366 0.1440 0.6510 0.0185 -0.0277 0.1510 0.2057 0.1504
LOG+RT 0.0434 0.1297 0.7210 0.0146 0.1663 0.4553 0.1571 0.1170
LOG+ANN 0.0452 0.1219 0.6190 0.0133 0.2634 0.5381 0.1671 0.1242
LOG+LSSVM 0.0460 0.1312 0.7485 0.0151 0.1471 0.4152 0.2272 0.1676
LOG+MARS 0.0467 0.1264 0.7365 0.0139 0.2082 0.4884 0.1381 0.0998
OLS+MARS 0.0471 0.1229 0.7189 0.0131 0.2512 0.5018 0.1231 0.0879
OLS+LSSVM 0.0483 0.1258 0.7416 0.0137 0.2148 0.4648 0.1869 0.1354
OLS+ANN 0.0570 0.1388 0.6730 0.0171 0.0442 0.2605 0.1369 0.1005
OLS+RT 0.0540 0.1372 0.7050 0.0168 0.0660 0.2578 0.1748 0.1285
Table 5: BANK3 performances

21
Method MAE RMSE AUC AOC R2 r ρ τ
OLS 0.2712 0.3479 0.8520 0.1208 0.4412 0.6643 0.5835 0.4331
B-OLS 0.2214 0.3743 0.8500 0.1396 0.3530 0.6510 0.5822 0.4321
BC-OLS 0.3185 0.4292 0.6750 0.1839 0.1478 0.5726 0.5820 0.4316
BR 0.3208 0.3777 0.8480 0.1425 0.3405 0.6527 0.5908 0.4452
RiR 0.2707 0.3473 0.8541 0.1204 0.4429 0.6657 0.5972 0.4495
RoR 0.2576 0.3607 0.8483 0.1299 0.3992 0.6527 0.5857 0.4402
RT 0.2476 0.3362 0.8480 0.1128 0.4782 0.6916 0.5919 0.4762
ANN 0.2393 0.3299 0.8670 0.1086 0.4974 0.7053 0.6109 0.4555
LSSVM 0.2428 0.3315 0.8655 0.1097 0.4924 0.7017 0.6203 0.4692
MARS 0.2617 0.3361 0.8636 0.1128 0.4783 0.6917 0.6162 0.4631
LOG+OLS 0.2577 0.3465 0.8520 0.1199 0.4455 0.6678 0.5840 0.4338
LOG+B-OLS 0.2399 0.3551 0.8500 0.1259 0.4176 0.6651 0.5801 0.4301
LOG+BC-OLS 0.2502 0.3489 0.8510 0.1215 0.4379 0.6659 0.5819 0.4322
LOG+BR 0.2738 0.3560 0.8520 0.1265 0.4147 0.6680 0.5868 0.4342
LOG+RiR 0.2538 0.3432 0.8572 0.1176 0.4559 0.6755 0.6026 0.4543
LOG+RoR 0.2354 0.3521 0.8534 0.1238 0.4275 0.6728 0.5960 0.4477
LOG+RT 0.2679 0.3621 0.8570 0.1309 0.3945 0.6656 0.5899 0.4364
LOG+ANN 0.2558 0.3457 0.8540 0.1184 0.4480 0.6698 0.5852 0.4348
LOG+LSSVM 0.2534 0.3425 0.8590 0.1172 0.4581 0.6771 0.6024 0.4541
LOG+MARS 0.2536 0.3433 0.8572 0.1177 0.4558 0.6754 0.6027 0.4544
OLS+MARS 0.2617 0.3362 0.8620 0.1128 0.4781 0.6915 0.6117 0.4582
OLS+LSSVM 0.2439 0.3322 0.8656 0.1102 0.4904 0.7003 0.6211 0.4698
OLS+RT 0.2628 0.3425 0.8590 0.1171 0.4582 0.6776 0.6017 0.4498
OLS+ANN 0.2404 0.3300 0.8710 0.1087 0.4971 0.7053 0.6195 0.4635
Table 6: BANK4 performances

22
Method MAE RMSE AUC AOC R2 r ρ τ
OLS 0.1875 0.2375 0.7480 0.0555 0.2218 0.4740 0.5192 0.3651
B-OLS 0.1861 0.2368 0.7410 0.0561 0.2263 0.5073 0.5168 0.3636
BC-OLS 0.1848 0.2373 0.7390 0.0560 0.2228 0.5014 0.5155 0.3632
BR 0.1957 0.2402 0.7240 0.0575 0.2038 0.4557 0.4811 0.3359
RiR 0.1864 0.2373 0.7467 0.0555 0.2233 0.4775 0.5238 0.3704
RoR 0.1892 0.2430 0.7406 0.0579 0.1852 0.4543 0.5121 0.3612
RT 0.1851 0.2324 0.7370 0.0538 0.2546 0.5056 0.4957 0.3888
ANN 0.1678 0.2173 0.7830 0.0470 0.3486 0.5964 0.5765 0.4148
LSSVM 0.1707 0.2198 0.7847 0.0479 0.3331 0.5794 0.5801 0.4167
MARS 0.1733 0.2222 0.7709 0.0488 0.3187 0.5666 0.5565 0.3980
LOG+OLS 0.1851 0.2336 0.7500 0.0542 0.2468 0.4975 0.5246 0.3704
LOG+B-OLS 0.1852 0.2347 0.7480 0.0548 0.2397 0.5117 0.5192 0.3658
LOG+BC-OLS 0.1833 0.2349 0.7470 0.0549 0.2388 0.5099 0.5238 0.3699
LOG+BR 0.1939 0.2395 0.7250 0.0572 0.2083 0.4568 0.4820 0.3364
LOG+RiR 0.1854 0.2347 0.7492 0.0547 0.2400 0.4922 0.5274 0.3730
LOG+RoR 0.1877 0.2390 0.7451 0.0567 0.2118 0.4744 0.5190 0.3665
LOG+RT 0.1846 0.2344 0.7380 0.0547 0.2420 0.5000 0.4903 0.3445
LOG+ANN 0.1689 0.2188 0.7810 0.0476 0.3396 0.5845 0.5737 0.4135
LOG+LSSVM 0.1708 0.2197 0.7835 0.0479 0.3340 0.5797 0.5795 0.4163
LOG+MARS 0.1738 0.2217 0.7726 0.0486 0.3215 0.5687 0.5597 0.3985
OLS+MARS 0.1713 0.2215 0.7740 0.0484 0.3231 0.5769 0.5707 0.4082
OLS+LSSVM 0.1695 0.2216 0.7882 0.0485 0.3223 0.5755 0.5933 0.4279
OLS+RT 0.1779 0.2320 0.7660 0.0530 0.2572 0.5357 0.5554 0.3963
OLS+ANN 0.1747 0.2277 0.7730 0.0510 0.2844 0.5567 0.5706 0.4086
Table 7: BANK5 performances

23
not be explained even with the best models.

The pure linear models built by OLS, RiR and RoR do not show consistent diﬀerences in per-
formance. RiR leads almost to an identical model as OLS. This indicates that the independent
variables in the datasets are not heavily correlated with each other. For all datasets, RoR yields
models with a somewhat lower or equal prediction performance. This indicates an absence of out-
liers and causes the technique to be less eﬃcient compared to OLS.

The empirical distribution of LGD often causes the OLS assumption of normally distributed
error terms to be violated. The techniques B-OLS, BR, BC-OLS are designed to cope with certain
types of non-normal error distributions. Nonetheless, these techniques are shown to perform worse
than OLS, suggesting that, unlike for corporate LGD models [18], they are not better at coping
with the pronounced point densities observed in retail lending LGD datasets, while they may be
less eﬃcient than OLS or could introduce model bias if a transformation is performed prior to OLS
estimation (as with B-OLS and BC-OLS).

Non-linear techniques such as RT, MARS, LSSVM and ANN perform consistently better than
linear techniques. This implies that the relation between the LGD and the diﬀerent independent
variables in the datasets is non-linear as is most noticeable in BANK3. In general, LSSVM and
especially ANN perform better than RT and MARS. However, LSSVM and ANN result in black
box models while RT and MARS result in comprehensible white box models. In contrast with
some prior benchmarking studies on classiﬁcation models for PD (e.g. [1]), non-linear models seem
to outperform linear models for the prediction of LGD.

The evaluation of the performance of the two-stage models where a logistic regression model to
choose between LGD ≤ 0 and LGD > 0 is combined with a second-stage model for LGD > 0, is not
that straightforward. Although a weak trend is noticeable that logistic regression combined with
linear models increases the performance of the latter, it seems that logistic regression combined
with non-linear models slightly diminishes the strong performance of the latter.

24
AUC R2

0.4
0.8
0.2
0
0.7
−0.2
0.6 −0.4
−0.6
0.5
1 2 3 4 5 1 2 3 4 5
r ρ

0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
1 2 3 4 5 1 2 3 4 5
τ

0.4

0.3

0.2

0.1

1 2 3 4 5

Figure 2: Comparison of predictive performances accross 5 real-life retail lending datasets

25
In contrast to the previous two-stage model, a clear trend can be observed with the combination
of a linear and a non-linear model. By estimating the error of OLS with a non-linear technique
the predictive performance increases to the level of the respective one-stage non-linear technique.
These two-stage models combine the comprehensibility of linear regression with the high predic-
tion performance of non-linear regression. Note that this methodology has also been successfully
applied in a classiﬁcation credit scoring context [32] [33] [34].

The average ranking over all datasets according to each performance metric is listed in columns
2 to 9 of Table 8. The best performing technique for each metric is underlined and techniques that
signiﬁcantly perform worse according to the Nemenyi’s post-hoc test (α = 0.5) are in italic. The last
column illustrates the meta-ranking (MR) as the average ranking (AR) over all datasets and over
all metrics. The techniques in the table are sorted according to their meta-ranking. Additionally,
columns 10 to 14 covers the meta-ranking only including respectively calibration, discrimination,
relative and absolute metrics. The best performing techniques are consistently ranked in the top
according to each metric, no matter whether they measure calibration or discrimination or are
absolute or relative.

The results of the Friedman test and subsequent Nemenyi’s post-hoc test with significance
level α = 0.05 can be intuitively visualised using Demsar’s significance diagram [12]. Although the
Demsar diagram can be displayed for all metric ranks, it is only illustrated in Figure 3 for R2 based
ranks. The diagram displays the performance rank of each technique along with a line segment
representing its corresponding critical difference (CD = 16.27). The right end of the line indicates
from which mean rank onward another technique is outperformed significantly by that technique.
The diagram is constructed with the regression techniques listed in ascending order of performance
on the y-axis, and the technique’s mean rank across all datasets displayed on the x-axis. A vertical
dashed line has been inserted to clearly identify the end of the best performing technique’s tail and
the start of the next significantly different technique.

Despite clear and consistent differences between regression techniques in terms of R2 , most
techniques do not differ significantly according to the Nemenyi test. Nonetheless, failing to reject

26
Method M AE RM SE AU C AOC R2 r ρ τ MRcal MRdis MRrel MRabs MR
OLS+LS-SVM 7.2 4.2 2.6 4.1 4.2 4.6 3 3.4 4.9 3.7 5.2 3.56 4.1
ANN 3.4 3.4 6.8 3 3.2 3.3 6.2 6.2 3.3 4.9 3.3 5.14 4.4
LSSVM 9.4 4.6 4.4 4.6 4.6 4.8 3.8 4.2 5.8 4.4 6.2 4.36 4.8
OLS+ANN 8.2 5.6 4 8.2 5.4 4.9 6.2 6 6.9 5.4 7.3 5.3 5.9
LOG+LS-SVM 8.9 7 5.9 7.2 7.1 6.8 6.8 6.4 7.6 6.7 7.7 6.6 7.0
LOG+ANN 6.8 5.7 11.6 6 5.8 5.8 9.2 9 6.1 7.8 6.2 8.28 7.3
OLS+MARS 12.9 5.5 6 5.2 5.5 5.6 9.6 10.2 7.3 7.1 7.9 7.38 7.6
OLS+RT 11.1 8.5 7.1 5.6 8.2 8.4 8.6 9.2 8.4 8.4 8.4 8.3 8.5
LOG+MARS 10.5 8.6 7.9 8.7 8.6 8.6 10.2 10.6 9.1 9.1 9.3 9.18 9.3
MARS 14.3 8 6.8 7.9 7.8 8 10.6 10.8 9.5 8.7 10.1 8.8 9.3
RT 11.1 9.5 18.5 9.8 9.4 10.2 12.4 7.4 10.0 11.3 10.1 11.58 10.3
LOG+RiR 14.8 12.7 12.3 12.4 12.3 14.2 11.8 12.4 13.1 12.7 13.3 12.6 12.8

27
LOG+RT 13.2 12.8 13 12.8 12.7 12.2 14.4 15 12.9 13.4 12.9 13.46 13.4
LOG+RoR 9.6 17.1 14.6 17.2 16.8 14.8 12 11.7 15.2 14.6 14.6 13.98 14.2
LOG+OLS 16.3 15 17.2 14.2 14.5 17.2 16.4 16.8 15.0 16.3 15.2 16.42 16.0
LOG+B-OLS 8.4 17.3 16.7 17.4 16.2 16 18.1 18.6 14.8 17.3 14.4 17.12 16.6
RiR 20.3 16 14.4 16.1 15.8 17.4 16.9 17.3 17.1 16.3 17.5 16.36 16.8
OLS 20.1 16.8 14.3 16.5 16.6 19 18.7 19.6 17.5 17.5 17.8 17.64 18.0
LOG+BC-OLS 11.8 19.6 19.7 19.7 19.4 17.8 17.3 17.6 17.6 18.6 17.0 18.36 18.0
B-OLS 12.2 20.2 15.5 21 20 19.6 17 17.4 18.4 18.3 17.8 17.9 18.3
RoR 15.4 21.5 17.2 21.5 21.4 18.9 16.6 16.6 20.0 18.7 19.5 18.14 18.7
BR 19.8 17.8 20.5 18.2 22.6 20.7 18.2 17.8 19.6 18.8 18.6 19.96 19.5
BC-OLS 15.4 21.9 21.2 21.9 21.8 20.2 16.6 17 20.3 19.8 19.7 19.36 19.3
LOG+BR 18.9 20.7 21.8 20.8 20.1 21 19.4 18.8 20.1 20.4 20.1 20.22 20.0

Table 8: Average ranking (AR) and meta-ranking (MR) accross all metrics and datasets
the null hypothesis that two techniques have equal performances does not guarantee that it is
true. For example, Nemenyi’s test is unable to reject the null hypothesis that ANN and OLS have
equal performances although ANN consistently performs better than OLS. This can mean that the
performance differences between these two are just due to chance. But the result could also be
a Type II error. Possibly the Nemenyi test does not have sufficient power to detect a significant
difference, given a significance level of α = 0.05, 5 datasets and 24 techniques [26]. The insufficient
power of the test can be explained by the use of a large number of techniques in contrast with a
relatively small number of datasets.

BR
BC−OLS
RoR
LOG+BR
B−OLS
LOG+BC−OLS
LOG+RoR
OLS
LOG+B−OLS
RiR
LOG+OLS
LOG+RT
LOG+RiR
RT
LOG+MARS
OLS+RT
MARS
LOG+LSSVM
LOG+ANN
OLS+MARS
OLS+ANN
LSSVM
OLS+LSSVM
ANN

0 5 10 15 20 25 30 35 40

Figure 3: Demsar’s significance diagram for R2 based ranks

28
6. Conclusion

This first large scale LGD benchmarking study evaluates 24 regression techniques on 5 real-
life retail lending datasets from major international banking institutions. The average predictive
performance of the models in terms of R2 ranges from 4 % to 43 %, which indicates that most
resulting models models do not have satisfactory explanatory power. Nonetheless, a clear trend
can be seen that non-linear techniques and artificial neural networks and support vector machines
in particular give higher performances than more traditional linear techniques. This indicates the
presence of non-linear interactions between the independent variables and the LGD, contrary to
some studies in PD modelling [1] where the difference between linear and non-linear techniques is
not that explicit. Given the fact that LGD has a bigger impact on the minimal capital require-
ments than PD, we demonstrated the potential and importance of applying non-linear techniques,
preferably in a two-stage context to obtain comprehensibility as well, for LGD modelling.

There is considerable evidence that the macro-economy aﬀects the client’s credit risk behaviour
so it might be an interesting topic of further research to examine the inﬂuence of macro-economic
variables [5], both in the context of improving LGD models as for stress testing. Finally, one
could also try to add comprehensibility to well-performing black box models with rule extraction
techniques to gain more insight [27] [28].

7. Acknowledgements

We would like to thank the Flemish Research Fund for the post-doctoral research grant to
David Martens and the Odysseus grant B.0915.09 to Bart Baesens. We would also like to thank
the EPSRC and SAS UK for their ﬁnancial support to Iain Brown.

References
[1] Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J. (2003). Benchmarking
state of the art classification algorithms for credit scoring. Journal of the Operational Research Society, 54,
627–635.
[2] Baesens, B., Viaene, S., Van Gestel, T., Suykens, J., Dedene, G., De Moor, B., & Vanthienen, J. (2000). An
empirical assessment of kernel type performance for least squares support vector machine classifiers. International
Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies, 1, 313–316.
[3] Bastos, J. (2009). Forecasting bank loans for loss-given-default. CEMAPRE Working Papers 0901, Centre for
Applied Mathematics and Economics (CEMAPRE), School of Economics and Management (ISEG), Technical
University of Lisbon.

29
[4] Bellotti, T. & Crook, J. (2007). Modelling and predicting loss given default for credit cards. In: Credit Scoring
and Credit Control XI conference.
[5] Bellotti, T. & Crook, J. (2009). Macroeconomic conditions in models of lgd for retail credit. In: Credit Scoring
and Credit Control XI conference.
[6] Bi, J. & Bennet, K. P. (2003). Regression error characteristic curves. In: Twentieth International Conference
on Machine Learning.
[7] Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford University Press.
[8] Box, G. & Cox, D. (1964). An analysis of transformations. Journal of Royal Statistics Society, 26, 211–252.
[9] Breiman, L., Friedman, J., Stone, C., & Olshen, R. (1984). Classification and Regression Trees. Chapman &
Hall/CRC.
[10] Caselli, S. & Querci, F. (2009). The sensitivity of the loss given default rate to systematic risk: New empirical
evidence on bank loans. Journal of Financial Services Research, 34, 1–34.
[11] Cohen, P., Cohen, J., West, S., & Aiken, L. (2002). Applied multiple regression/correlation analysis for the
behavioral sciences. Lawrence Erlbaum.
[12] Demsar, J. (2006). Statistical comparison of classifiers over multiple data sets. Journal of Machine Learning
Research, 7, 1–30.
[13] Draper, N. & Smith, H. (1998). Applied Regression Analysis. Wiley.
[14] Fawcett, T. (2006). An introduction to roc analysis. Pattern Recognition Letters, 27, 861–874.
[15] Freund, R. & Littell, R. (2000). SAS System for Regression. Wiley.
[16] Friedman, J. F. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19, 1–141.
[17] Friedman, M. (1940). A comparison of alternative tests of significance for the problems of m rankings. Annals
of Mathematical Statistics, 11, 86–92.
[18] Gupton, G. & Stein, M. (2002). Losscalc: Model for predicting loss given default (lgd). Tech. rep., Moody’s.
[19] Hampel, F., Ronchetti, R., Rousseeuw, P., & Stahel, W. (1986). Robust statistics : the approach based on
influence functions. Wiley.
[20] Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference,
and Prediction. Springer Series in Statistics.
[21] Hoerl, A. E. & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems.
Technometrics, 12, 55–67.
[22] Holland, P. & Welsch, R. (1977). Robust regression using iteratively reweighted least squares. Communications
in Statistics: Theory and Methods, 6, 813 – 827.
[23] Hosmer, D. & Stanley, L. (2000). Applied Logistic Regression. Wiley, 2nd edn.
[24] Huber, P. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35, 73–101.
[25] Huber, P. & Ronchetti, E. (2009). Robust statistics. Wiley.
[26] Lessmann, S., Baesens, B., Mues, C., & Pietsch, S. (2008). Benchmarking classification models for software
defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, 34,
485–496.
[27] Martens, D., Baesens, B., & Van Gestel, T. (2009). Decompositional rule extraction from support vector ma-
chines by active learning. IEEE Transactions on Knowledge and Data Engineering, 21, 178–191.
[28] Martens, D., Baesens, B., Van Gestel, T., & Vanthienen, J. (2007). Comprehensible credit scoring models using
rule extraction from support vector machines. European Journal of Operational Research, 183, 1466–1476.
[29] Nemenyi, P. (1963). Distribution-free multiple comparisons. Ph.D. thesis, Princeton University.
[30] Smithson, M. & Verkuilen, J. (2006). A better lemon squeezer? maximum-likelihood regression with beta-
distributed dependent variables. Psychological Methods, 11, 54–71.
[31] Suykens, J., Van Gestel, T., De Brabanter, J., De Moor, B., & Vandewalle, J. (2003). Least Squares Support
Vector Machines. World Scientific Publishing Company.
[32] Van Gestel, T., Baesens, B., Van Dijcke, P., Garcia, J., Suykens, J., & Vanthienen, J. (2006). A process model
to develop an internal rating system: Sovereign credit ratings. Decision Support Systems, 2, 1131–1151.
[33] Van Gestel, T., Baesens, B., Van Dijcke, P., Suykens, J., Garcia, J., & Alderweireld, T. (2005). Linear and
non-linear credit scoring by combining logistic regression and support vector machines. Journal of Credit Risk,
1.
[34] Van Gestel, T., Martens, D., Feremans, D., Baesens, B., Huysmans, J., & Vanthienen, J. (2007). Forecasting
and analyzing insurance companies’ ratings. International Journal of Forecasting, 23, 513–529.
[35] Van Gestel, T., Suykens, J., Baesens, B., Viaene, S., Vanthienen, J., Dedene, G., De Moor, B., & Vandewalle,
J. (2003). Benchmarking least squares support vector machine classifiers. Machine Learning, 54, 5–32.
[36] Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer.
30
[37] Wang, H. & Hu, D. (2005). Comparison of svm and ls-svm for regression. International Conference on Neural
Networks and Brain, 1, 279–283.

Quick Reference Guide To Unique Pet Species
100% (1)
Quick Reference Guide To Unique Pet Species
624 pages
Education Media List
No ratings yet
Education Media List
114 pages
Sub Rosa Issue 15
100% (1)
Sub Rosa Issue 15
80 pages
Bay Leaf
No ratings yet
Bay Leaf
8 pages
Marketing Management Notes For All 5 Units PDF
No ratings yet
Marketing Management Notes For All 5 Units PDF
133 pages
Vocabulary For Poetry Analysis
100% (3)
Vocabulary For Poetry Analysis
2 pages
In Other Words
100% (1)
In Other Words
60 pages
Praying Like Jesus - 10 Brief Studies in Prayer Author Pastor Rick Ezell
No ratings yet
Praying Like Jesus - 10 Brief Studies in Prayer Author Pastor Rick Ezell
11 pages
DA Unit-3
No ratings yet
DA Unit-3
11 pages
Filipino Thesis Sample Kabanata 2
100% (3)
Filipino Thesis Sample Kabanata 2
5 pages
The Stoic Doctrine of Providence A Study of Its Development and of Some of Its Major Issues 9781138125162 9781032049083 9781315647678
No ratings yet
The Stoic Doctrine of Providence A Study of Its Development and of Some of Its Major Issues 9781138125162 9781032049083 9781315647678
391 pages
LGD Modelling Compilation 1742809595
No ratings yet
LGD Modelling Compilation 1742809595
295 pages
Secured Borrowing and A Sale of Receivables
100% (2)
Secured Borrowing and A Sale of Receivables
1 page
Course Outline Leadership
100% (3)
Course Outline Leadership
4 pages
Lecture+Notes+-+Advanced+Regression
No ratings yet
Lecture+Notes+-+Advanced+Regression
12 pages
ML Unit-2
No ratings yet
ML Unit-2
123 pages
Unit-2 Machine Learning
No ratings yet
Unit-2 Machine Learning
110 pages
Unit 2 - ML - SRM
No ratings yet
Unit 2 - ML - SRM
89 pages
Seeing The Beat Generation Entering The Literature Through Film Raj Chandarlapaty Instant Download
No ratings yet
Seeing The Beat Generation Entering The Literature Through Film Raj Chandarlapaty Instant Download
80 pages
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
No ratings yet
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
60 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Unit 2 - ML - SRM
No ratings yet
Unit 2 - ML - SRM
66 pages
Belleville
No ratings yet
Belleville
73 pages
Artemis 1st Edition Stephanie Lynn Budin PDF Download
No ratings yet
Artemis 1st Edition Stephanie Lynn Budin PDF Download
46 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
54 pages
DAV 2201079 Exp 2 2-1
No ratings yet
DAV 2201079 Exp 2 2-1
35 pages
Lecture 10 - 04.09.2024 - Regression-02 Lecture Slides
No ratings yet
Lecture 10 - 04.09.2024 - Regression-02 Lecture Slides
61 pages
EC501 Lecture 04
No ratings yet
EC501 Lecture 04
30 pages
Lecture 09 - 02.09.2024 - Regression-01
No ratings yet
Lecture 09 - 02.09.2024 - Regression-01
62 pages
Module 5.2
No ratings yet
Module 5.2
51 pages
Performance Measures of LGD Models Katarzyna Bijak and Lyn Thomas 1
No ratings yet
Performance Measures of LGD Models Katarzyna Bijak and Lyn Thomas 1
32 pages
Gavriella Michael Thesis
No ratings yet
Gavriella Michael Thesis
85 pages
Models PDF
No ratings yet
Models PDF
86 pages
Modeling Bank Loan LGD of Corporate and
No ratings yet
Modeling Bank Loan LGD of Corporate and
23 pages
Regression
No ratings yet
Regression
25 pages
Unit - II - DA
No ratings yet
Unit - II - DA
22 pages
Optimal Bandwidth Choice For The Regression Discontinuity Estimator
No ratings yet
Optimal Bandwidth Choice For The Regression Discontinuity Estimator
27 pages
ML Unit-2 Final
No ratings yet
ML Unit-2 Final
32 pages
Mathematical Programming For Piecewise Linear Regression Analysis
No ratings yet
Mathematical Programming For Piecewise Linear Regression Analysis
43 pages
A Mathematical Programming Approach For Improving The Robustness of LAD Regression
No ratings yet
A Mathematical Programming Approach For Improving The Robustness of LAD Regression
22 pages
Logistic Regression
No ratings yet
Logistic Regression
19 pages
Teit ML2
No ratings yet
Teit ML2
11 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Da Sem Unit 3-1
No ratings yet
Da Sem Unit 3-1
13 pages
Everything You Need To Know About Linear Regression
No ratings yet
Everything You Need To Know About Linear Regression
19 pages
Bio 101 Hereditary Notes-Dr Anifowoshe
No ratings yet
Bio 101 Hereditary Notes-Dr Anifowoshe
10 pages
DA Unit-3
No ratings yet
DA Unit-3
14 pages
Lesson 3 - The Global Economy
No ratings yet
Lesson 3 - The Global Economy
6 pages
De Cuong On Tap HKI Tieng Anh 8 Global
No ratings yet
De Cuong On Tap HKI Tieng Anh 8 Global
5 pages
A Support Vector Machine Approach To Credit Scoring
No ratings yet
A Support Vector Machine Approach To Credit Scoring
15 pages
Advance Machine Learning
No ratings yet
Advance Machine Learning
16 pages
Unit1 - Data Science - SPPU
No ratings yet
Unit1 - Data Science - SPPU
15 pages
Ridge Regression and Ill-Conditioning
No ratings yet
Ridge Regression and Ill-Conditioning
10 pages
I-Flange SAE
No ratings yet
I-Flange SAE
59 pages
Unit - Iii
No ratings yet
Unit - Iii
9 pages
Transpo Law Reviewer 3D Tesoro
No ratings yet
Transpo Law Reviewer 3D Tesoro
57 pages
Notes MSM
No ratings yet
Notes MSM
66 pages
A) The Least-Squares Method
No ratings yet
A) The Least-Squares Method
19 pages
Unit III
No ratings yet
Unit III
18 pages
Mungadze Linear
No ratings yet
Mungadze Linear
21 pages
Advanced Statistical Techniques Using R: Outliers and Missing Data
No ratings yet
Advanced Statistical Techniques Using R: Outliers and Missing Data
28 pages
Aum Namah Shivaya Vedanta Darshanam Sep 09
No ratings yet
Aum Namah Shivaya Vedanta Darshanam Sep 09
39 pages
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model For Medical Data
No ratings yet
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model For Medical Data
7 pages
Lab 1
No ratings yet
Lab 1
6 pages
Merchandise20182 2 1
No ratings yet
Merchandise20182 2 1
2 pages
Matecconf Icpcm2023 01046
No ratings yet
Matecconf Icpcm2023 01046
6 pages
The Prediction of Default With Outliers - Robust Logistic Regression
No ratings yet
The Prediction of Default With Outliers - Robust Logistic Regression
21 pages
Outliers Detection in Regression Analysis Using Partial Least Square Approach
No ratings yet
Outliers Detection in Regression Analysis Using Partial Least Square Approach
3 pages
Application of Ordinary Least Square Method in Nonlinear
No ratings yet
Application of Ordinary Least Square Method in Nonlinear
4 pages
The Kite Runner Essay Good
No ratings yet
The Kite Runner Essay Good
8 pages
Curve Fitting and Interpolation
No ratings yet
Curve Fitting and Interpolation
14 pages
Song - The Hall of Fame
No ratings yet
Song - The Hall of Fame
1 page
Assignment of Econometrics
No ratings yet
Assignment of Econometrics
12 pages
There But For The Grace WIP
No ratings yet
There But For The Grace WIP
18 pages
Block 5 ST3189
No ratings yet
Block 5 ST3189
6 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
D Business Combinations - IFRS 3 (Revised)
No ratings yet
D Business Combinations - IFRS 3 (Revised)
10 pages
RTD Tea - Indonesia
No ratings yet
RTD Tea - Indonesia
9 pages
National University of Modern Languages Lahore Campus Topic
No ratings yet
National University of Modern Languages Lahore Campus Topic
5 pages
South African in London: BBC Learning English London Life
No ratings yet
South African in London: BBC Learning English London Life
4 pages
Skrip Project 2
No ratings yet
Skrip Project 2
8 pages
Development of A LGD Model Basel2 Compliant: A Case Study: Stefano Bonini
No ratings yet
Development of A LGD Model Basel2 Compliant: A Case Study: Stefano Bonini
18 pages
The Comparison of Adaptive Neuro-Fuzzy Inference System (ANFIS) With Nonlinear Regression For Estimation and Prediction
No ratings yet
The Comparison of Adaptive Neuro-Fuzzy Inference System (ANFIS) With Nonlinear Regression For Estimation and Prediction
7 pages
Comparison of Modeling Methods For Loss Given Default
No ratings yet
Comparison of Modeling Methods For Loss Given Default
14 pages
ML Summary PDF
No ratings yet
ML Summary PDF
5 pages
Example Rules For Preventive Ethics
No ratings yet
Example Rules For Preventive Ethics
7 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Multiple Models Approach in Automation: Takagi-Sugeno Fuzzy Systems
From Everand
Multiple Models Approach in Automation: Takagi-Sugeno Fuzzy Systems
Mohammed Chadli
No ratings yet
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Introduction to Applied Econometrics Analysis Using Stata
From Everand
Introduction to Applied Econometrics Analysis Using Stata
Justin Doran
5/5 (3)