Bench Marking Paper
Bench Marking Paper
Default Modelling
Gert Lotermana , Iain Brownb , David Martensa,c , Christophe Muesb , Bart Baesensb,c
a
Department of Business Administration and Public Management
University College Ghent, Ghent University
Voskenslaan 270, B-9000 Ghent, Belgium
{gert.loterman, david.martens} @hogent.be
b
School of Management, University of Southampton
University Road, Southampton, S017 1BJ, United Kingdom
{i.brown, c.mues, b.m.m.baesens} @soton.ac.uk
c
Department of Decision Sciences & Information Management
Catholic University of Leuven
Naamsestraat 69, B-3000 Leuven, Belgium
{david.martens, bart.baesens} @econ.kuleuven.be
Abstract
The recent introduction of the Basel II framework has had a huge impact on financial institu-
tions, allowing them to build credit risk models for three key risk parameters: PD (Probability of
Default), LGD (Loss Given Default) and EAD (Exposure at Default). Current credit risk research
is largely focused on the estimation and validation of the PD parameter. However, changes in LGD
directly affect the capital of a financial institution in a linear way, unlike PD, which therefore has
less of an effect on minimal capital requirements. The use of models that estimate LGD as accu-
rately as possible is thus of crucial importance as these can translate into significant future savings.
In this first large scale LGD benchmarking study, various state-of-the-art regression techniques
to model and predict LGD are studied. These include one-stage models, such as those built by
ordinary least squares, beta regression, robust regression, ridge regression, regression splines, arti-
ficial neural networks, support vector machines and regression trees, as well as two-stage models
which attempt to combine the benefits of multiple techniques. In total 24 regression techniques are
evaluated and compared using 5 real-life retail lending datasets from major international banking
institutions.
It is found that much of the variance of LGD remains unexplained as the average predictive
performance of the models in terms of R2 range from 4 % to 43 %. Nonetheless, a clear trend can
1
be observed that, non-linear techniques and in particular artificial neural networks and support
vector machines yield consistently higher predictive performances over all datasets than more tra-
ditional linear techniques. Also, two-stage models built by a combination of linear and non-linear
techniques are shown to have similarly good predictive power, while they offer the added advantage
of having a comprehensible linear model component.
Key words: Basel II, Credit risk, LGD, Data mining, Prediction
1. Introduction
Given the recent turmoil on credit markets, the topic of credit risk modelling has now become
more important than ever before. Also, to comply with the recently introduced Basel II accord,
financial institutions are investing heavily in developing improved credit risk models. The Basel
II Capital Accord aims at quantifying the minimum amount of regulatory buffer capital so as to
provide a safety cushion against unexpected credit-, market- and/or operational losses. From a
credit risk perspective, the accord encourages financial institutions to build risk models hereby
using three key risk parameters: Probability of Default (PD), Loss Given Default (LGD), and
Exposure at Default (EAD).
Nowadays, credit risk research is largely focused on the estimation and validation of the PD
parameter. The LGD parameter measures the economic loss, expressed as percentage of the ex-
posure, in case of default. This parameter is a crucial input to the Basel II capital as it enters
the capital requirement formula in a linear way (unlike PD, which therefore has less of an effect
on minimal capital requirements). Hence, changes in LGD directly affect the capital of a financial
institution and as such also its long-term strategy. It is thus of crucial importance to have models
that estimate LGD as accurately as possible. This is however not straightforward, as industry
models typically show low R2 values. Such models are often built using ordinary least squares
regression or regression trees [3] [4] [10] [18]. This first large scale LGD benchmarking study in
terms of techniques and datasets, investigates whether other approaches can improve the predictive
performance which, given the impact of LGD on capital requirements, can yield large benefits.
Preprint submitted to Int. J. of Forecasting, Special Issue on Credit Risk Forecasting February 26, 2010
The remainder of this paper is organised as follows. Section 2 gives a short overview of the
examined regression techniques. Section 3 details several performance metrics for the evaluation
and comparison of the regression models discussed in the previous section. Section 4 details the
datasets used and the experimental set-up implemented in this study. The penultimate section 5
displays the experimental results from this study. Finally section 6 concludes this paper.
2. Regression techniques
This study comprises both one-stage and two-stage techniques. One stage techniques can be
divided into linear and non-linear techniques. Linear techniques model the dependent variable as
a linear function of the independent variables while non-linear techniques fit a non-linear model
to a dataset. Two stage models are a strategic combination of the aforementioned one-stage models.
In this paper, the following mathematical notations are employed. A scalar x is denoted in
normal script. A vector x is represented in boldface and is assumed to be a column vector. The
corresponding row vector xT is obtained using the transpose T . Bold capital notation is used for
a matrix X. The number of independent variables is given by n and the number of observations is
given by l. The observation i is denoted as xi whereas variable j is indicated as xj . The value of
variable j for observation i is represented as xi (j) and the independent variable y for observation
i is represented as yi . P is used to denote a probability.
A regression technique fits a dataset to a model y = f (x) + e where y is the dependent variable,
x are the independent variables and e is the residual.
Ordinary least squares regression [13] is the most common technique to find optimal parameters
bT = [b0 b1 b2 ... bn ] to fit a linear model to a dataset as
y = bT x
where xT = [1 x1 x2 ... xn ]. OLS approaches this problem by minimising the sum of squared
residuals:
∑
l ∑
l
2
(ei )2 = (yi − bT xi )
i=1 i=1
3
By taking the derivative of this expression and subsequently setting the derivative equal to zero
∑
l
(yi − bT xi )xTi = 0
i=1
b = (XT X)−1 XT y
Ridge regression [21] is a linear regression variant that is less sensitive to correlated independent
variables than OLS. When independent variables are strongly correlated with each other, inverting
the XT X matrix leads to large and unreliable parameter estimates. Ridge regression reduces these
undesirable symptoms by minimising
∑
l ∑
l
λbT b + (ei )2 = λbT b + (yi − bT xi )2
i=1 i=1
where λ is defined as the ridge parameter which controls a trade-off between bias and variance.
With small values of λ, the model parameters are slightly biased but can be estimated more reliably
as
b = (XT X + λI)−1 XT y
Robust regression [22] is another linear regression variant that is less sensitive to outliers as
OLS. When the dataset contains outliers, the model parameters can become unreliable. Therefore,
the most common method for robust regression called M-estimation [24] minimises
∑
l ∑
l
ρ(ei ) = ρ(yi − bT xi )
i=1 i=1
where the objective function ρ(e) should be less sensitive for outliers than the function used by
OLS, i.e. ρ(e) = e2 . By taking the derivative of the objective function and subsequently setting
the derivative equal to zero
∑
l
wi (yi − bT xi )xTi = 0
i=1
4
∂ρ
where w(e) = ∂e is defined as the weight function and wi = w(ei ) are the resulting weights.
e
Because the weights depend upon the residuals, the residuals depend upon the estimated coefficients
and the estimated coefficients depend upon the weights, the solution requires an iterative procedure
(Iteratively Reweighted Least Squares or IRLS). To start, the initial model parameters b(0) are
estimated by setting wi = 1 as in OLS. At each iteration t, the model parameters b(t) are estimated
(t−1) (t−1)
using the residuals ei and associated weights wi from the previous iteration. The new
estimates are given by
b(t) = (XT W(t−1) X)−1 XT W(t−1) y
{ }
(t−1)
where W(t−1) = diag wi . This procedure stops when the estimated model parameters b
satisfy a convergence criterion [25].
Whereas OLS regression tests generally assume normality of the dependent variable y, the
empirical distribution of LGD can often be approximated more accurately by a Beta distribution
[18]. Assuming that y is constrained to the open interval (0, 1), the cumulative distribution function
(CDF) of a Beta distribution is given by:
∫ y
Γ(a + b)
β(y; a, b) = v a−1 (1 − v)b−1 dv
Γ(a)Γ(b) 0
where Γ() denotes the well-known Gamma function, and a and b are two shape parameters, which
can be estimated from the sample mean µ and variance σ 2 using the method of the moments, i.e.:
µ2 (1 − µ) 1
a= 2
− µ ; b = a( − 1)
σ µ
A potential solution to improve model fit therefore is to estimate an OLS model for a transformed
dependent variable yi∗ = N −1 (β(yi ; a, b)) (i = 1, ..., l), in which N −1 () denotes the inverse of the
standard normal CDF. The predictions by the OLS model are then transformed back through the
standard normal CDF and the inverse of the fitted Beta CDF to get the actual LGD estimates.
Instead of performing a Beta transformation prior to fitting an OLS model, an alternative Beta
regression approach is outlined in [30]. Their preferred model for estimating a dependent variable
bounded between 0 and 1 is closely related to the class of generalised linear models and allows for
5
a dependent variable that is Beta-distributed conditional on the covariates. Instead of the usual
parametrisation though of the Beta distribution, with shape parameters a and b, they propose
an alternative parametrisation involving a location parameter µ and a precision parameter φ, by
letting:
a
µ= ; φ=a+b
a+b
It can be easily shown that the first parameter is indeed the mean of a β(a, b)-distributed variable,
µ(1−µ)
whereas σ 2 = (φ+1) , so for fixed µ, the variance (dispersion) increases with smaller φ.
Two link functions mapping the unbounded input space of the linear predictor into the required
value range for both parameters are then chosen, viz. the logit link function for the location
parameter (as its value must be squeezed into the open unit interval) and a log function for the
precision parameter (which must be strictly positive), resulting in the following sub models:
T
eb xi
µi = E(yi |xi ) = T
1 + eb xi
T
φi = e−d xi
This particular parametrisation offers the advantage of producing more intuitive variable coef-
ficients (as the two rows of coefficients, bT and dT , provide an indication of the effect on the
estimate itself and its precision, respectively). By further selecting which variables to include in
(or exclude from) the second submodel, one can explicitly model heteroskedasticity. The result-
ing log-likelihood function is then used to compute maximum-likelihood estimators for all model
parameters.
Classification and regression trees are decision tree models, for a categorical or continuous
dependent variable, respectively, that recursively partition the original learning sample into smaller
subsamples, so that some impurity criterion i() for the resulting node segments is reduced [9]. To
grow the tree, one typically uses a greedy algorithm that, at each node t, evaluates a large set of
candidate variable splits so as to find the ’best’ split, i.e. the split s that maximises the weighted
decrease in impurity:
∆i(s, t) = i(t) − pL i(tL ) − pR i(tR )
where pL and pR denote the proportions of observations associated with node t that are sent to the
left child node tL or right child node tR , respectively. A commonly applied impurity measure i(t)
for regression trees is the mean squared error or variance for the subset of observations falling into
node t. Alternatively, a split may be chosen based on the p-value of an ANOVA F-test comparing
between-sample variances against within-sample variances for the subsamples associated with its
respective child nodes (ProbF criterion).
MARS [16] is a technique that uses piecewise linear functions to capture non-linearities and
interactions between variables. The method is based on a ‘divide and conquer’ strategy where the
input space is divided in partitions and each partition holds its own regression equation. MARS
fits a dataset to a model of the form
∑
K
y= bk Bk (x) + e
k=1
where B(x) is a basis function and K refers to the number of basis functions. A basis function
can either take the value 1 or a single hinge function h(xj ) that takes the form of max(0, xj − a)
or max(0, a − xj ) with a a so-called knot, or a product of 2 or more hinge functions to model
interactions. MARS builds a model in 2 phases: a forward and a backward pass. The forward pass
builds an over fitted model by adding a number of Hinge functions, typically twice the number of
Hinge functions with the lowest mean squared error. Both variables and knots are selected via a
partition scheme and a subsequent exhaustive search. The backward procedure prunes the model
by removing those Hinge functions that are associated with the smallest increase in the so-called
7
GCV (Generalised Cross Validation) error, defined as
∑
l
(yi − f (xi ))2
i=1
GCV =
C 2
(1 − )
l
where C = 1 + c · d, c is a penalty for adding a Hinge function and d is the number of independent
Hinge functions.
In this study an SVM [36] variant, called LSSVM [31], is used because of its higher efficiency
for solving large scale problems [37]. The basic idea behind regression with LSSVM is to map the
independent variables to a high dimensional feature space with a non-linear function ϕ so the data
becomes more appropriate for linear regression:
y = bT ϕ(x) + e
with ϕT (x) = [1 ϕ(x1 ) ϕ(x2 ) ... ϕ(xn )]. However, the model is never evaluated in this form.
Instead, LSSVM regression fits a model to a dataset by minimising
1 T 1 ∑l
1 1 ∑l
b b+ γ (ei )2 = bT b + γ (yi − bT ϕ(xi ))2
2 2 i=1 2 2 i=1
where γ is defined as the regularisation parameter. The primal optimisation problem indicates
that each data point has to be mapped to a high dimensional (possibly infinite) feature space.
This mapping however becomes quite fast computationally infeasible. To bypass this problem, the
kernel trick is used. In order to be able to do the kernel trick, the optimisation problem has to
be reformulated in its dual form by applying the method of Lagrange multipliers that leads to the
following equation:
∑
l
y= αi ϕ(x)T ϕ(xi ) + e
i=1
At this point the kernel trick can be performed. The kernel K is a function that calculates the dot
products of the input vectors in feature space without implicitly doing the mapping to the feature
space. The kernel trick is supported by Mercer’s theorem and replaces every dot product in high
dimensional feature space by a simple kernel function:
Artificial neural networks (ANNs) are mathematical representations inspired by the functioning
of the human brain [7]. The benefit of an ANN is its flexibility in modelling virtually any (non-
linear) dependency between independent variables and the dependent variable. Although various
architectures have been proposed, our study focuses on probably the most widely used type of ANN,
i.e. the Multilayer Perceptron (MLP). A MLP is typically composed of an input layer (consisting
of neurons for all input variables), a hidden layer (consisting of any number of hidden neurons),
and an output layer (in our case, one neuron). Each neuron processes its inputs and transmits
its output value to the neurons in the subsequent layer. Each such connection between neurons is
assigned a weight during training. The output of hidden neuron i is then computed by applying
(1)
an activation function f (1) to the weighted inputs and its bias term bi (having a similar role to
the intercept of a regression model) as follows:
(1) ∑
n
hi = f (1) (bi + Wij xj )
j=1
W is the weight matrix whereby Wij denotes the weight connecting input j to hidden neuron i.
Similarly, the output of the output layer is computed as follows:
∑
nh
y = f (2) (b(2) + vj hj )
j=1
with nh the number of hidden neurons and v the weight vector whereby vj represents the weight
connecting hidden neuron j to the output neuron. Examples of transfer functions that are com-
1 ex −e−x
monly used are the sigmoid function f (x) = 1+e−x , the hyperbolic tangent f (x) = ex +e−x and the
linear transfer function f (x) = x.
During model estimation, the weights of the network are first randomly initialised and then
iteratively adjusted so as to minimise an objective function, typically the sum of squared errors
(possibly accompanied by a regularisation term to prevent over fitting). This iterative procedure
can be based on simple gradient descent learning or more sophisticated optimisation methods such
as Levenberg-Marquardt or Quasi-Newton. The number of hidden neurons can be determined
through a grid search based on validation set performance.
9
2.11. Linear regression + non-linear regression
The purpose of this two-stage technique is to combine the good comprehensibility of OLS with
the predictive power of a non-linear regression technique [34]. In a first stage, a linear model
y = bT x + e
is built with OLS. In a second stage, the residuals e of this linear model
e = f (x) + e∗
are estimated with a non-linear regression model f in order to further improve the predictive ability
of the model. Doing so, the model takes the following form:
y = bT x + f (x) + e∗
where e∗ are the new residuals of estimating e. A combination of OLS with RT, MARS, LSSVM
and ANN is assessed in this study.
where y peak is the mean of the values of y ≤ 0, which practically equals to 0, and f (x) is a one-stage
(non)linear regression model, build on those observations only that are not in the peak. Whereas
y peak is determined using only the values of y ≤ 0, the one-stage model is built using only the
values of y > 0. A combination of logistic regression with all aforementioned one-stage techniques
as described above, is assessed is this study.
10
Metric Worst Best Comparability Evaluation
RMSE +∞ 0 Relative Calibration
MAE +∞ 0 Relative Calibration
AUC 0.5 1 Absolute Discrimination
AOC +∞ 0 Relative Calibration
R2 −∞ 1 Absolute Calibration
r 0 1 Absolute Discrimination
ρ 0 1 Absolute Discrimination
τ 0 1 Absolute Discrimination
Table 1: Performance metrics
3. Performance metrics
Performance metrics evaluate evaluate to which degree the predictions f (xi ) differ from the
observations yi of the dependent variable. Each of the following metrics, listed in Table 1, has its
own method to express the predictive performance of a model into a quantitative value. The second
and third column of the table show the metric values for respectively the worst and best possible
prediction performance. The fourth column indicates whether the metric is relatively or absolutely
comparable: relative metric values depend on the distribution of the dependent variable, whereas
absolute metric values do not. This implies that relative metrics can only be used to compare
predictive performance amongst models built on the same dataset. Absolute metrics can also be
used to compare predictive performance between models built on different datasets. The final
column shows whether the metric measures calibration or discrimination. Calibration indicates
how close the predictive values are with the observed values whereas discrimination refers to the
ability to provide an ordinal ranking of the dependent variable considered. A good ranking does
not necessarily imply a good calibration.
RMSE is defined as the square root of the average of the squared difference between predictions
and observations: v
u
u1 ∑
l
RM SE = t (f (xi ) − yi )2
l i=1
RMSE has the same units as the independent variable being predicted. Since residuals are squared,
this metric heavily weights outliers. The metric is bound between the maximum squared error and
0 (perfect prediction).
11
3.2. Mean Absolute Error (MAE)
MAE is given by the averaged absolute differences of predicted and observed values:
1∑ l
M AE = |f (xi ) − yi |
l i=1
Just like RMSE, MAE has the same unit scale as the dependent variable being predicted. Unlike
RMSE, MAE is not that sensitive to outliers. The metric is bound between the maximum absolute
error and 0 (perfect prediction).
Pearson’s r [11] is defined as the sum of the products of the standard scores of the observed
and predicted values divided by the degrees of freedom:
( )( )
1 ∑ l
yi − y f (xi ) − f )
r=
l − 1 i=1 sy sf
with y and f the mean and sy and sf the standard deviation of respectively the observations and
predictions. Pearson’s r can take values between -1 (perfect negative correlation) and +1 (perfect
positive correlation) with 0 meaning no correlation at all.
Spearman’s ρ [11] is defined as Pearson’s r applied to the rankings of predicted and observed
values. If there are no or few tied ranks however, it is more usual to use the equivalent formula
∑
l
6 d2i
i=1
ρ=1−
l(l2 − 1)
where dk is the difference between the ranks of observed and predicted values. Spearman’s ρ can
take values between -1 (perfect negative correlation) and +1 (perfect positive correlation) with 0
meaning no correlation at all.
Kendall’s τ [11] measures the degree of correspondence between observed and predicted values.
In other words, it measures the association of cross tabulations:
nc − nd
τ=
2 l(l − 1)
1
where nc is the number of concordant pairs and nd is the number of discordant pairs. A pair of
observations {i, k} is said to be concordant when there is no tie in either observed or predicted LGD
(i.e. yi 6= yk , f (xi ) 6= f (xk )), and if sgn(f (xk ) − f (xi )) = sgn(yk − yi ), where i, k = 1, ..., l (i 6= k).
Similarly, it is said to be discordant if there is no tie and if sgn(f (xk ) − f (xi )) = −sgn(yk − yi ).
Kendall’s τ can take values between -1 (perfect negative correlation) and +1 (perfect positive
correlation) with 0 meaning no correlation at all.
13
4. Datasets and experimental set-up
In this section the characteristics of the datasets are described as well as the experimental
benchmarking framework to assess the predictive performance of the regression techniques. Further,
a description of a technique’s parameter setting and tuning is given where required.
Table 2 displays the characteristics of 5 real-life retail lending LGD datasets from major inter-
national banking institutions. These include personal loans, revolving credit and mortgage loans.
The corresponding histograms of the LGD are shown in Figure 1. Note that the LGD distribution
for retail lending often contain one or two spikes around LGD = 0 (in which case there was a full
recovery) and/or LGD = 1 (no recovery). The number of dataset entries are as low as 3351 and
as high as 119210. The number of input variables vary from 12 to 44. The datasets are employed
to evaluate the predictive performance of 24 different regression techniques.
First, each dataset is randomly shuffled and divided into two thirds training set and one third
test set. The training set is used to build the models while the test set is solely used to assess
the predictive performance of these models. Where required, continuous independent variables are
standardised with the sample mean and standard deviation of the training set, nominal indepen-
dent variables are encoded with dummy variables and ordinal independent variables are encoded
with thermo variables.
An input selection method is used to remove irrelevant and redundant variables from the
dataset as this might improve the performance of regression techniques. For this, a stepwise selec-
tion method is applied for building the linear models. For computational efficiency reasons, an R2
14
BANK1 x 10
4 BANK2
15000 10
8
10000
6
4
5000
2
0 0
0 0.5 1 0 0.5 1
BANK3 BANK4
4000 4000
3000 3000
2000 2000
1000 1000
0 0
0 0.5 1 0 0.5 1
BANK5
400
300
200
100
0
0 0.5 1
15
based filter method [15] is applied prior to building the non-linear models.
After building the models, the predictive performance of each dataset is measured on the test
set by comparing the predictions and observations according to several performance metrics. Next,
an average ranking of techniques over all datasets is generated per performance metric as well as
a meta-ranking of techniques over all datasets and all performance metrics.
Finally, the regression techniques are statistically compared with each other [12]. A Friedman
test [17] is performed to test the null hypothesis that all regression techniques perform alike ac-
cording to a specific performance metric, i.e., performance differences would just be due to random
chance. Friedman’s test is based on the ranked performances rather than the actual performance
estimates and is therefore less susceptible to outliers. The test statistic of the Friedman test is
calculated as: [ ]
12D ∑E
E(E + 1)2
χ2F = ARe2 −
E(E + 1) e=1 4
1 ∑D
where D is the number of datasets, E is the number of techniques, ARe = re is the average
D d=1 d
rank of technique e over all datasets and rde is the rank of technique e for dataset d. If the value
of the test statistic χ2F is large enough to reject the null hypothesis, it may be concluded that
performance differences among regression techniques are nonrandom. In this case, a post-hoc
Nemenyi test [29] can be applied to test the null hypothesis that two techniques perform alike.
This test states that the performances of two techniques are significantly different if the average
ranks differ by at least the critical difference (CD):
√
E(E + 1)
CD = q(α, ∞, E)
12D
where q(α, ∞, E) is the studentised range statistic and α is the significance level.
kx − xi k2
−
K(x, xi ) = e 2σ 2
with kernel parameter σ is used here because of its good overall performance for LSSVM classifiers
[2]. The hyperparameters γ and σ for LSSVM regression are tuned with 10-fold cross validation
on the training dataset. A gridsearch evaluates all possible combinations of parameters within
the search space in order to find a possible optimal combination that minimises the mean squared
17
[ √ √]
error. The limits of the grid for the kernel parameter σ are set to 0.5 · l, 500 · l and the
[ ]
0.01 1000
limits of the grid for the regularisation parameter γ are set to , [35]. Estimating the
n n
LSSVM hyperparameters this way can be a computational burden. To tune the hyperparameters,
a sample from the complete training dataset is chosen as follows. First, 100 random subsets of 4000
observations are chosen. Next, the LGD distribution histogram of each subset is compared with the
LGD distribution histogram of the complete training set, and the subset that best approximates
the original set based on the mean squared error, is chosen.
Tables 3 to 7 display the values of several prediction performance metrics per dataset of the
24 regression models. Top performances for each metric are underlined. Figure 2 illustrates the
predictive performance of the regression techniques in and across all datasets in terms of the ab-
solute metrics AU C, R2 , r, ρ and τ . Similar trends can be observed across these metrics. Note
that differences in type of portfolio and available input variables cause a variation of the predictive
performance across the banks.
Although all absolute metrics can be used, it is adviced to use the R2 metric for the comparison
of regression models across datasets as it measures the calibration and is easily interpretable. As
indicated in Figure 2, the average predictive performance of these regression models in terms of
R2 varies from 4 % to 43 %. This means that the variance in the LGD that can be explained by
the independent variables is consistently below 50 %. Although most of the models do a better job
than a model that uses the mean of the training LGD, still most of the variance in the LGD can
18
Method MAE RMSE AUC AOC R2 r ρ τ
OLS 0.3257 0.3716 0.6570 0.1380 0.0972 0.3112 0.3084 0.2145
B-OLS 0.3474 0.4294 0.6580 0.1843 -0.2060 0.2954 0.2991 0.2071
BC-OLS 0.3835 0.4579 0.5180 0.2096 -0.3747 0.2403 0.2312 0.1602
BR 0.3356 0.3693 0.5690 0.1363 0.0546 0.2601 0.2641 0.1844
RiR 0.3267 0.3723 0.6561 0.1385 0.0933 0.3056 0.3033 0.2106
RoR 0.3262 0.3723 0.6565 0.1385 0.0935 0.3061 0.3034 0.2107
RT 0.3228 0.3732 0.5990 0.1392 0.0892 0.2997 0.2913 0.2095
ANN 0.3118 0.3648 0.6840 0.1331 0.1295 0.3603 0.3559 0.2524
LSSVM 0.3184 0.3669 0.6723 0.1346 0.1194 0.3466 0.3442 0.2444
MARS 0.3214 0.3704 0.6657 0.1372 0.1027 0.3205 0.3122 0.2187
LOG+OLS 0.3202 0.3700 0.6210 0.1366 0.1063 0.3262 0.3143 0.2214
LOG+B-OLS 0.3163 0.3750 0.6020 0.1406 0.1002 0.3166 0.3103 0.2185
LOG+BC-OLS 0.4308 0.5090 0.5040 0.2590 -0.6946 0.2125 0.2440 0.1731
LOG+BR 0.3560 0.4142 0.5270 0.1715 0.0782 0.2797 0.2591 0.1794
LOG+RiR 0.3193 0.3693 0.6655 0.1363 0.1081 0.3289 0.3167 0.2234
LOG+RoR 0.3171 0.3700 0.6554 0.1369 0.1045 0.3264 0.3205 0.2270
LOG+RT 0.3219 0.3693 0.6160 0.1363 0.1081 0.3301 0.3212 0.2263
LOG+ANN 0.3174 0.3664 0.6320 0.1342 0.1221 0.3502 0.3406 0.2395
LOG+LSSVM 0.3191 0.3679 0.6664 0.1353 0.1150 0.3401 0.3336 0.2371
LOG+MARS 0.3205 0.3689 0.6658 0.1360 0.1099 0.3320 0.3248 0.2286
OLS+MARS 0.3177 0.3679 0.6799 0.1353 0.1150 0.3394 0.3363 0.2352
OLS+LSSVM 0.3115 0.3631 0.6929 0.1317 0.1379 0.3714 0.3666 0.2596
OLS+RT 0.3170 0.3681 0.6730 0.1354 0.1137 0.3382 0.3342 0.2348
OLS+ANN 0.3079 0.3633 0.6960 0.1318 0.1367 0.3716 0.3638 0.2581
Table 3: BANK1 performances
19
Method MAE RMSE AUC AOC R2 r ρ τ
OLS 0.1187 0.1613 0.8100 0.0259 0.2353 0.4851 0.4890 0.3823
B-OLS 0.1058 0.1621 0.8000 0.0262 0.2273 0.4768 0.4967 0.3881
BC-OLS 0.1056 0.1623 0.7450 0.0262 0.2226 0.4718 0.4990 0.3900
BR 0.1020 0.1661 0.7300 0.0275 0.2120 0.4635 0.4857 0.3861
RiR 0.1187 0.1606 0.8074 0.0258 0.2415 0.4915 0.4855 0.3792
RoR 0.1075 0.1663 0.8063 0.0277 0.1866 0.4770 0.4824 0.3751
RT 0.0978 0.1499 0.7710 0.0224 0.3390 0.5823 0.5452 0.4357
ANN 0.0956 0.1472 0.8530 0.0216 0.3632 0.6029 0.5549 0.4366
LSSVM 0.1047 0.1518 0.8365 0.0230 0.3229 0.5690 0.5301 0.4160
MARS 0.1068 0.1531 0.8397 0.0234 0.3113 0.5579 0.5321 0.4168
LOG+OLS 0.1060 0.1622 0.7590 0.0255 0.2268 0.4838 0.5206 0.4084
LOG+B-OLS 0.1040 0.1567 0.8320 0.0245 0.2779 0.5286 0.5202 0.4083
LOG+BC-OLS 0.1034 0.1655 0.7320 0.0273 0.2124 0.4628 0.4870 0.3820
LOG+BR 0.1015 0.1688 0.7250 0.0285 0.2024 0.4529 0.4732 0.3876
LOG+RiR 0.1049 0.1554 0.8312 0.0240 0.2901 0.5386 0.5209 0.4091
LOG+RoR 0.1043 0.1558 0.8307 0.0242 0.2859 0.5350 0.5200 0.4084
LOG+RT 0.1041 0.1538 0.8360 0.0236 0.3049 0.5545 0.5254 0.4126
LOG+ANN 0.1011 0.1531 0.8430 0.0234 0.3109 0.5585 0.5380 0.4240
LOG+LSSVM 0.1031 0.1530 0.8334 0.0234 0.3121 0.5587 0.5243 0.4128
LOG+MARS 0.1031 0.1537 0.8355 0.0236 0.3059 0.5531 0.5268 0.4149
OLS+MARS 0.1081 0.1526 0.8379 0.0233 0.3150 0.5615 0.5300 0.4156
OLS+LSSVM 0.1029 0.1520 0.8428 0.0230 0.3208 0.5665 0.5398 0.4241
OLS+RT 0.1015 0.1506 0.8410 0.0227 0.3331 0.5786 0.5344 0.4188
OLS+ANN 0.0999 0.1474 0.8560 0.0217 0.3612 0.6010 0.5585 0.4398
Table 4: BANK2 performances
20
Method MAE RMSE AUC AOC R2 r ρ τ
OLS 0.0549 0.1411 0.6460 0.0178 0.0124 0.1168 0.0965 0.0718
B-OLS 0.0348 0.1449 0.6610 0.0188 -0.0419 0.0767 0.1754 0.1361
BC-OLS 0.0340 0.1456 0.6380 0.0190 -0.0529 0.1373 0.2312 0.1765
BR 0.0883 0.1315 0.6530 0.0169 -0.1128 0.1567 0.1719 0.1323
RiR 0.0550 0.1405 0.6499 0.0177 0.0210 0.1460 0.1270 0.0936
RoR 0.0347 0.1453 0.6438 0.0189 -0.0464 0.1733 0.1991 0.1501
RT 0.0482 0.1311 0.6990 0.0154 0.1477 0.3869 0.2007 0.1673
ANN 0.0458 0.1318 0.6000 0.0152 0.1386 0.3776 0.1482 0.1105
LSSVM 0.0473 0.1270 0.7441 0.0140 0.1998 0.4526 0.2085 0.1520
MARS 0.0478 0.1229 0.7345 0.0131 0.2506 0.5016 0.1344 0.0974
LOG+OLS 0.0553 0.1417 0.6010 0.0179 0.0043 0.0759 0.0701 0.0510
LOG+B-OLS 0.0392 0.1429 0.6330 0.0182 -0.0127 0.1214 0.1252 0.0923
LOG+BC-OLS 0.0349 0.1448 0.6330 0.0188 -0.0395 0.1665 0.1918 0.1426
LOG+BR 0.0569 0.1417 0.5790 0.0180 0.0043 0.0742 0.1710 0.1265
LOG+RiR 0.0545 0.1408 0.6404 0.0177 0.0169 0.1319 0.1511 0.1094
LOG+RoR 0.0366 0.1440 0.6510 0.0185 -0.0277 0.1510 0.2057 0.1504
LOG+RT 0.0434 0.1297 0.7210 0.0146 0.1663 0.4553 0.1571 0.1170
LOG+ANN 0.0452 0.1219 0.6190 0.0133 0.2634 0.5381 0.1671 0.1242
LOG+LSSVM 0.0460 0.1312 0.7485 0.0151 0.1471 0.4152 0.2272 0.1676
LOG+MARS 0.0467 0.1264 0.7365 0.0139 0.2082 0.4884 0.1381 0.0998
OLS+MARS 0.0471 0.1229 0.7189 0.0131 0.2512 0.5018 0.1231 0.0879
OLS+LSSVM 0.0483 0.1258 0.7416 0.0137 0.2148 0.4648 0.1869 0.1354
OLS+ANN 0.0570 0.1388 0.6730 0.0171 0.0442 0.2605 0.1369 0.1005
OLS+RT 0.0540 0.1372 0.7050 0.0168 0.0660 0.2578 0.1748 0.1285
Table 5: BANK3 performances
21
Method MAE RMSE AUC AOC R2 r ρ τ
OLS 0.2712 0.3479 0.8520 0.1208 0.4412 0.6643 0.5835 0.4331
B-OLS 0.2214 0.3743 0.8500 0.1396 0.3530 0.6510 0.5822 0.4321
BC-OLS 0.3185 0.4292 0.6750 0.1839 0.1478 0.5726 0.5820 0.4316
BR 0.3208 0.3777 0.8480 0.1425 0.3405 0.6527 0.5908 0.4452
RiR 0.2707 0.3473 0.8541 0.1204 0.4429 0.6657 0.5972 0.4495
RoR 0.2576 0.3607 0.8483 0.1299 0.3992 0.6527 0.5857 0.4402
RT 0.2476 0.3362 0.8480 0.1128 0.4782 0.6916 0.5919 0.4762
ANN 0.2393 0.3299 0.8670 0.1086 0.4974 0.7053 0.6109 0.4555
LSSVM 0.2428 0.3315 0.8655 0.1097 0.4924 0.7017 0.6203 0.4692
MARS 0.2617 0.3361 0.8636 0.1128 0.4783 0.6917 0.6162 0.4631
LOG+OLS 0.2577 0.3465 0.8520 0.1199 0.4455 0.6678 0.5840 0.4338
LOG+B-OLS 0.2399 0.3551 0.8500 0.1259 0.4176 0.6651 0.5801 0.4301
LOG+BC-OLS 0.2502 0.3489 0.8510 0.1215 0.4379 0.6659 0.5819 0.4322
LOG+BR 0.2738 0.3560 0.8520 0.1265 0.4147 0.6680 0.5868 0.4342
LOG+RiR 0.2538 0.3432 0.8572 0.1176 0.4559 0.6755 0.6026 0.4543
LOG+RoR 0.2354 0.3521 0.8534 0.1238 0.4275 0.6728 0.5960 0.4477
LOG+RT 0.2679 0.3621 0.8570 0.1309 0.3945 0.6656 0.5899 0.4364
LOG+ANN 0.2558 0.3457 0.8540 0.1184 0.4480 0.6698 0.5852 0.4348
LOG+LSSVM 0.2534 0.3425 0.8590 0.1172 0.4581 0.6771 0.6024 0.4541
LOG+MARS 0.2536 0.3433 0.8572 0.1177 0.4558 0.6754 0.6027 0.4544
OLS+MARS 0.2617 0.3362 0.8620 0.1128 0.4781 0.6915 0.6117 0.4582
OLS+LSSVM 0.2439 0.3322 0.8656 0.1102 0.4904 0.7003 0.6211 0.4698
OLS+RT 0.2628 0.3425 0.8590 0.1171 0.4582 0.6776 0.6017 0.4498
OLS+ANN 0.2404 0.3300 0.8710 0.1087 0.4971 0.7053 0.6195 0.4635
Table 6: BANK4 performances
22
Method MAE RMSE AUC AOC R2 r ρ τ
OLS 0.1875 0.2375 0.7480 0.0555 0.2218 0.4740 0.5192 0.3651
B-OLS 0.1861 0.2368 0.7410 0.0561 0.2263 0.5073 0.5168 0.3636
BC-OLS 0.1848 0.2373 0.7390 0.0560 0.2228 0.5014 0.5155 0.3632
BR 0.1957 0.2402 0.7240 0.0575 0.2038 0.4557 0.4811 0.3359
RiR 0.1864 0.2373 0.7467 0.0555 0.2233 0.4775 0.5238 0.3704
RoR 0.1892 0.2430 0.7406 0.0579 0.1852 0.4543 0.5121 0.3612
RT 0.1851 0.2324 0.7370 0.0538 0.2546 0.5056 0.4957 0.3888
ANN 0.1678 0.2173 0.7830 0.0470 0.3486 0.5964 0.5765 0.4148
LSSVM 0.1707 0.2198 0.7847 0.0479 0.3331 0.5794 0.5801 0.4167
MARS 0.1733 0.2222 0.7709 0.0488 0.3187 0.5666 0.5565 0.3980
LOG+OLS 0.1851 0.2336 0.7500 0.0542 0.2468 0.4975 0.5246 0.3704
LOG+B-OLS 0.1852 0.2347 0.7480 0.0548 0.2397 0.5117 0.5192 0.3658
LOG+BC-OLS 0.1833 0.2349 0.7470 0.0549 0.2388 0.5099 0.5238 0.3699
LOG+BR 0.1939 0.2395 0.7250 0.0572 0.2083 0.4568 0.4820 0.3364
LOG+RiR 0.1854 0.2347 0.7492 0.0547 0.2400 0.4922 0.5274 0.3730
LOG+RoR 0.1877 0.2390 0.7451 0.0567 0.2118 0.4744 0.5190 0.3665
LOG+RT 0.1846 0.2344 0.7380 0.0547 0.2420 0.5000 0.4903 0.3445
LOG+ANN 0.1689 0.2188 0.7810 0.0476 0.3396 0.5845 0.5737 0.4135
LOG+LSSVM 0.1708 0.2197 0.7835 0.0479 0.3340 0.5797 0.5795 0.4163
LOG+MARS 0.1738 0.2217 0.7726 0.0486 0.3215 0.5687 0.5597 0.3985
OLS+MARS 0.1713 0.2215 0.7740 0.0484 0.3231 0.5769 0.5707 0.4082
OLS+LSSVM 0.1695 0.2216 0.7882 0.0485 0.3223 0.5755 0.5933 0.4279
OLS+RT 0.1779 0.2320 0.7660 0.0530 0.2572 0.5357 0.5554 0.3963
OLS+ANN 0.1747 0.2277 0.7730 0.0510 0.2844 0.5567 0.5706 0.4086
Table 7: BANK5 performances
23
not be explained even with the best models.
The pure linear models built by OLS, RiR and RoR do not show consistent differences in per-
formance. RiR leads almost to an identical model as OLS. This indicates that the independent
variables in the datasets are not heavily correlated with each other. For all datasets, RoR yields
models with a somewhat lower or equal prediction performance. This indicates an absence of out-
liers and causes the technique to be less efficient compared to OLS.
The empirical distribution of LGD often causes the OLS assumption of normally distributed
error terms to be violated. The techniques B-OLS, BR, BC-OLS are designed to cope with certain
types of non-normal error distributions. Nonetheless, these techniques are shown to perform worse
than OLS, suggesting that, unlike for corporate LGD models [18], they are not better at coping
with the pronounced point densities observed in retail lending LGD datasets, while they may be
less efficient than OLS or could introduce model bias if a transformation is performed prior to OLS
estimation (as with B-OLS and BC-OLS).
Non-linear techniques such as RT, MARS, LSSVM and ANN perform consistently better than
linear techniques. This implies that the relation between the LGD and the different independent
variables in the datasets is non-linear as is most noticeable in BANK3. In general, LSSVM and
especially ANN perform better than RT and MARS. However, LSSVM and ANN result in black
box models while RT and MARS result in comprehensible white box models. In contrast with
some prior benchmarking studies on classification models for PD (e.g. [1]), non-linear models seem
to outperform linear models for the prediction of LGD.
The evaluation of the performance of the two-stage models where a logistic regression model to
choose between LGD ≤ 0 and LGD > 0 is combined with a second-stage model for LGD > 0, is not
that straightforward. Although a weak trend is noticeable that logistic regression combined with
linear models increases the performance of the latter, it seems that logistic regression combined
with non-linear models slightly diminishes the strong performance of the latter.
24
AUC R2
0.4
0.8
0.2
0
0.7
−0.2
0.6 −0.4
−0.6
0.5
1 2 3 4 5 1 2 3 4 5
r ρ
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
1 2 3 4 5 1 2 3 4 5
τ
0.4
0.3
0.2
0.1
1 2 3 4 5
25
In contrast to the previous two-stage model, a clear trend can be observed with the combination
of a linear and a non-linear model. By estimating the error of OLS with a non-linear technique
the predictive performance increases to the level of the respective one-stage non-linear technique.
These two-stage models combine the comprehensibility of linear regression with the high predic-
tion performance of non-linear regression. Note that this methodology has also been successfully
applied in a classification credit scoring context [32] [33] [34].
The average ranking over all datasets according to each performance metric is listed in columns
2 to 9 of Table 8. The best performing technique for each metric is underlined and techniques that
significantly perform worse according to the Nemenyi’s post-hoc test (α = 0.5) are in italic. The last
column illustrates the meta-ranking (MR) as the average ranking (AR) over all datasets and over
all metrics. The techniques in the table are sorted according to their meta-ranking. Additionally,
columns 10 to 14 covers the meta-ranking only including respectively calibration, discrimination,
relative and absolute metrics. The best performing techniques are consistently ranked in the top
according to each metric, no matter whether they measure calibration or discrimination or are
absolute or relative.
The results of the Friedman test and subsequent Nemenyi’s post-hoc test with significance
level α = 0.05 can be intuitively visualised using Demsar’s significance diagram [12]. Although the
Demsar diagram can be displayed for all metric ranks, it is only illustrated in Figure 3 for R2 based
ranks. The diagram displays the performance rank of each technique along with a line segment
representing its corresponding critical difference (CD = 16.27). The right end of the line indicates
from which mean rank onward another technique is outperformed significantly by that technique.
The diagram is constructed with the regression techniques listed in ascending order of performance
on the y-axis, and the technique’s mean rank across all datasets displayed on the x-axis. A vertical
dashed line has been inserted to clearly identify the end of the best performing technique’s tail and
the start of the next significantly different technique.
Despite clear and consistent differences between regression techniques in terms of R2 , most
techniques do not differ significantly according to the Nemenyi test. Nonetheless, failing to reject
26
Method M AE RM SE AU C AOC R2 r ρ τ MRcal MRdis MRrel MRabs MR
OLS+LS-SVM 7.2 4.2 2.6 4.1 4.2 4.6 3 3.4 4.9 3.7 5.2 3.56 4.1
ANN 3.4 3.4 6.8 3 3.2 3.3 6.2 6.2 3.3 4.9 3.3 5.14 4.4
LSSVM 9.4 4.6 4.4 4.6 4.6 4.8 3.8 4.2 5.8 4.4 6.2 4.36 4.8
OLS+ANN 8.2 5.6 4 8.2 5.4 4.9 6.2 6 6.9 5.4 7.3 5.3 5.9
LOG+LS-SVM 8.9 7 5.9 7.2 7.1 6.8 6.8 6.4 7.6 6.7 7.7 6.6 7.0
LOG+ANN 6.8 5.7 11.6 6 5.8 5.8 9.2 9 6.1 7.8 6.2 8.28 7.3
OLS+MARS 12.9 5.5 6 5.2 5.5 5.6 9.6 10.2 7.3 7.1 7.9 7.38 7.6
OLS+RT 11.1 8.5 7.1 5.6 8.2 8.4 8.6 9.2 8.4 8.4 8.4 8.3 8.5
LOG+MARS 10.5 8.6 7.9 8.7 8.6 8.6 10.2 10.6 9.1 9.1 9.3 9.18 9.3
MARS 14.3 8 6.8 7.9 7.8 8 10.6 10.8 9.5 8.7 10.1 8.8 9.3
RT 11.1 9.5 18.5 9.8 9.4 10.2 12.4 7.4 10.0 11.3 10.1 11.58 10.3
LOG+RiR 14.8 12.7 12.3 12.4 12.3 14.2 11.8 12.4 13.1 12.7 13.3 12.6 12.8
27
LOG+RT 13.2 12.8 13 12.8 12.7 12.2 14.4 15 12.9 13.4 12.9 13.46 13.4
LOG+RoR 9.6 17.1 14.6 17.2 16.8 14.8 12 11.7 15.2 14.6 14.6 13.98 14.2
LOG+OLS 16.3 15 17.2 14.2 14.5 17.2 16.4 16.8 15.0 16.3 15.2 16.42 16.0
LOG+B-OLS 8.4 17.3 16.7 17.4 16.2 16 18.1 18.6 14.8 17.3 14.4 17.12 16.6
RiR 20.3 16 14.4 16.1 15.8 17.4 16.9 17.3 17.1 16.3 17.5 16.36 16.8
OLS 20.1 16.8 14.3 16.5 16.6 19 18.7 19.6 17.5 17.5 17.8 17.64 18.0
LOG+BC-OLS 11.8 19.6 19.7 19.7 19.4 17.8 17.3 17.6 17.6 18.6 17.0 18.36 18.0
B-OLS 12.2 20.2 15.5 21 20 19.6 17 17.4 18.4 18.3 17.8 17.9 18.3
RoR 15.4 21.5 17.2 21.5 21.4 18.9 16.6 16.6 20.0 18.7 19.5 18.14 18.7
BR 19.8 17.8 20.5 18.2 22.6 20.7 18.2 17.8 19.6 18.8 18.6 19.96 19.5
BC-OLS 15.4 21.9 21.2 21.9 21.8 20.2 16.6 17 20.3 19.8 19.7 19.36 19.3
LOG+BR 18.9 20.7 21.8 20.8 20.1 21 19.4 18.8 20.1 20.4 20.1 20.22 20.0
Table 8: Average ranking (AR) and meta-ranking (MR) accross all metrics and datasets
the null hypothesis that two techniques have equal performances does not guarantee that it is
true. For example, Nemenyi’s test is unable to reject the null hypothesis that ANN and OLS have
equal performances although ANN consistently performs better than OLS. This can mean that the
performance differences between these two are just due to chance. But the result could also be
a Type II error. Possibly the Nemenyi test does not have sufficient power to detect a significant
difference, given a significance level of α = 0.05, 5 datasets and 24 techniques [26]. The insufficient
power of the test can be explained by the use of a large number of techniques in contrast with a
relatively small number of datasets.
BR
BC−OLS
RoR
LOG+BR
B−OLS
LOG+BC−OLS
LOG+RoR
OLS
LOG+B−OLS
RiR
LOG+OLS
LOG+RT
LOG+RiR
RT
LOG+MARS
OLS+RT
MARS
LOG+LSSVM
LOG+ANN
OLS+MARS
OLS+ANN
LSSVM
OLS+LSSVM
ANN
0 5 10 15 20 25 30 35 40
28
6. Conclusion
This first large scale LGD benchmarking study evaluates 24 regression techniques on 5 real-
life retail lending datasets from major international banking institutions. The average predictive
performance of the models in terms of R2 ranges from 4 % to 43 %, which indicates that most
resulting models models do not have satisfactory explanatory power. Nonetheless, a clear trend
can be seen that non-linear techniques and artificial neural networks and support vector machines
in particular give higher performances than more traditional linear techniques. This indicates the
presence of non-linear interactions between the independent variables and the LGD, contrary to
some studies in PD modelling [1] where the difference between linear and non-linear techniques is
not that explicit. Given the fact that LGD has a bigger impact on the minimal capital require-
ments than PD, we demonstrated the potential and importance of applying non-linear techniques,
preferably in a two-stage context to obtain comprehensibility as well, for LGD modelling.
There is considerable evidence that the macro-economy affects the client’s credit risk behaviour
so it might be an interesting topic of further research to examine the influence of macro-economic
variables [5], both in the context of improving LGD models as for stress testing. Finally, one
could also try to add comprehensibility to well-performing black box models with rule extraction
techniques to gain more insight [27] [28].
7. Acknowledgements
We would like to thank the Flemish Research Fund for the post-doctoral research grant to
David Martens and the Odysseus grant B.0915.09 to Bart Baesens. We would also like to thank
the EPSRC and SAS UK for their financial support to Iain Brown.
References
[1] Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J. (2003). Benchmarking
state of the art classification algorithms for credit scoring. Journal of the Operational Research Society, 54,
627–635.
[2] Baesens, B., Viaene, S., Van Gestel, T., Suykens, J., Dedene, G., De Moor, B., & Vanthienen, J. (2000). An
empirical assessment of kernel type performance for least squares support vector machine classifiers. International
Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies, 1, 313–316.
[3] Bastos, J. (2009). Forecasting bank loans for loss-given-default. CEMAPRE Working Papers 0901, Centre for
Applied Mathematics and Economics (CEMAPRE), School of Economics and Management (ISEG), Technical
University of Lisbon.
29
[4] Bellotti, T. & Crook, J. (2007). Modelling and predicting loss given default for credit cards. In: Credit Scoring
and Credit Control XI conference.
[5] Bellotti, T. & Crook, J. (2009). Macroeconomic conditions in models of lgd for retail credit. In: Credit Scoring
and Credit Control XI conference.
[6] Bi, J. & Bennet, K. P. (2003). Regression error characteristic curves. In: Twentieth International Conference
on Machine Learning.
[7] Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford University Press.
[8] Box, G. & Cox, D. (1964). An analysis of transformations. Journal of Royal Statistics Society, 26, 211–252.
[9] Breiman, L., Friedman, J., Stone, C., & Olshen, R. (1984). Classification and Regression Trees. Chapman &
Hall/CRC.
[10] Caselli, S. & Querci, F. (2009). The sensitivity of the loss given default rate to systematic risk: New empirical
evidence on bank loans. Journal of Financial Services Research, 34, 1–34.
[11] Cohen, P., Cohen, J., West, S., & Aiken, L. (2002). Applied multiple regression/correlation analysis for the
behavioral sciences. Lawrence Erlbaum.
[12] Demsar, J. (2006). Statistical comparison of classifiers over multiple data sets. Journal of Machine Learning
Research, 7, 1–30.
[13] Draper, N. & Smith, H. (1998). Applied Regression Analysis. Wiley.
[14] Fawcett, T. (2006). An introduction to roc analysis. Pattern Recognition Letters, 27, 861–874.
[15] Freund, R. & Littell, R. (2000). SAS System for Regression. Wiley.
[16] Friedman, J. F. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19, 1–141.
[17] Friedman, M. (1940). A comparison of alternative tests of significance for the problems of m rankings. Annals
of Mathematical Statistics, 11, 86–92.
[18] Gupton, G. & Stein, M. (2002). Losscalc: Model for predicting loss given default (lgd). Tech. rep., Moody’s.
[19] Hampel, F., Ronchetti, R., Rousseeuw, P., & Stahel, W. (1986). Robust statistics : the approach based on
influence functions. Wiley.
[20] Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference,
and Prediction. Springer Series in Statistics.
[21] Hoerl, A. E. & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems.
Technometrics, 12, 55–67.
[22] Holland, P. & Welsch, R. (1977). Robust regression using iteratively reweighted least squares. Communications
in Statistics: Theory and Methods, 6, 813 – 827.
[23] Hosmer, D. & Stanley, L. (2000). Applied Logistic Regression. Wiley, 2nd edn.
[24] Huber, P. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35, 73–101.
[25] Huber, P. & Ronchetti, E. (2009). Robust statistics. Wiley.
[26] Lessmann, S., Baesens, B., Mues, C., & Pietsch, S. (2008). Benchmarking classification models for software
defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, 34,
485–496.
[27] Martens, D., Baesens, B., & Van Gestel, T. (2009). Decompositional rule extraction from support vector ma-
chines by active learning. IEEE Transactions on Knowledge and Data Engineering, 21, 178–191.
[28] Martens, D., Baesens, B., Van Gestel, T., & Vanthienen, J. (2007). Comprehensible credit scoring models using
rule extraction from support vector machines. European Journal of Operational Research, 183, 1466–1476.
[29] Nemenyi, P. (1963). Distribution-free multiple comparisons. Ph.D. thesis, Princeton University.
[30] Smithson, M. & Verkuilen, J. (2006). A better lemon squeezer? maximum-likelihood regression with beta-
distributed dependent variables. Psychological Methods, 11, 54–71.
[31] Suykens, J., Van Gestel, T., De Brabanter, J., De Moor, B., & Vandewalle, J. (2003). Least Squares Support
Vector Machines. World Scientific Publishing Company.
[32] Van Gestel, T., Baesens, B., Van Dijcke, P., Garcia, J., Suykens, J., & Vanthienen, J. (2006). A process model
to develop an internal rating system: Sovereign credit ratings. Decision Support Systems, 2, 1131–1151.
[33] Van Gestel, T., Baesens, B., Van Dijcke, P., Suykens, J., Garcia, J., & Alderweireld, T. (2005). Linear and
non-linear credit scoring by combining logistic regression and support vector machines. Journal of Credit Risk,
1.
[34] Van Gestel, T., Martens, D., Feremans, D., Baesens, B., Huysmans, J., & Vanthienen, J. (2007). Forecasting
and analyzing insurance companies’ ratings. International Journal of Forecasting, 23, 513–529.
[35] Van Gestel, T., Suykens, J., Baesens, B., Viaene, S., Vanthienen, J., Dedene, G., De Moor, B., & Vandewalle,
J. (2003). Benchmarking least squares support vector machine classifiers. Machine Learning, 54, 5–32.
[36] Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer.
30
[37] Wang, H. & Hu, D. (2005). Comparison of svm and ls-svm for regression. International Conference on Neural
Networks and Brain, 1, 279–283.
31