Credit Risk Modeling
Credit Risk Modeling
ePrints Soton
Copyright © and Moral Rights for this thesis are retained by the author and/or other
copyright owners. A copy can be downloaded for personal non-commercial
research or study, without prior permission or charge. This thesis cannot be
reproduced or quoted extensively from without first obtaining permission in writing
from the copyright holder/s. The content must not be changed in any way or sold
commercially in any format or medium without the formal permission of the
copyright holders.
When referring to this work, full bibliographic details including the author, title,
awarding institution and date of the thesis must be given e.g.
http://eprints.soton.ac.uk
UNIVERSITY OF SOUTHAMPTON
FACULTY OF LAW, ARTS & SOCIAL SCIENCES
SCHOOL OF MANAGEMENT
Model development for imbalanced credit scoring data sets, Loss Given
Default (LGD) and Exposure At Default (EAD)
Supervising committee
The purpose of this thesis is to determine and to better inform industry practitioners to the most
appropriate classification and regression techniques for modelling the three key credit risk
components of the Basel II minimum capital requirement; probability of default (PD), loss given
default (LGD), and exposure at default (EAD). The Basel II accord regulates risk and capital
management requirements to ensure that a bank holds enough capital proportional to the exposed
risk of its lending practices. Under the advanced internal ratings based (IRB) approach Basel II
allows banks to develop their own empirical models based on historical data for each of PD, LGD
and EAD.
In this thesis, first the issue of imbalanced credit scoring data sets, a special case of PD modelling
where the number of defaulting observations in a data set is much lower than the number of
observations that do not default, is identified, and the suitability of various classification
techniques are analysed and presented. As well as using traditional classification techniques this
thesis also explores the suitability of gradient boosting, least square support vector machines and
random forests as a form of classification. The second part of this thesis focuses on the prediction
of LGD, which measures the economic loss, expressed as a percentage of the exposure, in case of
default. In this thesis, various state-of-the-art regression techniques to model LGD are considered.
In the final part of this thesis we investigate models for predicting the exposure at default (EAD).
For off-balance-sheet items (for example credit cards) to calculate the EAD one requires the
committed but unused loan amount times a credit conversion factor (CCF). Ordinary least squares
(OLS), logistic and cumulative logistic regression models are analysed, as well as an OLS with
Beta transformation model, with the main aim of finding the most robust and comprehensible
model for the prediction of the CCF. Also a direct estimation of EAD, using an OLS model, will
be analysed. All the models built and presented in this thesis have been applied to real-life data
sets from major global banking institutions.
iv |
Contents
1 Introduction .....................................................................................................................1
1.1 The Basel II Capital Accord ..................................................................................... 2
1.2 The imbalanced credit scoring data set problem (a special case of probability of
default (PD) modelling) .................................................................................................. 6
1.3 The estimation of Loss Given Default (LGD) .......................................................... 9
1.4 Model development for Exposure At Default (EAD) ............................................. 11
1.5 Contributions........................................................................................................... 13
1.5.1 Building default prediction models for imbalanced credit scoring data sets ... 13
1.5.2 Estimation of Loss Given Default (LGD) ........................................................ 14
1.5.3 Regression model development for Credit Card Exposure At Default (EAD) 14
1.6 Notation................................................................................................................... 16
4 Building default prediction models for imbalanced credit scoring data sets ..........65
4.1 Introduction ............................................................................................................. 67
4.2 Overview of classification techniques .................................................................... 68
4.3 Experimental set-up and data sets ........................................................................... 69
4.3.1 Data set characteristics ..................................................................................... 69
4.3.2 Re-sampling setup and performance metrics ................................................... 70
4.3.3 k-fold cross validation...................................................................................... 71
4.3.4 Parameter tuning and input selection ............................................................... 73
4.3.5 Statistical comparison of classifiers ................................................................. 74
4.4 Results and discussion ............................................................................................ 76
4.5 Conclusions and recommendations for further work .............................................. 82
6 Regression Model Development for Credit Card Exposure At Default (EAD) ....121
6.1 Introduction ........................................................................................................... 123
6.2 Overview of techniques ........................................................................................ 125
6.2.1 Ordinary Least Squares (OLS)....................................................................... 125
6.2.2 Binary and Cumulative Logit models (LOGIT & CLOGIT)......................... 125
6.2.3 Ordinary Least Squares with Beta Transformation (B-OLS) ........................ 125
6.3 Empirical set-up and data sets............................................................................... 126
6.3.1 Coefficient of Determination ( R 2 ) ................................................................ 131
6.3.2 Pearson‟s Correlation Coefficient ( r ) ........................................................... 132
6.3.3 Spearman‟s Correlation Coefficient ( ) ....................................................... 132
6.3.4 Root Mean Squared Error (RMSE)................................................................ 132
6.4 Results and discussion .......................................................................................... 133
6.5 Conclusions and recommendations for further work ............................................ 147
7 Conclusions ..................................................................................................................151
7.1 Thesis Summary and Conclusions ........................................................................ 151
7.2 Issues for further research ..................................................................................... 154
7.2.1 The imbalanced data set problem .................................................................. 154
7.2.2 Loss Given Default ........................................................................................ 155
7.2.3 Exposure at Default........................................................................................ 155
Appendices ......................................................................................................................157
A1: Data sets used in Chapter 4 .................................................................................. 157
A1.1 Australian Credit ............................................................................................ 157
A1.2 Bene1 ............................................................................................................. 158
A1.3 Bene2 ............................................................................................................. 159
A1.4 Behav ............................................................................................................. 160
A1.5 German Credit ................................................................................................ 160
A2: Residual plots for Chapter 4 ................................................................................ 161
A2.1 Australian Credit: Gradient Boosting ............................................................ 161
A2.2 Bene2: Gradient Boosting .............................................................................. 163
A3: Stepwise variable selection for Linear models used in Chapter 5 ....................... 164
A3.1 BANK1 .......................................................................................................... 164
A3.2 BANK2 .......................................................................................................... 166
A3.3 BANK3 .......................................................................................................... 167
A3.4 BANK4 .......................................................................................................... 167
A3.5 BANK5 .......................................................................................................... 168
A3.6 BANK6 .......................................................................................................... 168
A4: R-square based variable selection for Non-linear used in Chapter 5 ................... 169
| vii
References .......................................................................................................................181
viii |
TABLE 3.1: Regression techniques used for LGD and EAD modelling ......................... 56
TABLE 6.1: Characteristics of Cohorts for EAD data set .............................................. 126
TABLE 6.2: Information Values of constructed variables ............................................. 134
TABLE 6.3: Parameter estimates and P-values for CCF estimation on the COHORT2
data set ............................................................................................................................ 136
TABLE 6.4: EAD estimates based on conservative and mean estimate for CCF .......... 138
TABLE 6.5: EAD estimates based on CCF predictions against actual EAD amounts .. 138
TABLE 6.6: Direct Estimation of EAD .......................................................................... 139
II. Figures
FIGURE 6.1: Raw CCF distribution (x-axis displays a snapshot of the CCF values from
the period of -9 to 10) ..................................................................................................... 129
FIGURE 6.2: CCF distribution winsorised (between 0 and 1) ....................................... 130
FIGURE 6.3: Distribution of direct estimation of EAD (the actual EAD amount present is
indicated by the overlaid black line) ............................................................................... 139
FIGURE 6.4: OLS base model predicted Exposure at Default (EAD) distribution (the
actual EAD amount present is indicated by the overlaid black line) .............................. 140
FIGURE 6.5: Binary LOGIT model predicted Exposure at Default (EAD) distribution
(the actual EAD amount present is indicated by the overlaid black line) ....................... 140
FIGURE 6.6: Cumulative LOGIT model predicted Exposure at Default (EAD)
distribution (the actual EAD amount present is indicated by the overlaid black line) ... 141
FIGURE 6.7: OLS with Beta Transformation model predicted Exposure at Default
(EAD) distribution (the actual EAD amount present is indicated by the overlaid black
line) ................................................................................................................................. 141
FIGURE 6.8: OLS base model plot for the Actual Mean EAD against Predicted Mean
EAD across ten bins (R2=0.9968) ................................................................................... 142
FIGURE 6.9: Binary LOGIT model plot for the Actual Mean EAD against the Predicted
Mean EAD across ten bins (R2=0.9944) ......................................................................... 142
FIGURE 6.10: Cumulative LOGIT model plot for the Actual Mean EAD against the
Predicted Mean EAD across ten bins (R2=0.9954)......................................................... 143
FIGURE 6.11: OLS with Beta Transformation model plot for the Actual Mean EAD
against the Predicted Mean EAD across ten bins (R2=0.9957) ...................................... 143
FIGURE 6.12: OLS base model plot for the Actual Mean CCF against the Predicted
Mean CCF across ten bins (R2=0.7061) ......................................................................... 144
FIGURE 6.13: Binary LOGIT model plot for the Actual Mean CCF against the Predicted
Mean CCF across ten bins (R2=0.2867) ......................................................................... 145
FIGURE 6.14: Cumulative LOGIT base model plot for the Actual Mean CCF against the
Predicted Mean CCF across ten bins (R2=0.9063) ......................................................... 145
FIGURE 6.15: OLS with Beta Transformation model plot for the Actual Mean CCF
against the Predicted Mean CCF across ten bins (R2=0.9154) ....................................... 146
x|
DECLARATION OF AUTHORSHIP
and the work presented in the thesis are both my own, and have been generated
by me as the result of my own original research. I confirm that:
this work was done wholly or mainly while in candidature for a research
degree at this University;
where any part of this thesis has previously been submitted for a degree or
any other qualification at this University or any other institution, this has
been clearly stated;
where I have consulted the published work of others, this is always clearly
attributed;
where I have quoted from the work of others, the source is always given.
With the exception of such quotations, this thesis is entirely my own work;
where the thesis is based on work done by myself jointly with others, I have
made clear exactly what was done by others and what I have contributed
myself;
Signed: ………………………………………………………………………..
Date:…………………………………………………………………………….
| xi
Acknowledgements
This thesis would not have been possible without the support and guidance of a number
of important people, for which I would like to take this opportunity to acknowledge and
thank them.
First of all I would like to thank my supervisor, Dr Christophe Mues, who has been of
unwavering support throughout my time at the University of Southampton. Without his
tutorage and expert knowledge in the field of credit risk modelling I could not have
achieved the work conducted in this thesis. I would also like to thank my senior
supervisor Prof. Lyn Thomas whose guidance and advice was invaluable in the
development of the research topics presented in this thesis.
I have also had the great pleasure of working alongside a number of well established and
flourishing academics in the field of credit risk and credit scoring. I would like to thank
the team I worked alongside on the LGD project, Gert Loterman, Dr David Martens, Dr
Bart Baesens and Dr Christophe Mues. It was a pleasure to work alongside these fine
minds in the formulation of our LGD benchmarking paper, which I am happy to say, has
now been successfully accepted for publication with the International Journal of
Forecasting (IJF). A special thanks also goes to my fellow PhD colleague Gert Loterman
for his commitment to the project and presenting our findings together at the Credit
Scoring and Credit Control XI Conference in Edinburgh. I wish him all the best in his
research and hope to work alongside him again in the future. I would also like to thank
everyone who has formed the credit team at the University of Southampton both past and
present (Meko, Ed, Mindy, Kasia, Bob, Jie, Ross, Madhur, Angela and Anna). It has been
a pleasure to work alongside such friendly, helpful and likeminded persons.
I would like to thank the EPSRC whose financial support over the 3.5 years of this
research study allowed me to focus solely on the work at hand. I would also like to take
this opportunity to thank SAS UK for their sponsorship and everyone at SAS who I have
xii |
met and has helped me over my time as a PhD student. In particular I would like to show
my thanks to Geoffrey Taylor who enabled me to partake in the training programmes at
the SAS headquarters in Marlow, and for his advice on submitting an application to the
SAS Student Ambassador programme which, as a result, I was accepted for. I would also
like to thank Dr Laurie Miles for his support and interest in the work and papers that have
been completed as part of this thesis.
I have also worked alongside and come into contact with a number of amazing people
during my time at the University of Southampton. In particular I would like to thank
Gillian Groom not only for her help in enabling me to teach SAS but also for the part she
played in facilitating my future career, for this I am indebted to her. A special thanks goes
to my colleague and flatmate Mindy (and her husband Nic) for their support and Wii
parties throughout my time in Southampton. The friendship of both of them made life a
lot more interesting and enjoyable. I would also like to thank Shivam and Joe, not only
for being fellow PhD colleagues in the School of Management but for their football skills
on the pitch and being part of the mighty Lazy Town FC.
Finally I would like to express my greatest thanks to my whole family, who without their
support I would not have achieved any of the goals I have set out to attain. I owe a huge
debt (both monetarily and emotionally) to my parents and sister who have provided me
with the love and means necessary to succeed and achieve where I am today. I would also
like to thank my fiancé who has supported me throughout my PhD and whenever I have
felt disheartened with my work her love, encouragement and incredible cooking has put
me back on track.
| xiii
xiv |
Chapter 1: Introduction |1
Chapter 1
1 Introduction
With the recent financial instabilities in the credit markets, the area of credit risk
modelling has become ever more important, leading to the need for more accurate and
robust models. Further to this, the introduction of the Basel II Capital Accord (Basel
Committee on Banking Supervision, 2004) now allows for financial institutions to derive
their own internal credit risk models under the advanced internal ratings based approach
(AIRB). The Basel II Capital Accord prescribes the minimum amount of regulatory
capital an institution must hold so as to provide a safety cushion against unexpected
losses. From a credit risk perspective, and under the advanced internal ratings based
approach (AIRB), the accord allows financial institutions to build risk models for three
key risk parameters: Probability of Default (PD), Loss Given Default (LGD), and
Exposure at Default (EAD). The Probability of Default (PD) is defined as the likelihood
that a loan will not be repaid and will therefore fall into default. Loss Given Default
(LGD) is the estimated economic loss, expressed as a percentage of exposure, which will
be incurred if an obligor goes into default. Exposure at Default (EAD) is a measure of the
monetary exposure should an obligor go into default.
In this thesis, we study the use of classification and regression techniques to develop
models for the prediction of all three components of expected loss, Probability of Default
(PD), Loss Given Default (LGD) and Exposure At Default (EAD). The reason why these
particular topics have been chosen is due in part to the increased scrutiny on the financial
sector and the pressure on them by the financial regulators to move to and advanced
2|Basel II Co mpli ant Credit Ris k Modelling
internal ratings based approach. The financial sector is therefore looking to the best
models possible to determine their minimum capital requirements through the estimation
of PD, LGD and EAD. On the issue of PD estimation a great deal of work has already
been conducted in both academia and the industry; therefore in Chapter 4 we will tackle a
special case of PD modelling, i.e. building default prediction models for imbalanced
credit scoring data sets. In a credit scoring context imbalanced data sets frequently occur
when the number of defaulting loans in a data set is much lower than the number of
observations that do not default. Subsequently, in Chapters 5 and 6, we then turn our
attention to the other two much less researched risk components of LGD and EAD. It is
our aim to validate novel approaches, evaluate their effectiveness for all three
components of expected loss and obtain an improved understanding of the risk drivers in
the prediction of EAD.
The banking/financial sector is one of the most closely scrutinised and regulated
industries and as such subject to stringent controls. The reason for this is that banks can
only lend out money in the form of loans if depositors trust that the bank and the banking
system is stable enough and their money will be there when they require to withdraw it.
However, in order for the banking sector to provide the loans and mortgages they must
leverage depositors‟ savings meaning that only with this trust can they continue to
function. It is imperative therefore to prevent a loss of confidence and distrust in the
Chapter 1: Introduction |3
banking sector from occurring, as it can have serious implications to the wider economy
as a whole.
The job of the regulatory bodies therefore is to contribute to ensuring the necessary trust
and stability by limiting the level of risk that banks are allowed to take. In order for this
to effectively work, the maximum risk level banks can take needs to be set in relation to
the bank‟s own capital. From the banks perspective the high cost of acquiring and
holding capital makes it prohibitive and unfeasible to have it fully cover all of a bank‟s
risks. As a compromise, the major regulatory body of the banking industry, the Basel
Committee on Banking Supervision, proposed guidelines in 1988 whereby a solvability
coefficient of eight percent was introduced, i.e. the total assets, weighted for their risk,
must not exceed eight percent of the bank‟s own capital, Basel I (SAS Institute, 2002).
The figure of eight percent assigned by the Basel Committee is somewhat arbitrary and as
such since the conception of the idea has been subject to much debate. After the
introduction of the Basel I accord more than one hundred countries worldwide adopted
the guidelines, becoming a major milestone in the history of global banking regulation.
However, a number of the accord‟s inadequacies, in particular with regard to the way that
credit risk was measured, became apparent over time (SAS Institute, 2002). To account
for these issues a revised accord, Basel II, was conceived.
Pillar 1 aligns the minimum capital requirements to a bank‟s actual risk of economic loss.
Various approaches to calculating this are prescribed in the accord (including more risk-
4|Basel II Co mpli ant Credit Ris k Modelling
1. Standardised Approach
2. Internal Ratings Based (IRB) Approach
a. Foundation Approach
b. Advanced Approach
Under the standardised approach banks are required to use ratings from external credit
rating agencies to quantify required capital. The main purpose and strategy of the Basel
committee is to offer capital incentives to banks that move from a supervisory approach
to a best practice advanced internal ratings based one. The two versions of the internal
ratings based (IRB) approach permit banks to develop and use their own internal risk
ratings, to varying degrees. The IRB approach is based on the following four key
parameters:
1. Probability of Default (PD): the likelihood that a loan will not be repaid and will
therefore fall into default;
2. Loss Given Default (LGD): the estimated economic loss, expressed as a
percentage of exposure, which will be incurred if an obligor goes into default, in
other words, LGD equals: 1 minus the recovery rate;
3. Exposure At Default (EAD): a measure of the monetary exposure should an
obligor go into default;
Chapter 1: Introduction |5
4. Maturity (M): is the length of time to the final payment date of a loan or other
financial instrument.
From the parameters, PD, LGD and EAD, expected loss (EL) can be derived as follows:
For example, if PD 2%, LGD 40%, EAD £10,000 , then EL £80 . Expected loss
can also be measured as a percentage of EAD:
FIGURE 1.1: Illustration of foundation and advanced Internal Ratings-Based (IRB) approach
The difference between these two approaches is the degree to which the four parameters
can be measured internally. For the foundation approach, only PD may be calculated
internally, subject to supervisory review (Pillar 2). The values for LGD and EAD are
6|Basel II Co mpli ant Credit Ris k Modelling
fixed and based on supervisory values. For the final parameter, M, a single average
maturity of 2.5 years is assumed for the portfolio. In the advanced IRB approach all four
parameters are to be calculated by the bank and are subject to supervisory review
(Schuermann, 2004).
Under the AIRB, financial institutions are also recommended to estimate a „Downturn
LGD‟, which „cannot be less than the long-run default-weighted average LGD calculated
based on the average economic loss of all observed defaults with the data source for that
type of facility‟ (Basel, 2004).
We will now look at and identify some of the problems faced by financial institutions
wishing to implement the advanced IRB approach.
Commonly the first stage of PD estimation involves building a scoring model that can be
used to distinguish between different risk classes. In the development of credit scoring
models several statistical methods are used traditionally such as linear probability
models, logit models and discriminate analysis models. These statistical techniques can
be used to estimate the probability of default of a borrower based on factors such as loan
performance and the borrowers‟ characteristics. Based on this information credit
scorecards can be built to determine whether to accept or decline a borrower (application
scoring) or to provide an up-to-date assessment of the credit risk of existing borrowers
(behavioural scoring). The aim of credit scoring therefore is essentially to classify loan
applicants into two classes, i.e. good payers (i.e., those who are likely to keep up with
their repayments) and bad payers (i.e., those who are likely to default on their loans)
(Thomas, 2000) . In the current financial climate, and with the recent introduction of the
Basel II Accord, financial institutions have even more incentives to select and implement
the most appropriate credit scoring techniques for their credit data sets. It is stated in
Chapter 1: Introduction |7
Henley and Hand (1997) that companies could make significant future savings if an
improvement of only a fraction of a percent could be made in the accuracy of the credit
scoring techniques implemented. However, in the literature, data sets that can be
considered as very low risk, or imbalanced data sets, have had relatively little attention
paid to them in particular with regards to which techniques are most appropriate for
scoring them (Benjamin et al, 2006). The underlying problem with imbalanced data sets
is that they contain a much smaller number of observations in the class of defaulters than
in that of the good payers. A large class imbalance is therefore present which some
techniques may not be able to successfully handle (Benjamin et al, 2006). In a recent
FSA publication regarding conservative estimation of imbalanced data sets, regulatory
concerns were raised about whether firms can adequately asses the risk of imbalanced
credit scoring data sets (Benjamin et al, 2006).
A wide range of classification techniques have already been proposed in the credit
scoring literature, including statistical techniques, such as linear discriminant analysis and
logistic regression, and non-parametric models, such as k-nearest neighbour and decision
trees. But it is currently unclear from the literature which technique is the most
appropriate for improving discrimination for imbalanced credit scoring data sets. TABLE
1.1 in Section 2.1 provides a selection of techniques currently applied in a credit scoring
context, not specifically for imbalanced data sets, along with references showing some of
their reported applications in the literature.
Hence, the aim of the first project, reported in Chapter 4, is to conduct a study of various
classification techniques based on five real-life credit scoring data sets. These data sets
will then have the size of their minority class of defaulters further reduced by decrements
of 5% (from an original 70/30 good/bad split) to see how the performance of the various
classification techniques is affected by increasing class imbalance.
The five real-life credit scoring data sets to be used in this empirical study include two
data sets from Benelux (Belgium, Netherlands and Luxembourg) institutions, the German
Credit and Australian Credit data sets which are publicly available at the UCI repository
(http://kdd.ics.uci.edu/), and the fifth data set is a behavioural scoring data set, which was
also obtained from a Benelux institution.
8|Basel II Co mpli ant Credit Ris k Modelling
The techniques that will be evaluated in this chapter are traditional well reported
classification techniques (Baesens, et al 2003); logistic regression (LOG), linear and
quadratic discriminant analysis (LDA, QDA), nearest-neighbour classifiers (k-NN10, k-
NN100), decision trees (C4.5) and more machine learning techniques; least square
support vector machines (LS-SVM), neural networks (NN), a gradient boosting algorithm
and random forests. The reason why these machine learning techniques are selected are
their potential applications in a credit scoring context (Baesens, et al 2003) and the
interest in whether they can perform better than traditional techniques given a large class
imbalance. We are especially interested in the power and usefulness of the gradient
boosting and random forest classifiers which have yet to be thoroughly investigated in a
credit scoring context.
All techniques will be evaluated in terms of their Area Under the Receiver Operating
Characteristic Curve (AUC). This is a measure of the discrimination power of a classifier
without regard to class distribution or misclassification cost (Baesens, et al 2003).
To make statistical inferences from the observed differences in AUC, we will follow the
recommendations given in a recent article, Demšar 2006, which looked at the problem of
benchmarking classifiers on multiple data sets and recommended a set of simple robust
non-parametric tests for the statistical comparison of the classifiers. The AUC measures
will therefore be compared using Friedman's average rank test, and Nemenyi's post-hoc
test will be used to test the significance of the differences in rank between individual
classifiers. Finally, a variant of Demšar's significance diagrams will be plotted to
visualise their results.
Having introduced the topic of research which will be conducted in Chapter 4 we will
now identify the major motivations for this thesis chapter.
Fundamentally from a regulatory perspective the issue is whether firms can adequately
build loan-level scoring models on imbalanced data sets as not all techniques may be able
to cope well with class imbalances; as a result, discrimination performance may suffer.
Without an adequate scoring model, it becomes difficult to segment exposures into
Chapter 1: Introduction |9
different rating grades or pools. So the key question becomes not whether we can assess
the risk but can we still build a decent model that distinguishes between different levels
of (low) risk. Thus the topic in this research thesis has been chosen so as to assess the
capabilities of credit scoring techniques when a large class imbalance is present. The
motivation behind this particular research topic is to identify the capabilities of traditional
techniques such as logistic regression and linear discriminant analysis when a class
imbalance is present and compare these to techniques yet to be analysed in this field i.e.
gradient boosting and random forests. If for example logistic regression can perform
comparatively well to the more advanced techniques, when a large class imbalance is
present, this will provide confidence to practitioners wishing to implement such a
technique.
The experimental design has been chosen so that a variety of available datasets can be
compared at varying levels of class imbalance (through under sampling the bad
observations). A process of 10-fold cross validation is applied to retain statistical and
empirical inference where small numbers of bad observations are present in the
imbalanced samples. Further motivations behind the experimental design of this
particular research area are identified and assessed in the literature review section of this
thesis.
The LGD parameter measures the economic loss, expressed as percentage of the
exposure, in case of default. This parameter is a crucial input to the Basel II capital
calculation as it enters the capital requirement formula in a linear way (unlike PD, which
comparatively has a smaller effect on minimal capital requirements). Hence, changes in
LGD directly affect the capital of a financial institution and as such also its long-term
strategy. It is thus of crucial importance to have models that estimate LGD as accurately
as possible. This is however not straightforward, as industry models typically show low
R 2 values. Such models are often built using ordinary least squares regression or
10 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
regression trees, even though prior research has shown that LGD typically displays a non-
linear bi-modal distribution with spikes around 0 and 1 (Bellotti & Crook 2007). In the
literature the majority of work to date has focused on the issues related to PD estimation
whereas only more recently, academic work has been conducted into the estimation of
LGD (e.g. Bellotti and Crook, 2009, Loterman et al, 2009, Thomas et al, 2010).
In Chapter 5, a large set of state-of-the-art regression algorithms will be applied to 6 real-
life LGD data sets with the aim of achieving a better understanding of which techniques
perform the best in the prediction of LGD. The regression models employed will include
one-stage models, such as those built by ordinary least squares, beta regression, artificial
neural networks, support vector machines and regression trees, as well as two-stage
models which attempt to combine the benefits of multiple techniques. Their performances
will be determined through the calculation of several performance metrics which will in
turn be meta-ranked to determine the most predictive regression algorithm. The
performance metrics will again be compared using Friedman's average rank test and
Nemenyi's post-hoc test will be employed to test the significance of the differences in
rank between individual regression algorithms. Finally, a variant of Demšar's significance
diagrams will be plotted for each performance metric to visualise their results.
This first large scale LGD benchmarking study in terms of techniques and data sets,
investigates whether other approaches can improve the predictive performance which,
given the impact of LGD on capital requirements, can yield large benefits.
Having introduced the topic of research which will be conducted in Chapter 5 we will
now identify the major motivations for this thesis chapter.
There has been much industry debate as to the best techniques to apply in the estimation
of LGD, given its bi-modal distribution. The motivations for this particular research topic
are to determine the predictive power of commonly used techniques such as linear
regression with transformations and compare them to more advanced machine learning
techniques such as neural networks and support vector machines. The aim in doing this is
to better inform industry practitioners as to the comparable ability of potential techniques
C h a p t e r 1 : I n t r o d u c t i o n | 11
and to add to the current literature on both the topics of loss given default and
applications of domain specific regression algorithms.
Over the last few decades, credit risk research has largely been focused on the estimation
and validation of probability of default (PD) models in credit scoring. However, to date
very little model development and validation has been reported on the estimation of
EAD, particularly for retail lending (credit cards). As with LGD, EAD enters the capital
requirement formulas in a linear way and therefore changes to EAD estimations have a
crucial impact on regulatory capital. Hence, as with LGD, it is important to develop
robust models that estimate EAD as accurately as possible.
In defining EAD for on-balance sheet items, EAD is typically taken to be the nominal
outstanding balance net of any specific provisions (Financial Supervision Authority, UK
2004a, 2004b). For off-balance sheet items (for example, credit cards), EAD is estimated
as the current drawn amount, E (tr ) , plus the current undrawn amount (i.e. credit limit
minus drawn amount), L(tr ) E (tr ) , multiplied by a credit conversion factor, CCF or
loan equivalency factor (LEQ):
The credit conversion factor can be defined as the percentage rate of undrawn credit lines
(UCL) that have yet to be paid out but will be utilised by the borrower by the time the
default occurs (Gruber and Parchert, 2006). The calculation of the CCF is very important
for off-balance sheet items as the current exposure is generally not a good indication of
the final EAD, the reason being that, as an exposure moves towards default, the
likelihood is that more will be drawn down on the account. In other words, the source of
variability of the exposure is the possibility of additional withdrawals when the limit
allows this (Moral, 2006).
12 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
The purpose of this chapter will therefore be to look at the estimation and validation of
this CCF in order to correctly estimate the off-balance sheet EAD. A real-life data set
with monthly balance amounts for clients over the period 2001-2004 will be used in the
building and testing of the regression models. We also aim to gain a better understanding
of the variables that drive the prediction of the CCF for consumer credit. To achieve this,
predictive variables that have previously been suggested in the literature (Moral, 2006)
will be constructed, along with a combination of new and potentially significant
variables. We also aim to identify whether an improvement in predictive power can be
achieved over ordinary least squares regression (OLS) by the use of binary logit and
cumulative logit regression models and an OLS with Beta transformation model. The
reason why we propose these models is that recent studies (e.g. Jacobs, 2008) have
shown that the CCF exhibits a bi-modal distribution with two peaks around 0 and 1, and a
relatively flat distribution between those peaks. This non-normal distribution is therefore
less suitable for modelling with traditional ordinary least squares (OLS) regression. The
motivation for using an OLS with Beta transformation model is that it accounts for a
range of distributions including a U-shaped distribution. We will also trial a direct OLS
estimation of the EAD and use it as a comparison to estimating a CCF and applying it to
the EAD formulation.
Having introduced the topic of research which will be conducted in Chapter 6 we will
now identify the major motivations for this thesis chapter.
The correct calculation of credit conversion factors for off-balance sheet items is of
pertinent interest and importance to the financial sector. The main motivation for
choosing this research topic therefore is to provide insight to the industry as to the
potential techniques at their disposal for calculation their CCFs. The estimation of CCF is
also a similar problem to that of estimating LGD, given that it displays a bi-modal
distribution.
C h a p t e r 1 : I n t r o d u c t i o n | 13
1.5 Contributions
Having identified the need for a greater understanding of the appropriate credit risk
modelling techniques available to practitioners, we will now identify the major research
topics and contributions of this thesis chapter.
The contributions of the research set out in Chapter 4 of this thesis are as follows.
In Chapter 4 of this thesis, we will address this issue of estimating probability of default
for imbalanced data sets. Whereas other studies have benchmarked several scoring
techniques, in our study, we have explicitly looked at the problem of having to build
models on potentially highly imbalanced data sets. Two techniques that have yet to be
fully researched in the context of credit scoring, i.e. Gradient Boosting and Random
Forests, will be chosen, alongside traditional credit scoring techniques, to give a broader
review of the techniques available.
The results of these experiments will show that the Gradient Boosting and Random
Forest classifiers perform well in dealing with samples where a large class imbalance is
present. The findings will also suggest that the use of a linear kernel LS-SVM is not
beneficial in the scoring of data sets where a very large class imbalance exists.
14 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
In Chapter 5, a large scale Loss Given Default (LGD) benchmarking study will be
undertaken, with the aim of comparing various state-of-the-art regression techniques to
model and predict LGD. The findings displayed in Chapter 5 will indicate that the
average predictive performance of the models in terms of R 2 ranges from 4 % to 43 %,
indicating that most resulting models have limited explanatory power. Nonetheless, a
clear trend will be displayed showing that non-linear techniques and artificial neural
networks and support vector machines in particular give higher performances than more
traditional linear techniques. This indicates the presence of non-linear interactions
between the independent variables and the LGD, contrary to some studies in PD
modelling where the difference between linear and non-linear techniques is not that
explicit. Given the fact that LGD has a bigger impact on the minimal capital requirements
than PD, we will demonstrate the potential importance of applying non-linear techniques,
preferably in a two-stage context to obtain comprehensibility as well, for LGD modelling.
To the best of our knowledge, such an LGD study has not yet been conducted before in
the literature.
In Chapter 6, we will propose several models for predicting the Exposure At Default
(EAD) and estimating the credit conversion factor (CCF). Ordinary least squares, binary
logit and cumulative logit regression models will be estimated and compared for the
prediction of the CCF, which to date have not been thoroughly evaluated before. A
variety of new variables of interest will also be calculated and used in the prediction of
the CCF. An in-depth analysis of the predictive variables used in the modelling of the
CCF will be given, and will show that previously acknowledged variables are significant.
The results from this chapter will also show that a marginal improvement in the
coefficient of determination can be achieved with the use of a binary logit model over a
C h a p t e r 1 : I n t r o d u c t i o n | 15
traditional OLS model. Interestingly the use of a cumulative logit model is shown to
perform worse than both the binary logit and OLS models.
With regards to the additional variables proposed in the prediction of the CCF, only one,
i.e. average number of days delinquent in the last 6 months, gives an adequate p-value
when a stepwise procedure was used.
16 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
1.6 Notation
In this thesis, the following mathematical notations are used. A scalar x is denoted in
normal script. A vector x is represented in boldface and is assumed to be a column
x1
x
vector, x 2 . The corresponding row vector x T is obtained using the transpose T ,
xn
T
x1
x
x 2 x1
T
x2 xn . Bold capital notation is used for a matrix, X . The
xn
number of independent variables is given by n and the number of observations is given
by l . The observation i is denoted as x i whereas variable j is indicated as x j . The
Chapter 2
2 Literature Review
In this section a review of the literature topics related to this PhD thesis will be given.
This section is formulated as follows. We begin by looking at the current applications of
data mining techniques in credit risk modelling and go on to look at the current work and
issues in the modelling of the three parameters of the minimum capital requirements
(probability of default, loss given default and exposure at default). To date a considerable
amount of work has been done on the estimation of the probability of default. To further
this, the issue of imbalanced credit scoring data sets, which has been highlighted by the
Basel Committee on Banking Supervision (2005) as a potential problem for probability of
default modelling, is also looked at and reviewed. Finally, a summary of the literature
review chapter will be given.
20 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
In this section, a review of the current applications of data mining techniques in a credit
risk modelling environment will be given. The ideas already present in the literature will
be explored with the aim to highlight potential gaps with which further research could
fill. TABLE 1.1 provides a selection of techniques currently applied in a credit scoring
context, not specifically for imbalanced data sets, along with references showing some of
their reported applications in the literature.
In the development of credit risk modelling and scorecard building, discriminant analysis
and linear or logistic regression have traditionally been the most widely applied
techniques (Hand & Henley, 1997). This is partly due to their ability to be easily
understood and ease of application. The first major work in the application of machine-
learning algorithms in a credit risk modelling context was conducted by Davis et al
(1992). In this paper a number of algorithms, including Bayesian inference and neural
networks, are applied to credit-card assessment data from the Bank of Scotland. Their
findings suggest that overall all the algorithms analysed perform at the same level of
accuracy, with the neural network algorithms taking the longest to train. Their research
was limited however by the number of data observations in both the training and test sets,
and by the computational power of the period. Further research has since been conducted
into the applications of data mining techniques over a larger selection of data sets
however, and the findings from these studies will be discussed before conclusions to
potential gaps are made.
To date, a variety of data mining models have been used in the estimation of default for
both consumer and commercial credit. In Rosenberg and Gleit (1994), a survey of the use
of discriminant analysis, decision trees, expert systems for static decisions, dynamic and
linear programming and Markov chains is undertaken for credit management. They
surmised that although, up until that period, sophisticated techniques such as linear or
dynamic programming were unused in practice, there was a potential future use for them
in this context. This signified the potential for other practitioners to further their study
and apply techniques such as linear programming in the estimation of credit risk.
Hand and Henley (1997) examined the problems that have arisen in the credit scoring
context as well as giving a detailed review of the statistical methods used. They state that
although the main focus of statistical methods for credit scoring has so far been to simply
discriminate between good and bad risk classes, there is a much larger scope for the
application of these techniques. This leads to the application of data mining techniques in
22 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
a credit risk modelling environment, such as modelling risk parameters for Basel. Further
discussion of the implications and practicalities of applying these methods in a credit risk
modelling domain will follow the discussion of additional credit scoring techniques and
research. Similarly to Hand and Henley (1997) in Lee et al (2002) widely used techniques
(such as logistic regression and linear discriminant analysis) are compared as well as
exploring the integration of back-propagation neural networks are with traditional
discriminant analysis with the aim of improving credit scoring performance. Their
findings indicate that not only can convergence be achieved quicker than with neural
networks on their own, but in terms of accuracy an improvement over logistic regression
and discriminant analysis can be made.
Expanding on the work conducted by Davis et al (1992), Giudici (2001) identifies the use
of, Bayesian methods, coupled with Markov Chain Monte Carlo (MCMC) computational
techniques are shown to be successfully employed in the analysis of highly dimensional
complex data sets, as are common in data mining. This study shows the potential of
MCMC for credit scoring. Through the use of a reversible jump MCMC and graphical
models, one can extract valuable information from data sets, in the form of conditional
dependence relationships. Applications of MCMC in the specific context of modelling
LGD (discussed in section 2.2.2) can also be found in Luo & Shevchenko (2010).
More recently a comparison of a variety of data mining techniques is given in Yeh and
Lien (2009). In their paper the predictive accuracy of six data mining methods are
compared (K-nearest neighbours, Logistic Regression, Discriminant Analysis, Naïve
Bayesian classifiers, Artificial Neural Networks and Classification Trees) on customers‟
default payments in Taiwan. For this paper the predictive accuracy of the estimated
probability of default is analysed as opposed to a traditional classification analysis. The
findings indicate that the forecasting model produced by artificial neural networks has the
highest coefficient of determination (R-Square) in estimating the real probability of
default. This goes someway in agreeing with the findings shown in Baesens et al. (2003)
and hence strengths the need to identify how well these techniques can still perform give
varying levels of class imbalance in a credit scoring context.
It must be noted that this literature overview for current data mining techniques applied in
a credit risk modelling context is by no means exhaustive. Other techniques used for
credit scoring and risk modelling include for example, genetic algorithms (Bedingfield
and Smith, 2001, Fogarty et al. 1992) and mathematical programming (Freed and Glover,
1981, Hand and Jacka, 1981, Kolesar and Showers, 1985).
The majority of the studies reviewed here display the same limitations in numbers of real-
world credit data sets used and number/variety of techniques compared. Another
consideration is the fact that the area under the receiver operator curve (AUC) statistic is
24 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
not widely reported in these studies, whereas in industry practice this is a well understood
and well used statistical measure. This thesis therefore will attempt to incorporate a wide
variety of techniques and real world credit data sets and provide performance metrics that
are of use within industry practice (i.e. R-square for regression models, AUC for
classification models and correlation metrics).
For a more detailed review paper of the statistical classification methods used in
consumer credit scoring please see Hand and Henley (1997).
C h a p t e r 2 : L i t e r a t u r e R e v i e w | 25
2.2 Components
This section details the literature studies on the three contributing components to the
calculation of the minimum capital requirements. The current understanding and
implementations of these in the literature will be discussed.
Over the last few decades, the main focus of credit risk modelling literature has focused
on the estimation of the probability of default on individual loans or pools of transactions
(PD); with less literature available on the estimation of the loss given default (LGD) and
the correlation between defaults (Crouhy et al, 2000; Duffie & Singleton, 2003). Work
has also been developed on exposure at default modelling, but to a far lesser extent (cf.
section 2.2.3).
Probability of default (PD) can be defined as the likelihood that a loan will not be repaid
and will therefore fall into default. A default is considered to have occurred with regard
to a particular obligor (i.e. customer) when either or both of the two following events
have taken place:
1. The bank considers that the obligor is unlikely to pay its credit obligations to the
banking group in full (e.g. if an obligor declares bankruptcy), without recourse by
the bank to actions such as realising security (if held) (i.e. taking ownership of the
obligors house, if they were to default on a mortgage).
2. The obligor is past due, i.e. missed payments, for more than 90 days on any
material credit obligation to the banking group. (Basel, 2004)
This section gives a non-extensive overview of the key literature to date in the field of PD
modelling. A clear distinction can be made between those models developed for retail
credit and corporate credit facilities in the estimation of PD. As such this section has been
sub-divided into three categories distinguishing the literature for retail credit (cf. 2.2.1.a),
corporate credit (cf. 2.2.1.b) and calibration (cf. 2.2.1.c).
26 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
The main benefits of credit scoring models are their relative ease of implementation and
the fact that they do not suffer from the opaqueness of some of the other proposed “black-
box” techniques such as Neural Networks and Least Square Support Vector Machines
proposed in Baesens et al (2003).
Since the advent of the new capital accord (Basel Committee on Banking Supervision,
2004), a renewed interest has been seen in credit risk modelling. With the allowance
under the internal ratings based approach of the capital accord for organisations to create
their own internal ratings models, the use of appropriate modelling techniques is ever
more prevalent. Banks must now weigh up the issue of holding enough capital to limit
C h a p t e r 2 : L i t e r a t u r e R e v i e w | 27
insolvency risks and not holding excessive capital due to its cost and limits to efficiency
(Bonfim, 2009).
Further recent work on the discussion of PD estimation from a regulatory perspective for
retail credit can be found in Chatterjee et al (2007), where the consequences of changes in
regulation of bankruptcy are analysed and advisory pointers given.
and to also test the capabilities of these techniques when class imbalances are present.
Other recent work on PD estimation for corporate credit can be found in Fernandes
(2005), Carling et al (2007), Tarashev (2008), Miyake and Inoue (2009) and Kiefer
(2010).
2.2.1.c PD calibration
The purpose of PD calibration is the assignment of a default probability to each possible
score or rating grade values. The important information required for calibrating PD
models include:
- The PD forecasts over a rating class and the credit portfolio for a specific forecasting
period.
- The number of obligors assigned to the respective rating class by the model.
- The default status of the debtors at the end of the forecasting period.
(Guettler and Liedtke, 2007)
It has been found (Guettler and Liedtke, 2007) that realised default rates are actually
subject to relatively large fluctuations making it necessary to develop indicators to show
how well a rating model estimates the PDs. It is recommended in Tasche (2003), that
traffic light indicators could be used to show whether there is any significance in the
deviations of the realised and forecasted default rates. The three traffic light indicators,
green, yellow and red identify the following potential issues. A green traffic light
indicates that the true default rate is equal to, or lower than, the upper bound default rate
at a low confidence level. A yellow traffic light indicates the true default rate is higher
than the upper bound default rate at a low confidence level and equal to, or lower than,
the upper bound default rate at a high confidence level. Finally a red traffic light indicates
the true default rate is higher than the upper bound default rate at a high confidence level.
(Tasche, 2003 via Guettler and Liedtke, 2007)
Although a non-exhaustive list, substantial work has previously been conducted in the
estimation of probability of default. This section of literature is included to inform the
C h a p t e r 2 : L i t e r a t u r e R e v i e w | 29
reader of the current modelling research to date with regards to PD, which will form a
precursor to the analysis of credit scoring for imbalanced data sets. As the topic of
research of this thesis is focused towards estimating PD in imbalanced datasets a more
exhaustive review of the current literature on Probability of Default modelling can be
found in the following review papers; Altman and Sironi (2004), Erdem C (2008).
However, as we will see in the next section, an interesting finding is that little work has
been conducted on the area of imbalanced data sets, where there are a much smaller
number of observations in the class of defaulters than in that of the class of payers, where
a PD estimate must also be achieved. Therefore in the following section, the issue of
imbalanced credit scoring data sets will be looked at with the aim to identify the current
approaches in the literature and identify any potential gaps.
In 2005, The Basel Committee on Banking Supervision (2005) highlighted the fact that
calculations based on historical data made for very safe assets may “not be sufficiently
reliable” for estimating the probability of default. The reason for this is that as there are
so few defaulted observations, the resulting estimations are likely to be inaccurate.
Therefore a need is present for a better understanding of the appropriate modelling
techniques for data sets which display a limited number of defaulted observations.
This section has been further sub-divided into problems imbalanced credit scoring data
sets pose to modelling (cf. 2.2.1.1.a) and the issue of calibration (cf. 2.2.1.1.b), i.e. how a
long-run average that is statistically conservative can be achieved.
did not focus specifically on how these techniques compare on heavily imbalanced
samples, or to what extent any such comparison is affected by the issue of class
imbalance. For example, in Baesens et al. (2003), seventeen techniques including both
well known techniques such as logistic regression and discriminant analysis and more
advanced techniques such as least square support vector machines were compared on
eight real-life credit scoring data sets. Although more complicated techniques such as
radial basis function least square support vector machines (RBF LS-SVM) and neural
networks (NN) yielded good performances in terms of AUC, simpler linear classifiers
such as linear discriminant analysis (LDA) and logistic regression (LOG) also gave very
good performances. However, there are often conflicting opinions when comparing the
conclusions of studies promoting differing techniques. For example, in Yobas et al,
(2000), the authors found that linear discriminant analysis (LDA) outperformed neural
networks in the prediction of loan default, whereas in Desai et al, (1996), neural networks
were reported to actually perform significantly better than LDA. Furthermore, many
empirical studies only evaluate a small number of classification techniques on a single
credit scoring data set. The data sets used in these empirical studies are also often far
smaller and less imbalanced than those data sets used in practice. Hence, the issue of
which classification technique to use for credit scoring, particularly with a small number
of bad observations, remains a challenging problem (Baesens et al., 2003). In more recent
work on the effects of class distribution on the prediction of PD, Crone and Finlay (2011)
found that under sampled data sets are inferior to unbalanced and oversampled data sets.
However it was also found that the larger the sample size used, the less significant the
differences between the methods of balancing were. Their study also incorporated the use
of a variety of data mining techniques, including logistic regression, classification and
regression trees, linear discriminate analysis and neural networks. From the application of
these techniques over a variety of class balances it was found that logistic regression was
the least sensitive to balancing. This piece of work is thorough in its empirical design;
however it does not assess more novel machine learning techniques in the estimation of
default. In the study presented in this thesis, additional techniques such as Gradient
Boosting and Random Forests will be adopted to contribute additional value to the
literature.
C h a p t e r 2 : L i t e r a t u r e R e v i e w | 31
In Yao, (2009) hybrid SVM-based credit scoring models are constructed to evaluate
applicant‟s scoring from an applicant‟s input features. This paper shows the implications
of using machine learning based techniques (SVMs) in a credit scoring context on two
widely used credit scoring datasets (Australian credit and German credit) and compares
the accuracy of this model against other techniques (LDA, logistic regression and NN).
Their findings suggest that the SVM hybrid classifier has the best scoring capability
when compared to traditional techniques. Although this is a non-exhaustive study with a
bias towards the use of RBF-SVMs it gives a clear basis for the hypothetical use of
SVMs in a credit scoring context. The use of the Australian and German credit datasets is
also of interest as the same datasets will be utilised in Chapter 4 of this study. A lot can
be learned from the empirical setup of this work and will be built upon in this thesis.
In Kennedy, (2011) the suitability of one-class and supervised two-class classification
algorithms as a solution to the low-default portfolio problem are evaluated. This study
compares a variety of well established credit scoring techniques (e.g. LDA, LOG and k-
NN) against the use of a linear kernel SVM. Nine banking datasets are utilised and class
imbalance is artificially created by removing 10% of the defaulting observations from the
training set after each run. The only issue with this process is that the datasets are
comparatively small in size (ranging from 125 - 5,397) which leads this author to believe
a process of k-fold cross validation would have been more applicable considering the size
of the datasets after a training, validation and test set split are made. However, some
merit to this paper are that the findings shown, at least at the 70:30 class split, are
comparative to other studies in the area (e.g. Baesens et al. 2003) with no statistical
difference in the techniques at this level. As more class imbalance is induced it is shown
that logistic regression performs significantly better than Lin-SVM, QDC (Quadratic
Bayes Normal) and k-NN. It is also shown that oversampling produces no overall
improvement to the best performing two-class classifiers. The findings in this paper lead
into the work that will be conducted in this thesis, as several similar techniques and
datasets will be employed, alongside the determination of classifier performance on
imbalanced data sets.
32 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
The topic of which good/bad distribution is the most appropriate in classifying a data set
has been discussed in some detail in the machine learning and data mining literature. In
Weiss & Provost (2003) it was found that the naturally occurring class distributions in the
twenty-five data sets looked at, often did not produce the best-performing classifiers.
More specifically, based on the AUC measure (which was preferred over the use of the
error rate), it was shown that the optimal class distribution should contain between 50%
and 90% minority class examples within the training set. Alternatively, a progressive
adaptive sampling strategy for selecting the optimal class distribution is proposed in
Provost et al (1999). Whilst this method of class adjustment can be very effective for
large data sets, with an adequate number of observations in the minority class of
defaulters, in some imbalanced data sets there are only a very small number of loan
defaults to begin with.
Various kinds of techniques have been compared in the literature to try and ascertain the
most effective way of overcoming a large class imbalance. Chawla et al (2002) proposed
a Synthetic Minority Over-sampling technique (SMOTE) which was applied to example
data sets in fraud, telecommunications management, and detection of oil spills in satellite
images. In Japkowicz (2000) over-sampling and downsizing were compared to the
author's own method of “learning by recognition” in order to determine the most effective
technique. The findings, however, were inconclusive but demonstrated that both over-
sampling the minority class and downsizing the majority class can be very effective.
Subsequently Batista (2004) identified ten alternative techniques to deal with class
imbalances and trialled them on thirteen data sets. The techniques chosen included a
variety of under-sampling and over-sampling methods. Findings suggested that generally
oversampling methods provide more accurate results than under-sampling methods. Also,
a combination of either SMOTE (Chawla et al, 2002) and Tomek links or SMOTE and
ENN (a nearest-neighbour cleaning rule), were proposed.
In Wilde and Jackson (2006), it is shown that probability of default for low-default
portfolios can be calculated based on a re-calibration of the CreditRisk+ (cf. Chapter
2.2.1) model to a model of default behaviour similar to that of a Merton model. The
challenge of data issues, such as scarcity of defaults, in probability default models is
further explored by Dwyer (2006) through the use of Bayesian model validation. A
posterior distribution is derived for PD, providing a framework for finding the upper
bound for a PD in relation to imbalanced credit scoring data sets. The proposed method
allows the determination of when a calibration needs to be recomputed even when a
default rate is within the 95% confidence level. Burgt M (2007) looks at the issue of back
34 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
In summary, although work has been conducted into the area of imbalanced credit
scoring data sets there is still potential for more detailed work to be conducted as gaps
still exist e.g. on the modelling level. There is also scope for techniques and
methodologies to be used from the Machine Learning literature and applied in a credit
scoring context where imbalances in data are present.
Loss given default (LGD) is the estimated economic loss, expressed as a percentage of
exposure, which will be incurred if an obligor goes into default (in other words, 1 –
recovery rate in the literature). Producing robust and accurate estimates of potential
losses are essential for the efficient allocation of capital within financial organisations for
the pricing of credit derivatives and debt instruments (Jankowitsch et al., 2008). Banks
are also in the position to gain a competitive advantage if an improvement can be made to
their internally made loss-given default forecasts.
C h a p t e r 2 : L i t e r a t u r e R e v i e w | 35
Whilst the modelling of probability of default (PD) (cf. Chapter 2.2.1) has been the
subject of many studies during the past few decades, literature detailing recovery rates
has only emerged more recently. This increase in literature on recovery rates is due to the
advent of the new Basel Capital Accord. A detailed review of how credit risk models
have developed over the last thirty years on corporate bonds can be found in Altman
(2006).
A clear distinction can be made between those models developed for retail credit and
corporate credit facilities. As such this section has been sub-divided into four categories
distinguishing the literature for retail credit (cf. 2.2.2.a), corporate credit (cf. 2.2.2.b),
economic variables (cf. 2.2.2.c) and Downturn LGD (cf. 2.2.2.d).
In the more recent literature on corporate credit, Acharya et al (2007) use an extended set
of data on U.S. defaulted firms between 1982 and 1999 to show that creditors of
defaulted firms recover significantly lower amounts, in present-value terms, when their
particular industry is in distress. They find that not only an economic-downturn effect is
present but also a fire-sales effect, also identified by Shleifer and Vsihny (1992). This
fire-sales effect means that creditors recover less if the surviving firms are illiquid. The
C h a p t e r 2 : L i t e r a t u r e R e v i e w | 37
main finding of this study is that industry conditions at the time of default are robust and
economically important determinants of creditor recoveries.
An interesting study by Qi and Zhao (2011) shows the comparison of six statistical
approaches to estimation LGD (including regression trees, neural networks and OLS with
and without transformations). There findings suggest that non-parametric methods such
as neural networks outperform parametric methods such as OLS in terms of model fit and
predictive accuracy. It is also shown that the observed values for LGD in the corporate
default data set display a bi-modal distribution with focal points around 0 and 1. This
paper is limited however by the use of a single corporate defaults data set of a relatively
small size (3,751 observations). An extension of this study over multiple data sets and
including a variety of additional techniques would therefore add to the validity of the
results.
In Hu and Perraudin (2002), evidence that aggregate quarterly default rates and recovery
rates are negatively correlated is presented. This is achieved through the use of Moody‟s
historical bond market data in the period 1971-2000. Their conclusions suggest that
recoveries tend to be low when default rates are high. It is also concluded that typical
correlations for post 1982 quarters are -22%. Whereas, with respect to the full time period
1971-2000, correlations are typically lower, i.e. -19%.
38 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
Caselli et al (2008) verify the existence of a relation between the loss given default rate
(LGDR) and macroeconomic variables. Using a sizeable number of bank loans (11,649)
concerning the Italian market several models are tested in which LGD is expressed as a
linear combination of the explanatory variables. They find that for households, LGDR is
more sensitive to the default-to-loan-ratio, the unemployment rate and household
consumption. For small to medium enterprises (SMEs) however, LGDR is influenced to a
great extent by the GDP growth rate and total number of people employed. The
estimation of the model coefficients in this analysis, was achieved by using a simple OLS
regression model.
In an extension to their prior work, Bellotti and Crook (2009), add macroeconomic
variables to their regression analysis for retail credit cards. The conclusions drawn
indicate that although the data used has limitations in terms of the business cycle, adding
bank interest rates and unemployment levels as macroeconomic variables into an LGD
model yields a better model fit and that these variables are statistically significant
explanatory variables.
Further to the papers discussed in this section, the following additional papers provide
information on other areas of loss given default modelling; Benzschawel et al. (2011);
C h a p t e r 2 : L i t e r a t u r e R e v i e w | 39
Jacobs and Karagozoglu (2011); Sigrist and Stahel (2010); Luo and Shevchenko (2010);
Bastos (2010); Hlawatsch and Ostrowski (2010); Li (2010); Chalupka and Kopecsni
(2009).
In this section, a literature review of the current work conducted in the area of EAD is
given. To date the main focus of the literature has been conducted on corporate lending as
opposed to retail lending (i.e. consumer credit, e.g. through credit cards), with only more
recent studies taking into account the implications for retail lending. We will begin by
identifying these corporate lending studies and go on to look at the current retail lending
literature. Note that, in this thesis, the term Loan Equivalency Factor (LEQ) is used
interchangeably with the term credit conversion factor (CCF) as CCF is referred to as
LEQ in U.S. publications.
More recently, Araten and Jacobs (2001) used six years of data between 1994 and 2000
from Chase bank to calculate values for the LEQ factor. It was found that the estimated
LEQs calculated increased the closer the period of time to default and with better risk
40 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
categories. It was also found that the distribution of the LEQ value had significant
concentrations around the 0 and 1 values, giving a two-peaked distribution.
In the most recent work on corporate lending, Cespedes et al (2010) look at the issue of
modelling wrong way risk in the estimation of an alpha multiplier (the definition of a
portfolio‟s alpha is: total economic capital divided by the economic capital when
counterparty exposures are equal to expected positive exposure (EPE)). The alpha value
typically ranges from 1.1 for large global portfolios to over 2.5 for new users with
concentrated exposures. Wrong way risk is defined here as the correlation between
exposures and defaults in a given credit portfolio. Their paper gives a computationally
efficient and robust approach to the modelling of the alpha multiplier and stress-testing
wrong way risk. This is achieved through leveraging underlying counterparty potential
future exposure (PFE) simulations that are also used for credit limits and risk
management. An application of the methodology is provided on a realistic bank trading
portfolio with the results indicating that the alpha remains at or below 1.2 for
conservative correlation assumptions. Prior to this, Sufi (2009) looked at the use of credit
lines to corporations. The conclusions drawn show that the flexibility given to firms by
the use of unfunded commitments leads to a moral hazard problem. To tackle this, banks
tend to impose strict agreements, and only lend to borrowers with historically high
profitability.
They do however warn that in general a single parameter model, be it the use of a CCF or
an uplift factor is too simplistic. Strong arguments are given throughout this study
indicating that practitioners should take care and apply common sense to their models for
estimating EAD and take advantage of the flexibility offered by the Basel II Accord.
In Qi (2009) borrower and account information for unsecured credit card defaults from a
large US lender are used to calculate and model CCF (referred to as LEQ in the US). The
findings suggest that borrowers‟ attributes such as credit score, aggregate bankcard
balance, aggregate bankcard credit line utilization rate, number of recent credit inquiries,
and number of open retail accounts are significant drivers of CCF for accounts current
one year prior to default. It is also found that borrowers are more likely to draw down
additional funds as they approach default.
In Valvonis (2008) the issues related to the estimation of EAD and CCF are discussed in
detail as well as the EAD risk drivers (EADRDs). The findings suggest that many issues
pertaining to EAD modelling remain unanswered such as the issue of the stringent
supervisory requirements banks are under in their calculations of EAD. It is also shown
that point densities for the majority of realised CCFs occur around 0 and 1, and it is
suggested that logit or probit regression models could indeed be appropriate here.
In the academic and regulatory literature, on the other hand, there has been little work
done on the estimation of EAD and the appropriate models required. The majority of
work to date has been done on modelling exposure at default (EAD) for defaulted firms.
Most notably, in Jacobs (2008), a variety of explanatory variables are investigated with
various measures of EAD risk derived and compared. Also, a multiple regression model
in the generalised linear class is built. The findings suggest that there is weak evidence
for counter-cyclicality in EAD and utilization has a strong inverse relationship to EAD.
As with Asarnow and Marker (1995), the risk rating is found to have some explanatory
power in the estimation of the EAD. Similarly to Jacobs (2008), other academic work that
has been conducted in this area has also focused on corporate lending as opposed to retail
lending. In Jiménez et al (2009), LEQ factors for revolving commitments in the Spanish
credit register are studied over the period of the last two decades for corporate lending.
42 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
The conclusions drawn from this work are that the firms that go into default have much
higher credit line usage rates and EAD values up to five years prior to their default than
non-defaulting facilities. Variations in EAD are also identified due to collateralisation,
maturity and credit line size.
In summary, although more recent studies on EAD modelling have become available for
retail lending, there is a clear need to further develop our understanding of the risk drivers
and appropriate EAD modelling techniques for consumer credit. Hence, in this paper, we
will investigate both using a real-world data set of credit card defaults.
C h a p t e r 2 : L i t e r a t u r e R e v i e w | 43
In summary, as shown in this chapter a wide range of modelling work has been
conducted in the field of credit risk modelling, with particular attention paid to that of
probability of default (PD) modelling. Since the advent of the Basel II capital Accord
however there has become an even greater need for the development of suitable and
robust estimation techniques for loss given default (LGD) and exposure at default (EAD),
as well as a more comprehensive review of the appropriate techniques to use when a
scarcity of defaults is present (imbalanced data sets). It is therefore the focus of this thesis
to provide a better understanding of the classification and regression techniques required
for the prediction of imbalanced credit scoring data sets, LGD and EAD as well as
providing robust statistical results.
In the next chapter, a detailed explanation of each of the techniques applied in this thesis
will be presented.
C h a p t e r 3 : C l a s s i f i c a t i o n a n d R e g r e s s i o n T e c h n i q u e s | 45
Chapter 3
This thesis analyses a variety of established and novel classification and regression
techniques in the estimation of the three components of the minimum capital
requirement, PD, LGD and EAD.
Classification is defined as the process of assigning a given piece of input data into one of
a given number of categories. In terms of Probability of Default (PD) modelling,
classification techniques are applied as the purpose of PD modelling is to estimate the
likelihood that a loan will not be repaid and will fall into default, this requires the
classification of loan applicants into two classes, i.e. good payers (i.e., those who are
likely to keep up with their repayments) and bad payers (i.e., those who are likely to
default on their loans). Regression analysis estimates the conditional expectation of a
dependent variable given a linear or non-linear combination of a set of independent
variables. This is therefore appropriate for use in the estimation of Loss Given Default
(LGD) and Exposure at Default (EAD) where the goal is to determine their conditional
expectations given a set of independent variables.
The literature review section of this thesis (cf. Chapter 2.1) identified current and
potentially applicable classification and regression techniques to the field of credit risk
modelling. Therefore, this thesis aims to apply the most prevalent techniques identified
with the aim of finding the most appropriate techniques in the estimation of PD, LGD and
EAD.
46 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
In this chapter a brief explanation of each of the techniques applied in this thesis is
presented with citations given to their full derivation. (N.B. some of the techniques
described have applications in both classification and regression analysis. Where this is
the case the technique is only described in the classification section.)
C h a p t e r 3 : C l a s s i f i c a t i o n a n d R e g r e s s i o n T e c h n i q u e s | 47
where is the intercept parameter and T contains the variable coefficients (Hosmer
and Stanley, 2000).
48 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
The cumulative logit model (see e.g. Walker and Duncan, 1967) is simply an extension of
the binary two-class logit model which allows for an ordered discrete outcome with more
than 2 levels k 2 :
1
P(class j ) , (3.2)
d j b1x1 b2 x2 ...bn xn
1 e
j 1, 2, , k 1 .
probabilities for the occurrence of response levels up to and including the j th level of y .
The main advantage of logistic regression is the fact that it is a non-parametric
classification technique as no prior assumptions are made with regard to the probability
distribution of the given attributes.
the reverse is true. According to Bayes' theorem, these posterior probabilities are given
by:
p(x | y ) p( y )
p ( y | x) . (3.3)
p ( x)
x 0 0 x 0 x 1 1 x 1
T 1 T 1
. (3.4)
2 log P y 0 log P y 1 log 1 log 0
Linear discriminant analysis is then obtained if the simplifying assumption is made that
both covariance matrices are equal, i.e. 0 1 , which has the effect of cancelling
out the quadratic terms in the expression above.
example the logistic function) to the weighted inputs and its bias term bi(1) :
1 1
n
hi f bi Wij x j , (3.5)
j 1
where W represents a weight matrix in which Wij denotes the weight connecting input j
to hidden neuron i . For the analysis conducted in this thesis, a binary prediction will be
made; hence, for the activation function in the output layer, we will be using the logistic
1
(sigmoid) activation function, f 2 x to obtain a response probability:
1 e x
50 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
nh
f 2 b 2 v j h j , (3.6)
j 1
with nh the number of hidden neurons and v the weight vector where v j represents the
weight connecting hidden neuron j to the output neuron. Examples of other transfer
e x e x
functions that are commonly used are the hyperbolic tangent f x x x and the
e e
linear transfer function f x x .
During model estimation, the weights of the network are first randomly initialised and
then iteratively adjusted so as to minimise an objective function, e.g. the sum of squared
errors (possibly accompanied by a regularisation term to prevent over-fitting). This
iterative procedure can be based on simple gradient descent learning or more
sophisticated optimisation methods such as Levenberg-Marquardt or Quasi-Newton. The
number of hidden neurons can be determined through a grid search based on validation
set performance.
Support vector machines (SVMs) are a set of powerful supervised learning techniques
used for classification and regression. Their basic principle, when applied as a classifier,
is to construct a maximum-margin separating hyperplane in some transformed feature
space. Rather than requiring one to specify the exact transformation though, they use the
principle of kernel substitution to turn them into a general (non-linear) model. The least
square support vector machine (LS-SVM) proposed by Suykens, et al (2002) is a further
adaptation of Vapnik's original SVM formulation which leads to solving linear KKT
(Karush-Kuhn-Tucker) systems (rather than a more complex quadratic programming
problem). The optimisation problem for the LS-SVM is defined as:
1 T 1 N 2
min J (w, b, e) w w ei , (3.7)
w ,b ,e 2 2 i 1
C h a p t e r 3 : C l a s s i f i c a t i o n a n d R e g r e s s i o n T e c h n i q u e s | 51
Where w is the weight vector in primal space, is the regularisation parameter, and
Mercer theorem. The hyper parameter for the LS-SVM classification technique could,
for example, be tuned using 10-fold cross validation.
Classification and regression trees are decision tree models, for a categorical or
continuous dependent variable, respectively, that recursively partition the original
learning sample into smaller subsamples, so that some impurity criterion i () for the
resulting node segments is reduced (Breiman, et al (1984). To grow the tree, one
typically uses a greedy algorithm that, at each node t , evaluates a large set of candidate
variable splits so as to find the ‟best‟ split, i.e. the split s that maximises the weighted
decrease in impurity:
i s, t i t pLi t L pRi t R . (3.10)
52 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
Where pL and pR denote the proportions of observations associated with node t that are
sent to the left child node t L or right child node t R , respectively. A decision tree consists
of internal nodes that specify tests on individual input variables or attributes that split the
data into smaller subsets, and a series of leaf nodes assigning a class to each of the
observations in the resulting segments. For Chapter 4, we chose the popular decision tree
classifier C4.5, which builds decision trees using the concept of information entropy
(Quinlan, 1993). The entropy of a sample S of classified observations is given by:
where p1 and p0 are the proportions of the class values 1 and 0 in the sample S,
respectively. C4.5 examines the normalised information gain (entropy difference) that
results from choosing an attribute for splitting the data. The attribute with the highest
normalised information gain is the one used to make the decision. The algorithm then
recurs on the smaller subsets.
The k-nearest neighbours algorithm (k-NN) classifies a data point by taking a majority
vote of its k most similar data points (Hastie, et al 2001). The similarity measure used in
this thesis is the Euclidean distance between the two points:
One of the major disadvantages of the k-nearest neighbour classifier is the large
requirement on computing power as for classifying an object, the distance between it and
all the objects in the training set has to be calculated. Furthermore, when many irrelevant
attributes are present, the classification performance may degrade when observations
have distant values for these attributes (Baesens B, 2003a).
C h a p t e r 3 : C l a s s i f i c a t i o n a n d R e g r e s s i o n T e c h n i q u e s | 53
Gradient boosting (Friedman, 2001, 2002) is an ensemble algorithm that improves the
accuracy of a predictive function through incremental minimisation of the error term.
After the initial base learner (most commonly a tree) is grown, each tree in the series is fit
to the so-called “pseudo residuals” of the prediction from the earlier trees with the
purpose of reducing the error. The estimated probabilities are adjusted by weight
estimates, and the weight estimates are increased when the previous model misclassified
a response. This leads to the following model:
where G0 equals the first value for the series, T1 , , Tu are the trees fitted to the pseudo-
residuals, and i are coefficients for the respective tree nodes computed by the Gradient
Boosting algorithm. A more detailed explanation of gradient boosting can be found in
Friedman (2001) and Friedman (2002). The meta-parameters which require tuning for a
Gradient Boosting classifier are the number of iterations and the maximum branch used
in the splitting rule. The number of iterations specifies the number of terms in the
boosting series, for a binary target the number of iterations determines the number of
trees. The maximum branch parameter determines the maximum number of branches that
the splitting rule produces from one node, a suitable number for this parameter is 2, a
binary split.
C h a p t e r 3 : C l a s s i f i c a t i o n a n d R e g r e s s i o n T e c h n i q u e s | 55
Whereas in the previous section we looked at the proposed classification techniques for
PD modelling, in this section we will detail the proposed regression techniques to be
implemented in the modelling of LGD and EAD. The experiments comprise a selection
of one-stage and two-stage techniques. One-stage techniques can be divided into linear
and non-linear techniques. The linear techniques included in Chapter 5 and 6, model the
(original or transformed) dependent variable as a linear function of the independent
variables whereas non-linear techniques fit a non-linear model to the data set. Two-stage
models are a combination of the aforementioned one-stage models. These either combine
the comprehensibility of an OLS model with the added predictive power of a non-linear
technique, or they use one model to first discriminate between zero- and higher LGDs
and a second model to estimate LGD for the subpopulation of non-zero LGDs.
A regression technique fits a model y f x e onto a data set, where y is the
Regression Techniques
LGD
Linear
Ordinary Least Squares (OLS)
Ordinary Least Squares with Beta Transformation (B-OLS)
Beta Regression (BR)
Ordinary Least Squares with Box-Cox Transformation (BC-OLS)
Non-linear
Regression Trees (RT)
Least Square Support Vector Machines (LS-SVM)
Neural Networks (NN)
56 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
Log+(non-)linear
Logistic regression + OLS, B-OLS, BR, BC-OLS, RT, LS-SVM or NN
Linear+non-linear
Ordinary Least Squares + Regression Trees (OLS+RT)
Ordinary Least Squares + Least Square Support Vector Machines
(OLS+LSSVM)
Ordinary Least Squares + Neural Networks (OLS+NN)
EAD
Ordinary Least Squares (OLS)
Ordinary Least Squares with Beta Transformation (B-OLS)
Binary Logistic Regression (LOGIT)
Cumulative Logistic Regression (CLOGIT)
TABLE 3.1: Regression techniques used for LGD and EAD modelling
Ordinary least squares regression (Draper & Smith, 1998) is the most common technique
to find optimal parameters bT b0 , b1 , b2 , , bn to fit a linear model to a data set:
y bT x , (3.14)
squared residuals:
ei yi bT xi .
l l
2 2
(3.15)
i 1 i 1
By taking the derivative of this expression and subsequently setting the derivative equal
to zero:
y b x x
l
i
T
i
T
i 0, (3.16)
i 1
C h a p t e r 3 : C l a s s i f i c a t i o n a n d R e g r e s s i o n T e c h n i q u e s | 57
b XT X XT y ,
1
(3.17)
with XT x1 , x2 , , xl and y y1 , y2 , , yl .
T
Whereas OLS regression tests generally assume normality of the dependent variable y ,
the empirical distribution of LGD can often be approximated more accurately by a Beta
distribution (Gupton & Stein, 2002). Assuming that y is constrained to the open interval
0,1 , the cumulative distribution function (CDF) of a Beta distribution is given by:
a b y a 1
y; a, b v 1 v dv ,
b 1
a b 0
(3.18)
where () denotes the well-known Gamma function, and a and b are two shape
parameters, which can be estimated from the sample mean and variance 2 using the
method of the moments, i.e.:
2 1 1
a ; b a 1 . (3.19)
2
A potential solution to improve model fit therefore is to estimate an OLS model for a
transformed dependent variable yi* N 1 yi ; a, b i 1, , l , in which N 1 ()
denotes the inverse of the standard normal CDF. The predictions by the OLS model are
then transformed back through the standard normal CDF and the inverse of the fitted Beta
CDF to get the actual LGD estimates.
58 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
It can be easily shown that the first parameter is indeed the mean of a a, b -distributed
1
variable, whereas 2 , so for fixed , the variance (dispersion) increases with
1
smaller .
Two link functions mapping the unbounded input space of the linear predictor into the
required value range for both parameters are then chosen, viz. the logit link function for
the location parameter (as its value must be squeezed into the open unit interval) and a
log function for the precision parameter (which must be strictly positive), resulting in the
following sub models:
T
eb xi
i E yi | xi T . (3.21)
1 eb xi
i ed x
T
i
This particular parameterisation offers the advantage of producing more intuitive variable
coefficients (as the two rows of coefficients, bT and dT , provide an indication of the
effect on the estimate itself and its precision, respectively). By further selecting which
C h a p t e r 3 : C l a s s i f i c a t i o n a n d R e g r e s s i o n T e c h n i q u e s | 59
variables to include in (or exclude from) the second sub model, one can explicitly model
heteroskedasticity. The resulting log-likelihood function is then used to compute
maximum-likelihood estimators for all model parameters.
The aim of the family of Box-Cox transformations (Box & Cox, 1964) is to make the
residuals of the regression model more homoskedastic and closer to a normal distribution.
The Box-Cox transformation on the dependent variable yi takes the form
y c 1 if 0
i
, (3.22)
if 0
log yi c
with power parameter and parameter c . If needed, the value of c can be set to a non-
zero value to rescale y so that it becomes strictly positive. After a model is built on the
transformed dependent variable using OLS, the predicted values can be transformed back
to their original value range.
In Section 3.1.5 we looked at the application of decision trees for classification problems.
Decision trees can also be used for regression analysis where they are designed to
approximate real-valued functions as apposed to a classification task. A commonly
applied impurity measure i t for regression trees is the mean squared error or variance
for the subset of observations falling into node t . Alternatively, a split may be chosen
based on the p-value of an ANOVA F-test comparing between-sample variances against
within-sample variances for the subsamples associated with its respective child nodes
(ProbF criterion).
60 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
Section 3.1.3 details the implementation of Neural Networks for classification problems.
In terms of regression, Neural Networks produce an output value by feeding inputs
through a network whose subsequent nodes apply some chosen activation function to a
weighted sum of incoming values. The type of NN used in Chapter 5 of this thesis is the
popular multilayer perceptron (MLP).
Section 3.1.4 details the implementation of Least Square Support Vector Machines for
classification problems. In terms of regression, Least Square Support Vector Machines
implicitly map the input space to a kernel-induced high-dimensional feature space in
which a linear relationship is fitted.
Techniques such as Neural Networks and Support Vector Machines are often seen as
“black box” techniques meaning that the model obtained is not understandable in terms
of physical parameters. This is an obvious issue when applying these techniques to a
credit risk modelling scenario where physical parameters are required. To solve this issue
we propose the use of a two-stage approach to combine the good comprehensibility of
OLS with the predictive power of a non-linear regression technique (Van Gestel, et al
2005). In the first stage, an ordinary least squares regression model is built:
y bT x e . (3.23)
e f x e* , (3.24)
C h a p t e r 3 : C l a s s i f i c a t i o n a n d R e g r e s s i o n T e c h n i q u e s | 61
are estimated with a non-linear regression model f x in order to further improve the
predictive ability of the model. Doing so, the model takes the following form:
y bT x f x e* . (3.25)
Where e* are the new residuals of estimating e . A combination of OLS with RT,
LSSVM and NN is assessed in this thesis.
The LGD distribution is often characterised by a large peak around LGD 0 . This non-
normal distribution can lead to inaccurate regression models. This proposed two-stage
technique attempts to resolve this issue by modelling the peak separately from the rest.
Therefore, the first stage of this two-stage model consists of a logistic regression to
estimate whether LGD 0 or LGD 0 .
In a second stage the mean of the observed values of the peak is used as prediction in the
first case and a one-stage (non)linear regression model is used to provide a prediction in
the second case. The latter is trained on part of the data set, i.e. those observations that
have an LGD 0 . More specifically, a logistic regression results in an estimate of the
probability P of being in the peak:
1
P , (3.26)
bT x
1 e
with 1 P as the probability of not being in the peak. An estimate for LGD is then
obtained by:
y P. y peak 1 P . f x e , (3.27)
62 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
one-stage (non)linear regression model, built on those observations only that are not in
the peak. A combination of logistic regression with all aforementioned one-stage
techniques as described above, is assessed is this thesis (Matuszyk et al 2010).
C h a p t e r 3 : C l a s s i f i c a t i o n a n d R e g r e s s i o n T e c h n i q u e s | 63
Chapter 4: Building default prediction models for imbalanced
c r e d i t s c o r i n g d a t a s e t s | 65
Chapter 4
In this chapter, we set out to compare several techniques that can be used in the analysis
of imbalanced credit scoring data sets. In a credit scoring context, imbalanced data sets
occur as the number of defaulting loans in a data set is usually much lower than the
number of observations that do not default.
However, some techniques may not be able to adequately cope with these imbalanced
data sets therefore, the objective is to compare a variety of techniques performances‟ over
differing sizes of class distribution. As well as evaluating traditional classification
techniques such as logistic regression, neural networks and decision trees, this chapter
will also explore the suitability of gradient boosting, least square support vector machines
and random forests for loan default prediction. These particular techniques have been
selected due to either their proven ability within the credit scoring domain (c.f.
TABLE1.1) or their similar applications in other fields which can be transferred to a
credit scoring context (c.f. Literature Review). The purpose of this study is to compare
widely used credit scoring techniques against novel machine learning techniques to
identify whether any improvement can be made over traditional techniques when a class
imbalance is present.
Five real-world credit scoring data sets have been adapted to mimic imbalanced data sets
and are used to build classifiers and test their performance. In our experiments, we
66 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
progressively increase class imbalance in each of these data sets by randomly under-
sampling the minority class of defaulters, so as to identify to what extent the predictive
power of the respective techniques is adversely affected.
The performance criterion chosen to measure this effect is the area under the receiver
operating characteristic curve (AUC); Friedman's statistic and Nemenyi post-hoc tests are
used to test for significance of AUC differences between techniques.
The results from this empirical study indicate that the Random Forest and Gradient
Boosting classifiers perform very well in a credit scoring context and are able to cope
comparatively well with pronounced class imbalances in these data sets. We also find
that, when faced with a large class imbalance, the support vector machines and quadratic
discriminant analysis perform significantly worse than the best performing classifiers.
The remainder of this chapter is organised as follows. Section 4.2 gives a list overview of
the examined classification techniques (a more detailed explanation of each of the
techniques used in this chapter can be found in Chapter 3). Section 4.3 details the
empirical set up, data sets used and the criteria used for comparing the classification
performance. Section 4.4 the results of our experiments are presented and discussed.
Finally section 4.5 gives the conclusions that can be drawn from the study and
recommendations for further research work will be outlined.
Chapter 4: Building default prediction models for imbalanced
c r e d i t s c o r i n g d a t a s e t s | 67
4.1 Introduction
.
68 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
The characteristics of the data sets used in evaluating the performance of the
aforementioned classification techniques are given below in TABLE 4.2. (The
independent variables available in each data set are presented in APPENDIX A1 at the
end of this thesis). The Bene1 and Bene2 data sets were obtained from two major
financial institutions in the Benelux region. For these two data sets, a bad customer was
defined as someone who had missed three consecutive months of payments. The German
credit data set and the Australian Credit data set are publicly available at the UCI
repository (http://kdd.ics.uci.edu/). The Behav data set was also acquired from a Benelux
institution. As the data sets used vary in size, from 547 to 7,190, and the data sets will be
further reduced, with the under sampling of the bad observations to create larger class
imbalances, a process of 10-fold cross validation will be applied on the full data set.
* Altered data set class distribution, Bene1 original distribution was 66.6% good observations, 33.3% bad
observations, Austr original distribution was 55.5% good observations, 44.5% bad observations and the
Behav original distribution was 80% good observations, 20% bad observations.
70 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
In order for the percentage reduction in the bad observations, in each data set, to be
relatively compared, the Bene1 set, Australian credit and the Behavioural Scoring set
have first been altered to give a 70/30 class distribution. This was done by either under-
sampling the bad observations (from a total of 1041 bad observations in the Bene1 data
set, only 892 observations have been used; and from a total of 307 bad observations in
the Australian credit data set, only 164 observations have been used) or under-sampling
the good observations in the behavioural scoring data set, (from a total of 1436 good
observations, only 838 observations have been used).
For this empirical study, the class of defaulters in each of the data sets was artificially
reduced, by a factor of 5% up to 95%, so as to create a larger difference in class
distribution. As a result of this reduction, six data sets were created from each of the five
original data sets. For this empirical study our focus is on the performance of
classification techniques on data sets with a large class imbalance. Therefore detailed
results will only be presented for the data set with the original 70/30 split, as a
benchmark, and data sets with 85%, 90% and 95% splits. By doing so, it is possible to
identify whether techniques are adversely affected in the prediction of the target variable
when there is a substantially lower number of observations in one of the classes. The
performance criterion chosen to measure this effect is the area under the receiver operator
characteristic curve (AUC) statistic as proposed by Baesens et al., (2003).
The diagonal line represents the trade-off between the sensitivity and (1-specificity) for a
random model, and has an AUC of 0.5. For a well performing classifier the ROC curve
needs to be as far to the top left-hand corner as possible. In the example shown in
FIGURE 4.1, the classifier that performs the best is that corresponding to the ROC1
curve.
For each of the techniques applied in this study a 10-fold cross validation (CV) method
was applied during the modelling stage to add validity to the techniques built on the
imbalanced data sets. The number of folds was selected as 10 due to the computational
time for each of the different techniques over each of the data set splits. Although we
would prefer a larger number of folds to reduce the bias of the true error rate estimator 10
was deemed sufficiently large for this empirical study. For the following techniques:
Linear Discriminant Analysis (LDA)
Quadratic Discriminant Analysis (QDA)
72 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
The data transformation node is required to create a random segmentation ID for the data
for the k-fold groups to be used as cross validation indicators in the group processing
loop. The formulation used to compute this is displayed in FIGURE 4.3:
(For the Random Forests and C4.5 techniques a 10-fold cross validation approach was
also applied using the cross-validation option in Weka. A 10-fold cross validation
Chapter 4: Building default prediction models for imbalanced
c r e d i t s c o r i n g d a t a s e t s | 73
approach was also applied in the LS-SVMlab Matlab toolbox in Matlab for the LS-SVM
classifier.)
Each classifier is then trained k times (k=10) using nine folds for training purposes and
the reaming fold for evaluation (validation). A performance estimate for the classifier can
then be determined by averaging the 10-validation estimates determined through the 10
runs of the cross validation. As mentioned in Kohavi R, (1995) common values used for
k are 5 and 10 (we select 10 here in this study). Cross validation is often used to assess
the performance of classification techniques on small data sets, due to the loss of
potential data in the modelling process with a training/test set split. Hence why cross
validation has been chosen in this instance. (Baesens, B 2003a).
The linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and
logistic regression (LOG) classification techniques require no parameter tuning. The
LOG model was built in SAS using proc logistic and using a stepwise variable selection
method. Both the LDA and QDA techniques were run in SAS using proc discrim. Before
all the techniques were run, dummy variables were created for the categorical variables.
The AUC statistic was computed using the ROC macro by De Long et al (1988), which is
available from the SAS website (http://support.sas.com/kb/25/017.html).
For the LS-SVM classifier, a linear kernel was chosen and a grid search mechanism was
used to tune the hyper-parameters. For the LS-SVM, the LS-SVMlab Matlab toolbox
developed by Suykens et al (2002) was used.
The NN classifiers were trained after selecting the best performing number of hidden
neurons based on a validation set. The neural networks were trained in SAS Enterprise
Miner using a logistic hidden and target layer activation function with the remaining EM
default architecture in place (i.e. Weight Decay equal to 0, Normal randomisation
distribution for random initial weights and perturbations).
74 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
The confidence level for the pruning strategy of C4.5 was varied from 0.01 to 0.5, and the
most appropriate value was selected for each data set based on validation set
performance. The tree was built using the Weka (Witten & Frank, 2005) package.
Two parameters have to be set for the Random Forests technique: these are the number of
trees and the number of attributes used to grow each tree. A range of [10, 50, 100, 250,
500, 1000] trees has been assessed, as well as three different settings for the number of
randomly selected attributes per tree [0.5,1, 2]. n , whereby n denotes the number of
attributes within the respective data set (Breiman, 2001). As with the C4.5 algorithm,
Random Forests were also trained in Weka (Witten & Frank, 2005), using 10-fold cross-
validation for tuning the parameters.
The k-Nearest Neighbours technique was applied for both k=10 and k=100, using the
Weka (Witten & Frank, 2005) IBk classifier. These values of k have been selected due to
their previous use in the literature (e.g. Baesens et al 2003, Chatterjee & Barcun, 1970,
West, 2000). For the Gradient Boosting classifier a partitioning algorithm was used as
proposed by Friedman (2001). The number of iterations was varied in the range [10, 50,
100, 250, 500, 1000], with a maximum branch size of two selected for the splitting rule
(Friedman, 2001). The gradient boosting node in SAS Enterprise Miner was used to run
this technique.
We used Friedman's test (Friedman, 1940) to compare the AUCs of the different
classifiers. The Friedman test statistic is based on the average ranked (AR) performances
of the classification techniques on each data set, and is calculated as follows:
12 D K K ( K 1)2 1 D j
F2 j ri
2
AR where AR (4.1)
K ( K 1) j 1
j
4 D i 1
Chapter 4: Building default prediction models for imbalanced
c r e d i t s c o r i n g d a t a s e t s | 75
In (4.1), D denotes the number of data sets used in the study, K is the total number of
classifiers and ri j is the rank of classifier j on data set i. F2 is distributed according to
The post-hoc Nemenyi test (Nemenyi, 1963) is applied to report any significant
differences between individual classifiers. The Nemenyi post-hoc test states that the
performances of two or more classifiers are significantly different if their average ranks
differ by at least the critical difference (CD), given by:
K ( K 1)
CD q ,, K (4.2)
12 D
In this formula, the value q ,,K is based on the studentized range statistic (Nemenyi,
1963).
Finally, the results from Friedman's statistic and the Nemenyi post-hoc tests are displayed
using a modified version of Demšar's (Demšar, 2006) significance diagrams (Lessmann
et al., 2008). These diagrams display the ranked performances of the classification
techniques along with the critical difference to clearly show any techniques which are
significantly different to the best performing classifiers.
76 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
The table on the following page (TABLE 4.3) reports the AUCs of all ten classifiers on
the five credit scoring data sets at varying degrees of class imbalance (calculated by
averaging the 10-validation estimates determined through the 10 runs of the cross
validation for each classifier). For each level of imbalance, the Friedman test statistic and
corresponding p-value is shown. As these were all significant (p<0.005), the post-hoc
Nemenyi test procedure was then applied to each class distribution. The technique
achieving the highest AUC on each data set is underlined as well as the overall highest
ranked technique. TABLE 4.3 shows that the gradient boosting algorithm has the highest
Friedman score (average rank (AR)) on two of the five different percentage class splits.
However at the extreme class split (95% good, 5% bad) Random Forests provides the
best average ranking across the five data sets. (N.B. example residual plots for the
Gradient Boosting and Random Forest classifiers are located in Appendix A2 of this
thesis).
In the majority of the class splits, the AR of the QDA and Lin LS-SVM classifiers are
statistically worse than the AR of the Random Forests classifier at the 5% critical
difference level ( 0.05) , as shown in the significance diagrams included next. Note
that, even though the differences between the classifiers are small, it is important to note
that in a credit scoring context, an increase in the discrimination ability of even a fraction
of a percent may translate into significant future savings (Henley & Hand, 1997).
C h a p t e r 4 : B u i l d i n g d e f a u l t p r e d i c t i o n m o d e l s f o r i m b a l a n c e d c r e d i t s c o r i n g d a t a s e t s | 77
The following significance diagrams display the AUC performance ranks of the
classifiers, along with Nemenyi's critical difference (CD) tail. The CD value for all the
following diagrams is equal to 6.06. Each diagram shows the classification techniques
listed in ascending order of ranked performance on the y-axis, and the classifier‟s mean
rank across all five data sets displayed on the x-axis. Two vertical dashed lines have been
inserted to clearly identify the end of the best performing classifier‟s tail and the start of
the next significantly different classifier.
The first significance diagram (see FIGURE 4.4) displays the average rank of the
classifiers at the original class distribution of a 70% good, 30% bad split:
C4.5
QDA
k-NN10
k-NN100
LOG
NN
Gradient Boosting
LDA
Random Forests
Lin LS-SVM
0 2 4 6 8 10 12 14 16
Classifiers' mean ranks across five datasets
At this original 70/30 percentage split, the Linear LS-SVM is the best performing
classification technique with an AR value of 1.4. This diagram clearly shows that the k-
NN10, QDA and C4.5 techniques perform significantly worse than the best performing
classifier with values of 7.5, 8.4 and 9.0 respectively.
The following significance diagram displays the average rank of the classifiers at an 85%
good, 15% bad class split:
Chapter 4: Building default prediction models for imbalanced
c r e d i t s c o r i n g d a t a s e t s | 79
QDA
k-NN10
C4.5
Lin LS-SVM
NN
k-NN100
LOG
LDA
Random Forests
Gradient Boosting
0 2 4 6 8 10 12 14 16 18
Classifiers' mean ranks across five datasets
At the level where only 15% of the data sets are bad observations, it is shown in the
significance diagram that Gradient Boosting becomes the best performing classifier (see
FIGURE 4.5). The Gradient Boosting classifier performs significantly better than the
quadratic discriminant analysis (QDA) classifier. From these findings we can make a
preliminary assumption that when a larger class imbalance is present, the QDA classifier
remains significantly different to the Gradient Boosting classifier. All the other
techniques used are not significantly different.
Lin LS-SVM
QDA
C4.5
k-NN10
NN
k-NN100
LOG
Gradient Boosting
LDA
Random Forests
0 2 4 6 8 10 12 14 16
Classifiers' mean ranks across five datasets
At a 90% good, 10% bad class split the significance diagram shown in FIGURE 4.6
indicates that the Lin LS-SVM and QDA algorithms are significantly worse than the
Random Forests classifier. It can be noted that the Linear LS-SVM classifier is
progressively becoming less powerful as a large class imbalance is present.
80 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
Lin LS-SVM
QDA
C4.5
k-NN10
LOG
NN
k-NN100
LDA
Gradient Boosting
Random Forests
0 2 4 6 8 10 12 14 16
Classifiers' mean ranks across five datasets
At a 95% good, 5% bad class split the significance diagram shown in FIGURE 4.7
indicates that the Linear LS-SVM and QDA classifiers now becomes significantly worse
than the random forests classifier. This indicates that, as with the previous class split
(FIGURE 4.6), the Linear LS-SVM classifier progressively becomes less powerful as a
large class imbalance is present.
In summary, when considering the AUC performance measures, it can be concluded that
the gradient boosting and random forest classifiers yield a very good performance at
extreme levels of class imbalance, whereas the Lin LS-SVM sees a reduction in
performance as a larger class imbalance is introduced. However, the simpler, linear
classification techniques such as LDA and LOG also give a relatively good performance,
which is not significantly different from that of the gradient boosting and random forest
classifiers. This finding seems to confirm the suggestion made in (Baesens et al., 2003)
that most credit scoring data sets are only weakly non-linear. The findings presented in
this study show that, whereas in Yao, (2009) SVM‟s were shown to be the best
performing classifier, at large class imbalances SVM‟s lose their predictive capabilities.
Therefore the findings presented in this chapter agree with past analysis (Baesens et al.,
2003, Yao, 2009 and Kennedy et al., 2011), but with the caveat that as a larger class
imbalance is present some techniques, in particular SVM‟s do not perform as well. It is
Chapter 4: Building default prediction models for imbalanced
c r e d i t s c o r i n g d a t a s e t s | 81
also shown here that techniques such as QDA, C4.5 and k-NN10 perform significantly
worse than the best performing classifiers at varying percentage reductions. The majority
of classification techniques yielded classification performances that are quite competitive
with each other.
82 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
In this comparative study we have looked at a number of credit scoring techniques, and
studied their performance over various class distributions in five real-life credit data sets.
Two techniques that have yet to be fully researched in the context of credit scoring, i.e.
Gradient Boosting and Random Forests, were also chosen to give a broader review of the
techniques available. The classification power of these techniques was assessed based on
the area under the receiver operating characteristic curve (AUC). Friedman's test and
Nemenyi's post-hoc tests were then applied to determine whether the differences between
the average ranked performances of the AUCs were statistically significant. Finally, these
significance results were visualised using significance diagrams for each of the various
class distributions analysed.
The results of these experiments show that the Gradient Boosting and Random Forest
classifiers performed well in dealing with samples where a large class imbalance was
present. It does appear that in extreme cases the ability of random forests and gradient
boosting to concentrate on „local‟ features in the imbalanced data is useful. The most
commonly used credit scoring techniques, linear discriminant analysis (LDA) and logistic
regression (LOG), gave results that were reasonably competitive with the more complex
techniques and this competitive performance continued even when the samples became
much more imbalanced. This would suggest that the currently most popular approaches
are fairly robust to imbalanced class sizes. On the other hand, techniques such as QDA
and C4.5 were significantly worse than the best performing classifiers. It can also be
concluded that the use of a linear kernel LS-SVM would not be beneficial in the scoring
of data sets where a very large class imbalance exists.
Further work that could be conducted, as a result of these findings, would be to firstly
consider a stacking approach to classification through the combination of multiple
techniques. Such an approach would allow a meta-learner to pick the best model to
classify an observation. Secondly, another interesting extension to the research would be
to apply these techniques on much larger data sets which display a wider variety of class
Chapter 4: Building default prediction models for imbalanced
c r e d i t s c o r i n g d a t a s e t s | 83
distributions. It would also be of interest to look into the effect of not only the percentage
class distribution but also the effect of the actual number of observations in a data set.
Finally, as stated in the literature review chapter (cf. Chapter 2) of this thesis, there have
been several approaches already researched in the area of oversampling techniques to
deal with large class imbalances. Further research into this and their effect on credit
scoring model performance would be beneficial.
C h a p t e r 5 : E s t i m a t i o n o f L o s s G i v e n D e f a u l t ( L G D ) | 85
Chapter 5
As stated in Chapter 1, the recent introduction of the Basel II framework has had a huge
impact on financial institutions, allowing them to build credit risk models for three key
risk parameters: PD (Probability of Default), LGD (Loss Given Default) and EAD
(Exposure at Default). To date current credit risk research has largely focused on the
estimation and validation of the PD parameter. However, changes in LGD directly affect
the capital of a financial institution in a linear way, unlike PD, which therefore has less of
an effect on minimal capital requirements. The use of models that estimate LGD as
accurately as possible are thus of crucial importance as these can translate into significant
future savings.
In this chapter the estimation of LGD is analysed through the implementation of various
state-of-the-art regression techniques to model and predict LGD. These include one-stage
models, such as those built by ordinary least squares, beta regression, artificial neural
networks, support vector machines and regression trees, as well as two-stage models
which attempt to combine the benefits of multiple techniques. In total 17 regression
techniques are evaluated and compared using 6 real-life retail lending data sets from
major international banking institutions. These particular techniques have been selected
due to either their proven ability to model LGD (e.g. OLS) or their similar applications in
other fields which can be transferred to a credit risk modelling context (c.f. Literature
Review). The purpose of this study is to compare the widely used OLS model (with
86 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
It is found that much of the variance of LGD remains unexplained as the average
predictive performance of the models in terms of R 2 range from 4% to 43%.
Nonetheless, a trend can be observed that, non-linear techniques and in particular
artificial neural networks and support vector machines yield consistently higher
predictive performances over all data sets than more traditional linear techniques. Also,
two-stage models built by a combination of linear and non-linear techniques are shown to
have similarly good predictive power, while they offer the added advantage of having a
comprehensible linear model component.1
The remainder of this chapter is organised as follows. Section 5.2 gives a list overview of
the examined regression techniques (a more detailed explanation of each of the
techniques used in this chapter can be found in Chapter 3). Section 5.3 details several
performance metrics for the evaluation and comparison of the regression models listed in
the previous section. Section 5.4 details the data sets used and the experimental set-up
implemented in this study. The penultimate section 5.5 displays the experimental results
from this study, and finally section 5.6 concludes this chapter.
1
Nota bene: A larger version of this study was conducted as a collaborative study with the University
College Ghent. Only the work contributed by the author of this thesis is presented in this chapter,
except for the LS-SVM calculations which were conducted by my colleague Gert Loterman.
C h a p t e r 5 : E s t i m a t i o n o f L o s s G i v e n D e f a u l t ( L G D ) | 87
5.1 Introduction
A detailed background and introduction to the topic of Loss Given Default (LGD) along
with motivations for the work can be found in Chapter 1 of this thesis.
88 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
This study comprises both one-stage and two-stage techniques. One stage techniques can
be divided into linear and non-linear techniques. Linear techniques model the dependent
variable as a linear function of the independent variables while non-linear techniques fit a
non-linear model to a data set. Two stage models are a combination of the
aforementioned one-stage models.
The regression techniques used in this chapter comprise of both linear and non-linear
techniques, and combinations of the two. A full description of these techniques can be
found in Chapter 3.
Regression Techniques
Linear
Ordinary Least Squares (OLS)
Ordinary least squares regression (Draper & Smith, 1998) is the most common technique
to find optimal parameters to fit a linear model to a data set. OLS estimation produces a
linear regression model that minimises the sum of squared residuals for the data set.
produce a generalised linear model variant that allows for a dependent variable that is
beta-distributed conditional on the input variables.
Non-linear
Regression trees (RT)
Regression Tree, sometimes referred to as classification and regression trees (CART),
(Breiman, et al. 1984) algorithms produce a decision tree for the dependent variable by
recursively partitioning the input space based on a splitting criterion, e.g. weighted
reduction in within-node variance.
Log + (non-)linear
LOG+OLS, B-OLS, BC-OLS & BR
This class of two-stage (mixture) modelling approaches (Matuszyk, et al. 2010) uses
logistic regression (see e.g. Hosmer & Stanley, 2000) to first estimate the probability of
LGD ending up in the peak at 0 (i.e. LGD 0 ) or to the right of it (i.e. LGD 0 ). A
second-stage (non-)linear regression model is built using only the observations for which
LGD 0 . An LGD estimate is then produced by weighting the average LGD in the peak
and the estimate produced by the second-stage model by their respective probabilities.
90 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
Linear + non-linear
OLS+RT, LSSVM & ANN
The purpose of this two-stage technique (Van Gestel, et al. 2005) is to combine the good
comprehensibility of OLS with the predictive power of a non-linear regression technique.
In a first stage, a linear model is built using OLS. In a second stage, the residuals of this
linear model estimated with a non-linear regression model. This estimate for the residual
is then added to the OLS estimate to obtain a more accurate prediction for LGD.
Although the concept of a two stage approach has been used before (Van Gestel, et al.
2005) it was only applied for an SVM model. This study therefore contributes the
findings of an RT and ANN two-stage application as well.
TABLE 5.1: List of regression techniques
C h a p t e r 5 : E s t i m a t i o n o f L o s s G i v e n D e f a u l t ( L G D ) | 91
Performance metrics evaluate to which degree the predictions f xi differ from the
observations yi of the dependent variable, LGD. Each of the following metrics, listed in
TABLE 5.2, has its own method to express the predictive performance of a model as a
quantitative value. The second and third columns of the table show the metric values for
respectively the worst and best possible prediction performance2. The final column shows
whether the metric measures calibration or discrimination (Van Gestel & Baesens, 2009).
Calibration indicates how close the predictive values are with the observed values
whereas discrimination refers to the ability to provide an ordinal ranking of the dependent
variable considered. A good ranking does not necessarily imply a good calibration.
2 2
Note that the R measure defined here could possibly lie outside the [0,1] interval when applied to non-
OLS models. Although alternative generalised goodness-of-fit measures have been put forward for
evaluating various non-linear models (see e.g. Nagelkerke, 1991), the measure defined in TABLE 5.2 has
the advantage that it is widely used and can be calculated for all techniques.
92 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
RMSE (see for example, Draper & Smith, 1998) is defined as the square root of the
average of the squared difference between predictions and observations:
1 l
f xi yi
2
RMSE (5.1)
l i 1
RMSE has the same units as the independent variable being predicted. Since residuals are
squared, this metric heavily weights outliers. The smaller the value of RMSE the better
the prediction, with 0 being a perfect prediction.
MAE (see for example, Draper & Smith, 1998) is given by the averaged absolute
differences of predicted and observed values:
1 l
MAE f xi yi
l i 1
(5.2)
Just like RMSE, MAE has the same unit scale as the dependent variable being predicted.
Unlike RMSE, MAE is not that sensitive to outliers. The metric is bound between the
maximum absolute error and 0 (perfect prediction).
ROC curves are normally used for the assessment of binary classification techniques (see
for example, Fawcett, 2006). It is however used in this context to measure how good the
regression technique is in distinguishing high values from low values of the dependent
variable. To build the ROC curve, the observed values are first classified into high and
low classes using the mean y of the training set as reference.
C h a p t e r 5 : E s t i m a t i o n o f L o s s G i v e n D e f a u l t ( L G D ) | 93
REC curves (Bi & Bennet, 2003) generalise ROC curves for regression. The AOC curve
plots the error tolerance on the x-axis versus the percentage of points predicted within the
tolerance (or accuracy) on the y-axis (FIGURE 5.1). The resulting curve estimates the
cumulative distribution function of the squared error. The area over the REC curve
(AOC) is an estimate of the predictive power of the technique. The metric is bound
between 0 (perfect prediction) and the maximum squared error.
The Coefficient of Determination R 2 (see for example, Draper & Smith, 1998) can be
defined as 1 minus the fraction of the residual sum of squares to the total sum of squares:
SSerr
R2 1 (5.3)
SStot
l l
Where SSerr yi f xi , SStot yi y and y is the mean of the observed
2 2
i 1 i 1
values. Since the second term in the formula can be seen as the fraction of unexplained
variance, the R 2 can be interpreted as the fraction of explained variance. Although R 2 is
usually expressed as a number on a scale from 0 to 1, R 2 can yield negative values when
the model predictions are worse than using the mean y from the training set as
prediction. Although alternative generalised goodness-of-fit measures have been put
forward for evaluating various non-linear models (see e.g. Nagelkerke, 1991), R 2 has the
advantage that it is widely used and can be calculated for all techniques.
Pearson‟s r (see e.g. Cohen, et al. 2002) is defined as the sum of the products of the
standard scores of the observed and predicted values divided by the degrees of freedom:
1 l yi y f xi f
r
l 1 i 1 s y sf
(5.4)
with y and f the mean and s y and s f the standard deviation of respectively the
observations and predictions. Pearson‟s r can take values between -1 (perfect negative
correlation) and +1 (perfect positive correlation) with 0 meaning no correlation at all.
C h a p t e r 5 : E s t i m a t i o n o f L o s s G i v e n D e f a u l t ( L G D ) | 95
Spearman‟s (see e.g. Cohen, et al. 2002) is defined as Pearson‟s r applied to the
rankings of predicted and observed values. If there are no (or very few) tied ranks
however, it is common to use the equivalent formula:
l
6 di2
1 i 1
(5.5)
l l 2 1
where d i is the difference between the ranks of observed and predicted values.
Spearman‟s can take values between -1 (perfect negative correlation) and +1 (perfect
positive correlation) with 0 meaning no correlation at all.
Kendall‟s (see e.g. Cohen, et al. 2002) measures the degree of correspondence between
observed and predicted values. In other words, it measures the association of cross
tabulations:
nc nd
(5.6)
1
l l 1
2
where nc is the number of concordant pairs and nd is the number of discordant pairs. A
pair of observations i, k is said to be concordant when there is no tie in either observed
take values between -1 (perfect negative correlation) and +1 (perfect positive correlation)
with 0 meaning no correlation is present.
96 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
In this section the characteristics of the data sets are described as well as the experimental
benchmarking framework to assess the predictive performance of the regression
techniques. Further, a description of a technique‟s parameter setting and tuning is given
where required.
TABLE 5.3 displays the characteristics of 6 real-life lending LGD data sets from a series
of financial institutions, each of which contains loan-level data about defaulted loans and
their resulting losses. The number of data set entries varies from a few thousands to just
under 120,000 observations. The number of available input variables ranges from 12 to
44. The types of loan data set included are personal loans, corporate loans, revolving
credit and mortgage loans. The empirical distribution of LGD values observed in each of
the data sets is displayed in FIGURE 5.2. Note that the LGD distribution in consumer
lending often contains one or two spikes around LGD 0 (in which case there was a full
recovery) and/or LGD 1 (no recovery). Also, a number of data sets include some LGD
values that are negative (e.g., because of penalties paid, gains in collateral sales, etc.) or
larger than 1 (e.g., due to additional collection costs incurred); in other data sets, values
outside the unit interval were truncated to 0 or 1 by the banks themselves. Importantly
LGD does not display a normal distribution in any of these data sets.
C h a p t e r 5 : E s t i m a t i o n o f L o s s G i v e n D e f a u l t ( L G D ) | 97
First, each data set is randomly shuffled and divided into two thirds training set and one
third test set. The training set is used to build the models while the test set is solely used
to assess the predictive performance of these models. Where required, continuous
independent variables are standardised with the sample mean and standard deviation of
the training set, nominal independent variables are encoded with dummy variables and
ordinal independent variables are encoded with thermo variables.
An input selection method is used to remove irrelevant and redundant variables from the
data set, with the aim of improving the performance of regression techniques. For this, a
stepwise selection method is applied for building the linear models (APPENDIX A3). For
computational efficiency reasons, an R 2 based filter method (Freund & Littell, 2000) is
applied prior to building the non-linear models (APPENDIX A4).
After building the models, the predictive performance of each data set is measured on the
test set by comparing the predictions and observations according to several performance
metrics. Next, an average ranking of techniques over all data sets is generated per
performance metric as well as a meta-ranking of techniques over all data sets and all
performance metrics.
Finally, the regression techniques are statistically compared with each other (Demsar,
2006). A Friedman test (Friedman, 1940) is performed to test the null hypothesis that all
regression techniques perform alike according to a specific performance metric, i.e.,
performance differences would just be due to random chance. A more detailed summary
and the applied formulas can be found in the previous chapter (cf. Chapter 4.3.4).
C h a p t e r 5 : E s t i m a t i o n o f L o s s G i v e n D e f a u l t ( L G D ) | 99
During model building, several techniques require parameters to be set or tuned. This
section describes how these are set or tuned where appropriate.
criterion and criterion method was selected based on the mean squared error on the
validation set.
2
x xi
K x, xi e 2 2
(5.7)
with kernel parameter is used here because of its good overall performance for
LSSVM classifiers (Baesens, et al. 2000). The hyper parameters and for LSSVM
regression are tuned with 10-fold cross validation on the training data set. A grid search
evaluates all possible combinations of parameters within the search space in order to find
a possible optimal combination that minimises the mean squared error. The limits of the
grid for the kernel parameter are set to 0.5 l ,500 l and the limits of the grid for
0.01 1000
the regularisation parameter are set to , (Van Gestel, et al. 2003).
n n
Estimating the LSSVM hyper parameters this way can be a computational burden. To
tune the hyper parameters, a sample from the complete training data set is chosen as
follows. First, 100 random subsets of 4000 observations are chosen. Next, the LGD
distribution histogram of each subset is compared with the LGD distribution histogram of
the complete training set, and the subset that best approximates the original set based on
the mean squared error, is chosen.
initialised and then iteratively adjusted so as to minimise the mean squared error. The
choice of activation function and number of hidden neurons is selected based on the mean
squared error on the validation set. The hidden layer activation function is set as logistic.
102 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
TABLES 5.4 to 5.9 contain the performance results obtained for all techniques on the 6
respective data sets. The best performing model according to each metric is underlined.
FIGURE 5.3 displays a series of box plots for the observed distributions of performance
values for the metrics AUC, R 2 , r, and . Similar trends can be observed across all
metrics. Note that differences in type of data set, number of observations and available
independent variables are the likely causes of the observed variability of actual
performance levels between the 6 different data sets.
Although all performance metrics listed above are useful measures in their own right, it is
common to use the coefficient of determination R 2 to compare model performance, since
R 2 measures calibration and can be compared meaningfully across different data sets. As
shown in FIGURE 5.3, the average R 2 of the models varies from about 4 % to 43 %. In
other words, the variance in LGD that can be explained by the independent variables is
consistently below 50 %, implying that most of the variance cannot be explained even
with the best models. Note that although R 2 usually is a number on a scale from 0 to 1,
R 2 can yield negative values for non-OLS models when the model predictions are worse
than always using the mean from the training set as prediction.
C h a p t e r 5 : E s t i m a t i o n o f L o s s G i v e n D e f a u l t ( L G D ) | 103
FIGURE 5.3: Comparison of predictive performances across 6 real-life retail lending data sets
The linear models that incorporate some form of transformation to the dependent variable
(i.e. B-OLS, BR, BC-OLS) are shown to perform consistently worse than OLS, despite
the fact that these approaches are specifically designed to cope with the violation of the
OLS normality assumption. This suggest that they too have difficulties dealing with the
pronounced point densities observed in LGD data sets, while they may be less efficient
than OLS or they could introduce model bias if a transformation is performed prior to
OLS estimation (as is the case for B-OLS and BC-OLS).
110 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
Perhaps the most striking result is that, in contrast with prior benchmarking studies on
classification models for PD (Baesens, et al. 2003), non-linear models such as LSSVM
and ANN significantly outperform most linear models in the prediction of LGD. This
implies that the relation between LGD and the independent variables in the data sets is
non-linear (as is most apparent on data set BANK3, see TABLE 5.6). Also, LSSVM and
ANN generally perform better than RT. However, LSSVM and ANN result in black-box
models while RT have the ability to produce comprehensible white-box models. To
circumvent this disadvantage, one could try to obtain an interpretation for a well-
performing black-box model by applying rule extraction techniques (Martens, et al. 2007,
Martens, et al. 2009).
In contrast with the previous two-stage model, a clear trend can be observed for the
combination of a linear and a non-linear model ( OLS ). By estimating the error residual
of an OLS model using a non-linear technique, the prediction performance tends to
increase to somewhere very near the level of the corresponding one-stage non-linear
technique. What makes these two-stage models attractive is that they have the advantage
of combing the high prediction performance of non-linear regression with the
comprehensibility of a linear regression component. Note that this modelling method has
C h a p t e r 5 : E s t i m a t i o n o f L o s s G i v e n D e f a u l t ( L G D ) | 111
also been successfully applied in a PD modelling context Van Gestel, et al. 2005, Van
Gestel, et al. 2006, Van Gestel, et al. 2007).
The average ranking over all data sets according to each performance metric is listed in
columns 2 to 9 of TABLE 5.10. The best performing technique for each metric is
underlined and techniques that significantly perform worse than the best performing
technique for that metric according to the Nemenyi‟s post-hoc test 0.5 are in italic.
The last column illustrates the meta-ranking (MR) as the average ranking (AR) over all
data sets and over all metrics. The techniques in the table are sorted according to their
meta-ranking. Additionally, columns 10 and 11 cover the meta-ranking only including
respectively calibration and discrimination metrics. The best performing techniques are
consistently ranked in the top according to each metric, no matter whether they measure
calibration or discrimination.
The results of the Friedman test and subsequent Nemenyi‟s post-hoc test with
significance level 0.05 can be intuitively visualised using Demsar‟s significance
diagram (Demsar, 2006). FIGURES 5.4-5.11 display the Demsar significance diagrams
for all metric ranks across all 6 data sets. The diagrams display the performance rank of
each technique along with a line segment representing its corresponding critical
difference (CD = 10.08).
A detailed description of the diagrammatic setup can be found in the previous chapter (cf.
Chapter 4.5).
112 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
1 LSSVM 7.5 3.5 3.3 3.5 3.5 3.7 3.3 4.1 4.5 3.6 4.1
2 ANN 3.2 2.8 5.0 2.5 2.7 3.1 7.0 7.1 2.8 5.5 4.2
3 OLS+LS-SVM 7.5 4.2 3.5 3.9 4.2 4.5 4.3 4.7 4.9 4.3 4.6
4 LOG+ANN 6.0 4.2 6.8 4.1 4.2 4.2 6.3 6.5 4.6 6.0 5.3
5 OLS+ANN 9.0 6.5 3.6 4.7 4.3 4.3 6.2 6.3 6.1 5.1 5.6
6 LOG+LS-SVM 7.5 6.4 4.6 6.4 6.5 5.2 5.2 4.9 6.7 5.0 5.8
7 OLS+RT 6.8 4.3 5.3 6.0 6.0 6.2 7.0 7.7 5.8 6.5 6.2
8 RT 8.6 7.0 12.9 7.4 7.0 7.8 7.3 4.7 7.5 8.2 7.8
9 LOG+RT 9.7 9.4 10.0 9.4 9.3 9.3 9.3 9.2 9.5 9.5 9.5
10 LOG+OLS 12.6 10.3 10.7 9.8 9.9 11.7 11.7 11.8 10.6 11.5 11.1
11 LOG+B-OLS 6.5 12.0 11.8 12.0 12.0 11.2 13.1 13.3 10.6 12.3 11.5
12 OLS 13.9 10.6 9.3 10.5 10.5 11.8 12.8 13.3 11.4 11.8 11.6
13 B-OLS 7.8 14.3 10.3 14.8 14.0 13.8 11.0 11.3 12.7 11.6 12.2
14 LOG+BC-OLS 8.2 14.1 13.0 14.1 13.7 13.0 11.8 11.8 12.5 12.4 12.5
15 BC-OLS 9.8 15.7 13.8 15.6 15.3 14.7 10.7 11.0 14.1 12.5 13.3
16 BR 14.5 12.9 14.1 13.3 15.3 14.5 11.8 11.5 14.0 13.0 13.5
17 LOG+BR 13.9 14.9 14.9 15.0 14.6 14.2 14.2 13.8 14.6 14.3 14.4
TABLE 5.10: Average rankings (AR) and meta-rankings (MR) across all metrics and data sets
C h a p t e r 5 : E s t i m a t i o n o f L o s s G i v e n D e f a u l t ( L G D ) | 113
BR
LOG+BR
OLS
LOG+OLS
BC-OLS
LOG+RT
OLS+ANN
RT
LOG+BC-OLS
B-OLS
OLS+LS-SVM
LOG+LS-SVM
LSSVM
OLS+RT
LOG+B-OLS
LOG+ANN
ANN
0 5 10 15 20 25 30
FIGURE 5.4: Demsar's significance diagram for MAE based ranks across 6 data sets
114 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
BC-OLS
LOG+BR
B-OLS
LOG+BC-OLS
BR
LOG+B-OLS
OLS
LOG+OLS
LOG+RT
RT
OLS+ANN
LOG+LS-SVM
OLS+RT
OLS+LS-SVM
LOG+ANN
LSSVM
ANN
0 5 10 15 20 25 30
FIGURE 5.5: Demsar's significance diagram for RMSE based ranks across 6 data sets
LOG+BR
BR
BC-OLS
LOG+BC-OLS
RT
LOG+B-OLS
LOG+OLS
B-OLS
LOG+RT
OLS
LOG+ANN
OLS+RT
ANN
LOG+LS-SVM
OLS+ANN
OLS+LS-SVM
LSSVM
0 5 10 15 20 25 30
FIGURE 5.6: Demsar's significance diagram for AUC based ranks across 6 data sets
C h a p t e r 5 : E s t i m a t i o n o f L o s s G i v e n D e f a u l t ( L G D ) | 115
BC-OLS
LOG+BR
B-OLS
LOG+BC-OLS
BR
LOG+B-OLS
OLS
LOG+OLS
LOG+RT
RT
LOG+LS-SVM
OLS+RT
OLS+ANN
LOG+ANN
OLS+LS-SVM
LSSVM
ANN
0 5 10 15 20 25 30
FIGURE 5.7: Demsar's significance diagram for AOC based ranks across 6 data sets
BR
BC-OLS
LOG+BR
B-OLS
LOG+BC-OLS
LOG+B-OLS
OLS
LOG+OLS
LOG+RT
RT
LOG+LS-SVM
OLS+RT
OLS+ANN
OLS+LS-SVM
LOG+ANN
LSSVM
ANN
0 5 10 15 20 25 30
FIGURE 5.8: Demsar's significance diagram for R 2 based ranks across 6 data sets
116 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
BC-OLS
BR
LOG+BR
B-OLS
LOG+BC-OLS
OLS
LOG+OLS
LOG+B-OLS
LOG+RT
RT
OLS+RT
LOG+LS-SVM
OLS+LS-SVM
OLS+ANN
LOG+ANN
LSSVM
ANN
0 5 10 15 20 25 30
FIGURE 5.9: Demsar's significance diagram for r based ranks across 6 data sets
LOG+BR
LOG+B-OLS
OLS
LOG+BC-OLS
BR
LOG+OLS
B-OLS
BC-OLS
LOG+RT
RT
OLS+RT
ANN
LOG+ANN
OLS+ANN
LOG+LS-SVM
OLS+LS-SVM
LSSVM
0 5 10 15 20 25 30
FIGURE 5.10: Demsar's significance diagram for based ranks across 6 data sets
C h a p t e r 5 : E s t i m a t i o n o f L o s s G i v e n D e f a u l t ( L G D ) | 117
LOG+BR
OLS
LOG+B-OLS
LOG+BC-OLS
LOG+OLS
BR
B-OLS
BC-OLS
LOG+RT
OLS+RT
ANN
LOG+ANN
OLS+ANN
LOG+LS-SVM
OLS+LS-SVM
RT
LSSVM
0 5 10 15 20 25 30
FIGURE 5.11: Demsar's significance diagram for based ranks across 6 data sets
118 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
This chapter evaluates the estimation of LGD through the use of 17 regression techniques
on 6 real life retail lending data sets from major international banking institutions. The
average predictive performance of the models in terms of R 2 ranges from 4 % to 43 %,
which indicates that most resulting models do not have satisfactory explanatory power.
Nonetheless, a clear trend can be seen that non-linear techniques such as artificial neural
networks and support vector machines in particular give higher performances than more
traditional linear techniques. This indicates the presence of non-linear interactions
between the independent variables and the LGD, contrary to some studies in PD
modelling (Baesens, et al. 2003) where the difference between linear and non-linear
techniques is not that explicit. Given the fact that LGD has a bigger impact on the
minimal capital requirements than PD, we demonstrated the potential and importance of
applying non-linear techniques, preferably in a two-stage context to obtain
comprehensibility as well, for LGD modelling. The findings presented in this chapter also
go some way in agreeing with the findings presented in Qi and Zhao (2011), where it was
shown that non-parametric techniques such as regression trees and neural networks gave
improved model fit and predictive accuracy over parametric methods.
There is considerable evidence that the macro-economy affects the client‟s credit risk
behaviour so it might be an interesting topic of further research to examine the influence
of macro-economic variables (Bellotti & Crook, 2009), both in the context of improving
LGD models as for stress testing. Finally, one could also try to add comprehensibility to
well-performing black box models with rule extraction techniques to gain more insight
(Martens, et al. 2007, Martens, et al. 2009).
C h a p t e r 5 : E s t i m a t i o n o f L o s s G i v e n D e f a u l t ( L G D ) | 119
Chapter 6: Regression Model Development for Credit Card
E x p o s u r e A t D e f a u l t ( E A D ) | 121
Chapter 6
Under the Basel II requirements for the advanced internal ratings based approach (AIRB)
banks must estimate and empirically validate their own models for Probability of Default
(PD), Loss Given Default (LGD) and Exposure at Default (EAD). However, to date, the
majority of academic literature has focused on the estimation and validation of PD and
LGD models, with little work conducted on EAD modelling. In this chapter, we develop
and compute a series of models for predicting Exposure At Default (EAD). For off-
balance-sheet items, for example credit cards, to calculate the EAD one requires the
committed but unused loan amount times a credit conversion factor (CCF). Ordinary least
squares (OLS), binary logit and cumulative logit regression models, as well as an OLS
with Beta transformation model, are estimated and compared with the main aim of
finding the most robust and comprehensible model for the prediction of the CCF. Finally
a direct estimation of EAD, using an OLS model, will be analysed.
A real-life data set with monthly balance amounts for clients over the period 2001-2004
will be used in the building and testing of the regression models. Parameter estimates and
comparative statistics will be given to determine the best overall model. The findings
from this study indicate that a marginal improvement in the coefficient of determination
can be achieved with the use of a binary logit model over a traditional OLS model in the
estimation of the CCF. It is also concluded that although the predictive power of the CCF
is relatively weak across all of the models employed, when this predicted value is applied
122 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
to the EAD formulation to predict the actual EAD value, the predictive power is fairly
strong. Interestingly the use of a direct estimation of EAD shows an increase in predictive
power over first estimating a CCF and applying the CCF to the formulation.
Chapter 6: Regression Model Development for Credit Card
E x p o s u r e A t D e f a u l t ( E A D ) | 123
6.1 Introduction
A detailed background and introduction to the topic of Exposure at Default (EAD) along
with motivations for the work can be found in Chapter 1 of this thesis.
The purpose of this chapter will be to look at the estimation and validation of this credit
conversion factor (CCF) in order to correctly estimate the off-balance sheet EAD. We
also aim to gain a better understanding of the variables that drive the prediction of the
CCF for consumer credit. To achieve this, predictive variables that have previously been
suggested in the literature (Moral, 2006) will be constructed, along with a combination of
new and potentially significant variables. We also aim to identify whether an
improvement in predictive power can be achieved over ordinary least squares regression
by the use of binary logit and cumulative logit regression models, as well as an OLS with
Beta transformation model. The reason why we propose these two logit models is that
recent studies (e.g. Jacobs, 2008) have shown that the CCF exhibits a bi-modal
distribution with two peaks around 0 and 1, and a relatively flat distribution between
those peaks. This non-normal distribution is therefore less suitable for modelling with
traditional ordinary least squares (OLS) regression. The motivation for using an OLS
with Beta transformation model is that it accounts for a range of distributions including a
U-shaped distribution. We will also trial a direct OLS estimation of the EAD and use it as
a comparison to estimating a CCF and applying it to the EAD formulation.
The purpose of this experimental setup is to extend the current literature and to better
inform practitioners as to the potential techniques that can be applied in the estimation of
CCF and the resulting EAD. It also aims to explore the practicalities of using OLS
models for estimating the bi-modal distribution displayed by CCF and the potential of
binning this distribution for the use of logistic and cumulative logistic regression models.
The remainder of this chapter is organised as follows. Section 6.2 outlines the proposed
regression techniques that will be used in the estimation of the CCF. Section 6.3 details
the empirical set up and data set used. Section 6.4 highlights the results of the regression
124 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
techniques in the estimation of the CCF. Finally, section 6.5 details the conclusions and
recommendations that can be drawn from the results of the empirical study.
Chapter 6: Regression Model Development for Credit Card
E x p o s u r e A t D e f a u l t ( E A D ) | 125
For the detailing of the techniques implemented in the estimation of the CCF value, the
dependent variable y (i.e. the value of the CCF) for observation i is represented as yi .
The data set used was obtained from a major financial institution in the UK and contains
monthly data on credit card usage for a three-year period (January 2001 – December
2004). Here, we define a default to have occurred on a credit card when a charge off has
been made on that account (a charge off in this case is defined as the declaration by the
creditor that an amount of debt is unlikely to be collected, declared at the point of 180
days or 6 months without payment). In order to calculate the CCF value, the original data
set has been split into two twelve-month cohorts, with the first cohort running from
November 2002 to October 2003 and the second cohort from November 2003 to October
2004. The cohort approach groups defaulted facilities into discrete calendar periods, in
this case 12-month periods, according to the date of default. Information is then collected
regarding risk factors and drawn/undrawn amounts at the beginning of the calendar
period and drawn amount at the date of default. We have chosen the cohorts to begin in
November and end in October as we wanted to reduce the effects of any seasonality on
the calculation of the CCF.
The characteristics of the cohorts used in evaluating the performance of the regression
models are given below in TABLE 6.1:
COHORT1 will be used to train the regression models, while COHORT2 will be used to
test the performance of the model (out-of-time testing).
Both data sets contain variables detailing the type of defaulted credit card product and the
following monthly variables: advised credit limit, current balance, the number of days
delinquent and the behavioural score.
The following variables suggested in Moral (2006) were then computed based on the
monthly data found in each of the cohorts, where td is the default date and tr is the
reference date (i.e. the start of the cohort):
Committed amount, L(tr ) : the advised credit limit at the start of the cohort;
Undrawn amount, L(tr ) E (tr ) : the advised limit minus the exposure at the start
of cohort;
E (tr )
Credit percentage usage, : the exposure at the start of the cohort divided by
L(tr )
the advised credit limit at the start of the cohort;
Time to default, (td tr ) : the default date minus the reference date (in months);
Rating class, R(tr ) : the behavioural score at the start of the cohort, binned into
four discrete categories 1: AAA-A; 2: BBB-B; 3: C; 4: UR (unrated).
Credit conversion factor, CCFi : calculated as the actual EAD minus the drawn
amount at the start of the cohort divided by the advised credit limit at the start of
the cohort minus the drawn amount at the start of the cohort, i.e. :
E (td ) E (tr )
CCFi . (6.1)
L(tr ) E (tr )
128 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
The potential predictiveness of all the variables proposed in this chapter will be evaluated
by calculating the information value (IV) based on their ability to separate the CCF value
into either of two classes, 0 : CCF CCF (non-event), and 1: CCF CCF (event).
After binning input variables using an entropy-based procedure, implemented in SAS
Enterprise Miner, the information value of a variable with k bins is given by:
k
n i n i n i / N1
IV 1 0 ln 1 , (6.2)
i 1
N 1 N 0 n0 i / N 0
Chapter 6: Regression Model Development for Credit Card
E x p o s u r e A t D e f a u l t ( E A D ) | 129
where n0 i , n1 i denote the number of non-events and events in bin i , and N0 , N1 are
the total number of non-events and events in the data set, respectively.
This measure allows us to do a preliminary screening of the relative potential
contribution of each variable in the prediction of the CCF.
The distribution of the raw CCF for the first Cohort (COHORT1) is shown below in
FIGURE 6.1:
FREQUENCY
600
500
400
300
200
100
0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 00 0 0 01 1 11 1 2 2 22 2 3 3 3 3 34 4 44 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 1
9 8 8 8 8 8 7 7 7 7 7 6 6 66 6 5 5 5 5 5 4 4 44 4 3 33 3 32 2 2 2 21 1 1 1 1 0 00 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 02 4 6 80 2 46 8 0 2 46 8 0 2 4 6 80 2 46 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 .
0 8 6 4 2 0 8 6 4 2 0 8 6 42 0 8 6 4 2 0 8 6 42 0 8 64 2 08 6 4 2 08 6 4 2 0 8 64 2 0
ccf MIDPOINT
FIGURE 6.1: Raw CCF distribution (x-axis displays a snapshot of the CCF values from the period of
-9 to 10)
The raw CCF displays a substantial peak around 0 and a slight peak at 1 with substantial
tails either side of these points. (FIGURE 6.1 displays a snapshot of CCF values in the
period -9 to 10. This snapshot boundary has been selected to allow for the visualisation of
the CCF distribution.) Values of CCF 1 can occur when the actual EAD is greater than
the advised credit limit, whereas values of CCF 0 can occur when both the drawn
amount and the EAD exceed the advised credit limit or where the EAD is smaller than
the drawn amount. In practice this occurs as the advised credit limit and drawn amount
are measured at a time period, (tr ) , prior to default and therefore at (td ) the advised
credit limit maybe higher or lower than at (tr ) . Extremely large positive and negative
130 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
values of CCF can also occur if the drawn amount is slightly above or below the advised
credit limit, e.g.:
As in Jacobs, (2008) and Qi, (2009) it therefore seems reasonable to winsorise the data
so that the CCF can only fall between values of 0 and 1. FIGURE 6.2 displays the same
CCF value winsorised at 0 and 1:
F REQUENCY
1700
1600
1500
1400
1300
1200
1100
1000
900
800
700
600
500
400
300
200
100
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
. . . . . . . . . . . . . . . . . . . . . . . . . .
0 0 0 1 1 2 2 2 3 3 4 4 4 5 5 6 6 6 7 7 8 8 8 9 9 0
0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0
ccf MI DPOI NT
The winsorised CCF (FIGURE 6.2) yields a bimodal distribution with peaks at 0 and 1,
and a relatively flat distribution between the two peaks. This bears a strong resemblance
to the distributions identified in loss given default modelling (LGD) (Thomas et al,
2010). In our estimation of the CCF we will be using this limited CCF between 0 and 1,
similarly to Jacobs (2008).
Chapter 6: Regression Model Development for Credit Card
E x p o s u r e A t D e f a u l t ( E A D ) | 131
The OLS, B-OLS, LOGIT and CLOGIT models were estimated using SAS. Each model
was built on the first Cohort data set (COHORT1) and then tested on the second Cohort
data set (COHORT2).
A stepwise variable selection method was used in the construction of all three regression
models with the aim of selecting only the most predictive input variables for the
estimation of the CCF. The threshold level for the variables to enter and remain in the
model using the stepwise procedure was a p-value of 0.01. For the LOGIT and CLOGIT
models the resulting predicted probabilities were taken as the values for the CCF.
The following performance metrics were used to compare the regression techniques:
i 1 i 1
CCF value. Although R 2 is usually a number from 0 to 1, R 2 could also yield negative
values when the model prediction is worse than using the mean y from the training set as
a prediction.
In order to calculate the performance metrics on the categorical predictions made by the
LOGIT and CLOGIT models, first a continuous prediction value must be obtained. This
is achieved by multiplying the probability of being in each of the bins by the average
CCF value for each of those respective bins and summing the result, thus obtaining an
expected value of CCF. After this value has been computed, the resulting value is then
used in the calculation of the performance metrics.
132 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
Spearman‟s (see e.g. Cohen et al, 2002) is defined as the Pearson‟s r applied to the
rankings of predicted and observed values.
1 l
f xi yi
2
RMSE (6.6)
l i 1
Chapter 6: Regression Model Development for Credit Card
E x p o s u r e A t D e f a u l t ( E A D ) | 133
In this section we will begin by analysing the input variables and their relationship to the
dichotomised CCF value ( 0 : CCF CCF ; 1: CCF CCF ). The following table displays
the resulting information value for each variable, ranked from most to least predictive:
Typically, input variables which display an information value greater than 0.1 are deemed
to have a significant contribution in the prediction of the target variable. From this
analysis, we can see that the majority of the relative and absolute changes in drawn,
undrawn and committed amounts do not possess the same ability to discriminate between
low and high CCFs as the original variable measures at reference time only. It is also
clear from the results that the undrawn amount could be an important variable in the
discrimination of the CCF value. It must be taken into consideration however that
although the variables may display a good ability to discriminate between the low and
high CCFs, the variables themselves are highly correlated with each other (see Table A6
in the APPENDIX).
From TABLE 6.3, we can see that the best performing regression algorithm for all three
performance measures is the binary logit model with an R 2 value of 0.1028. Although
this R 2 value is low, it is comparable to the range of performance results previously
reported in other work on LGD modelling (cf. Chapter 5). This result also re-affirms the
proposed usefulness of a logit model for estimating CCFs in Valvonis (2008). It can also
be seen that all five models are quite similar in terms of variable significance levels and
positive/negative signs. There does however seem to be some discrepancy for the Rating
class variable, where the medium-range behavioural score band appears to be associated
with the highest CCF‟s.
136 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
Variables Coefficient OLS model (using only OLS model (OLS) OLS with Beta Binary logit model Cumulative logit
sign suggested variables in (additional variables) transformation (B- (LOGIT) model
reported in Moral, (2006)) OLS) (CLOGIT)
Jacobs,
(2008)
Parameter P-value Parameter P-value Parameter P-value Parameter P-value Parameter P-value
Estimate Estimate Estimate Estimate Estimate
Intercept 1 0.1830 <.0001 0.1365 <.001 -0.5573 <.0001 -1.5701 <.0001 0.6493 <.0001
Intercept 2 -0.5491 <.001
Credit percentage usage – -0.1220 <.001 -0.1260 <.001 -0.5737 <.001 -1.3220 <.0001
Committed amount + 1.73E-05 <.0001 1.76E-05 <.0001 2.2E-05 <.0001 9.0E-05 <.0001 8.8E-05 <.0001
Undrawn + -8.68E-05 <.0001 -8.88E-05 <.0001 -1.1E-04 <.0001 -4.7E-04 <.0001 -3.6E-04 <.0001
Time-to-Default + 0.0334 <.0001 0.0326 <.0001 0.0358 <.0001 0.1538 <.0001 0.1009 <.0001
Rating class –
Rating 1 (AAA-A) vs. 4 (UR) 0.1735 <.0001 0.2304 <.0001 0.2223 <.0001 0.4000 0.0069 -0.0772 0.5472
Rating 2 (BBB-B) vs. 4 (UR) 0.2483 <.0001 0.2977 <.0001 0.3894 <.0001 0.5885 <.0001 0.6922 <.0001
Rating 3 (C) vs. 4 (UR) 0.0944 <.0001 0.1201 <.0001 0.1664 <.0001 -0.2121 0.0043 -0.0157 0.8098
Average number of days N/A 0.0048 <.0001 0.0062 <.0001 0.0216 <.0001 0.0218 <.0001
delinquent in the last 6
months
Undrawn percentage N/A 0.2784 <.0001
Of the additional variables we tested (e.g. absolute or relative change in the drawn
amount, credit limit and undrawn amount), only „Average number of days delinquent in
the last 6 months‟ and „Undrawn percentage‟ were retained by the stepwise selection
procedures. This is most likely due to the fact that their relation to the CCF is already
largely accounted for by the base model variables. Further to this, Table A6 in the
APPENDIX details a correlation matrix for the inputs, indicating that for example the
Drawn amount has a high positive correlation with the Committed Amount (0.782). It is
also of interest to note that although one additional variable is selected in the stepwise
procedure for the second OLS model, there is no increase in predictive power over the
original OLS model.
A direct estimation of the un-winsorised CCF with the use of an OLS model was also
undertaken. The results from this experimentation indicate that it is even harder to predict
the un-winsorised CCF than the CCF winsorised between 0 and 1 with a predictive
performance far weaker than the winsorised model. (When these results are applied to the
estimation of the actual EAD an inferior result is also achieved).
With the predicted values for the CCF obtained from the five models, it is then possible
to estimate the actual EAD value for each observation i in the COHORT2 data set, as
follows:
This gives us an estimated “monetary EAD” value which can be compared to the actual
EAD value found in the data set. For comparison purposes, a conservative estimate for
the EAD assuming CCF 1 is also calculated, as well as an estimate for EAD where
the mean of the CCF in the first cohort is used (TABLE 6.4). The following table
(TABLE 6.5) displays the predictive performance of this estimated EAD amount against
the actual EAD amount:
138 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
Variables OLS model OLS model OLS with Binary logit Cumulative
(using only (including Beta model logit model
previously average transformatio (LOGIT) (CLOGIT)
suggested number of n (B-OLS)
variables) days
delinquent in
the last 6
months)
Coefficient of 0.6450 0.6431 0.8365 0.6344 0.6498
Determination (R2)
Pearson‟s 0.8049 0.8038 0.8000 0.8016 0.8068
Correlation
Coefficient ( r )
Spearman‟s 0.7421 0.7405 0.7270 0.7387 0.7381
Correlation
Coefficient ( )
TABLE 6.5: EAD estimates based on CCF predictions against actual EAD amounts
It is quite clear from these results that although the predicted CCF value gave a relatively
weak performance, when this value is applied to the calculation of the estimated EAD
formulation a significant improvement over the conservative model can be made. It can
also be noted that the application of the OLS with Beta transformation model gives a
significantly higher value for the coefficient of determination (0.8365), although the
correlation values are comparative to the other models. A possible reason for this is that
even though the CCF has been winsorised prior to estimation, the B-OLS model‟s
predictions are much closer to the real CCF values before winsorisation. Thus the B-OLS
model produces a better actual estimate of the EAD. However, by simply applying the
mean of the CCF, a similar result to the other predicted models can be achieved.
The direct estimation of the EAD, through the use of an OLS model, has also been taken
into consideration, without the first estimation and application of a CCF. The results from
this direct estimation of EAD are shown in TABLE 6.6, with the distribution for the
direct estimation of EAD given in FIGURE 6.3: (The legend for FIGURES 6.3-6.7
Chapter 6: Regression Model Development for Credit Card
E x p o s u r e A t D e f a u l t ( E A D ) | 139
details the frequency of values along the y-axis and the estimated EAD value along the x-
axis)
FIGURE 6.3: Distribution of direct estimation of EAD (the actual EAD amount present is indicated
by the overlaid black line)
It is self-evident from the performance metrics and the produced distribution that a direct
estimation of EAD without firstly estimating and applying a CCF can indeed produce
reasonable estimations for the actual EAD. This goes someway in ratifying the findings
show by Taplin et al (2007).
The following figures (FIGURES 6.4-6.7) display the distribution for the actual EAD
amount present in COHORT2 and the estimated EAD values for the regression models. It
is apparent from the predicted distributions that all five models approximate the actual
EAD distribution very well. All three models do however somewhat underestimate the
140 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
FIGURE 6.4: OLS base model predicted Exposure at Default (EAD) distribution (the actual EAD
amount present is indicated by the overlaid black line)
FIGURE 6.5: Binary LOGIT model predicted Exposure at Default (EAD) distribution (the actual
EAD amount present is indicated by the overlaid black line)
Chapter 6: Regression Model Development for Credit Card
E x p o s u r e A t D e f a u l t ( E A D ) | 141
FIGURE 6.6: Cumulative LOGIT model predicted Exposure at Default (EAD) distribution (the
actual EAD amount present is indicated by the overlaid black line)
FIGURE 6.7: OLS with Beta Transformation model predicted Exposure at Default (EAD)
distribution (the actual EAD amount present is indicated by the overlaid black line)
142 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
EAD_ Ac t u a l _ me a n
12000
11000
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000
EAD_ Pr e d _ me a n
FIGURE 6.8: OLS base model plot for the Actual Mean EAD against Predicted Mean EAD across
ten bins (R2=0.9968)
EAD_ Ac t u a l _ me a n
13000
12000
11000
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000
EAD_ Pr e d _ me a n
FIGURE 6.9: Binary LOGIT model plot for the Actual Mean EAD against the Predicted Mean EAD
across ten bins (R2=0.9944)
Chapter 6: Regression Model Development for Credit Card
E x p o s u r e A t D e f a u l t ( E A D ) | 143
EAD_ Ac t u a l _ me a n
13000
12000
11000
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000
EAD_ Pr e d _ me a n
FIGURE 6.10: Cumulative LOGIT model plot for the Actual Mean EAD against the Predicted Mean
EAD across ten bins (R2=0.9954)
FIGURE 6.11: OLS with Beta Transformation model plot for the Actual Mean EAD against the
Predicted Mean EAD across ten bins (R2=0.9957)
144 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
FIGURES 6.8-6.11 display plots for the actual mean of the EAD against the predicted
mean of the EAD across ten bins. (The legend for FIGURES 6.8-6.11 details the mean
actual EAD along the y-axis and the mean predicted EAD along the x-axis across the 10
bins). The bins are created by splitting the distribution of the predicted EAD into ten bins
of equal size. The plots show that the means for the actual and predicted EAD in bins one
to ten are close to the diagonal for all three models, indicating that the predictions for the
EAD well approximate actual EAD. The points that deviate slightly from the diagonal
again occur at the left and right ends of the EAD.
Similarly to FIGUREs 6.8-6.11, FIGURES 6.12-6.15 display plots for the actual mean of
the CCF against the predicted mean of the CCF across ten bins. (The legend for
FIGURES 6.12-6.15 details the mean actual CCF along the y-axis and the mean predicted
CCF along the x-axis across the 10 bins). From these plots it is clear that the regression
models struggle to closely predict the values for CCF.
FIGURE 6.12: OLS base model plot for the Actual Mean CCF against the Predicted Mean CCF
across ten bins (R2=0.7061)
Chapter 6: Regression Model Development for Credit Card
E x p o s u r e A t D e f a u l t ( E A D ) | 145
FIGURE 6.13: Binary LOGIT model plot for the Actual Mean CCF against the Predicted Mean CCF
across ten bins (R2=0.2867)
FIGURE 6.14: Cumulative LOGIT base model plot for the Actual Mean CCF against the Predicted
Mean CCF across ten bins (R2=0.9063)
146 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
FIGURE 6.15: OLS with Beta Transformation model plot for the Actual Mean CCF against the
Predicted Mean CCF across ten bins (R2=0.9154)
Chapter 6: Regression Model Development for Credit Card
E x p o s u r e A t D e f a u l t ( E A D ) | 147
In summary, this chapter has set out to develop comprehensible and robust regression
models for the estimation of Exposure at Default (EAD) for consumer credit through
the prediction of the credit conversion factor (CCF). An in-depth analysis of the
predictive variables used in the modelling of the CCF has also been given, showing
that previously acknowledged variables are significant and identifying a series of
additional variables.
As the results show, a marginal improvement in the coefficient of determination can
be achieved with the use of a binary logit model over a traditional OLS model.
Interestingly the use of a cumulative logit model performs worse than both the binary
logit and OLS models. The probable cause of this are the size of the peaks around 0
and 1 compared to the number of observations found in the interval between the two
peaks. This therefore allows for more error in the prediction of the CCF via a
cumulative three-class model.
Another interesting finding is that although the predictive power of the CCF is weak,
when this predicted value is applied to the EAD formulation to predict the actual EAD
value, the predictive power is fairly strong. In particular when the predictive values
obtained through the application of the OLS with Beta transformation model were
applied to the EAD formulation an improvement in the coefficient of determination
was seen. Nonetheless, similar performance, in terms of correlations, could be
achieved by a simple model that takes the average CCF of the previous cohort,
showing that much of the explanatory power of EAD modelling derives from the
current exposure.
With regards to the additional variables proposed in the prediction of the CCF only
one, i.e. average number of days delinquent in the last 6 months, gave an adequate p-
value, whilst undrawn percentage, potentially an alternative to credit percentage, was
significant for the OLS with Beta transformation model. Even though the relative
changes in the undrawn amount give reasonable information value scores, these
variables do not prove to be significant in the regression models, probably due to their
high correlation with the undrawn variable. This shows that the actual values at the
start of the cohort already give a significant representation of previous activity in
order to predict the CCF.
148 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
There is an obvious need for further research into the prediction of the exposure at
default (EAD) value as this chapter can only go so far in its estimations. A more
extensive study with multiple data sets over a longer timescale would be able to give
more reliable results in the prediction of the EAD. A variation of the time period used
prior to default other than the cohort method would also be an interesting extension.
Also, previous work stated in the literature review section has already looked at some
alternative techniques, such as a generalised beta link model. A benchmarking study
including this and the techniques mentioned in this chapter may give a better
understanding of any improvements that could be made over an ordinary least squares
regression model or the logistic regression models suggested in this chapter. The
availability of application data in the modelling process may also provide some
additional predictive variables in the modelling of the CCF.
Chapter 6: Regression Model Development for Credit Card
E x p o s u r e A t D e f a u l t ( E A D ) | 149
C h a p t e r 7 : C o n c l u s i o n s | 151
Chapter 7
7 Conclusions
In this PhD thesis, we addressed three issues relating to the implementation of the
advanced internal ratings based approach (AIRB) by financial institutions. The issues
raised in this thesis included that of building classification models for the estimation
of probability of default (PD) for imbalanced credit scoring data sets; the accurate
prediction of loss given default (LGD); and the construction of a robust and
comprehensible model for exposure at default (EAD).
In this chapter we display the conclusions that can be drawn from the research
undertaken in this thesis. After highlighting the conclusions from each project, issues
for further research will also be given.
In the literature review of this thesis (cf. Chapter 2), we identified issues pertaining to
the estimation of probability of default (PD) in imbalanced credit scoring data sets.
Although to date a lot of work has been undertaken in the field of PD estimation, the
issue of imbalanced data sets has as of yet not been fully addressed.
In Chapter 4 of this thesis, we addressed this issue of estimating probability of default
for imbalanced data sets. We achieved this by looking at a number of credit scoring
techniques, and studying their performance over various class distributions on five
real-life credit data sets. Two techniques that have yet to be fully researched in the
context of credit scoring, i.e. Gradient Boosting and Random Forests, were also
chosen to give a broader review of the techniques available. The classification power
of these techniques was assessed based on the area under the receiver operating
152 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
characteristic curve (AUC). Friedman's test and Nemenyi's post-hoc tests were then
applied to determine whether the differences between the average ranked
performances of the AUCs were statistically significant. Finally, these significance
results were visualised using significance diagrams for each of the various class
distributions analysed.
The results of these experiments showed that the Gradient Boosting and Random
Forest classifiers performed well in dealing with samples where a large class
imbalance was present. It does appear that in extreme cases the ability of random
forests and gradient boosting to concentrate on „local‟ features in the imbalanced data
is useful. The most commonly used credit scoring techniques, linear discriminant
analysis (LDA) and logistic regression (LOG), gave results that were reasonably
competitive with the more complex techniques and this competitive performance
continued even when the samples became much more imbalanced. This would
suggest that the currently most popular approaches are fairly robust to imbalanced
class sizes. On the other hand, techniques such as QDA and C4.5 were significantly
worse than the best performing classifiers. It can also be concluded that the use of a
linear kernel LS-SVM would not be beneficial in the scoring of data sets where a very
large class imbalance exists.
Finally the issue of regression model development for credit card exposure at default
(EAD) is dealt with in Chapter 6 of this thesis. This chapter sets out with the aim of
developing a comprehensible and robust regression model for the estimation of
Exposure at Default (EAD) for consumer credit cards through the prediction of the
credit conversion factor (CCF). An in-depth analysis of the predictive variables used
in the modelling of the CCF is also given, showing that previously acknowledged
variables are significant and identifying a series of additional variables.
The results from this chapter show that a marginal improvement in the coefficient of
determination can be achieved with the use of a binary logit model over a traditional
OLS model. Interestingly the use of a cumulative logit model performs worse than
both the binary logit and OLS models. The probable cause of this are the size of the
peaks around 0 and 1 compared to the number of observations found in the interval
between the two peaks. This therefore allows for more error in the prediction of the
CCF via a cumulative three-class model.
Another interesting finding is that although the predictive power of the CCF is weak,
when this predicted value is applied to the EAD formulation to predict the actual EAD
value, the predictive power is fairly strong. In particular when the predictive values
obtained through the application of the OLS with Beta transformation model were
applied to the EAD formulation an improvement in the coefficient of determination
was seen. Nonetheless, similar performance, in terms of correlations, could be
achieved by a simple model that takes the average CCF of the previous cohort,
showing that much of the explanatory power of EAD modelling derives from the
current exposure.
With regards to the additional variables proposed in the prediction of the CCF only
one, i.e. average number of days delinquent in the last 6 months, gave an adequate p-
value, whilst undrawn percentage, potentially an alternative to credit percentage, was
significant for the OLS with Beta transformation model. Even though the relative
changes in the undrawn amount give reasonable information value scores, these
variables do not prove to be significant in the regression models, probably due to their
high correlation with the undrawn variable. This shows that the actual values at the
154 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
In summary, this thesis has identified and presented detailed results and findings for
three main issues facing financial institutions wishing to implement an AIRB
approach. An extensive review of the current literature and findings has also been
presented and extrapolated upon with the aim of presenting a better understanding for
financial institutions considering appropriate techniques and methodologies in the
modelling process.
Further to the conclusions presented in this thesis, there still remain many challenging
issues for further research. This section will highlight the issues for further research
identified by each of the major works conducted in this thesis.
With regards to probability of default (PD) modelling for imbalanced data sets further
work that could be conducted, as a result of the findings presented in this thesis,
would be to firstly consider a stacking approach to classification through the
combination of multiple techniques. Such an approach would allow a meta-learner to
pick the best model to classify an observation. Secondly, another interesting extension
to the research would be to apply these techniques on much larger data sets which
display a wider variety of class distributions. It would also be of interest to look into
the effect of not only the percentage class distribution but also the effect of the actual
number of observations in a data set.
Finally, as stated in the literature review chapter of this thesis, there have been several
approaches already researched in the area of oversampling techniques to deal with
large class imbalances, in the area of machine learning. Further research into this and
their effect on credit scoring model performance would be beneficial.
C h a p t e r 7 : C o n c l u s i o n s | 155
In the literature to date there has been considerable evidence that macroeconomic
factors affect a client‟s credit risk behaviour. To further the research presented in this
thesis it maybe a worthwhile endeavour to investigate the influence of macro-
economic variables, both in the context of improving LGD models and for stress
testing.
A variety of LGD data sets have been analysed and reported in Chapter 5 of this
thesis. To further this work separate studies on corporate and retail credit LGD data
sets could be made, to determine whether separate risk drivers are present in the
prediction of each. Finally, one could also try to add comprehensibility to well-
performing black box models with rule extraction techniques to gain more insight.
There is an obvious need for further research into the prediction of the exposure at
default (EAD) value as this thesis can only go so far in its estimations. A more
extensive study with multiple data sets over a longer timescale would be able to give
more reliable results in the prediction of the EAD. A variation of the time period used
prior to default other than the cohort method would also be an interesting extension.
Also, previous work stated in the literature review section has already looked at some
alternative techniques, such as a generalised beta link model. A benchmarking study
including this and the techniques mentioned in this thesis may give a better
understanding of any improvements that could be made over an ordinary least squares
regression model or the logistic regression models suggested in this thesis. The
availability of application data in the modelling process may also provide some
additional predictive variables in the modelling of the CCF.
A p p e n d i c e s | 157
Appendices
A8 Nominal
A9 Nominal
A10 Continuous
A11 Nominal
A12 Nominal
A13 Continuous
A14 Continuous
A15 Binary Target
A1.2 Bene1
Variable name Type
Identification number continuous
Amount of loan continuous
Amount on purchase invoice continuous
Percentage of financial burden continuous
Term continuous
Personal loan nominal
Purpose nominal
Private or professional loan nominal
Monthly payment continuous
Savings account continuous
Other loan expenses continuous
Income continuous
Profession nominal
A p p e n d i c e s | 159
A1.3 Bene2
The variable names for the Bene2 dataset cannot be displayed for confidentiality purposes. The dataset includes:
28 input variables:
Continuos variables: 18
Nominal variables: 10
1 Binary class variable
160 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
A1.4 Behav
The variable names for the Behav dataset cannot be displayed for confidentiality purposes. The dataset includes:
1 ID variable
60 Input variables:
Nominal variables: 10
Ordinal variables: 1
Continuous variables: 49
1 Binary class variable (0 = “good account”; 1 = “bad account”)
Age Continuous
Other_Payment_Plans Nominal
Housing Nominal
Existing Credits Continuous
Job Nominal
Number of Dependents Continuous
Own_Telephone Nominal
Foreign_worker Nominal
A3.1 BANK1
A3.2 BANK2
A3.3 BANK3
A3.4 BANK4
A3.5 BANK5
A3.6 BANK6
The variable names for the BANK6 dataset cannot be displayed for confidentiality purposes.
A p p e n d i c e s | 169
A4.1 BANK1
A4.2 BANK2
170 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
A4.3 BANK3
A4.4 BANK4
A p p e n d i c e s | 171
A4.5 BANK5
A4.6 BANK6
The variable names for the BANK6 dataset cannot be displayed for confidentiality purposes.
display diagonal normal probability. It therefore seems that the normality assumption is not satisfied for these data sets, leading to the
summation that the OLS model fit is relatively poor.
A6: Pearson’s correlation coefficients matrix for input variables used in Chapter 6
CCF EAD Commit_ Drawn Undrawn_ Credit Time_d Rating Rating Rating Rating Av_No_d Av_No_d Av_No_d Av_No_d Incr_com
Amt _Amt Amt % efault _grade _grade _grade _grade ays_del_3 ays_del_6 ays_del_9 ays_del_1 mit_Amt
_1 _2 _3 _4 2
EAD 0.323 1.000
Commit_Amt 0.030 0.712 1.000
Drawn_Amt 0.089 0.755 0.782 1.000
Undrawn_Amt -0.083 0.012 0.421 -0.236 1.000
Credit% 0.039 0.212 -0.067 0.458 -0.771 1.000
Time_default 0.229 0.078 0.046 -0.021 0.102 -0.111 1.000
Rating_grade_1 -0.041 -0.050 0.077 -0.136 0.318 -0.319 -0.004 1.000
Rating_grade_2 0.231 0.121 0.177 0.095 0.138 -0.114 0.280 -0.214 1.000
Rating_grade_3 -0.093 -0.038 -0.106 0.007 -0.176 0.185 -0.106 -0.118 -0.654 1.000
Rating_grade_4 -0.184 -0.094 -0.154 -0.068 -0.141 0.098 -0.253 -0.084 -0.466 -0.258 1.000
Av_No_days_del_3 -0.067 -0.072 -0.099 -0.038 -0.099 0.112 -0.134 -0.097 -0.404 0.132 0.447 1.000
Av_No_days_del_6 -0.053 -0.072 -0.096 -0.039 -0.093 0.103 -0.103 -0.107 -0.397 0.160 0.407 0.844 1.000
Av_No_days_del_9 -0.056 -0.084 -0.105 -0.054 -0.085 0.087 -0.089 -0.111 -0.391 0.175 0.382 0.757 0.925 1.000
Av_No_days_del_12 -0.050 -0.085 -0.109 -0.060 -0.083 0.080 -0.076 -0.113 -0.386 0.178 0.373 0.719 0.874 0.961 1.000
Incr_commit_Amt 0.070 0.293 0.349 0.286 0.128 -0.021 0.080 -0.020 0.223 -0.089 -0.188 -0.127 -0.160 -0.195 -0.211 1.000
Undrawn% -0.039 -0.212 0.067 -0.458 0.771 -1.000 0.111 0.319 0.114 -0.185 -0.098 -0.112 -0.103 -0.087 -0.080 0.021
Abs_change_drawn_3 0.036 0.260 0.222 0.414 -0.256 0.263 -0.072 -0.055 0.013 0.020 -0.014 -0.090 -0.086 -0.084 -0.077 0.142
Abs_change_drawn_6 0.019 0.326 0.277 0.482 -0.270 0.308 -0.078 -0.071 0.023 0.015 -0.013 -0.060 -0.099 -0.106 -0.105 0.239
Abs_change_drawn_12 0.053 0.439 0.374 0.595 -0.281 0.359 -0.068 -0.105 0.031 0.022 -0.014 -0.051 -0.074 -0.094 -0.105 0.297
Abs_change_undrawn_3 -0.008 -0.167 -0.082 -0.328 0.349 -0.308 0.117 0.056 0.075 -0.076 -0.040 0.033 0.030 0.027 0.020 -0.001
Abs_change_undrawn_6 0.008 -0.183 -0.078 -0.357 0.398 -0.368 0.119 0.075 0.097 -0.087 -0.066 -0.022 0.012 0.017 0.018 0.029
Abs_change_undrawn_12 -0.017 -0.236 -0.091 -0.411 0.456 -0.430 0.124 0.113 0.110 -0.100 -0.088 -0.028 -0.016 -0.009 -0.002 0.128
Abs_change_commit_3 0.061 0.202 0.302 0.185 0.202 -0.095 0.097 0.001 0.190 -0.120 -0.115 -0.125 -0.122 -0.125 -0.123 0.305
Abs_change_commit_6 0.051 0.291 0.390 0.269 0.216 -0.086 0.070 0.002 0.223 -0.131 -0.148 -0.156 -0.168 -0.173 -0.170 0.515
Abs_change_commit_12 0.061 0.364 0.486 0.345 0.255 -0.083 0.084 0.003 0.228 -0.125 -0.163 -0.130 -0.151 -0.174 -0.182 0.706
Rel_change_drawn_3 0.013 -0.038 -0.035 -0.049 0.016 -0.022 0.015 0.006 -0.018 0.009 0.011 0.007 0.008 0.009 -0.005 -0.020
Rel_change_drawn_6 -0.024 -0.026 -0.035 -0.048 0.015 -0.056 0.015 0.014 -0.016 0.016 -0.004 -0.012 -0.009 -0.002 0.002 -0.028
Rel_change_drawn_12 -0.002 0.007 0.010 0.014 -0.004 -0.010 -0.027 0.004 -0.038 -0.007 0.059 0.015 0.006 0.009 0.005 0.001
Rel_change_undrawn_3 -0.013 -0.002 -0.005 0.003 -0.012 0.025 -0.001 0.002 0.019 -0.038 0.019 0.013 0.008 0.008 0.009 -0.016
Rel_change_undrawn_6 -0.004 -0.008 0.005 -0.013 0.026 -0.013 -0.009 0.000 0.008 -0.007 -0.002 0.002 -0.001 0.007 0.006 0.000
Rel_change_undrawn_12 -0.006 -0.015 -0.014 -0.030 0.022 -0.026 -0.010 0.028 0.012 -0.025 -0.001 0.008 -0.002 -0.002 0.000 0.012
Rel_change_commit_3 0.074 0.007 0.021 -0.028 0.073 -0.104 0.096 -0.020 0.144 -0.064 -0.109 -0.119 -0.117 -0.115 -0.115 0.257
Rel_change_commit_6 0.061 0.017 0.007 -0.030 0.055 -0.088 0.080 -0.018 0.166 -0.079 -0.122 -0.138 -0.157 -0.155 -0.153 0.448
Rel_change_commit_12 0.062 0.029 0.013 -0.018 0.047 -0.076 0.075 -0.017 0.172 -0.077 -0.134 -0.112 -0.140 -0.156 -0.163 0.600
Undrawn Abs_cha Abs_cha Abs_cha Abs_chan Abs_chan Abs_chan Abs_chan Abs_chan Abs_chan Rel_chan Rel_chan Rel_chan Rel_chan
% nge_dra nge_dra nge_dra ge_undra ge_undra ge_undra ge_commi ge_commi ge_commi ge_drawn ge_drawn ge_drawn ge_undra
wn_3 wn_6 wn_12 wn_3 wn_6 wn_12 t_3 t_6 t_12 _3 _6 _12 wn_3
Undrawn% 1.000
A p p e n d i c e s | 179
References
Allen L, De Long G and Saunders A (2004). Issues in the credit risk modeling of
retail markets, Journal of Banking & Finance, 28(4), 727–752.
Altman E and Sironi A (2004). Default recovery rates in credit risk modelling: A
review of the literature and empirical evidence. Economic Note, 33, 183–208.
Altman E and Suggitt H J (2000). Default rates in the syndicated bank loan market: A
mortality analysis, Journal of Banking & Finance, 24(1-2), 229-253.
Araten M and Jacobs M (2001). Loan Equivalents for Revolving Credits and Advised
Lines, The RMA Journal, May, 34–39.
Asarnow E and Marker J (1995). Historical Performance of the U.S. Corporate Loan
Market: 1988–1993. Journal of Commercial Lending, 10(2), 13–32.
Baesens B (2003a). Developing intelligent systems for credit scoring using machine
learning techniques, PhD Thesis, Faculty of Economics, KU Leuven.
Baesens B, Setiono R, Mues C and Vanthienen J (2003). Using neural network rule
extraction and decision tables for credit-risk evaluation, Management Science, 49(3),
312–329.
Basel Committee on Banking Supervision (2001a). The New Basel Capital Accord,
Jan. Available at: http://www.bis.org/publ/bcbsca03.pdf.
Bastos J (2010). Forecasting bank loans for loss-given-default. Journal of Banking &
Finance, 34(10), 2510-2517.
R e f e r e n c e s | 183
Bastos J (2010). Predicting bank loan recovery rates with neural networks,
CEMAPRE Working Papers 1003, Centre for Applied Mathematics and Economics
(CEMAPRE), School of Economics and Management (ISEG), Technical University
of Lisbon.
Batista G (2004). A Study of the Behavior of Several Methods for Balancing Machine
Learning Training Data. ACM SIGKDD Explorations Newsletter, 6(1), 20-29.
Bedingfield S and Smith KA (2001). Evolutionary rule generation and its application
to credit scoring. In L. Reznik and V. Kreinovich, editors, Soft Computing in
Measurement and Information Acquisition, Heidelberg, 2001. Physica-Verlag.
Bellotti T and Crook J (2007). Modelling and predicting loss given default for credit
cards. In: Credit Scoring and Credit Control XI conference.
Benjamin N, Cathcart A and Ryan K (2006). Low Default Portfolios: A Proposal for
Conservative Estimation of Default Probabilities. Discussion Paper, Financial
Services Authority.
Berger J and Berliner L (1986). Robust bayes and empirical bayes analysis with
contaminated priors, The Annals of Statistics, 14(2), 461–486.
Berry M and Linoff S (2000). Mastering data mining: The art and science of customer
relationship management, John Wiley & Sons, Inc, New York.
184 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
Bonfim D (2009). Credit risk drivers: Evaluating the contribution of firm level
information and of macroeconomic dynamics, Journal of Banking & Finance, 33(2),
281-299.
Box G and Tiao G (1992), Bayesian Inference in Statistical Analysis, John Wiley &
Sons, New York.
Caselli S and Querci F (2009). The sensitivity of the loss given default rate to
systematic risk: New empirical evidence on bank loans. Journal of Financial Services
Research, 34, 1-34.
Cespedes JCG, de Juan Herrero JA, Rosen D, and Saunders D (2010). Effective
Modeling of Wrong Way Risk, Counterparty Credit Risk Capital and Alpha in Basel
II, Journal of Risk Model Validation, 4(1), 71-98.
R e f e r e n c e s | 185
Chalupka R and Kopecsni J (2009). Modeling Bank Loan LGD of Corporate and
SME Segments: A Case Study, Czech Journal of Economics and Finance, 59(4), 360-
382
Chawla NW, Bowyer KW, Hall LO and Kegelmeyer WP (2002). SMOTE: Synthetic
Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16,
321–357.
Chou M (2006). Cash and credit card crisis in Taiwan, Business Weekly, 24–27.
Crouhy M, Galai D and Mark R (2000). A comparative analysis of current credit risk
models, Journal of Banking & Finance, 24, 57–117.
Davis RH, Edelman DB and Gammerman AJ (1992). Machine learning algorithms for
credit card applications, IMA Journal of Mathematics Applied Business & Industry, 4,
43-51.
DeLong ER, DeLong DM, and Clarke-Pearson DL (1988). Comparing the Areas
Under Two or More Correlated Receiver Operating Characteristic Curves: A
Nonparametric Approach. Biometrics, 44(3), 837–845.
Erdem C (2008). Factors affecting the probability of credit card default and the
intention of card use in Turkey. International Research Journal of Finance and
Economics, 18, 1450-2887.
Fernandes JE (2005). Corporate credit risk modeling: Quantitative rating system and
probability of default estimation, mimeo.
R e f e r e n c e s | 187
Crone S and Finlay S (2011). Instance Sampling in Credit Scoring: an empirical study
of sample size and balancing, International Journal of Forecasting, forthcoming;
Originally in: Big or Balanced? An empirical study of the effects of sample size and
balancing on model performance, In: Conference on Risk Management in the
Personal Financial Services Sector, 22-23 Jan 2009, Imperial College, London
Fogarty TC, Ireson NS, and Battles SA (1992). Developing rule based systems for
credit card applications from data with genetic algorithms. IMA Journal of
Mathematics Applied In Business and Industry, 4, 53-59.
Freed N and Glover F (1981). Simple but powerful goal programming models for
discriminant problems. European Journal of Operational Research, 7, 44-60.
Giambona F, and Iancono VL (2008). Survival models and credit scoring: some
evidence from Italian Banking System., 8th international business research conference,
Dubai, 27th-28th March 2008
Giudici P (2001). Bayesian data mining, with application to benchmarking and credit
scoring, Applied Stochastic Models in Business and Society, 17, 69–81.
Giudici P (2003). Applied data mining: Statistical methods for business and industry,
John Wiley & Sons, Inc, New York.
Guettler A and Liedtke HG (2007). Calibration of Internal Rating Systems: The Case
of Dependent Default Events, Kredit und Kapital, 40, 527-552
Gupton G and Stein M (2002). Losscalc: Model for predicting loss given default
(lgd). Tech. rep., Moody's.
Gupton G and Stein M (2005) LossCalc V2: Dynamic prediction of LGD. Moody‟s
Investors Service.
R e f e r e n c e s | 189
Han J and Kamber M (2001). Data mining: Concepts and techniques, Morgan
Kaufmann, San Fransisco.
Holland P and Welsch R (1977). Robust regression using iteratively reweighted least
squares. Communications in Statistics: Theory and Methods, 6, 813-827.
Hosmer DW and Stanley L (2000). Applied Logistic Regression, 2nd ed. New York;
Chichester, Wiley.
Huang CL, Chen MC and Weng CJ (2007). Credit scoring with a data mining
approach based on support vector machines, Expert Systems with Applications, 33(4),
847-856.
Jagielska I and Jaworski J (1996). Neural network for predicting the performance of
credit card accounts, Computational Economics, 9(1), 77–82.
Jiménez G, Lopez J A, and Saurina J (2009). EAD Calibration for Corporate Credit
Lines. Journal of Risk Management at Financial Institutions, 2, 121–29.
Kadane JB, Dickey JM, Winkler RL, Smith WS and Peters SC (1980). Interactive
elicitation of opinion for a normal linear model, Journal of the American Statistical
Association 75(372), 845–854 Dec.
Koh HC and Chan KLG (2002). Data mining and customer relationship marketing in
the banking industry, Singapore Management Review, 24(2), 1–27.
Kolesar P and Showers JL (1985). A robust credit screening model using categorical
data. Management Science, 31, 123-133.
192 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
Lee YS et al. (2002). Credit scoring using the hybrid neural discriminant
technique, Expert Systems with Applications, 23(3), 245–254.
Luo X and Shevchenko PV (2010). LGD credit risk model: estimation of capital with
parameter uncertainty using MCMC, Quantitative Finance Papers
Mays E (1998). Credit Risk Modeling: Design and Application, New York: Glenlake
Merton RC (1974). On the Pricing of Corporate Debt: The Risk Structure of Interest
Rates, Journal of Finance, 29(2), 449–470.
Mester LJ (1997). What‟s the Point of Credit Scoring? Business Review, 5, 3-16.
Miu P and Ozdemir B (2006). Basel requirements of downturn loss given default
modelling and estimating probability of default and loss given default correlations.
The Journal of Credit Risk, 2(2), 43-68
Moral G (2006). EAD Estimates for Facilities with Explicit Limits. in: Engelmann B,
Rauhmeier R (Eds), The Basel II Risk Parameters: Estimation, Validation and Stress
Testing, Springer, Berlin, 197-242.
OCC (2006). Validation of credit rating and scoring models: a workshop for managers
and practitioners, Office of the Comptroller of the Currency (2006).
Qi M and Zhao X (2011). Comparison of Modeling Methods for Loss Given Default,
Journal of Banking & Finance, 35(11), 2842-2855.
Quinlan JR (1993). C4.5 Programs for Machine Learning. Morgan Kaufmann: San
Mateo, CA.
SAS Institute (2002) .Comply and Exceed. Credit Risk Management for Basel II and
Beyond. A SAS White Paper.
Schuermann T (2004). What do we know about loss given default? Working Paper
No. 04-01, Wharton Financial Institutions Center, Feb.
Shleifer A and Vishny R (1992). Liquidation values and debt capacity: A market
equilibrium approach, Journal of Finance, 47, 1343-1366.
Sigrist F and Stahel WA (2010). Using The Censored Gamma Distribution for
Modeling Fractional Response Variables with an Application to Loss Given Default,
Quantitative Finance Papers
Steenackers A and Goovaerts MJ (1989). A credit scoring model for personal loans.
Inurances: Mathematics and Economics, 8(1), 31-34.
Valvonis V (2008). Estimating EAD for retail exposures for Basel II purposes.
Journal of Credit Risk, 4(1), 79-109
Van Der Burgt M (2007). Calibrating Low-Default Portfolios, using the Cumulative
Accuracy Profile. ABN AMRO.
Van Gestel T and Baesens B (2009). Credit Risk Management. Oxford University
Press.
Van Gestel T, Baesens B, Van Dijcke P, Garcia J, Suykens J and Vanthienen J (2006).
A process model to develop an internal rating system: Sovereign credit ratings.
Decision Support Systems, 2, 1131-1151.
Viganò L (1993). “A Credit Scoring Model for Development Banks: An African Case
Study”, Savings and Development, 17(4), 441-482.
Weiss GM and Provost FJ (2003). Learning When Training Data are Costly: The
Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence
Research (JAIR), 19, 315-354.
West D (2000). Neural network credit scoring models. Computers & Operational
Research, 27(11-12), 1131–1152.
Wilde T and Jackson L (2006). Low-default portfolios without simulation, Risk, 60–
63.
Witten IH and Frank E (2005). Data Mining: Practical machine learning tools and
techniques, 2nd Edition, Morgan Kaufmann, San Francisco.
Yang Y (2007). Adaptive credit scoring with kernel learning methods. The European
Journal of Operational Research Society, 183(3), 1521-1536.
Yao P (2009). Hybrid Classifier Using Neighborhood Rough Set and SVM for Credit
Scoring, International Conference on Business Intelligence and Financial
Engineering, 138-142
198 | B a s e l I I C o m p l i a n t C r e d i t R i s k M o d e l l i n g
Yeh IC and Lien CH (2009). The comparisons of data mining techniques for the
predictive accuracy of probability of default of credit card clients, Expert Systems
with Applications, 36(2), 2473-2480
Yobas MB, Crook JN and Ross P (2000). Credit scoring using neural and
evolutionary techniques. IMA Journal of Management Mathematics, 11(2), 111-125.