Credit Risk Analysis - Project Report
Credit Risk Analysis - Project Report
M ASTER T HESIS
Author: Supervisors:
D. Gorter B. Roorda
M. van Keulen
F. Reuter
M. Westra
May 4, 2017
ii
“One problem in the field of statistics has been that everyone wants to be a theorist. Part of
this is envy - the real sciences are based on mathematical theory. In the universities for this
century, the glamor and prestige has been in mathematical models and theorems, no matter
how irrelevant. ”
L. Breiman
iii
University of Twente
Abstract
BMS
IEBIS
Master of Science
This thesis aims to pinpoint the added value of machine learning in the domain of
retail credit risk, where (logistic) regression approaches are most commonly used.
Credit data of the on-line peer to peer lending platform Lending Club is used to cre-
ate retail credit risk models with Logistic Regression, Random Forests, Neural Net-
works and Support Vector Machines. A level playing field is created for the models
by means of a single data transformation to keep the input of all models equal. This
level playing field is achieved by using Weight of Evidence to create a scaled data
set without outliers or missing values. The created retail credit risk models are eval-
uated in terms of modeling approach and in terms of model performance in order
to find added value. The research shows that the added value of the machine learn-
ing approach over the traditional (logistic) regression approach is present. Where
the machine learning algorithms can handle all variables and decide for themselves
how to model the relationships between the variables, the (logistic) regression ap-
proaches need careful selection of subsets of independent variables. This can be
valuable when in the future the amount of information available about loan appli-
cants is larger than there is time to address data issues like correlated variables. The
research has also found added value of machine learning in terms of model perfor-
mance. The Neural Networks and Random Forests produce more accurate results
than (logistic) regression. The Support Vector Machines however are not suitable for
retail credit risk predictions because the best predictions are made when models are
trained with large amounts of data which proved to be problematic for the Support
Vector Machines.
The results of this research depend on the Weight of Evidence transformation which
is shown to be sub optimal for the Random Forests and possibly the other machine
learning models. However while this transformation is suitable for Logistic Regres-
sion, the method is still outperformed by the Random Forests and Neural Networks.
v
Acknowledgements
Writing this thesis at the Financial Risk Management department of Deloitte has
been a great experience during which I had the opportunity to learn a lot from the
team members. They have shown great interest in my work and have helped me un-
derstand difficult topics, especially I would like to thank Florian Reuter and Martijn
Westra who have monitored my weekly progress and contributed to my thesis by
discussing their views on the subject with me. I would also like to thank my teach-
ers Berend Roorda and Maurice van Keulen who guided me during the research and
writing the thesis, their comments have helped me a lot and discussing my work at
the university with them was always enjoyable and insightful.
During my time at the University of Twente I have had the tools to develop my-
self academically up to the point of obtaining a masters degree with this research,
I would like to thank teachers as well as students with whom I have worked on
projects for their involvement in this journey.
Contents
Abstract iii
Acknowledgements v
1 Introduction 1
1.1 Research background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Theoretical Framework 5
2.1 Current industry developments . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Retail Credit Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Expected Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Probability of Default . . . . . . . . . . . . . . . . . . . . . . . . 7
Exposure At Default . . . . . . . . . . . . . . . . . . . . . . . . . 7
Loss Given Default . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Charged Off loans . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Algorithm selection . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Probability Estimation . . . . . . . . . . . . . . . . . . . . . . . . 12
regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Variable importance classification/probability estimation forests 13
Variable importance regression forests . . . . . . . . . . . . . . 13
2.3.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . 15
classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Probability estimation . . . . . . . . . . . . . . . . . . . . . . . . 17
Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Data transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Receiver Operating Characteristic . . . . . . . . . . . . . . . . . 20
2.5.2 Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.3 R squared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.4 Loss capture ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
viii
5 Model Analysis 51
5.1 Modeling approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Model performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.1 Probability of Charge Off model . . . . . . . . . . . . . . . . . . 52
5.2.2 EAC and LGC models . . . . . . . . . . . . . . . . . . . . . . . . 54
Exposure At Charge off . . . . . . . . . . . . . . . . . . . . . . . 54
Loss Given Charge off . . . . . . . . . . . . . . . . . . . . . . . . 55
Weight of Evidence use in EAC LGC models . . . . . . . . . . . 55
5.2.3 Expected Loss models . . . . . . . . . . . . . . . . . . . . . . . . 56
Loss capture evaluation . . . . . . . . . . . . . . . . . . . . . . . 56
Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A Variable description 65
C Descriptive statistics 77
ix
E Regression models 87
E.1 Exposure at charge off model . . . . . . . . . . . . . . . . . . . . . . . . 87
E.2 Loss given charge off model . . . . . . . . . . . . . . . . . . . . . . . . . 87
E.3 Expected loss model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Bibliography 89
xi
List of Figures
4.1 Random Foret initial ROC performance on train and validation data . 37
4.2 Random Forest ROC performance of final model using original and
WOE transformed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 1%Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 10%Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 100%Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.6 Neural Network Train and Validation AUC with different data set
sizes and numbers of hidden nodes . . . . . . . . . . . . . . . . . . . . . 40
4.7 Neural Network EAC MSE on train and validation data vs number of
hidden nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.8 Neural Network LGC MSE on train and validation data vs number of
hidden nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.9 Neural Network EL MSE on train and validation data vs number of
hidden nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
List of Tables
4.1 Random Forest probability of charge off top ten most important vari-
ables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Random Forest Train and validation performance on different values
of minimal node size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Random Forest Train and validation performance on number of vari-
ables as split candidates (mTry) . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Random Forest Train and validation performance on number of trees
in the forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Train and Validation AUC with different set sizes . . . . . . . . . . . . 40
4.6 Performance of Support Vector Machines trained with 2800 observations 41
4.7 Random Forest exposure at charge off top ten most important variables 42
4.8 Random Forest EAC model parameter search sorted on validation
performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.9 Neural Network EAC train and validation performance . . . . . . . . . 44
4.10 Support Vector Machine EAC performance using different kernels . . . 44
4.11 Random Forest Loss Given Charge off top ten most important variables 45
4.12 Random Forest Loss Given Charge off model parameter search sorted
on validation MSE performance . . . . . . . . . . . . . . . . . . . . . . . 45
4.13 Neural Network LGC train and validation performance . . . . . . . . . 46
4.14 Support Vector Machine LGC performance using different kernels . . . 47
4.15 Random Forest Expected Loss top ten most important variables . . . . 48
4.16 Random Forest Expected Loss model parameter search sorted on val-
idation MSE performance . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.17 Neural Network EL train and validation performance . . . . . . . . . 49
Chapter 1
Introduction
The first chapter introduces the research by describing the background and motiva-
tion for the research. The research questions are presented and the methods to find
answers to these questions are introduced.
perspective it is interesting to find out how good some of these algorithms are in con-
structing credit risk models on a large data set, however using these "black boxes"
might be far from being accepted from a regulatory point of view.
Not only fin tech, but also big tech (Google Facebook) companies are potential en-
trants to the financial industry. These companies have access to personal information
that could enable them to make highly accurate credit assessments. the financial in-
stitutions still have precious payment data of their clients, but to remain relevant in
their industry they need to explore the methods used by data scientists to provide
highly personalized and low cost financial services.
Where is the added value of machine learning in retail credit risk modeling?
To find added value, this research will focus on modeling approach and model
performance of machine learning and (logistic) regression. The following sub-
questions have been formulated to be able to answer the main question in a
structured way.
1.3 Methodology
The research methods will be structured as follows:
• Literature research
To develop a theoretical framework that will answer subquestions a and b. The
literature research will discus the following concepts.
• Data
Publicly available credit data from LendingClub.com will be stored in a SQL
database. To create a level playing field between models, a single data trans-
formation is used to create a data set that all algorithms can work with.
1.4 Outline
In Chapter 2, the theoretical framework needed for the research, will described. In
Chapter 3 the data set will be introduced to help the reader understand what data is
available for our machine learning retail credit risk research and descriptive statistics
will be provided. In Chapter 4 we apply algorithms discussed in Chapter 2 and
develop machine learning credit scoring models. These models will be analyzed and
compared in chapter 5. In the last chapter, Chapter 6, the conclusions drawn from
the research will be presented and suggestions for further research will be discussed.
5
Chapter 2
Theoretical Framework
The theoretical framework will first discuss the current developments in retail credit
risk and briefly explain more about retail credit risk. Then Machine learning and
the specific algorithms that will be used in the research are explained in more detail.
In the last part of the framework, data transformation and model evaluation will be
addressed.
• Big data
An often heard buzzword is big data, but is it relevant for retail lending credit
risk. The technological advances open doors to gather and store data that
would have been impossible not too long ago. To assess creditworthiness of
customers, models are created from historical data of consumer and loan char-
acteristics, the model can then translate information of a loan applicant to a
credit score. Having more and alternative data sources will give quantitative
financial specialists more information about sources of risk in loan applica-
tions. However when dealing with truly big data traditional retail credit risk
modeling might become problematic, if there are for instance more features
than we have time to evaluate individually.
• Consumer expectations
The consumer of today has high expectations, waiting for products or services
is considered a thing of the past and everything should be accessible on-line at
6 Chapter 2. Theoretical Framework
• Fintech
Financial technical startups pop up everywhere, which is problematic for the
traditional financial institutions. The Fintech companies are able to start their
business with state of the art technological equipment. They need less people
to provide the same service more efficiently. These companies thrive on new
technology and are passionate about using these technological developments
to create better or cheaper financial products. An example of such a company
is Advice Robo, they provide machine learning credit risk solutions to financial
institutions.
The unexpected loss is a measure of what the potential losses can be in a very adverse
scenario, in this scenario much more borrowers fail to meet their obligations than
they on average would. In regulatory context the unexpected loss is of interest in
order to make sure a financial institution can survive a crisis.
the amount that can not be recovered from the borrower. These three quantities are
called probability of default (PD), exposure at default (EAD) and loss given default
(LGD).
EL = P D ∗ EAD ∗ LGD
Probability of Default
The foundation for the method to estimate this quantity has been laid more than
half a century ago. It is called Logistic Regression and was first introduced in 1958
by David Cox. In the article he studies events with binary outcome dependent on
multiple independent variables (Cox, 1958). The technique is nowadays still widely
applied in the field of retail credit risk modeling. The method is about estimating
the coefficients β of the following formula:
n
F (x) X
log = β0 + βi xi
1 − F (x)
i=1
The left hand side is the called logit function, where F (x) is the probability of default
as a function of risk driving variables x. The right hand side of the equation is a
linear combination of the different risk driving variables and their weights. In order
for such a model to be qualified as a good model, a lot of model assumptions have
to be satisfied.
Exposure At Default
For many products, the amount to which the bank or an investor is exposed in case of
a default is known with certainty. However when dealing with unsecured amortiz-
ing loans (Lending Club), the exposure is not known in advance. When the exposure
at time of default is uncertain, there is need for a model that estimates this quantity.
Compared to PD and LGD the EAD is lagging behind in both industry practice and
academic research (Qi, 2009). The most obvious model for amortizing loans would
be a linear combination of loan amount and time until maturity. This method cannot
be used in case the exposure at default must be estimated at loan application time.
In that case a regression model is the standard approach.
EL = P C ∗ EAC ∗ LGC
• Random Forests
An ensemble approach combining many weak models into a strong model.
• Neural Networks
A structure resembling the brain, capable of learning complex relations in data.
These will be compared with the algorithm/technique that is currently most domi-
nant in retail credit risk modeling.
• (Logistic) Regression
Make predictions through a linear combination of input variables.
The Random Forest algorithm (Breiman, 2001a) is an ensemble method where mul-
tiple predictors are combined into a single strong predictor. During the construction
of a single tree in the forest, nodes are created to split observations, see figure 2.3 for
example. At each node a preset number of candidate features are chosen from all
features. The feature that can realize the best possible split is chosen from the subset
of features, this process is repeated until the desired terminal node size is reached.
Evaluating which feature realizes the best possible split can be performed with var-
ious methods, depending on the purpose of the tree (classification or regression),
these will be discussed in the next subsections.
• Every tree is grown with a bootstrapped sample form the training data, a boot-
strapped sample is a sample that is drawn with replacement from original data
and contains the same number of observations as the original data set.
• Split decisions in the trees are chosen by evaluating all available features on
their split performance. In terms of credit risk a good split would be separat-
ing all defaults from non-defaults. The random component is introduced by
drawing random subsets of the available feature set.
When the number of features to chose from (mTry) is the same as the total num-
ber of features available and thus disregarding the second element of randomness,
then we are essentially bagging. This technique significantly under performs Ran-
dom Forests due to individual tree similarity. On the other hand, when there are 100
features and mTry is one, individual trees become very complex and out of sample
model performance will be poor.
It is important to realize that during the creation of a tree we start with all obser-
vations, and after a split we have two sets of observations. eventually we end up
with n sets containing less than m observations, where m is the maximum size of
the terminal nodes and n is the number of terminal nodes. When the tree is fully
2.3. Machine Learning 11
grown, the response is determined in every terminal node which can be a value, a
probability or a class. The prediction that a Random Forest makes is the aggregate
of all tree predictions in the forest. All trees in a Random Forest have equal weight.
2. Bootstrap a sample with the size of the training set, the observations are drawn
with replacement.
6. Repeat steps 2 to 5 until the desired forest size is reached (number of trees in
the forest).
7. For out of sample testing or predicting on new data, drop an observation down
all trees in the forest, and aggregate the response values of the trees in the
forest.
Classification
The Random Forest algorithm for classification, predicts a class that is best associ-
ated with an observation. An overview of a simple classification tree is given below.
As described above such trees are combined into a forest and the predictions are
aggregated. In terms of credit risk, a binary classification tree would use historical
data to create trees that separate defaulted observations from non-defaulted obser-
vations.
x1 < 10
x1 < 5 xi < ai
Probability Estimation
Probability estimation trees, so called PET’s are almost identical to classification
trees. The only difference is that the response value is a set of probabilities of an
observation belonging to a class. Individual trees contain probabilities in their ter-
minal nodes, if for example half of the observations in a terminal node belong to class
0 and the other half belongs to class 1, then the class probabilities in that terminal
node are 50% and 50%. The class probabilities predicted by a probability estimation
forest are the average class probabilities of all trees in the forest. According to (Mal-
ley et al., 2012) PET’s are consistent probability estimators when their classification
counterparts are consistent class predictors.
regression
Like a classification tree, a regression tree splits data into subsets until a certain depth
of the tree is reached. In this case the terminal node value is the average of the re-
sponse values in the terminal node. Figure 2.4 shows how this works. The figure
represents a regression problem with two independent variables and one response
variable. Every split is represented by a line in the scatter-plot, the rectangles are the
terminal nodes of the regression tree. Splits are chosen in such a way that the vari-
ance inside the child nodes is smallest. The values at the end of the tree correspond
with the values inside the rectangles, they are the average of the response variables.
2.3. Machine Learning 13
The prediction of a regression forest is like a probability estimation forest, the aver-
age of all tree predictions.
For a simple Neural Network without hidden layers and a logistic activation func-
tion, we can calculate the output with the following formula:
n
!
X
y = f w0 + wi ∗ Xi
i=1
In the formula, f represents the logistic transformation. As we can see this is the
same as a Logistic Regression, however Neural Networks become more interesting
when hidden layers are introduced. These layers allow for nonlinear combinations
and more complex relations in the data to be captured by the Neural Network. In
this research we will use "vanilla" Neural Networks (Friedman, Hastie, and Tib-
shirani, 2001), which are networks with one hidden layer and a logistic activation
function.
The ’extra’ nodes at the top of the network, which can be seen in Figure 2.5, contain
a constant input of one and are called bias nodes. These nodes can be seen as the
intercept term in Generalized Linear Models (GLM’s), the family where Logistic Re-
gression belongs to.
This simple network has four input neurons, one hidden layer with four neurons
and one output neuron. The values on the edges in the network, are the weights
that the algorithm learns during the training of the network. Unlike in GLM’s it is
2.3. Machine Learning 15
not easy to observe how a change in one input variable is related to different output,
because the change in output can also be related to interaction terms in the network.
Backpropagation
At the beginning of the training of a Neural Network, the weights are chosen at ran-
dom. Then using the training data a forward pass through the network is executed
and the prediction error is calculated. Obviously this error should be minimized,
this is done by a number of iterations of forward and backward passes through the
network. A backward pass is calculating the derivative with respect to the error in
every point/weight of the network. Before a new iteration, all weights are updated
according to a beforehand specified learning rate and the derivatives. The training
of the network is finished when the weights have converged and the error does not
decrease anymore. A mathematical explanation of the backpropagation algorithm
can be found in the book; "Elements of statistical learning" (Friedman, Hastie, and
Tibshirani, 2001).
classification
The data in Figure 2.6 (A) can be separated in different ways, all of the green lines
separate the data perfectly, intuitively none of these hyperplanes feels like the best
classifier. In Figure 2.6 (B) we see the optimal classifier, in this case the margins
between observations and the hyperplane are largest, the shortest line from the ob-
servations on the margins to the hyperplane are called the support vectors, hence the
name Support Vector Machines. In this simple example the data can be separated, in
larger and more complex data sets this will be rare. To solve this problem, two steps
can be taken individually or both at the same time. A transformation can be applied
to attempt separating the data in a non linear feature space, and soft margins can
be applied. The non linear approach is called the kernel trick, different kernels are
available to find a feature space where data can be separated. The second option,
a soft margin, allows the model to leave observations in the margin or even at the
16 Chapter 2. Theoretical Framework
wrong side of the hyperplane. Through introducing slack variables, a Support Vec-
tor Machine can have a soft margin.
max M
βj ,i
subject to :
p
X
βj2 = 1,
j=1
The slack variables indicate if an observation was on the good side of the hyper-
plane, = 0 or on the wrong side of the hyperplane, > 1. Parameter C is a con-
straint on how big the slack variables can become. The problem formulation also
includes multiplying M with (1 − i ) which lowers the value of M when observa-
tions are on the wrong side. The yi represents the class label which can take a value
of −1 or 1. A more detailed explanation Support Vector Machine mathematics can
be found in the book; An introduction to statistical learning (James et al., 2013).
kernels
The examples in figures 2.6 and 2.8, are cases of Support Vector Machines with a lin-
ear kernel. Other kernels can be used used to map the input to a higher dimensional
space in which a linear solution can be found. Kernels have to satisfy Mercers the-
orem (Korotkov, 2011), we will not go into this theorem because it goes to deep for
the high level analysis of machine learning in this research. Figure 2.7 shows what
the goal of a kernel is.
The SVM optimization problem can be solved by using only the inner product of all
training observations. The inner productPof two vectors a and b, denoted by ha, bi,
in an n dimensional space is obtained by ni=1 ai bi . Kernels are functions that trans-
form these inner products.
• Linear
K(a, b) = ha, bi
• Polynomial
K(a, b) = (1 + ha, bi)d
• Sigmoid
K(a, b) = tanh(γha, bi + 1)
• Radial
2
K(a, b) = e−γha,bi
The influence regions of the sigmoid and radial kernels are controlled by parameter
γ, the polynomial kernel can be of different degrees d and the linear kernel is com-
puted with only the inner product of the vectors belonging to training observations.
Probability estimation
To obtain a probability estimate from the binary classification Support Vector Ma-
chine, a logistic model is fitted to the decision values (Meyer et al., 2017). With the
logistic model, the decision values can be transformed to probabilities between zero
and one.
Regression
Similar to classification the Support Vector Regression algorithm finds margins. The
big difference is that the observations should be inside the margins of the support
vectors. Figure 2.8 is an example of Support Vector Regression. The non support
vectors do not contribute anything to the model, the regression line is the result
from the choice of margin and the support vectors.
A big difference with ordinary least squares regression is that the support vector
method uses the shortest distance to the regression line/margins, where ordinary
least squares regression uses the vertical distance to find the line with the lowest
value of the sum of squared errors.
• Missing values
• Outliers
To overcome these problems, we have a few standard options. For instance replacing
missing values with the average of the non missing values, creating binary dummy
variables for a categorical field and disregarding observations that are outside a
number of standard deviations of the mean to make sure there are no unusually
low or high values that have high impact on the outcome of a model.
Generalized linear models (GLM) are linear regressions where a link function is used
to transform output to make sure it has the right characteristics (Nelder and Wed-
derburn, 1972). When probability is estimated, the output must be in the interval
[0, 1] to achieve this a logistic transformation is often used. Another option, in this
case, would be the probit transformation using the cumulative density function of a
standard normal distribution having some advantages when a normal prior distri-
bution is assumed.
When the logistic transformation is chosen, data can beforehand be transformed, by
means of Weight of Evidence (WOE). This method uses the following formula and
binned data to assign weights Bi in terms of the log odds ratio of the binary response
variable.
P (Bi |Y = 1)
W OEi = log
P (Bi |Y = 0)
After this transformation, all variables have the property that a higher WOE bin
value corresponds to a higher probability of Y = 1
The WOE method takes categorical data fields as a bin per level of a category and
missing values are handled as a separate level. The latter has the advantage that a
field, left blank, by a consumer can possibly hold more information than replacing
the blanks with the average value for that field.
2.5. Evaluation Methods 19
• Probabilities
Binary classification probability estimates will be evaluated with the receiver
operating characteristic.
• Continuous quantities
These will be evaluated with Mean Squared Error and R squared. Expected
Loss predictions, which are continuous, are also evaluated in terms of ability
to capture observed losses.
20 Chapter 2. Theoretical Framework
Figure 2.9 shows three ROC ’curves’. The straight line starting from the origin is
the performance of a random model, the performance of a rating model and the
performance of a perfect model. Any rating model should operate between the per-
formance of a random model where AU C = 0.5 and a perfect model where AU C = 1
2.5.3 R squared
The R2 is the amount of variance in a response variable that can be explained by the
model, calculated with the following formula:
(ŷi − yi )2
P
2
R =1− P
(ȳ − yi )2
The top part in the fraction is the summed squared residual of predicted minus ob-
served values. The lower part of the fraction is the summed squared residual of
average observed values minus observed value. A perfect model can account for
all variance in the response variable and achieves an R2 of 1. A model which can
account for none of the variance in data achieves an R2 of 0.
Chapter 3
In Chapter 3, the data set that is used in the research will be described. First we will
introduce Lending Club, the company that has made the data available, and then
we will discuss what data is available and how we will prepare and create different
data sets that will be used to create and analyze different credit risk models.
The figure shows that more than half of the loans issued by Lending Club are related
to other debt that the customers already have. To give an overview of the purposes
that borrowers can choose from, the purposes are listed on the next page.
24 Chapter 3. Lending Club data set
The investors on the other side of the platform are people that want to have better
returns than they can find in more traditional investments like the stock market or
on deposit accounts.
"Don’t take our word for it. See for yourself. Our entire loan database is available to
download. Help yourself to our data, and slice and dice it anyway you want. Try that at
your favorite banking institution!"
-Lendingclub.com
They are encouraged, by Lending Club, to explore the data in order to find charac-
teristics of borrowers that suit their investment strategy. To help investors that do
not want to go through the extensive amount of data available, Lending Club pro-
vides risk grades per loan, which are also translated into interest rates on loans.
F IGURE 3.2: Lending Club grade mix over time (LendingClub, 2016)
F IGURE 3.3: Lending Club annualized net returns per risk grade
(LendingClub, 2016)
3.2. Data Set 25
There are seven risk grades and each grade has 5 sub grades resulting in 35 distinct
risk levels that can be assigned to a loan. In Figure 3.2 the distribution of grades
assigned to loans over the past years is provided, notice that around 70% of all loans
are in the top three risk grades, indicating that Lending Club accepts more rela-
tively safe borrowers than the riskier borrowers. Figure 3.3 gives an overview of
the adjusted net annualized returns per sub grade assigned by Lending Club. The
adjustment in this calculation is for expected future losses.
File Loans
2007-2011_LoanStats3a 39,786
2012-2013_LoanStats3b 188,181
2014_LoanStats3c 235,629
2015_LoanStats3d 421,095
2016Q1_LoanStats_2016Q1 133,887
2016Q2_LoanStats_2016Q2 97,854
total 1,116,432
The size of the different lending Club files shows the growth that the platform has
gone through since its origination. Currently more than 400 thousand loans are
funded every year making increasingly more data available for credit risk research.
Combining these files results in a database of 1,116,432 loans with 111 columns that
contain information about the loans. In this data set we observe loans with different
statuses, shown in Table 3.2:
There are almost no loans labeled ’Default’ in the data, and more than half of the
loans have status ’Current’. We will proceed with loans that are either fully paid or
charged off. We are interested in supervised learning, therefore we only keep loans
26 Chapter 3. Lending Club data set
that have matured. Loans that are in grace period, late or in default reside in non
absorbing states, and are not taken into account. Would there have been a more
significant amount of defaults, then we could take them into account in predictive
modeling. This decision is a result of Lending Club’s policy to charge off loans very
fast in stead of managing a portfolio of defaulted loans for their investors.
The data set after removal of loans not in absorbing states is shown in Table 3.3:
The payment history data set contains valuable American credit information called
the FICO score, this score is a credit rating for consumers. Lending Club is currently
not providing the FICO score directly with the other loan information, they have
changed their open and transparent strategy by making it harder for investors and
researchers to obtain all data. On the contrary, Lending Club has also added data
fields in the past years, some at the request of the community. Coming back to the
FICO score, this data is added to the matured loan information data by joining on
loan id using an SQL database.
The table shows a few interesting things. The first being that the amount funded
is not alway the same as the amount funded by investors, there are even loans that
have an amount invested by investors of zero. Lending Club sometimes also invests
in loans to get them funded resulting in this difference. Another curious thing is that
the data set includes observations of loans that are provided to individuals that have
no income and an individual with an annual income of 8.9 million. With a maximum
loan amount of 40 thousand it seems highly unlikely that someone earning multiple
millions per year takes out a loan for a few thousand dollars. These peculiarities are
smoothed away by the WOE transformation described in Section 2.4.
To provide a overview of the credit risk associated with investing in the different
Lending Club sub grades, we have calculated the historical averages of received
interest, Charge off rate, Exposure at charge off, loss given charge off and percentage
of principal lost. These can be found in Table 3.5.
28 Chapter 3. Lending Club data set
sub grade interest Charge off rate EAC LGC % of principal lost
A1 6.19% 2.89% 84.97% 91.90% 2.25%
A2 6.98% 4.52% 75.73% 91.96% 3.15%
A3 8.08% 5.54% 68.81% 92.45% 3.52%
A4 8.58% 7.01% 73.80% 92.22% 4.77%
A5 9.32% 8.54% 74.86% 92.64% 5.92%
B1 10.55% 9.56% 81.57% 92.41% 7.20%
B2 11.85% 10.50% 83.54% 92.19% 8.09%
B3 13.21% 12.21% 66.70% 92.59% 7.54%
B4 13.73% 13.52% 76.99% 92.18% 9.59%
B5 13.59% 15.52% 76.09% 92.26% 10.90%
C1 14.22% 17.35% 79.30% 92.48% 12.72%
C2 14.77% 18.95% 81.47% 92.57% 14.29%
C3 15.02% 21.13% 72.01% 92.33% 14.05%
C4 15.41% 22.96% 70.32% 92.15% 14.88%
C5 16.22% 24.13% 78.15% 92.29% 17.40%
D1 16.66% 25.72% 62.04% 92.81% 14.81%
D2 17.84% 26.95% 75.39% 92.17% 18.72%
D3 18.11% 27.31% 64.57% 92.31% 16.28%
D4 18.51% 30.07% 84.22% 92.09% 23.32%
D5 19.10% 30.95% 85.24% 91.80% 24.21%
E1 18.57% 33.40% 84.10% 91.76% 25.78%
E2 19.66% 35.26% 65.36% 92.13% 21.23%
E3 19.75% 36.52% 87.13% 92.63% 29.48%
E4 20.91% 37.98% 64.14% 92.24% 22.47%
E5 21.09% 39.15% 86.63% 91.44% 31.01%
F1 22.45% 38.42% 63.30% 94.19% 22.91%
F2 22.65% 41.53% 63.23% 92.40% 24.27%
F3 23.56% 42.97% 64.60% 92.80% 25.76%
F4 23.11% 45.33% 83.71% 92.43% 35.08%
F5 23.08% 46.98% 86.53% 92.10% 37.43%
G1 23.71% 46.02% 62.27% 92.88% 26.62%
G3 23.73% 48.75% 90.81% 91.95% 40.71%
G4 24.45% 40.37% 89.06% 92.53% 33.27%
G5 21.12% 49.69% 88.12% 92.07% 40.31%
TABLE 3.5: Overview of credit risk per Lending Club sub grade
The grades below C have a high average percentage of principal lost, However in
Figure 3.3 Lending club reports a return on investment which is about 7% for all
grades, except for the highest grade where the return on investment is 5.12% on av-
erage. These high losses can be covered by the interest rates because the interest
rates in the table are annualized rates and the total average interest earned is there-
fore higher than the average percentage of principal lost. Furthermore we see in
Table 3.5 that the charge off rates increase with the sub grades, the average exposure
at charge off fluctuates between 60% and 90% and that the loss given charge off is
quite stable across the different sub grades with an average of 92.3%. We can already
see that the impact of a Loss given charge off model over just taking the average will
be limited because most charged off loans will have a loss given charge off close to
92.3%.
To handle categorical data we can make dummy variables, unfortunately there are
a lot of categories and transforming all categorical variables to dummy variables
blows up the amount of variables in the data set. Alternatively the Weight of Evi-
dence method described in Chapter 2 is used. This method is also applied to numer-
ical data by binning the these variables. The R "information" package (Kim, 2016)
calculates the WOE scores. The transformation is applied after the data is split in
different sets for training a model, validating the model and later testing the model,
this will be motivated in Section 3.6. When the data is eventually transformed into
WOE scores we have to be careful with direct interpretation of the scores, because
a high score on a variable can be the result of correlation with another variable. In
short this means that conditional independence should be satisfied to draw conclu-
sions about the WOE score of a bin or category.
30 Chapter 3. Lending Club data set
Like the original features in the data set, they seem to have little predictive power
on their own. However the added features could prove to be valuable in nonlinear
models.
Chapter 4
In this chapter the test and validation data sets will be used to create models for
the probability of charge off, exposure at charge off and loss given charge off. The
expected loss will also be modeled directly besides modeling the separate quantities
of which the experted loss consists. The algorithms that were discussed in Chapter
2 will be used with the train and validation data that was described in Chapter 3.
Optimal models per quantity and algorithm are found by making models with the
train data under different parameter settings, and selecting the model with best per-
formance on the validation set. This approach prevents overfitting to the training
data by looking at performance of new data.
The baseline that we set for the models is to predict the average charge off rate ob-
served in the entire train set as a probability of charge off for the test observations.
which is 19.149%. The AUC of this model is 0.5, this corresponds to having a model
with zero predictive power.
First we deal with finding a reasonable amount of variables to include in the model.
By trying different target variable correlation thresholds. When the threshold is set
at 0.07 we have 20 variables left from which we can fit models. All possible models
with one, two, three, four and five variables are fitted to the data. The 21.699 models
34 Chapter 4. Machine Learning Retail Credit Risk models
The AIC measures how good a model is, relative to another model created in the
same environment. Models are punished for having unnecessary complexity, which
is done with the following formula:
AIC = 2k − 2ln(L)
where k represents the number of estimated parameters and L represents the max-
imized value of model likelihood. The value of the AIC itself has no meaning, but
when we have models that are created in the same environment, we can select the
best model by picking the one with the lowest AIC.
Model 1 summary:
Deviance R e s i d u a l s :
Min 1Q Median 3Q Max
−1.6710 −0.6811 −0.5370 −0.3555 2.9054
Coefficients :
. Estimate Std . E r r o r z value Pr ( >| z |)
( Intercept ) −1.444171 0 . 0 0 5 0 9 3 −283.58 <2e−16 ∗∗∗
term 1.062204 0.011838 89.72 <2e−16 ∗∗∗
APPL_FICO_BAND 0 . 9 3 9 9 6 9 0.014545 64.63 <2e−16 ∗∗∗
dti 0.673021 0.016208 41.52 <2e−16 ∗∗∗
zip_code 0.851267 0.021584 39.44 <2e−16 ∗∗∗
annual_inc 0.948192 0.024528 38.66 <2e−16 ∗∗∗
Number o f F i s h e r S c o r i n g i t e r a t i o n s : 5
AUC: 0 . 6 8 0 2 9 4 7
The selected model has five significant variables, meaning that the coefficients of the
variables in the model are significantly different from zero for which the hypothesis
was tested.
4.1. Probability of charge off 35
Model 2 summary:
Deviance R e s i d u a l s :
Min 1Q Median 3Q Max
−1.6425 −0.6818 −0.5272 −0.3404 2.9740
Coefficients :
. Estimate Std . E r r o r z value Pr ( >| z |)
( Intercept ) −1.444828 0 . 0 0 5 1 2 6 −281.846 < 2e−16 ∗∗∗
term 1.028194 0.012296 8 3 . 6 1 7 < 2e−16 ∗∗∗
APPL_FICO_BAND 0.702271 0.017536 4 0 . 0 4 7 < 2e−16 ∗∗∗
dti 0.475499 0.017889 2 6 . 5 8 0 < 2e−16 ∗∗∗
bc_open_to_buy 0.814874 0.054250 1 5 . 0 2 1 < 2e−16 ∗∗∗
percent_bc_gt_75 0.137116 0.039355 3 . 4 8 4 0 . 0 0 0 4 9 4 ∗∗∗
avg_cur_bal 0.389631 0.046667 8 . 3 4 9 < 2e−16 ∗∗∗
zip_code 0.848744 0.021732 3 9 . 0 5 5 < 2e−16 ∗∗∗
bc_util −0.516950 0 . 0 5 0 7 0 1 −10.196 < 2e−16 ∗∗∗
tot_cur_bal −0.147237 0.048624 −3.028 0 . 0 0 2 4 6 1 ∗∗
annual_inc 0.941332 0.028286 3 3 . 2 7 9 < 2e−16 ∗∗∗
verification_status 0.486283 0.024300 2 0 . 0 1 1 < 2e−16 ∗∗∗
acc_open_past_24mths 0 . 5 0 4 8 9 2 0.032761 1 5 . 4 1 2 < 2e−16 ∗∗∗
revol_util 0.305937 0.036711 8 . 3 3 4 < 2e−16 ∗∗∗
mort_acc 0.539045 0.032901 1 6 . 3 8 4 < 2e−16 ∗∗∗
total_bc_limit −0.689281 0 . 0 5 8 2 6 6 −11.830 < 2e−16 ∗∗∗
mo_sin_rcnt_tl 0.328387 0.036420 9 . 0 1 7 < 2e−16 ∗∗∗
num_actv_rev_tl −0.229393 0.037312 −6.148 7 . 8 5 e−10 ∗∗∗
−−−
S i g n i f . codes : 0 ’∗∗∗ ’ 0 . 0 0 1 ’∗∗ ’ 0 . 0 1 ’ ∗ ’ 0 . 0 5 ’ . ’ 0 . 1 ’ ’ 1
Number o f F i s h e r S c o r i n g i t e r a t i o n s : 5
AUC: 0 . 6 9 1 0 9 3 2
The second model has lower AIC and higher AUC, and is therefore, according to an
econometric perspective, a preferable model. The models were fitted with 280,000
observations, and the AUC was calculated on a validation set of 120,000 observa-
tions. This model is sensitive to the amount of data used. When the model is fitted
with 28,000 loans, the AUC on the validation set drops to 0.67 and four variable
coefficients have become statistically insignificant for the model.
TABLE 4.1: Random Forest probability of charge off top ten most im-
portant variables
We can see that there is a clear winner among the available variables, the zip code.
It is almost twice as important as the second most important variable. We will keep
it in the model in order to achieve high performance, however from an ethical point
of view the use of this variable is of course questionable.
A standard Random Forest creates five hundred trees, randomly selects the rounded
square root of the available variables as candidates for the splits and sets a minimal
node size of 10 in case of probability estimation. These settings and the amount of
data used to train the model have to be optimized in terms of area under the receiver
operating characteristic curve.
Adding more data to the model will, to a certain extent, improve the model but is
also expensive in computing time. Therefore we start with a reasonable amount of
data to get a sense of the behavior of the forest under different parameter settings.
When we have found good parameters, we will improve performance by adding
data. Ten thousand is a reasonable amount to create models, this corresponds to
4286 observations in the validation set in order to keep the 70/30 percent fractions in
tact. The standard Random Forest model achieves a performance on the validation
data of 0.65614 AUC. The performance on the training data is 0.99999 AUC. These
AUC’s correspond to the ROC plots in the figure below.
4.1. Probability of charge off 37
F IGURE 4.1: Random Foret initial ROC performance on train and val-
idation data
Now we have to test different values of the training set size, the amount of variables
randomly selected for split criterion and the terminal node size and the number of
trees to put in the forest.
From Table 4.2 we conclude that 2% of train set size is a good setting for the depth
parameter. The number of variables, to draw from all variables, as candidates to
make a split on will be tested on a range, close to the default setting, the rounded
square root of all variables.
38 Chapter 4. Machine Learning Retail Credit Risk models
In Table 4.3 we observe that lowering the amount of candidate variables increases
the AUC on the validation set. We stated earlier that a higher value allows for a less
complex forest and should reduce the difference between the performance on train
and validation data. The higher performance with a lower mTry, can be explained
by the fact that a high amount of candidates will will make the model choose the best
variable, in terms of train data, to make the split. Having less freedom in picking the
split variable makes the forest able to make better generalizations, which is reflected
in the performance on validation data.
Table 4.4 shows that the optimal size of the forest is 700 trees.
Final forest
Running the Random Forest on the entire train data set, with the parameters found
on the smaller set, an AUC of 0.7129 is achieved. Adding data to the forest produces
a substantial performance gain, because the larger set is a better representation of
the validation data. By repeating the parameter search, that was performed with
less data, on the larger data set, other parameters are found that produce an even
higher AUC. When a Random Forest is trained on 280,000 observations with 1,000
trees, mTry 7 and node size 1,000 we achieve the highest AUC which is 0.7231. From
this we can conclude that training a forest with more data needs re evaluation of the
4.1. Probability of charge off 39
parameters.
For comparison, we also run a Random Forest without the weight of evidence trans-
formation, using the same model parameters, from re evaluation of the larger data
set forest. This model achieves an area under the receiver operating characteristic of
0.7247 on the validation data.
Figure 4.2 shows that on the validation set, the model using original data performs
similar to the model using weight of evidence data. However the better perfor-
mance on the train set might indicate that there is more room for improvement for
the model using original data. The highest AUC achieved by the WOE based forest
is 0.7232. The original data model achieved a higher AUC of 0.7244.
When we include all features and increase the number of neurons in the hidden
layer, we are gradually introducing more complexity into the model. With a small
amount of data, an over fit will be produced quickly without adding much neurons
in the hidden layer. This gradually adding of complexity will be done with 1, 10 and
100 percent of the data. The following plots and Table 4.5 represent the results of
this process.
40 Chapter 4. Machine Learning Retail Credit Risk models
F IGURE 4.6: Neural Network Train and Validation AUC with differ-
ent data set sizes and numbers of hidden nodes
TABLE 4.5: Train and Validation AUC with different set sizes
The measure of complexity in the model needed for optimal performance on the
validation set increases with the size of the data set. In Table 4.5 we see that the
smallest train/validation set requires four hidden nodes achieving 0.62 AUC, with
six hidden nodes an optimal validation AUC of 0.69 is achieved in the medium data
set size. The Neural Network with 100 percent of the train data is able to achieve an
AUC of 0.72 with 9 hidden nodes.
The probability of charge off Neural Network is very sensitive to the amount of
data and the number of hidden nodes in the network. The best model is a Neural
Network with 87 input nodes, 9 hidden nodes and one output node, resulting in a
validation set AUC of 0.7208.
small data set. Optimizing on more data proves to be very time consuming, because
the Support Vector Machine needs to perform calculations on every pair of observa-
tions, the amount of calculations grows very fast by adding data.
Support Vector Machines are said to be an excellent model choice when the data set
suffers from the dimensionality curse (having a lot of variables and relatively few
observations). Clearly with 87 variables and 400k observations, the Lending Club
data set does not suffer from this curse. Leading us to expecting inferior behavior of
Support Vector Machines, however they might be able to show good results when
trained with a small amount of observations.
The Support Vector Machine with the radial basis function kernel performs best on
standard model settings, finding the optimal settings for this kernel is therefore most
promising.
Gamma controls how much influence support vectors have, when gamma is low
data points have a large influence region and this can lead to an under fit. High
gamma on the other hand limits the region of influence of the support vectors and
leads to over fitting to the training data. A gamma of 1/18 is found to be optimal for
the validation data set.
Training a Support Vector Machine with radial kernel and gamma 1/18 on 28000
observations, results in 0.9947 and 0.6506 AUC on the train and validation data re-
spectively.
Training the Support Vector Machine with 2800 observations takes 15 seconds, train-
ing the same model with 28000 observations costs 5644 seconds. The training time
of an SVM with radial basis function kernel on 280,000 observations is estimated to
be larger than 5644/15 = 376.27 times 1.5 hours, under the assumption that training
time grows linear with the data set size. This assumption gives us a lower bound of
roughly 564 hours, because in reality the training time grows much faster than linear
with the data set size.
As baseline for this model, we take the average of the train data and predict this for
the validation data. resulting in a mean squared error on the validation set of:
base M SE = 0.03955
The exposure at charge off must be modeled with data from charged off loans. The
same initial split of 280000/120000 is made. After splitting the data, the charged
off loans are selected resulting in 53618/22825 train/validation observations, which
approximately still is a 70/30 percent split.
4.2.1 Regression
Using stepwise regression, with backwards elimination that minimizes the AIC, we
find a model with 28 variables. intercorrelated variables were removed, the variable
of the correlated pair with highest correlation to the response variable is kept in
the initial regression model. The model mean squared error is 0.02778 with an R2 of
0.29770, these values were calculated using the validation data. The model summary
can be found in Appendix E.1.
TABLE 4.7: Random Forest exposure at charge off top ten most im-
portant variables
Through a parameter search, presented in Table 4.8 we find the best parameters of
the forest to be a random drawing of 20 variables to try as split variables for each
split, nodes that contain more than 25 observations will be split and the forest of
regression trees will contain 1200 individual trees.
4.2. Exposure At Charge off 43
mTry Train MSE Val MSE nTree Train MSE Val MSE Node Size Train MSE Val MSE
20 0.004904 0.026622 1200 0.006075 0.026822 25 0.015001 0.026787
28 0.004725 0.026636 1400 0.006095 0.026841 40 0.018066 0.026790
30 0.004697 0.026642 1000 0.006077 0.026848 44 0.018649 0.026800
22 0.004846 0.026648 1100 0.006094 0.026849 22 0.014108 0.026806
24 0.004798 0.026653 1500 0.006073 0.026851 27 0.015521 0.026809
29 0.004714 0.026654 900 0.006085 0.026851 31 0.016440 0.026811
26 0.004759 0.026656 1000 0.006082 0.026855 43 0.018500 0.026813
23 0.004821 0.026664 1300 0.006069 0.026865 29 0.015998 0.026815
18 0.004975 0.026669 800 0.006079 0.026865 41 0.018214 0.026816
21 0.004874 0.026670 600 0.006088 0.026877 33 0.016858 0.026818
In the table there are parameters missing, the table only shows the top ten sorted
on validation MSE. Table D.5 in the appendix gives a complete overview of perfor-
mance on all parameters that were tested.
The exposure at charge off Random Forest model achieves 0.012972 train MSE and
0.026565 validation MSE. The R2 on the validation set is 0.328295. A Random Forest
with the unprocessed data results in a MSE on the validation set of 0.026509 with an
R2 of 0.329690.
F IGURE 4.7: Neural Network EAC MSE on train and validation data
vs number of hidden nodes
44 Chapter 4. Machine Learning Retail Credit Risk models
Nodes Train MSE Val MSE Nodes Train MSE Val MSE
1 0.027943 0.027614 11 0.027008 0.027109
2 0.027397 0.027208 12 0.026857 0.026999
3 0.027738 0.027580 13 0.026748 0.026850
4 0.027359 0.027240 14 0.027145 0.027145
5 0.027013 0.026903 15 0.026815 0.026929
6 0.027156 0.027089 16 0.026794 0.026889
7 0.027234 0.027179 17 0.026784 0.026888
8 0.027166 0.027157 18 0.026973 0.027186
9 0.026979 0.027070 19 0.027247 0.027481
10 0.026988 0.027084 20 0.026878 0.026999
The Neural Network with 13 nodes in the hidden layer results in the lowest mean
squared error on the validation set. The error is 0.02685 with an R2 of 0.3188852.
The results in Table 4.10 come from standard parameters of the models. In case of
the polynomial kernel (degree 2), increasing or decreasing the degree makes perfor-
mance of the model worse. The linear kernel has no parameter to improve and the
sigmoid kernel is showing very bad performance on this data set. We continue with
the most promising model, the Support Vector Machine with the radial basis func-
tion kernel. The best gamma parameter in terms of MSE on the validation data set
1
is 350 .
On the large data set, the RBF Support Vector Machine achieves 0.02694668 MSE,
0.0271566 validation MSE and an R2 of 0.3133397 on the validation set.
base M SE = 0.00870
For modeling the loss given charge off we use only charged off loans like we did in
the previous section.
4.3.1 Regression
From the 87 variables we keep 37 that are not highly correlated to each other. The
variables that were thrown out were less correlated to the response variable than the
variables that are kept in the model. From the 37 variables we remove the insignif-
icant ones one by one minimizing the AIC. The model with the lowest AIC has 25
variables included, the model summary is added in Appendix E.2 This results in an
MSE of 0.008262 and and R2 of 0.049975 on the validation set.
TABLE 4.11: Random Forest Loss Given Charge off top ten most im-
portant variables
In Table 4.12 the top parameters of a parameter search are presented. The entire
parameter search is specified in Table D.6 in the appendix.
mTry Train MSE Val MSE nTree Train MSE Val MSE Node Size Train MSE Val MSE
9 0.002287 0.008203 1400 0.002290 0.008198 49 0.006367 0.008176
8 0.002406 0.008203 1200 0.002284 0.008199 47 0.006301 0.008177
11 0.002110 0.008208 600 0.002301 0.008200 48 0.006333 0.008177
10 0.002192 0.008209 1100 0.002289 0.008201 37 0.005884 0.008179
5 0.003042 0.008209 1300 0.002286 0.008201 44 0.006185 0.008179
7 0.002597 0.008209 900 0.002298 0.008202 42 0.006105 0.008180
14 0.001975 0.008210 1500 0.002289 0.008202 50 0.006401 0.008180
6 0.002802 0.008211 1000 0.002283 0.008206 39 0.005973 0.008180
13 0.001997 0.008212 800 0.002292 0.008206 36 0.005820 0.008180
3 0.004258 0.008215 500 0.002296 0.008207 41 0.006071 0.008181
TABLE 4.12: Random Forest Loss Given Charge off model parameter
search sorted on validation MSE performance
46 Chapter 4. Machine Learning Retail Credit Risk models
A Random Forest with minimal node size 49, mtry 9 and 1400 trees is chosen as final
model. This model achieves 0.00817 validation MSE with an R2 of 0.05989. When
we use the unprocessed data, the Random Forest achieves 0.00817 validation MSE
with an R2 of 0.06015.
F IGURE 4.8: Neural Network LGC MSE on train and validation data
vs number of hidden nodes
Nodes Train MSE Test MSE Nodes Train MSE Test MSE
1 0.008831 0.008282 16 0.008747 0.008288
2 0.008913 0.008369 17 0.008738 0.008266
3 0.009082 0.008574 18 0.008795 0.008343
4 0.008943 0.008470 19 0.008819 0.008342
5 0.009075 0.008584 20 0.008711 0.008277
6 0.008850 0.008356 21 0.008768 0.008271
7 0.008845 0.008371 22 0.008807 0.008317
8 0.008809 0.008319 23 0.008854 0.008350
9 0.008872 0.008376 24 0.008755 0.008284
10 0.008742 0.008283 25 0.008783 0.008337
11 0.008869 0.008387 26 0.008728 0.008281
12 0.008841 0.008398 27 0.008767 0.008277
13 0.008793 0.008315 28 0.008738 0.008240
14 0.008837 0.008382 29 0.008729 0.008288
15 0.008751 0.008285 30 0.008739 0.008265
28 nodes in the hidden layer of the network gives the best performance. A MSE of
0.00824 and R2 of 0.04550 are achieved with the single layer Neural Network.
The values in the table are calculated with the standard parameters of the models.
In case of the polynomial kernel, decreasing the degree to two, makes performance
of the model slightly better. However this is still not close to the performance of the
radial basis function kernel.
1
The best gamma for the RBF kernel is 600 , a gamma that is this low leads to high
influence regions of the potential support vectors. This would in general result in
an under fit, however it is the best parameter on the validation data. Running this
model on the large data set results in a validation MSE 0.00898 and R2 -0.03267.
This model shows good performance in terms of MSE, yet the negative R2 contra-
dicts good performance. A negative R2 means that this model does worse that the
baseline average prediction, which has an MSE of 0.00870.
The average loss of principal across the train data is 13.219%, if we predict this for
all validation observations, the mean squared error that we find is 0.08149.
4.4.1 Regression
After removing correlated variables, a stepwise regression model with backwards
elimination is created. The resulting model, summarized in Table E.3, with lowest
AIC consists of 33 independent variables and performance on the validation data in
terms of MSE is 0.07276 and the R2 is 0.10712.
48 Chapter 4. Machine Learning Retail Credit Risk models
TABLE 4.15: Random Forest Expected Loss top ten most important
variables
Because the size of the train set is larger, we take steps of 10 in the search for the best
minimal node size. Memory limitations prevented the training of forests larger than
900 trees.
mTry Train MSE Val MSE nTree Train MSE Val MSE Node Size Train MSE Val MSE
8 0.017405 0.071639 800 0.016767 0.071558 40 0.047228 0.071421
12 0.015657 0.071672 700 0.016786 0.071606 60 0.052848 0.071429
11 0.015946 0.071673 900 0.016749 0.071618 50 0.050439 0.071432
10 0.016312 0.071674 500 0.016789 0.071632 80 0.056276 0.071457
9 0.016790 0.071681 600 0.016771 0.071650 70 0.054765 0.071470
13 0.015410 0.071683 400 0.016831 0.071712 30 0.042813 0.071472
7 0.018226 0.071705 300 0.016831 0.071720 90 0.057576 0.071481
14 0.015239 0.071718 200 0.016878 0.071881 20 0.036165 0.071483
6 0.019430 0.071727 100 0.017119 0.072140 100 0.058661 0.071497
15 0.015056 0.071764 10 0.025125 0.071579
The best performing parameters, shown in Table 4.16, are a minimal node size of 40,
mTry 8 and 800 trees. This Random Forest achieves 0.071451 validation MSE and an
R2 of 0.123163. Using the original data a Random Forest with the same parameters
obtains a MSE of 0.071419 and an R2 of 0.123556.
Nodes Train MSE Val MSE Nodes Train MSE Val MSE
1 0.071397 0.071959 16 0.069788 0.070959
2 0.071133 0.071794 17 0.069737 0.071009
3 0.070748 0.071341 18 0.069742 0.071008
4 0.070489 0.071187 19 0.069582 0.070872
5 0.070384 0.071204 20 0.069674 0.071034
6 0.070296 0.071132 21 0.069667 0.070947
7 0.070175 0.071010 22 0.069703 0.071117
8 0.070202 0.071183 23 0.069656 0.071012
9 0.069980 0.070991 24 0.069569 0.070945
10 0.070030 0.071140 25 0.069857 0.071243
11 0.069844 0.070947 26 0.069604 0.070958
12 0.069846 0.071031 27 0.069574 0.070870
13 0.069855 0.071017 28 0.069517 0.070989
14 0.069898 0.071084 29 0.070074 0.071402
15 0.069701 0.070968 30 0.069569 0.071056
The best Neural Network for expected loss is a network with 27 nodes in the hidden
layer of the network. This model achieves 0.070939 validation MSE and an R2 of
0.129425.
Chapter 5
Model Analysis
Chapter 5 will evaluate and compare both the model approach and the model perfor-
mance of (logistic) regression and the machine learning algorithms Random Forests,
Neural Networks and Support Vector Machines. The model approach differences
are addressed first, this will contribute to answering subquestion c. The next and
also last subquestion about added value of machine learning in terms of model per-
formance will be analyzed on prediction accuracy and on calibration
• (Logistic) Regression
During the modeling of probability of charge off with Logistic Regression and
the continuous quantities with regression, we had to carefully select variables
to include in the models. With the main concern that correlated variables
produce unreliable models due to the independence assumption of regression
models. The next step was to choose from the correlated variables which one
to keep in the model. trying all possible combinations and subsets of combi-
nations of uncorrelated variables was not possible because of the number of
available variables. Our approach was to keep the variables with the highest
correlation to the quantity that the model should predict. This step has led to
a loss of potentially informative variables. Our following step was to use the
AIC in order to select the best model from all created regression models. This
is an often used statistical method to select the best regression model, it can
easily be misused when it is used to compare models that are created in dif-
ferent data environments. When the AIC is used in a correct manner, it shows
which model has the best trade off between complexity and performance. The
model with the lowest AIC value is then selected as the final model.
that parameters that work well on a small data set are not the best parame-
ters when more data is used. For the Support Vector Machines however we
still had to apply the initial approach, because finding optimal parameters on
large data sets is too time consuming. Parameters for the other models cre-
ated during this research were searched using the largest available data sets.
The optimal parameters are the parameters that perform best when the models
make predictions on new data. we have seen in Chapter 4 that machine learn-
ing algorithms are in most cases able to create models that fit very good to the
training data set but that the same model can perform very poor on new data.
When the parameters that produce optimal out of sample results are found,
we are finished and have a final model.
The machine learning approach saves time in the early stages of modeling, because
there is no independence assumption that requires carefully disregarding parts of
the data set. During the selection of the best parameters for a machine learning
model, a lot of time is consumed when large data sets are used to train the models.
Stepwise regression is a faster approach in finding the optimal model. On the other
hand, a thorough approach in finding the optimal regression model would be to try
all possible combinations of model variables. This becomes problematic with the
amount of variables used in this research. Another difference between regression
and machine learning is that the model selection based on AIC is using the model
likelihood on training data. The machine learning models are selected by looking at
performance on data that was not used during the training of the model.
A big advantage of the Random Forest algorithm is that it did not require the trans-
formation of data to a scaled and numeric data set. The algorithm is even insensi-
tive to outliers present in the data, where the regression and other machine learn-
ing approaches can suffer from the presence of such outliers in data sets. Another
strong point of the Random Forest models is the ability to show what variables are
the most important during the training of the Random Forest. When scaled data is
used, regression model coefficients also show what variable contributes the most to
the prediction of the model. Neural Networks and Support Vector Machines do not
have such simple and intuitive ways of showing what variables are important for
the credit risk predictions.
Model AUC
Baseline 0.5
Logistic Regression 0.68366
Random Forest 0.71782
Neural Network 0.71747
Support Vector Machine* 0.64942
First of all we notice that the Random Forest and the Neural Network clearly out-
perform the other models, secondly these two models are very close both in terms
of ROC curve shape and AUC value. From these two models the probability that
the Random Forest has assigned a higher probability of charge off to an observation
of the charged off population when compared to an observation of the fully paid
population is 0.035% higher.
The Random Forest on original data achieves 0.72053 AUC on the test data, with the
settings that were optimal for the Weight of Evidence forest on the validation data.
In Section 4.1.2 we have seen that the gap between train and validation ROC is larger
for not transformed data than for the transformed data. This performance loss comes
from restricting the Random Forest in its own decision making through binned data.
*Due to the computational expensiveness, the Support Vector Machine has not been
run with the same data set size as was used with the other models.
54 Chapter 5. Model Analysis
When the probability of charge off predictions for 11471 test observations of all mod-
els are ordered on predicted probability of charge off, we can evaluate how much of
the safe and risky loans, according to the models, were actually a good or bad in-
vestment. Tables 5.2 and 5.3 show the number of charged off loans in the thousand
riskiest and safest loans respectively.
TABLE 5.2: Number of charged off loans in the 10% riskiest predicted
loans
TABLE 5.3: Number of charged off loans in the 10% safest predicted
loans
These tables show in a more intuitive way how good models are at concentrating
possibly bad loans in the high probabilities of charge off and possibly good loans in
the low probabilities of charge off. We already know that the Random Forest and
Neural Network out perform Logistic Regression and the Support Vector Machine.
These tables show however, that a Neural Network is slightly better at concentrating
actually charged off loans in the high probability region and that a Random Forest
is slightly better at concentrating loans that were fully repaid in the low probability
of charge off region.
Model MSE R2
Baseline 0.03894
Regression 0.02746 0.29447
Random Forest 0.02644 0.32084
Neural Network 0.02666 0.31522
Support Vector Machine 0.02674 0.31312
The mean squared errors for the machine learning models are all lower than the
mean squared error of the regression model. This is also reflected in the R2 that the
models achieve. The lowest mean square error and highest R2 is achieved by the
Random Forest model. When the data that was not transformed is used the Random
Forest model achieves 0.02628 MSE and an R2 of 0.32485.
Model MSE R2
Baseline 0.00980
Regression 0.00933 0.04751
Random Forest 0.00931 0.04895
Neural Network 0.00968 0.01183
Support Vector Machine 0.01028 -0.04989
The models perform slightly better than the baseline average prediction, except for
Support Vector Machine that performs worse than the average prediction and is able
to achieve a negative R2 . The Random Forest again performs best, however in this
case the runner up is not the Neural Network but regression. The forest with origi-
nal data achieved a mean squared error of 0.00927 and an R2 of 0.05258.
From the R2 squared values we can conclude that the models for this quantity hardly
have any predictive power because the amount of variance in the test data that can
be explained by the models is very small. Recall that in Chapter 3 Table 3.5 shows
that across all Lending Club sub grades the loss given charge off’s are between
91.44% and 94.19%. The expectation that a predictive model would have very lit-
tle power is confirmed by the low R2 values and mean square errors close to the
baseline mean square error which is small in absolute sense.
gain of the probability of charge off Random Forests that predict the same response
as used for the Weight of Evidence transformation.
For every algorithm we have multiplied the predictions of its probability of charge
off, exposure at charge off and loss given charge off. Together with the observed
losses the model performance is evaluated, which is represented in Table 5.6.
Model MSE R2
Baseline 0.08145
(Logistic) Regression 0.07351 0.09751
Random Forest 0.07151 0.12200
Neural Network 0.07119 0.12601
Support Vector Machine 0.07527 0.07589
TABLE 5.6: EL from separate PC, EAC and LGC model predictions
When the models are used to predict the expected losses without first modeling
the separate components of expected loss, the performance reported in Table 5.7 is
achieved.
Model MSE R2
Baseline 0.08145
Regression 0.07279 0.10626
Random Forest 0.07147 0.12255
Neural Network 0.07029 0.13697
Support Vector Machine 0.07929 0.02648
If we compare the two expected loss approaches, we see that except for Support Vec-
tor Machines, the models perform better when expected loss is modeled directly. The
reason that Support Vector Machines perform better in the separate model setting is
that the support vector models for exposure at charge off and loss given charge off
were trained with the same amount of data as the other exposure and charge off
models. The Support Vector Machine could not be trained with all data for the prob-
ability of charge off and the expected loss models. So for the separate model case
the SVM is in disadvantage only with modeling the probability of charge off. For
the expected loss in one model the Support Vector Machine is in disadvantage for
the entire loss prediction. In both approaches the Neural Network models outper-
form the other models and the Random Forest comes close to the Neural Network
performance.
the ability to capture losses of the separate model EL approach and Figure 5.3 shows
the same for the individual EL models.
The areas under the curves in figures 5.2 and 5.3 indicate how good a model can
capture losses, an area of 0.5 indicates that a model does nothing. An optimal model
(OPT in the figures), ordering all the losses from largest to smallest, would look like
the aquamarine line. This line shows that roughly 19% of the loans in the test set
58 Chapter 5. Model Analysis
have been charged off and resulted in a loss, after that point the loss captured by the
perfect model is 100% and it becomes a straight line parallel to the x axis.
The Neural Network performs best in concentrating the actual losses in the loans
that have higher expected losses. The single model approach slightly outperforms
the expected loss performance from creating separate models for the probability of
charge off, exposure at charge off and loss given charge off. In case of the regression
model, the individual model also slightly outperforms the loss capture abilities of
the separate model approaches. In case of the Random Forests and Support Vector
Machines the predictions form separate components are better in capturing losses
with high expected loss predictions. The baseline model in figures 5.2 and 5.3 repre-
sents a ranking of the loans on loan amount, it is able to achieve an area under the
loss capture curve a lot higher than 0.5. This means that the loan amount can be seen
as an important risk driver in the Lending Club data.
Calibration
In the previous sections we have analyzed how accurate the models are at individual
loan level and how good the models are at assigning a high loss expectation to loans
that have actually ben charged off. In this section we will let the models put the loans
in five buckets ranging from highest expected loss (bucket 1) to lowest expected loss
(bucket 5). In these buckets the total predicted loss is compared with the actual loss
of all loans in the bucket. The figures 5.4 and 5.5 show the difference between the
two as a percentage of the actual loss in the bucket, a positive bar in the figures
indicate that the model expected more loss than what was actually observed and
a negative bar indicates the opposite. When bars are small and close to zero, the
model has predicted a loss in the bucket close to the actual sum of losses observed
in the bucket.
5.2. Model performance 59
In the separate model approach, Logistic Regression combined with standard re-
gression and the Support Vector Machine models predict bucket losses closest to
their actual values. Furthermore Neural Network models underestimate the losses
in all buckets and the Random Forest models underestimate the losses in the most
risky bucket while overestimating the losses in the other buckets.
The more direct single model approach for expected loss has enabled the Neural
60 Chapter 5. Model Analysis
Network to make better calibrated expected loss predictions. The Random Forest
shows the same pattern of under and over estimation of losses as the separate model
Random Forests, with a less severe underestimation of the losses in the most risky
bucket. What we can also see is that the Regression approach is not able to make
better calibrated predictions when the expected loss is directly modeled and that the
Support Vector Machine model has very poor calibration also note that the predicted
loss of the least risky bucket is more than 75% less than the actual losses of the loans
in that bucket.
61
Chapter 6
Conclusions and
Recommendations
This research investigated two possible sources of the added value of machine learn-
ing in retail credit risk. The first source being the modeling approach of traditional
credit risk modeling versus the approach of machine learning algorithms. The sec-
ond source of possible added value was evaluated by comparing model perfor-
mance. After the conclusions about added value of machine learning in retail credit
risk are discussed, the recommendations regarding future research will be discussed.
Conclusions
To put these performance results into perspective we have to address the fact that
the Weight of Evidence data transformation has not been beneficial for the Random
Forests and possibly also for the other machine learning algorithms. Because the
Random Forests did not require a data transformation we were able to also show
the results of the model with original data in Chapter 4. These results show that the
Random Forests without transformed data consistently outperform the WOE Ran-
dom Forests. We suspect that the main reason for this is that binning the numerical
data has given the models less freedom in making their own decisions.
The Weight of Evidence method was used to create a level playing field, in Chapter
2 we discussed why this method was most suitable for Logistic Regression, yet Lo-
gistic Regression is outperformed on WOE transformed data by the Random Forests
and Neural Networks which strengthens our conclusion about the presence of added
value in terms of performance in retail credit risk.
• The next step in machine learning research on retail credit risk data such as the
Lending Club data would be to evaluate the added value of online learning
algorithms. These algorithms are constantly updated when new data becomes
available. This would be valuable because when models adjust themselves,
there is no need to invest in creating new models when a lot of new data is
available or macro economic circumstances have changed.
• In this reseach we have used Neural Networks with one hidden layer. When
more hidden layers are added the algorithm is called Deep Learning. It will
be interesting to see if this method can perform better because of its ability to
create more complex models.
performance when data transformations with other link functions such as the
probit instead of the logit from Weight of Evidence are used.
65
Appendix A
Variable description
Name Description
acc_now_delinq The number of accounts on which the borrower is
now delinquent.
acc_open_past_24mths Number of trades opened in past 24 months.
addr_state The state provided by the borrower in the loan ap-
plication
all_util Balance to credit limit on all trades
annual_inc The self-reported annual income provided by the
borrower during registration.
annual_inc_joint The combined self-reported annual income pro-
vided by the co-borrowers during registration
application_type Indicates whether the loan is an individual appli-
cation or a joint application with two co-borrowers
avg_cur_bal Average current balance of all accounts
bc_open_to_buy Total open to buy on revolving bankcards.
bc_util Ratio of total current balance to high credit/credit
limit for all bankcard accounts.
chargeoff_within_12_mths Number of charge-offs within 12 months
collection_recovery_fee post charge off collection fee
collections_12_mths_ex_med Number of collections in 12 months excluding
medical collections
delinq_2yrs The number of 30+ days past-due incidences of
delinquency in the borrower’s credit file for the
past 2 years
delinq_amnt The past-due amount owed for the accounts on
which the borrower is now delinquent.
desc Loan description provided by the borrower
dti A ratio calculated using the borrower’s total
monthly debt payments on the total debt obli-
gations, excluding mortgage and the requested
LC loan, divided by the borrower’s self-reported
monthly income.
dti_joint A ratio calculated using the co-borrowers’ to-
tal monthly payments on the total debt obliga-
tions, excluding mortgages and the requested LC
loan, divided by the co-borrowers’ combined self-
reported monthly income
66 Appendix A. Variable description
Name Description
earliest_cr_line The month the borrower’s earliest reported credit
line was opened
emp_length Employment length in years. Possible values are
between 0 and 10 where 0 means less than one year
and 10 means ten or more years.
emp_title The job title supplied by the Borrower when ap-
plying for the loan.
fico_range_high The upper boundary range the borrower’s FICO at
loan origination belongs to.
fico_range_low The lower boundary range the borrower’s FICO at
loan origination belongs to.
funded_amnt The total amount committed to that loan at that
point in time.
funded_amnt_inv The total amount committed by investors for that
loan at that point in time.
grade LC assigned loan grade
home_ownership The home ownership status provided by the bor-
rower during registration. Our values are: RENT,
OWN, MORTGAGE, OTHER.
id A unique LC assigned ID for the loan listing.
il_util Ratio of total current balance to high credit/credit
limit on all install acct
initial_list_status The initial listing status of the loan. Possible values
are – W, F
inq_fi Number of personal finance inquiries
inq_last_12m Number of credit inquiries in past 12 months
inq_last_6mths The number of inquiries in past 6 months (exclud-
ing auto and mortgage inquiries)
installment The monthly payment owed by the borrower if the
loan originates.
int_rate Interest Rate on the loan
issue_d The month which the loan was funded
last_credit_pull_d The most recent month LC pulled credit for this
loan
last_fico_range_high The upper boundary range the borrower’s last
FICO pulled belongs to.
last_fico_range_low The lower boundary range the borrower’s last
FICO pulled belongs to.
last_pymnt_amnt Last total payment amount received
last_pymnt_d Last month payment was received
loan_amnt The listed amount of the loan applied for by the
borrower. If at some point in time, the credit de-
partment reduces the loan amount, then it will be
reflected in this value.
loan_status Current status of the loan
max_bal_bc Maximum current balance owed on all revolving
accounts
member_id A unique LC assigned Id for the borrower mem-
ber.
Appendix A. Variable description 67
Name Description
mo_sin_old_il_acct Months since oldest bank installment account
opened
mo_sin_old_rev_tl_op Months since oldest revolving account opened
mo_sin_rcnt_rev_tl_op Months since most recent revolving account
opened
mo_sin_rcnt_tl Months since most recent account opened
mort_acc Number of mortgage accounts.
mths_since_last_delinq The number of months since the borrower’s last
delinquency.
mths_since_last_major_derog Months since most recent 90-day or worse rating
mths_since_last_record The number of months since the last public record.
mths_since_rcnt_il Months since most recent installment accounts
opened
mths_since_recent_bc Months since most recent bankcard account
opened.
mths_since_recent_bc_dlq Months since most recent bankcard delinquency
mths_since_recent_inq Months since most recent inquiry.
mths_since_recent_revol_delinq Months since most recent revolving delinquency.
next_pymnt_d Next scheduled payment date
num_accts_ever_120_pd Number of accounts ever 120 or more days past
due
num_actv_bc_tl Number of currently active bankcard accounts
num_actv_rev_tl Number of currently active revolving trades
num_bc_sats Number of satisfactory bankcard accounts
num_bc_tl Number of bankcard accounts
num_il_tl Number of installment accounts
num_op_rev_tl Number of open revolving accounts
num_rev_accts Number of revolving accounts
num_rev_tl_bal_gt_0 Number of revolving trades with balance >0
num_sats Number of satisfactory accounts
num_tl_120dpd_2m Number of accounts currently 120 days past due
(updated in past 2 months)
num_tl_30dpd Number of accounts currently 30 days past due
(updated in past 2 months)
num_tl_90g_dpd_24m Number of accounts 90 or more days past due in
last 24 months
num_tl_op_past_12m Number of accounts opened in past 12 months
open_acc The number of open credit lines in the borrower’s
credit file.
open_acc_6m Number of open trades in last 6 months
open_il_12m Number of installment accounts opened in past 12
months
open_il_24m Number of installment accounts opened in past 24
months
open_il_6m Number of currently active installment trades
open_rv_12m Number of revolving trades opened in past 12
months
68 Appendix A. Variable description
Name Description
open_rv_24m Number of revolving trades opened in past 24
months
out_prncp Remaining outstanding principal for total amount
funded
out_prncp_inv Remaining outstanding principal for portion of to-
tal amount funded by investors
pct_tl_nvr_dlq Percent of trades never delinquent
percent_bc_gt_75 Percentage of all bankcard accounts >75% of limit.
policy_code publicly available policy_code=1 new products
not publicly available policy_code=2
pub_rec Number of derogatory public records
pub_rec_bankruptcies Number of public record bankruptcies
purpose A category provided by the borrower for the loan
request.
pymnt_plan Indicates if a payment plan has been put in place
for the loan
recoveries post charge off gross recovery
revol_bal Total credit revolving balance
revol_util Revolving line utilization rate, or the amount of
credit the borrower is using relative to all available
revolving credit.
sub_grade LC assigned loan subgrade
tax_liens Number of tax liens
term The number of payments on the loan. Values are
in months and can be either 36 or 60.
title The loan title provided by the borrower
tot_coll_amt Total collection amounts ever owed
tot_cur_bal Total current balance of all accounts
tot_hi_cred_lim Total high credit/credit limit
total_acc The total number of credit lines currently in the
borrower’s credit file
total_bal_ex_mort Total credit balance excluding mortgage
total_bal_il Total current balance of all installment accounts
total_bc_limit Total bankcard high credit/credit limit
total_cu_tl Number of finance trades
total_il_high_credit_limit Total installment high credit/credit limit
total_pymnt Payments received to date for total amount funded
total_pymnt_inv Payments received to date for portion of total
amount funded by investors
total_rec_int Interest received to date
total_rec_late_fee Late fees received to date
total_rec_prncp Principal received to date
total_rev_hi_lim Total revolving high credit/credit limit
url URL for the LC page with listing data.
verification_status Indicates if income was verified by LC, not veri-
fied, or if the income source was verified
Appendix A. Variable description 69
Name Description
verified_status_joint Indicates if the co-borrowers’ joint income was
verified by LC, not verified, or if the income source
was verified
zip_code The first 3 numbers of the zip code provided by
the borrower in the loan application.
71
Appendix B
Appendix C
Descriptive statistics
Appendix D
mTry Train MSE Test MSE nTree Train MSE Test MSE Node Size Train MSE Test MSE
20 0.004904 0.026622 1200 0.006075 0.026822 25 0.015001 0.026787
28 0.004725 0.026636 1400 0.006095 0.026841 40 0.018066 0.02679
30 0.004697 0.026642 1000 0.006077 0.026848 44 0.018649 0.0268
22 0.004846 0.026648 1100 0.006094 0.026849 22 0.014108 0.026806
24 0.004798 0.026653 1500 0.006073 0.026851 27 0.015521 0.026809
29 0.004714 0.026654 900 0.006085 0.026851 31 0.01644 0.026811
26 0.004759 0.026656 1000 0.006082 0.026855 43 0.0185 0.026813
23 0.004821 0.026664 1300 0.006069 0.026865 29 0.015998 0.026815
18 0.004975 0.026669 800 0.006079 0.026865 41 0.018214 0.026816
21 0.004874 0.02667 600 0.006088 0.026877 33 0.016858 0.026818
27 0.004739 0.02667 500 0.006091 0.026882 38 0.017738 0.026818
25 0.004782 0.02667 400 0.006086 0.026882 46 0.018916 0.02682
15 0.00513 0.026701 700 0.006094 0.026891 34 0.017053 0.02682
19 0.004942 0.026704 300 0.006151 0.026957 35 0.017218 0.02682
17 0.005029 0.026706 200 0.006145 0.02696 28 0.015766 0.026822
16 0.005071 0.026721 100 0.006225 0.027043 24 0.01473 0.026825
14 0.00521 0.026745 49 0.019277 0.026827
12 0.005425 0.026757 23 0.014437 0.026829
13 0.005292 0.026781 36 0.017393 0.026829
11 0.005598 0.026819 12 0.010114 0.026832
10 0.005798 0.02684 30 0.016247 0.026833
9 0.006115 0.026842 42 0.018351 0.026834
8 0.006507 0.026931 16 0.011951 0.026834
7 0.007112 0.027056 21 0.013816 0.026835
6 0.007947 0.027193 32 0.016666 0.026835
5 0.00911 0.027388 39 0.017933 0.026837
4 0.01091 0.027638 48 0.019152 0.026838
3 0.014203 0.028092 50 0.019386 0.026838
2 0.021247 0.029222 37 0.017593 0.026839
1 0.029936 0.031233 19 0.013112 0.02684
47 0.019034 0.026842
26 0.015272 0.026845
13 0.010625 0.026846
45 0.018783 0.026849
20 0.013484 0.026852
14 0.011076 0.026854
18 0.012752 0.026857
7 0.007291 0.026859
8 0.007874 0.02686
15 0.01156 0.026863
17 0.012376 0.026864
10 0.009058 0.026864
6 0.006735 0.026868
11 0.009585 0.026868
3 0.004983 0.026873
5 0.006088 0.026874
9 0.008493 0.026876
2 0.004548 0.026885
4 0.005504 0.026886
1 0.004291 0.026909
mTry Train MSE Test MSE nTree Train MSE Test MSE Node Size Train MSE Test MSE
9 0.002287 0.008203 1400 0.00229 0.008198 49 0.006367 0.008176
8 0.002406 0.008203 1200 0.002284 0.008199 47 0.006301 0.008177
11 0.00211 0.008208 600 0.002301 0.0082 48 0.006333 0.008177
10 0.002192 0.008209 1100 0.002289 0.008201 37 0.005884 0.008179
5 0.003042 0.008209 1300 0.002286 0.008201 44 0.006185 0.008179
7 0.002597 0.008209 900 0.002298 0.008202 42 0.006105 0.00818
14 0.001975 0.00821 1500 0.002289 0.008202 50 0.006401 0.00818
6 0.002802 0.008211 1000 0.002283 0.008206 39 0.005973 0.00818
13 0.001997 0.008212 800 0.002292 0.008206 36 0.00582 0.00818
3 0.004258 0.008215 500 0.002296 0.008207 41 0.006071 0.008181
15 0.001951 0.008215 700 0.002299 0.008208 46 0.006267 0.008183
4 0.003439 0.008215 400 0.002295 0.008216 35 0.005788 0.008183
12 0.00206 0.008217 300 0.002299 0.008228 43 0.006145 0.008183
19 0.001876 0.00822 200 0.002323 0.008245 45 0.006223 0.008184
16 0.001922 0.008221 100 0.002369 0.008276 38 0.005922 0.008184
18 0.001885 0.008223 34 0.005721 0.008185
20 0.001861 0.008224 31 0.005553 0.008185
23 0.001832 0.008225 40 0.006012 0.008186
22 0.001832 0.008225 23 0.004955 0.008186
2 0.006026 0.008226 24 0.005043 0.008187
27 0.001791 0.008226 28 0.005342 0.008187
30 0.001785 0.008227 33 0.005665 0.008188
25 0.001809 0.008231 32 0.00562 0.008188
24 0.001827 0.008232 30 0.005485 0.008189
21 0.001848 0.008232 27 0.005274 0.008189
28 0.001797 0.008233 26 0.005204 0.008189
17 0.001902 0.008235 19 0.004576 0.008189
29 0.001782 0.008237 20 0.004673 0.00819
26 0.001802 0.008237 25 0.005127 0.00819
1 0.008248 0.008268 16 0.004216 0.008192
22 0.004868 0.008192
29 0.00542 0.008193
15 0.004095 0.008194
21 0.004777 0.008195
13 0.003796 0.008195
17 0.004338 0.008196
14 0.003956 0.008197
18 0.004459 0.008197
12 0.003653 0.008198
3 0.001809 0.008201
5 0.002309 0.008202
11 0.003494 0.008202
8 0.002946 0.008202
9 0.003162 0.008203
2 0.001608 0.008206
10 0.003316 0.008208
6 0.002515 0.008214
1 0.001448 0.008214
4 0.00207 0.008215
7 0.002739 0.008217
mTry Train MSE Test MSE nTree Train MSE Test MSE Node Size Train MSE Test MSE
8 0.017405 0.071639 800 0.016767 0.071558 40 0.047228 0.071421
12 0.015657 0.071672 700 0.016786 0.071606 60 0.052848 0.071429
11 0.015946 0.071673 900 0.016749 0.071618 50 0.050439 0.071432
10 0.016312 0.071674 500 0.016789 0.071632 80 0.056276 0.071457
9 0.01679 0.071681 600 0.016771 0.07165 70 0.054765 0.07147
13 0.01541 0.071683 400 0.016831 0.071712 30 0.042813 0.071472
7 0.018226 0.071705 300 0.016831 0.07172 90 0.057576 0.071481
14 0.015239 0.071718 200 0.016878 0.071881 20 0.036165 0.071483
6 0.01943 0.071727 100 0.017119 0.07214 100 0.058661 0.071497
15 0.015056 0.071764 10 0.025125 0.071579
16 0.014928 0.071769 1 0.010428 0.071679
17 0.014787 0.07179
18 0.014678 0.07182
19 0.014579 0.07183
20 0.014482 0.071863
21 0.014416 0.071869
5 0.021052 0.071875
23 0.014271 0.071876
22 0.014327 0.071896
26 0.01408 0.071919
24 0.014211 0.071923
25 0.014124 0.071927
30 0.013887 0.07199
29 0.013953 0.071995
27 0.01404 0.071997
28 0.013984 0.072023
4 0.023801 0.072085
3 0.029893 0.072432
2 0.045883 0.073427
1 0.071821 0.076389
Appendix E
Regression models
Call :
lm ( formula = model . formula , data = l e n d i n g _ d f . t r a i n )
Residuals :
Min 1Q Median 3Q Max
−0.94687 −0.07897 0 . 0 1 4 4 6 0.10844 0.48947
Coefficients :
E s t i m a t e Std . E r r o r t value Pr ( >| t |)
( Intercept ) 0.7242840 0.0007728 937.279 < 2e−16 ∗∗∗
annual_inc_joint 0.8359876 0.0129599 64.506 < 2e−16 ∗∗∗
term 0.1577588 0.0017261 91.397 < 2e−16 ∗∗∗
provide_description 0.1850767 0.0071985 25.710 < 2e−16 ∗∗∗
inq_fi −0.0871134 0 . 0 0 7 2 9 4 5 −11.942 < 2e−16 ∗∗∗
num_tl_op_past_12m 0.0705302 0.0049749 14.177 < 2e−16 ∗∗∗
pct_tl_nvr_dlq 0.0726385 0.0125420 5.792 7 . 0 1 e−09 ∗∗∗
verification_status 0.0207171 0.0036557 5.667 1 . 4 6 e−08 ∗∗∗
dti 0.0201118 0.0024972 8.054 8 . 2 0 e−16 ∗∗∗
mo_sin_old_rev_tl_op 0.0222098 0.0067232 3.303 0 . 0 0 0 9 5 6 ∗∗∗
bc_util 0.0280603 0.0048157 5.827 5 . 6 8 e−09 ∗∗∗
APPL_FICO_BAND 0.0253607 0.0027022 9.385 < 2e−16 ∗∗∗
open_acc −0.1914179 0 . 0 1 5 4 7 7 5 −12.368 < 2e−16 ∗∗∗
diff_fundinvfund 0.0825648 0.0377456 2.187 0.028718 ∗
total_rev_hi_lim −0.0253267 0 . 0 0 6 5 2 0 2 −3.884 0 . 0 0 0 1 0 3 ∗∗∗
purpose 0.0507640 0.0046483 10.921 < 2e−16 ∗∗∗
mort_acc −0.0123231 0 . 0 0 5 1 6 9 2 −2.384 0.017131 ∗
pub_rec −0.0330247 0 . 0 2 0 8 1 0 8 −1.587 0.112540
collections_12_mths_ex_med 0.0347500 0.0189359 1.835 0.066490 .
delinq_2yrs 0.0633387 0.0190398 3.327 0 . 0 0 0 8 8 0 ∗∗∗
acc_now_delinq 0.1066618 0.0398983 2.673 0 . 0 0 7 5 1 2 ∗∗
inq_last_6mths 0.0579230 0.0052041 11.130 < 2e−16 ∗∗∗
issue_month −0.0380450 0 . 0 1 9 2 9 5 1 −1.972 0.048644 ∗
mths_since_last_delinq −0.1129974 0 . 0 3 8 0 2 4 6 −2.972 0 . 0 0 2 9 6 3 ∗∗
home_ownership 0.0312624 0.0055828 5.600 2 . 1 6 e−08 ∗∗∗
application_type −0.9537059 0 . 0 5 8 5 5 7 8 −16.287 < 2e−16 ∗∗∗
revol_util −0.0253932 0 . 0 0 5 2 6 8 8 −4.820 1 . 4 4 e−06 ∗∗∗
revol_bal −0.1362097 0 . 0 1 7 9 9 0 3 −7.571 3 . 7 5 e−14 ∗∗∗
tax_liens 0.0283604 0.0191226 1.483 0.138060
−−−
S i g n i f . codes : 0 ’∗∗∗ ’ 0 . 0 0 1 ’∗∗ ’ 0 . 0 1 ’ ∗ ’ 0 . 0 5 ’. ’ 0.1 ’ ’ 1
Call :
lm ( formula = model . formula , data = l e n d i n g _ d f . t r a i n )
Residuals :
Min 1Q Median 3Q Max
−1.14511 −0.05507 0 . 0 2 6 6 9 0.07467 0.12208
Coefficients :
Estimate Std . E r r o r t value Pr ( >| t |)
( Intercept ) 0.9214995 0 . 0 0 0 4 3 4 8 2 1 1 9 . 1 2 9 < 2e−16 ∗∗∗
annual_inc_joint 0.2786326 0.0071892 3 8 . 7 5 7 < 2e−16 ∗∗∗
inq_fi −0.0372416 0.0040751 −9.139 < 2e−16 ∗∗∗
provide_description 0.0181347 0.0039330 4 . 6 1 1 4 . 0 2 e−06 ∗∗∗
emp_length 0.0223883 0.0034351 6 . 5 1 7 7 . 2 2 e−11 ∗∗∗
annual_inc 0.0268224 0.0023547 1 1 . 3 9 1 < 2e−16 ∗∗∗
pub_rec 0.0598971 0.0108261 5 . 5 3 3 3 . 1 7 e−08 ∗∗∗
num_actv_rev_tl 0.0116188 0.0031861 3 . 6 4 7 0 . 0 0 0 2 6 6 ∗∗∗
88 Appendix E. Regression models
Call :
lm ( formula = model . formula , data = l e n d i n g _ d f . t r a i n )
Residuals :
Min 1Q Median 3Q Max
−0.47666 −0.15750 −0.08716 0.00406 1.07571
Coefficients :
E s t i m a t e Std . E r r o r t value Pr ( >| t |)
( Intercept ) 0.1536446 0.0005219 294.392 < 2e−16 ∗∗∗
term 0.1407186 0.0014662 95.972 < 2e−16 ∗∗∗
APPL_FICO_BAND 0.0420352 0.0017083 24.606 < 2e−16 ∗∗∗
dti 0.0681909 0.0019808 34.427 < 2e−16 ∗∗∗
acc_open_past_24mths 0.0959395 0.0032484 29.535 < 2e−16 ∗∗∗
bc_open_to_buy 0.0374914 0.0027229 13.769 < 2e−16 ∗∗∗
verification_status 0.0251846 0.0024169 10.420 < 2e−16 ∗∗∗
avg_cur_bal 0.0299705 0.0029582 10.131 < 2e−16 ∗∗∗
percent_bc_gt_75 0.0218862 0.0031988 6.842 7 . 8 3 e−12 ∗∗∗
zip_code 0.0839619 0.0021642 38.795 < 2e−16 ∗∗∗
open_il_6m 0.0810967 0.0025149 32.246 < 2e−16 ∗∗∗
annual_inc_joint 0.1148578 0.0073231 15.684 < 2e−16 ∗∗∗
funded_amnt_inv 0.0940734 0.0042193 22.296 < 2e−16 ∗∗∗
mort_acc 0.0199904 0.0037314 5.357 8 . 4 5 e−08 ∗∗∗
provide_description 0.0545386 0.0047399 11.506 < 2e−16 ∗∗∗
revol_util 0.0134155 0.0035458 3.784 0.000155 ∗∗∗
annual_inc 0.0746671 0.0031185 23.943 < 2e−16 ∗∗∗
mths_since_recent_inq −0.0398438 0 . 0 0 5 4 8 8 7 −7.259 3 . 9 0 e−13 ∗∗∗
t o t a l _ i l _ h i g h _ c r e d i t _ l i m i t −0.0818550 0 . 0 0 6 4 2 0 9 −12.748 < 2e−16 ∗∗∗
purpose 0.0735547 0.0032650 22.528 < 2e−16 ∗∗∗
home_ownership 0.0723832 0.0040628 17.816 < 2e−16 ∗∗∗
inq_last_6mths 0.1196836 0.0043318 27.629 < 2e−16 ∗∗∗
emp_length 0.1044092 0.0049609 21.046 < 2e−16 ∗∗∗
open_acc 0.0457575 0.0130370 3.510 0.000448 ∗∗∗
delinq_2yrs 0.1623778 0.0140729 11.538 < 2e−16 ∗∗∗
pub_rec −0.0887399 0 . 0 1 5 1 6 9 2 −5.850 4 . 9 2 e−09 ∗∗∗
tax_liens 0.1120335 0.0147313 7.605 2 . 8 5 e−14 ∗∗∗
collections_12_mths_ex_med 0.0696793 0.0152081 4.582 4 . 6 1 e−06 ∗∗∗
issue_month 0.0221908 0.0134287 1.652 0.098435 .
total_acc 0.2693868 0.0117234 22.978 < 2e−16 ∗∗∗
diff_fundinvfund −0.0921930 0 . 0 2 6 3 6 8 4 −3.496 0.000472 ∗∗∗
revol_bal −0.1524011 0 . 0 1 2 2 6 3 8 −12.427 < 2e−16 ∗∗∗
acc_now_delinq 0.0607842 0.0311270 1.953 0.050847 .
mths_since_last_delinq −0.1473465 0 . 0 2 7 9 3 8 2 −5.274 1 . 3 4 e−07 ∗∗∗
−−−
S i g n i f . codes : 0 ’∗∗∗ ’ 0 . 0 0 1 ’∗∗ ’ 0 . 0 1 ’ ∗ ’ 0 . 0 5 ’ . ’ 0 . 1 ’ ’ 1
Bibliography
BCBS, basel committee on banking (2005). “Working paper 14; Studies on the Vali-
dation of Internal Rating Systems”.
Breiman, L. (2001a). “Random forests”. In: Machine Learning 45.1, pp. 5–32.
Breiman, Leo (2001b). “Statistical Modeling: The Two Cultures”. In: Statistical Science
16.3, pp. 199–215.
Breiman, Leo et al. (1984). Classification and regression trees. CRC press.
Brownlee, Jason (2013). A Tour of Machine Learning Algorithms. [Online; accessed
April 13, 2017]. URL: http://machinelearningmastery.com/a- tour-
of-machine-learning-algorithms/.
Charpentier, Arthur (2013). Regression tree using Gini’s index. [Online; accessed April
13, 2017]. URL: https://freakonometrics.hypotheses.org/1279.
Cox, David R (1958). “The regression analysis of binary sequences”. In: Journal of the
Royal Statistical Society. Series B (Methodological), pp. 215–242.
Emekter, R. et al. (2015). “Evaluating credit risk and loan performance in online Peer-
to-Peer (P2P) lending”. In: Applied Economics 47.1, pp. 54–70.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani (2001). The elements of statis-
tical learning. Vol. 1. Springer series in statistics Springer, Berlin.
Gilardi, Nicolas and Samy Bengio (2000). “Local machine learning models for spatial
data analysis”. In: Journal of Geographic Information and Decision Analysis 4.EPFL-
ARTICLE-82651, pp. 11–28.
Gini, Corrado (1912). “Variabilità e mutabilità”. In: Reprinted in Memorie di metodolog-
ica statistica (Ed. Pizetti E, Salvemini, T). Rome: Libreria Eredi Virgilio Veschi 1.
Hu, Shuhua (2007). “Akaike information criterion”. In: Center for Research in Scientific
Computation.
James, Gareth et al. (2013). An introduction to statistical learning. Vol. 6. Springer.
Kim, Larsen (2016). Information: Data Exploration with Information Theory (Weight-of-
Evidence and Information Value). R package version 0.0.9. URL: https://CRAN.
R-project.org/package=Information.
Korotkov, V.B. (2011). Mercer theorem, Encyclopedia of Mathematics. URL: http : / /
www.encyclopediaofmath.org/index.php?title=Mercer_theorem&
oldid=11889.
LendingClub (2016). Lending Club Corporation website. [Online; accessed November
15, 2016]. URL: https://www.lendingclub.com/.
Malley, J.D. et al. (2012). “Probability Machines: Consistent probability estimation
using nonparametric learning machines”. In: Methods of Information in Medicine
51.1, pp. 74–81.
Mateescu, A (2015). “Peer-to-Peer Lending”. In: Data & Society [online].
Meyer, David et al. (2017). e1071: Misc Functions of the Department of Statistics, Prob-
ability Theory Group (Formerly: E1071), TU Wien. R package version 1.6-8. URL:
https://CRAN.R-project.org/package=e1071.
Nelder, J.A. and R.W.M. Wedderburn (1972). “Generalized linear models”. In: Ency-
clopedia of statistical sciences.
90 BIBLIOGRAPHY
OpenCV (2017). Introduction to Support Vector Machines. [Online; accessed April 13,
2017]. URL: http : / / docs . opencv . org / 2 . 4 / doc / tutorials / ml /
introduction_to_svm/introduction_to_svm.html.
Qi, Min (2009). Exposure at default of unsecured credit cards. Office of the Comptroller
of the Currency.
Raghava, G.P.S. (2006). A SVM-based method for rice blast prediction. [Online; accessed
April 13, 2017]. URL: http : / / www . imtech . res . in / raghava / rbpred /
svm.jpg.
riskarticles.com (2017). Credit Risk: How to Calculate Expected Loss & Unexpected Loss.
[Online; accessed April 13, 2017]. URL: http://riskarticles.com/credit-
risk-how-to-calculate-expected-loss-unexpected-loss/.
Tsai, Kevin, Sivagami Ramiah, and Sudhanshu Singh (2014). “Peer Lending Risk
Predictor”. In: CS229 Autumn 2014.
Wright, Marvin N and Andreas Ziegler (2015). “ranger: A fast implementation of
random forests for high dimensional data in C++ and R”. In: arXiv preprint arXiv:1508.04409.