Machine Learning Approach For Credit Scoring

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/343441891

Machine Learning approach for Credit Scoring

Preprint · July 2020

CITATIONS READS

0 2,491

10 authors, including:

Angela Rita Provenzano Nicola Jean

4 PUBLICATIONS 6 CITATIONS 4 PUBLICATIONS 6 CITATIONS

SEE PROFILE SEE PROFILE

Giacomo Le Pera Maurizio Spadaccino


Illimity Bank Illimity Bank
10 PUBLICATIONS 6 CITATIONS 4 PUBLICATIONS 6 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Giacomo Le Pera on 10 August 2020.

The user has requested enhancement of the downloaded file.


Machine Learning approach for
Credit Scoring
A. R. Provenzano∗, D. Trifirò, A. Datteo, L. Giada, N. Jean, A. Riciputi,
G. Le Pera, M. Spadaccino, L. Massaron and C. Nordio

August 5, 2020
arXiv:2008.01687v1 [q-fin.ST] 20 Jul 2020

Working paper1

Abstract
In this work we build a stack of machine learning models aimed at composing a state-of-the-art
credit rating and default prediction system, obtaining excellent out-of-sample performances.
Our approach is an excursion through the most recent ML / AI concepts, starting from
natural language processes (NLP) applied to economic sectors’ (textual) descriptions using
embedding and autoencoders (AE), going through the classification of defaultable firms on
the base of a wide range of economic features using gradient boosting machines (GBM) and
calibrating their probabilities paying due attention to the treatment of unbalanced samples.
Finally we assign credit ratings through genetic algorithms (differential evolution, DE). Model
interpretability is achieved by implementing recent techniques such as SHAP and LIME,
which explain predictions locally in features’ space.

JEL Classification codes: C45, C55, G24, G32, G33 AMS Classification codes: 62M45,
68T01, 68T50, 91G40

Keywords: Artificial Intelligence, Machine Learning, Explainable AI, Autoencoders, Em-


bedding, LightGBM, Differential Evolution, SHAP, LIME, Credit Risk, Rating Model, Default,
Probability of Default, Classification

Introduction
In the aftermath of the economic crisis, the probability of default (PD) has become a topical
theme in the field of financial research. Indeed, given its usage in the risk management, in the
valuation of the credit derivatives, in the estimation of the creditworthiness of a borrower and
in the calculation of economic or regulatory capital for banking institution (under Basel II),
incorrect PD prediction can lead to false valuation of risk, unreasonable rating and incorrect
pricing of financial instruments. In the last decades, a growing number of approaches has been

corresponding author: [email protected]
1
This paper reflects the authors’ opinions and not necessarily those of their employers.

1
developed to model the credit quality of a company, by exploring statistical techniques. Several
works have employed probit models[1] or linear and logistic regression to estimate company rat-
ings using the main financial indicators as model input. However, these models suffer from their
clear inability to capture non-linear dynamics, which are prevalent in financial ratio data[2].
New statistical techniques, especially from the field of machine learning, have gained a world-
wide reputation thanks to their ability to efficiently capture information from big dataset by
recognizing non-linear patterns and temporal dependencies among data. Zhao et al. (2015) [3]
employed feed forward neural networks in credit corporate rating determination. Petropopulos
et al.[4] explore two state of the art techniques namely Extreme Gradient Boosting (XGBoost)
and deep learning neural networks in order to estimate loan PD and calibrate an internal rating
system, useful both for internal usage and regulatory scope. Addo et al. (2018)[5] built binary
classifiers based on machine and deep learning models on real data to predict loan probability of
default. They observed that the tree-based models are more stable than ones based on multilayer
artificial neural networks.
Starting from these studies, we propose a sophisticated framework of machine learning models
which, on the basis of company annual (end-of-year) financial statements coupled with relevant
macroeconomic indicators, attempts to classify the status of a company (performing - “in-bonis”
- or defaulted) and to build a robust rating system in which each rating class will be matched to
an internally calibrated default probability. In this regard, here the target variable is different
from a previous work by some of the authors [6], where the goal was to predict the credit rating
that Moody’s would assign, according to an approach commonly called “shadow rating”. The
novelty of our approach lies in the combination of data preprocessing algorithms, responsible for
feature engineering and feature selection, and a core model architecture made of a concatenation
of a Boosted Tree default classifier, a probability calibrator and a rating attribution system
based on genetic algorithm. Great attention is then given to model interpretability, as we
propose two intuitive approaches to interpret the model output by exploring the property of
local explainability. In details, the article is composed of the following sections: Section 1 is
devoted to describe the input dataset and the preprocessing phase; Section 2 in which the core
model architecture is explained; Section 3 which collects results from the core model structure
(i.e. default classifier, PD calibrator and rating clustering); finally Section 4 is left to model
explainability.

1 Dataset description
Data used for model training have been collected from the Credit Research Database (CRD)
provided by Moody’s, and consist of 919, 636 annual (end of year) financial statements of 157, 986
Italian companies belonging to different sectors (e.g. automotive, construction, consumer goods
durable and not-durable, energy, high-teach industries, media, etc. to the exclusion of FIRE
sector, i.e. finance, insurance, and real estate sectors). The dependent variable in our dataset,
i.e. the target of the proposed default prediction model, is a binary indicator with the value of
1 flagging a default event (i.e. a bankruptcy occurrence over a one-year horizon), 0 otherwise.
In accordance to the above-defined target variable, input variables of our model have been se-
lected to be consistent with factors that can affect the companies capacity to service external
debt (a full explanation of the input model’s features is reported in Appendix A). In partic-
ular they consist of balance-sheet indexes and ratios, and Key Performance Indicators (KPI)
calculated from CRDs financial reports[7]. The latter include indicators for efficiency (i.e. mea-
sures of operating performance), liquidity (i.e. ratios used to determine how quickly a company
can turn its assets into cash if it is experiencing financial distress or impending bankruptcy),

2
solvency (i.e. ratios that depict how much a company relies upon its debt to fund operations)
and profitability (i.e. measures that demonstrate how profitable a company is). Since business
cycles can have great impact on a firm profitability and influence its risk profile, we joined orig-
inal information with more general macro variables (2 years lagged historical data) addressing
the surrounding climate in which companies operate. Among the wide range of macroeconomic
indicators provided by Oxford Economics [8], a subset of the most influential ones has been
selected as explanatory variables2 . Some of them are country-specific, others are common to the
whole Eurozone3 . The combined dataset of balance-sheet indexes, financial ratios and macro
variables along with data transformations and feature selection (better described hereafter in
Section 1.1), led to a set of 179 features and covers the period 2011 − 2017.
As fully described hereafter, the obtained dataset was split into three parts: an out-of-time
dataset which includes the data referred to year 2017 (marked in light-blue in Figure 1), used
to test model performance; a stratified pair of train/test dataset (constituting each the 80%
and the 20% of the input dataset) which covers the period 2011 − 2016, employed for model
development and calibration.

Figure 1: Number of balance-sheets per financial statement year and the corresponding 1-year
default rate. The default rate shows an increasing trend in the 2011 − 2012 period, and then
decreases till 2017.

1.1 Feature engineering and feature selection


A preliminary step for building a machine learning model consists in generating a set of features
suitable for model training. This task involves data manipulation processes like transformation
of categorical features, missing values treatment, infinite values handling, outliers detection,
data leakage avoidance. In particular, categorical, non-ordinal variables are one of the main
issue that must be tackled in order to feed any machine learning model[9].
Different encoding technique can be used to make categorical data legible for a machine learning
algorithm. The most common way to deal with categories is to simply map each category with
2
The list of selected indicators are reported in Section A.3
3
Regional aggregate Eurozone includes the following countries: Austria, Belgium, Cyprus, Estonia, Finland,
France, Germany, Greece, Ireland, Italy, Latvia, Lithuania, Luxembourg, Malta, Netherlands, Portugal, Slovenia,
Slovakia and Spain

3
a number. By applying such transformation, known as Label Encoding, a model would treat
categories as ordered integers, which would imply non-existent ordinal relationships between
data, that could be misleading for model training. Another simple way to handle categorical
data is the One-Hot Encoding technique, which consists in transforming each categorical feature
into fixed-size sparse vector of all zeros but 1 in the cell used uniquely to identify specific real-
ization of that variable. The main drawback of this technique relies on the fact that categories
with an high number of possible realizations would generate large dimension datasets, which
makes it a memory-inefficient encoder. Moreover this sparse representation does not preserve
similarity between feature values. An alternative approach to overcome these issues is repre-
sented by Categorical Embedding, which consists in mapping via Deep Neural Network (DNN)
each possible discrete value of a given categorical variable in a low-dimensional, learned, contin-
uous vector representation. This method allows to place each categorical feature in a Euclidean
space, keeping coherent relationship with other realization of the same variable. The extension
of categorical embedding approach to words and document representation is known as Word
Embedding[10]. In particular, Sentence Embedding is an application of word embedding aiming
at representing a full sentence into a vector space. In this study, we applied sentence embed-
ding to represent industry sectors descriptions associated to each ”NACE code”4 . In order to
guarantee the ”semantics” of the original data and work in a low dimensional space, we pro-
pose a framework of embedding with autoencoder regularization, in which the original data are
embedded into low dimension vectors. The obtained embeddings maintain local similarity and
can be easily reverted to their original forms. The encoding of the NACE is a novel way to
overcome the NACE1-NACE2 mapping conundrum. In our dataset both NACE are used and as
already stated in many papers the two encoding systems are not fully compatible[11]. Moreover
the NACE encoding allows for proper industry segment description of multi sector firms that
cannot be easily described by a single NACE code, further extending the predictive power of
the economic sector category.
A different encoding method is the Target encoding, in which categorical features are replaced
with the mean target value for samples having that category. This allows to encode an arbi-
trary number of features without increasing data dimensionality. However, as a drawback, a
naive application of this type of encoding can allow data leakage, leading to model overfitting
and poor predictive performance. A target encoding algorithm developed for preventing data
leakage is know as James-Stein estimator, and is the one used in our model. In more details,
it transforms each categorical feature with a weighted average of the mean target value for the
observed feature value and the mean target value computed regardless of the feature realization.
As described above, some feature transformations could result in a general increase of input
data dimensionality, which makes it urgent to implement a robust and independent feature
selection framework. In fact, training a machine learning model to a huge number of independent
variables is doomed to suffer from the so-called curse of dimensionality[12], i.e. the problem
of exponential increase in volume associated with adding extra dimensions to a vector space.
We employed a voting ensemble of models to independently assign importance to the available
features and efficiently select those features which will contribute most to model prediction.
Hereafter in this section we will look into the implementation of satellite models aiming
at: performing sentence embedding of the industry sectors description; reducing embeddings
dimensionality via stacked autoencoder; selecting relevant features via voting approach.

4
The ”Statistical Classification of Economic Activities in the European Community”, commonly referred to as
NACE, is the industry standard classification system used in the European Union

4
Sentence Embedding of sector descriptions A common practice in Natural Language
Processing (NLP) is the use of pre-trained embeddings to represent words or sentence in a doc-
ument. Following this common practice, we use the pre-trained models built in SpaCy NLP
library for embedding the sequence of NACE sector textual description. In particular, we per-
formed sentence embedding i.e. we transform each description into a 300-dimensional real-value
vector. Each sentence embedding is automatically constructed by SpaCy averaging the 300-
dimensional real-value pre-trained vectors which map each word in that sentence.
Here’s a glimpse at how spaCy processes textual data. It first segments text into words, punc-
tuations, symbols and others by applying specific rules to each language (i.e. it tokenizes the
text). Then it performs Part-of-speech (POS) tagging to understand the grammatical properties
of each word by means of built in statistical model. A model consists of binary data trained on a
dataset large enough to allow the system to make predictions that generalize across the language.
A key assumption to the word embedding approach is the idea of using for each word a dense
distributed representation learned based on the usage of words [13]. This allows words that
are used in similar ways to have similar representations, naturally capturing their meaning[14].
Given the high importance industry sector has in financial literature as default prediction driver,
embedding NACE industry descriptions improves the overall model performance in application
by helping the model to generalize better and to smoothly handle unseen elements.

Dimensionality reduction via stacked autoencoder The aforementioned world embed-


ding models are powerful way to represent categorical variables which preserve relationship
between data, but at the cost of increase in dimensionality. In order to reduce the number of
dimensions of the output embeddings from 300 to 5, a stacked autoencoder (SAE) of 6 layers5
has been developed via tensorflow[15].
In details, autoencoders (AE) are a family of neural networks in which input and output coincide.
They work by compressing the input into a latent-space representation and then reconstruct the
output by means of this representation. They consist of two principal component: the encoder
which takes the input and compresses it into a representation with less dimensions, and the
decoder which tries to reconstruct the input. Among AEs, stacked autoencoders are deep neural
networks in which the output of each hidden layer is connected to the input of the successive
hidden layer. All hidden layers are trained by an unsupervised algorithm and then fine-tuned
by a supervised method aimed at minimize the cost function. Since they can learn even non-
linear transformations, unlike PCA, by using a non-linear activation function and a multiple
layer structure, autoencoders are efficient tools for dimensionality reduction. Moreover, in our
application, SAE exhibited a low reconstruction loss6 (around 6% of MSE), contrary to a low
fraction of variance explained by the PCA.

Voting approach for feature selection Feature selection is a key component when building
machine learning models. We can either demand this task to the main model or use a set of
lighter models in a preparatory task so that the required effort for further feature selections will
be reduced when training the main model. This is particularly useful for multi parameters models
like Light-GBM where the training phase involves also the calibration of a set of hyperparamters
usually spanning very wide ranges. Neglecting the expert based component, algorithmic feature
selection methods are usually divided into three classes: filter methods, wrapper methods and
5
a 3-layer encoder and 3-layer decoder
6
The reconstruction loss is the loss function (usually either the mean-squared error or cross-entropy between
the reconstructed output and the input) which penalizes the network for creating outputs different from the
original input

5
embedded methods. Filter-based methods apply a statistical measure to assign a score to each
feature; variables of the starting dataset are then ranked according to their scores and either
selected to be kept or removed. Wrapper-based methods consider the selection of a set of features
as a ”search problem”, where different combinations are prepared, evaluated and compared to
other combinations. In details, a predictive model is used to evaluate a combination of features
and assign a score based on model accuracy. Embedded-based methods learn which features
best contribute to the accuracy of the model while the model is being created.
We combined a set of 6 different models for feature selection, stacking each algorithm into a
hard voting framework where features which receive the highest number of votes among all the
models have been selected. In particular, after having transformed categorical features via target
encoding (by means of James-Stein encoder), each feature in the dataset has been ranked on
the basis of the following models:

• Pearson criterion. It is a filter-based method which consists in checking the absolute


value of the Pearson correlation between the target and features in the input dataset and
keeping the top n features based on this score.

• Chi-squared criterion. It is another filter-based method in which we calculate the chi-


squared metric between each feature and the target and select the desired number of
features which exhibit the best chi-squared scores. The underlying intuition is that if a
feature is independent to the target it is uninformative for classifying information.

• Recursive Feature Elimination (RFE). This is a wrapper-based method whose goal


is to select features by recursively considering smaller and smaller sets of features. First,
the estimator is trained on the initial set of features and the importance of each feature is
computed. In our specific case the estimator used is a Logistic Regression. Then, the least
important features are pruned from current set of features. That procedure is recursively
repeated on the pruned set until the desired number of features to select is eventually
reached.

• Random Forest Classifier (RF7 ). This is a wrapper-based method that uses a built-in
algorithm for feature selection. In particular, variables are selected accordingly to feature
importance, obtained by averaging of all decision tree feature importance.

• Logistic Lasso Regression It is an embedded-based method which uses the built-in


feature selection algorithm embedded into the Logistic Regression with L1 regularization
model.

• Light-GBM[16] (LGBM8 ). It is a wrapper-based method analogous to the above-mentioned


RF classifier.

2 Model architecture
Moving beyond the satellite models described in Section 1.1, and used in the preprocessor phase,
in this section we will present the core model architecture. It consists of a concatenation of three
7
RF is an ensemble of Decision Trees generally trained via bagging method : this approach consists in using
the same training algorithm for every predictor, but training them on different random subsets of the Train set.
Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the
predictions of all predictors
8
LGBM is a fast, high-performance gradient boosting framework based on decision tree algorithm

6
machine learning models aiming at building a robust and reliable framework whose purpose is
not only to classify the company status (as in bonis or defaulted), but also to construct an
internally calibrated rating system in which each rating class will correspond to a self-consistent
default probability.
In details, we developed a Boosted Tree algorithm in order to classify the status of a company
over a one-year horizon. This is a classical binary classification problem whose target variable
is 1 in case of default event or 0 otherwise. Hyper-parameters tuning has been performed via
an extension of cross validation for time-series that will be further described hereafter.
The purpose of the second model is to fit the output score of the binary classificator to the
actual default rate. We built a calibrator which consists of a Logistic Regression trained on the
leaf assignments of the Boosted Tree.
Finally, we calibrated our own rating system by splitting the refitted default probability into 9
clusters via genetic algorithm.

2.1 Light-GBM for default classification


In order to leverage the availability of a large scale dataset, enriched with a high number of
features, we developed a robust machine learning approach based on Gradient Boosting decision
trees known as Light-GBM.
Gradient Boosting trees model[17] is one method of combining a group of ”weak learners” (specif-
ically decision trees) in order to form a ”strong predictor” model, by reducing both variance
and bias. Differently from other tree methods like Random Forest, Boosted tress work by se-
quentially adding predictors to an ensemble, each one corrects its predecessor by trying to fit
the new predictor to the residuals of the previous one. These residuals are the gradient of the
loss functional being minimized, with respect to the model values at each training data point
evaluated at the current step. Specifically, at each iteration a sub-sample of the training data is
drawn at random (without replacement) from the full training data-set. This randomly selected
sub-sample is then used in place of the full sample to fit the ”weak learner” and compute the
model update for the current iteration.
In particular Light GBM (LGBM) is a fast, high-performance gradient boosting framework
based on decision tree algorithm, which has proved to be highly effective in classification and
regression models when applied on tabular, structured data, such as the ones we are dealing
with. The model hyper-parameters have been tuned via out-of-time cross-validation procedure
based on a custom extension of the Fβ -measure where the balance of specificity9 (also called
”the true negative rate”) and recall10 (also called ”the true positive rate”) in the calculation of
the harmonic mean is controlled by a coefficient called β as follows:

specificity · recall
Fβ = (1 + β 2 ) (1)
β2 · specificity + recall
In this procedure, each test sets consist of a single year of future observation, while the cor-
responding training sets are made up of the observations that occurred prior to the observation
that forms the test set. In this way, the model is optimized in predicting what will happen in
the future using only information available up to the the present day. The objective function
used for the classification problem was the log-loss, which measures the distance between each
predicted probability and the actual class output value by means of a logarithmic penalty. Due
9
The specificity is defined as the number of true negatives over the number of true negatives plus the number
of false negatives
10
The recall is defined as the number of true positives over the number of true positives plus the number of
false negatives

7
to the high unbalance between 0 and 1 target flags, we have used the modified unbalanced log-
loss, by setting the scale pos weight parameter of the Light-GBM equal to the ratio between
the number of 0s and the number of 1s. Other objective functions we have used like Focal
Loss[18] and custom weighted log losses have not given any specific advantage compared to the
unbalanced log loss.

2.2 Probability calibration for tree-base model


A natural extension of the issue of classification of corporate default forecast consists in predict-
ing the probability of default. Complex non-linear machine learning algorithms can provide poor
estimates of the class probabilities, especially in case the target variable is highly unbalanced, so
that the distribution and behaviour of the probabilities may not be reflective of the true under-
lying probability of the sample. The unbalanced log loss objective chosen for the classification
task creates a custom metric in the default probability space that is reflected in distorted class
probabilities. The perfect classifier would have only 0 and 1 probabilities, but these would not
be able to match historical default rates. They simply represent the probability of belonging to
a class with the switch threshold at 0.5, they are not predicted default rates. Fortunately, it is
possible to adjust the probability distribution in order to better match the actual distribution
observed in the data, without losing predictive power. This adjustment is referred to as calibra-
tion. In particular, calibrating a classifier consists in fitting a regressor (known as calibrator )
which maps the output of the classifier fi to a calibrated probability p(yi = 1|fi ) in [0, 1].
Taking inspiration from the hybrid model structure proposed by Xinran He et all [19], based on
the concatenation of boosted decision trees and of a probabilistic sparse classifier, we calibrated
the output probabilities of the LGBM classifier by fitting a L2-Regularized Logistic Regression
(LR) (where the inverse of regularization strength, i.e. the c parameter, has been optimized on
the out-of-time sample of the training-set) on its one-hot encoded leaf assignments.
We first fitted the LGBM on the stratified test-set we left aside for the classification task, as
the sample used to train the calibrator should not be used to train the target classifier. We
treated the output of each individual tree of the LGBM classifier as a categorical feature that
takes as value the index of the leaf an instance ends up falling in. We applied one-hot encoding
to have dummies indicating leaf assignments on which fitting the LR model. Finally we tested
the calibrator on the train-set previously used for LGBM training phase. This methodology of
using an intermediate result and change the output from classification to regression is analogous
to what is currently known as Transfer Learning in the Deep Neural Network world, where the
final neural net layer is removed and substituted with a novel output. The main advantage of
this method is in preserving all the inner complex feature engineering that the system learned
in the original training task and transferring it to a different problem, in our specific case the
prediction of actual default rates.

2.3 Rating attribution via genetic algorithm


A robust default classification system, able to meet both supervisory requirements and internal
banking usage, provides a way to represent the internally calibrated probability of default to a
rating system, in which each PD bucket is matched to a rating grade. In order to calibrate our
own rating system, the refitted default probability has been split into 9 groups (corresponding
to 9 different rating classes) by means of a genetic algorithm known Differential Evolution[20].
The algorithmic task of calibrating a rating system can be stated as an optimization problem, as
it tries to minimize the Brier Score, maximise the similarities among elements of the same groups
(the so called cohesion i.e. the items in a cluster should be as similar as possible), minimise the

8
dissimilarities between different groups (the so called separation i.e. any two clusters should be
as distinct as possible in terms of similarity of items), ensure both PD monotonicity (i.e. lower
default rates have to correspond to low rating grades and vice versa) and have an acceptable
cluster size (i.e. each cluster has to include a fraction of the total population that has to be
roughly homogeneous among them). Among the partitioning clustering algorithms, Genetic
Algorithms (GA) are stochastic search heuristic inspired by the concepts of Darwinian evolution
and genetics. They are based on the idea of creating a population of candidate solutions to
an optimization problem, which is iteratively refined by alteration (mutation) and selection of
good solutions for the next iteration. Candidate solutions are selected according to a so-called
fitness function, which evaluates their quality in respect to the optimization problem. In the
case of Differential Evolution (DE) algorithms the candidate solutions are linear combinations of
existing solutions. In the end, the best individual of the population is returned. This individual
represents the best solution discovered by the algorithm.

3 Results
The metric used to evaluate the model performance is the AUROC or AUC-ROC score (Area
Under the Receiver Operating Characteristics). In particular, ROC is a probability curve and
AUC represents the degree of separability. This measure tells how much a model is capable of
distinguishing between classes: for an excellent model it is near to 1, for a poor model it near
to 0. The ROC curve is constructed by evaluating the fraction of ”true positives”(tpr or True
Positive Rate) and ”false positives” (fpr or False Positive Rate) for different threshold values.
In details, tpr, also known as Recall or Sensitivity, is defined in Equation (2) as the number of
items correctly identified as positive out of total true positives:

TP
tpr = (2)
TP + FN
where TP is the number of true positives and FN is the number of false negatives. The fpr,
also known as Type I Error, is defined in Equation (3) as the number of items wrongly identified
as positive out of total true negatives:

FP
f pr = (3)
FP + TN
where FP is the number of false positive and TN is the number of true negatives.
Prediction results are then summarized into a confusion matrix which counts the number
of correct and incorrect predictions made by the classifier. A threshold is applied to the cut-off
point in probability between the positive and negative classes, which for the default classifier
has been set at 0.5. However, a trade-off exists between tpr and fpr, such that changing the
threshold of classification will change the balance of predictions towards improving the True
Positive Rate at the expense of False Positive Rate, or vice versa.
The metric used to evaluate the performance of internally calibrated PD prediction is the
Brier score (BS), i.e. a way to verify the accuracy of a probability forecast in term of distance
between the actual results. The most common formulation of the Brier score is mean squared
error :
N
1 X
BS = (ft − ot )2 (4)
N
t=1

9
in which ft is the forecast probability, ot the actual outcome of the event at instance t and
N is the number of forecasting instances. The best possible Brier score is 0, for total accuracy,
the lowest possible score is 1, which mean the forecast was wholly inaccurate.
Note that, all the metrics described so far have been calculated on the Test-set obtained by
splitting the dataset along the financial statement year11 .

3.1 Default Classification


We obtained an high performance corresponding to an AUROC = 95.0% (see Figure 2) and
summarized in the normalized confusion matrix of Figure 3. In highly unbalanced datasets,
usually the confusion matrix is skewed towards predicting well only the majority class, producing
unsatisfactory performances in the minority class, even if the event of misclassification for the
minority class is the event business usually try to minimize the most. The distorsion in the
default probability space and an accurate choice of feature selection and hyperparameters created
a system able to effectively discriminate events in the minority class, reducing the occurrence of
false negatives.

Figure 2: ROC curve for Light-GBM classifier

11
Train-set spans from 2011 to 2016, Test-set covers 2017

10
Figure 3: Normalized confusion matrix with 50% threshold for Light-GBM classifier

3.2 PD refitting
Default probabilities forecast before and after refitting procedure are summarized in the cali-
bration plots (also called reliability curves) of Figure 4, which allow checking if the predicted
probabilities produced by the model are well calibrated. Specifically, a calibration plot con-
sists of a line plot of the relative observed frequency (y-axis) versus the predicted probabilities
(x-axis)12 . A perfect classifier would produce only 0 and 1 predictions but would not be able
to forecast actual default rates. A perfect actual default rate model would produce reliability
diagrams as close as possible to the main diagonal from the bottom left to the top right of the
plot. The refitting procedure maps the perfect classifier to a reliable default rate predictor.
The refitting procedure left AUROC score of the model unchanged (AUROC = 95.0%); the
calibration performance is evaluated with the Brier score (BS = 1.2%13 ); classification results
are summarized in the normalized confusion matrix reported in Figure 6, where the threshold to
the cut-off point between the positive and negative classes has been optimized on the Figure 5.

12
In details, the predicted probabilities are divided up into a fixed number of buckets along the x-axis. The
number of target events (i.e. the occurrence of 1-year default) are then counted for each bin (i.e. the relative
observed frequency). Finally, the counts are normalized and the results are plotted as a line plot
13
The closer the Brier score is to zero the better is the forecast of default probabilities

11
(a)

(b)

Figure 4: Calibration plots and log-scaled histograms of forecast probability before (Figure 8a)
and after (Figure 8b) refitting. Accuracy of predicted probabilities is expressed in term of log
loss measure

12
Figure 5: ROC curve for the calibrated classifier

Figure 6: Normalized confusion matrix with optimized threshold for the calibrated classifier

13
3.3 PD clustering
Among the several common statistical tests that can be performed to validate the assignment of
a probability of default to a certain rating grade, two approaches have been used: the one-sided
Binomial test and the Extended Traffic-Light Approach.
The Binomial Test is one of the most popular single-grade single-period14 test performed
for rating system validation. For a certain rating grade k ∈ {1, . . . , K}, being K the number of
rating classes15 , we made the assumptions that default events are independent within the grade
k and could be modelled as a binomially distributed random variable X with size parameter
Nk and ”success” probability P Dk . Thus, we can assess the correctness of the PD forecast by
testing the null hypothesis H0 , where:

• H0 : the actual default rate is less than or equal to the forecast default rate given by the
PD;

The null hypothesis H0 is rejected at a confidence level α in case the number of observed
defaults d per rating grade is greater than or equal to the critical value, as reported in Equation 5:
 
Nk  
 X Nk 
dα = min d : P Dkj (1 − P Dkj )Nk −j ≤ 1 − α (5)
 j 
j=d

The Extended Traffic Light Approach is a novel technique for default probability validation,
first adopted by Tasche (2003)[21]. The implementation used in this section refers to a heuristic
approach proposed by Blochwitz et al. (2005) [22] which is based on the estimation of a relative
distance between observed default rates and forecast probabilities of default, under the key
assumption of binomially distributed default events. Four coloured zones, Green, Yellow, Orange,
Red, are established to analyse the deviation of forecasts and actual realizations. In details: if
the result of the validation assessment lies in the Green zone there is no obvious contradiction
between forecast and realized default rate; the Yellow and Orange lights indicate that the
realized default rate is not compatible with the PD forecast, however, the difference of realized
rate and forecast is still in the range of usual statistical fluctuations; and last red traffic light
indicates a wrong forecast of default probability. The boundaries between the afore mentioned
light-zones are summarized in Equation 6:



 Green pk < P Dk
Yellow P D 6 p < P D + K y σ(P D , N )

k k k k k
y σ(P D , N ) 6 p < P D + K 0 σ(P D , N )
(6)


 Orange P Dk + K k k k k k k
P Dk + K 0 σ(P Dk , Nk ) 6 pk

Red
p
where σ(P Dk , Nk ) = P Dk (1 − P Dk )/Nk . The parameters K y and K 0 play a major role
in the validation assessment, so have to be tuned carefully. A proper choice based on practical
considerations is setting K y = 0.84 and K 0 = 1.44, which corresponds to a probability of
observing green of 0.5, observing yellow with 0.3, orange with 0.15 and red with 0.05.
The results of the application of the above mentioned statistical test are summarized in
Table 1.

14
usually one year
15
9 in our case

14
Out-of-sample Extended
Rating PD Bins Rating Class One-sided
Default Rate Traffic Light
class (%) PD (%) Binomial Test
(%) Approach
AAA [0.00, 0.005) 0.03 0.00 Passed Green
AA [0.05, 0.42) 0.24 0.03 Passed Green
A [0.42, 0.55) 0.48 0.08 Passed Green
BBB [0.55, 0.74) 0.64 0.21 Passed Green
BB [0.74, 1.00) 0.87 0.40 Passed Green
B [1.00, 1.42) 1.21 0.83 Passed Green
CCC [1.42, 2.12) 1.77 1.29 Passed Green
CC [2.12, 9.03) 5.57 5.06 Passed Green
C [9.03,100) 54.52 33.77 Passed Green

Table 1: Internally calibrated PD clustering into 9 rating classes. Despite being borrowed from
S&P rating scales, the labels are assigned to a PD calibrated on an internal dataset (the one
used during the training-phase) and does not correspond to any rating agencies PD

4 Model explainability
Machine learning models which operate in higher dimensions than cannot be directly visualized
by human mind are often referred as ”black boxes”, in the sense that high model performance
is often achieved on the detriment of output explainability, leading users not to understand the
logic behind model predictions. Even greater attention to model interpretability has led to the
development of several methods to provide an explanation to machine learning outputs, both in
term of global and local interpretability. In the first case, the goal is being able to explain and
understand model decisions based on conditional interactions between the dependent variable
(i.e. target) and the independent features on the entire dataset. In the latter case, the aim is to
understand model output for a single prediction by looking at a local subregion in the feature
space around that instance.
Two popular approaches described hereafter in this section are SHAP and LIME, which explore
and leverage the property of local explainability to build surrogate models which are able to
interpret the output of any machine learning models. The technique upon which these algorithms
are based is slightly tweaking the input and modelling the changes in prediction by means of
surrogate agnostic models. In particular, SHAP measures how much each feature in our model
contributes, either positively or negatively, to each prediction, in term of difference between
the actual prediction and its expected value. LIME builds sparse linear models around each
prediction to explain how the black box model works in that local vicinity.

4.1 SHAP
SHAP, which stands for (SH apley Additive exPlanation) [23], is a novel approach for model
explainability which exploits the idea of Shapley regression value 16 to model feature influence
scoring. SHAP values quantify the magnitude and direction (positive or negative) of a feature’s
effect on a prediction via an additive feature attribution method. In simple words, SHAP builds
model explanations by asking, for each prediction i and feature j, how i changes when j is
removed from the model. Since SHAP considers all possible predictions for an instance using all
16
The technical definition of Shapley value is the average marginal contribution of a feature value over all
possible coalitions

15
possible combinations of feature inputs, it can guarantee both consistency and local accuracy.
More in details, SHAP method computes Shapley values from coalitional game theory. The
feature values of a data instance act as players17 in a coalition: Shapley values suggest how to
fairly distribute the payout (i.e. the prediction) among the features.

SHAP summary plot As reported in Figure 10, it combines feature importance with feature
effects to measure the global impact of features on the model. For each feature shown in the
y-axis, and ordered according to their importance, each point on the plot represents the Shapley
value (reported along the x-axis) for a given prediction. The color of each point represents the
impact of the feature on model output from low (i.e. blue) to high (i.e. red). Overlapping points
are littered in y-axis direction, so we get a sense of the distribution of the Shapley values per
feature.

Figure 7: SHAP summary plot for Light-GBM classifier. The details of model’s feature descrip-
tion are reported in Appendix A

17
note that a player can be an individual feature value or a group of feature values

16
SHAP dependence plot It is a scatter plot that shows the effect a single feature has on the
model predictions. In particular, each dot represents a single prediction where the feature value
is on the x-axis and its SHAP value, representing how much knowing that feature’s value changes
the output of the model for that sample’s prediction, on the y-axis. The color corresponds to
a second feature that may have an interaction effect with the plotted feature. If an interaction
effect is present it will show up as a distinct vertical pattern of colouring.

SHAP waterfall plot The waterfall plot reported in Figure 9 is designed to display how
the SHAP values of each feature move the model output from our prior expectation under
the background data distribution (E[(f (X)]), to the final model prediction (f (X)) given the
evidence of all the features. Features are sorted by the magnitude of their SHAP values with
the smallest magnitude features grouped together at the bottom of the plot. The color of each
row represents the impact of the feature on model output from low (i.e. blue) to high (i.e. red).

17
(a) (b)

(c) (d)

(e) (f)

(g)

Figure 8: SHAP dependency plots for ACTIVITY in Figure 8a, cashAndMarketableSecurities


in Figure 8b, DEBT EQUITY in Figure 8c, EBITDA RATIO in Figure 8d, netIncome in Fig-
ure 8e, ROI in Figure 8f, totalInterestExpense in Figure 8g. The details of model’s feature
description are reported in Appendix A
18
Figure 9: SHAP waterfall plot The details of model’s feature description are reported in Ap-
pendix A

19
4.2 Lime

LIME, Local I nterpretable M odel-agnostic Explanations, is a novel technique that explains the
predictions of any classifier in an interpretable and faithful manner, by learning an interpretable
model locally around the prediction[24]. Behind the workings of LIME lies the assumption that
every complex model is linear on a local scale, so it is possible to fit a simple model around a single
observation that will mimic how the global model behaves at that locality. The output of LIME
is a list of explanations, reflecting the contribution of each feature to the prediction of a data
sample, allowing to determine which feature changes will have most impact on the prediction.
Note that, LIME has the desirable property of additivity, i.e. the sum of the individual impact
is equal to the total impact. Results for a prediction are summarized in Figure 10.

Figure 10: Lime local explanation for a prediction from Light-GBM classifier. The details of
model’s feature description are reported in Appendix A

5 Conclusions

Starting from Moody’s dataset of historical balancesheets, bankruptcy statuses and macroeco-
nomic variables we have built three models: a classifier, a default probability model and a rating
system. By leveraging on modern techniques in both data processing and parameter calibration
we have reached state of the art results. The three models show stunning out of sample per-
formances allowing for an intensive usage in risk averse businesses where the occurrence of false
negatives can dramatically harm the firm itself. The explainability layers via Shap and Lime
give a set of extra tools to increase the confidence in the model and help in understanding the
main features determining a specific result. These information can be leveraged by the analyst
to understand how to reduce the bankruptcy probability of a specific firm or to get insight in
which balance sheet fields need to be improved to increase the rating, therefore providing a
business instrument to actively manage clients and structured finance deals.

20
Acknowledgements
We are grateful to Corrado Passera for encouraging our research.

References
[1] P. Mizen and S. Tsoukas, “Forecasting us bond default ratings allowing for previous and
initial state dependence in an ordered probit model,” International Journal of Forecasting,
vol. 28, no. 1, pp. 273–287, 2012.

[2] P. Gurnỳ and M. Gurnỳ, “Comparison of credit scoring models on probability of default
estimation for us banks,” 2013.

[3] Z. Zhao, S. Xu, B. H. Kang, M. M. J. Kabir, Y. Liu, and R. Wasinger, “Investigation and
improvement of multi-layer perceptron neural networks for credit scoring,” Expert Systems
with Applications, vol. 42, no. 7, pp. 3508–3516, 2015.

[4] A. Petropoulos, V. Siakoulis, E. Stavroulakis, A. Klamargias, et al., “A robust machine


learning approach for credit risk analysis of large loan level datasets using deep learning
and extreme gradient boosting,” Are Post-crisis Statistical Initiatives Completed, vol. 49,
pp. 49–49, 2019.

[5] P. M. Addo, D. Guegan, and B. Hassani, “Credit risk analysis using machine and deep
learning models,” Risks, vol. 6, no. 2, p. 38, 2018.

[6] A. R. Provenzano, D. Trifir, N. Jean, G. L. Pera, M. Spadaccino, L. Massaron, and C. Nor-


dio, “An artificial intelligence approach to shadow rating,” 2019.

[7] Moody’s Analytics, “Credit research database.” https://www.moodysanalytics.com/


product-list/credit-research-database-crd.

[8] Oxford Economics, “Global economic databank.” https://www.oxfordeconomics.com/


Global-Economic-Databank.

[9] A. Zheng and A. Casari, Feature engineering for machine learning: principles and tech-
niques for data scientists. ” O’Reilly Media, Inc.”, 2018.

[10] C. Guo and F. Berkhahn, “Entity embeddings of categorical variables,” arXiv preprint
arXiv:1604.06737, 2016.

[11] G. Perani, V. Cirillo, et al., “Matching industry classifications. a method for converting
nace rev. 2 to nace rev. 1,” tech. rep., 2015.

[12] R. Bellman, “Dynamic programming: Princeton univ. press,” NJ, vol. 95, 1957.

[13] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language


model,” Journal of machine learning research, vol. 3, no. Feb, pp. 1137–1155, 2003.

[14] Y. Goldberg, “Neural network methods for natural language processing,” Synthesis Lectures
on Human Language Technologies, vol. 10, no. 1, pp. 1–309, 2017.

[15] A. Géron, Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools,
and techniques to build intelligent systems. ” O’Reilly Media, Inc.”, 2017.

21
[16] https://github.com/microsoft/LightGBM.

[17] J. H. Friedman, “Stochastic gradient boosting,” Computational statistics & data analysis,
vol. 38, no. 4, pp. 367–378, 2002.

[18] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detec-
tion,” in Proceedings of the IEEE international conference on computer vision, pp. 2980–
2988, 2017.

[19] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers,
et al., “Practical lessons from predicting clicks on ads at facebook,” in Proceedings of the
Eighth International Workshop on Data Mining for Online Advertising, pp. 1–9, 2014.

[20] R. Storn and K. Price, “Differential evolution - a simple and efficient heuristic for global
optimization over continuous spaces,” Journal of Global Optimization, vol. 11, pp. 341–359,
01 1997.

[21] D. Tasche, “A traffic lights approach to pd validation,” arXiv preprint cond-mat/0305038,


2003.

[22] S. Blochwitz, S. Hohl, and C. Wehn, “Reconsidering ratings,” Wilmott Magazine, vol. 5,
pp. 60–69, 2005.

[23] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” in
Advances in neural information processing systems, pp. 4765–4774, 2017.

[24] M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predic-
tions of any classifier,” in Proceedings of the 22nd ACM SIGKDD international conference
on knowledge discovery and data mining, pp. 1135–1144, 2016.

22
Appendices
A Model’s Features descriptions
In this section the details of selected features, upon which the model have been trained, are
reported.

A.1 Balancesheet index descriptions

Code Definition
cashAndMarketableSe-
Cash and marketable securities.
curities
depreciationExpense The depreciation expense for current period.
Earnings before interest, taxes, depreciation and amortization
ebitda
before extraordinary items.
entityConsolidation-
For companies with subsidiaries, i
Type
incorporationRegion The entity’s incorporation region.
The entity’s incorporation province or administrative division
incorporationState
where the entity has a legal representation.
longTermDebtCurrent- The current maturities of long-term debt, principal payments
Maturities due within 12 months.
netIncome Net income is the total period-end earnings.
Net worth is the sum of all equity items, including retained
netWorth
earnings and other equity.
payableToTrade Accounts payable to regular trade accounts.
receivableFromTrade Accounts receivable from trade.
retainedEarnings Retained earnings.
Defined as the difference between netWorth and
tangibleNetWorth
totalIntangibleAssets.
totalAccountsPayable The sum of accounts payable.
totalAccountsReceiv-
The total accounts receivable, net of any provision from loss.
able
totalAmortizationAnd- The sum of amortization and depreciation expense for current
Depreciaton period.
The total assets of the borrower which is the sum of the current
totalAssets
assets and non-current assets.
totalCapital Total subscribed and share capital.
totalCapi-
Defined as the sum of totalLiabilities and totalCapital
tal and totalLiabilities
totalCurrentAssets The sum of all current assets.
totalCurrentLiabilities The sum of all current liabilities.
Total fixed assets are the Gross Fixed Assets less Accumulated
totalFixedAssets
Depreciation.
totalIntangibleAssets Total intangible assets.

23
Code Definition
The total interest expense is any gross interest expense generated
totalInterestExpense
from short-term, long-term, subordinated or related debt.
totalInventory The sum of all the inventories.
The sum of Total Current Liabilities and Total non-current
totalLiabilities
liabilities.
The amount due to financial and other institutions after 12
totalLongTermDebt
months.
totalOperatingExpense The sum of all operating expenses.
totalOperatingProfit The Gross Profit less Total Operating Expense
totalProvisions Total provisions for pensions, taxes, etc.
totalSales Total sales.
totalWageExpense The total wage expense.
Defined as the sum of receivableFromTrade,
workingCapital totalAccountsReceivable and totalInventory minus
payableToTrade and totalAccountsPayable.

A.2 KPIs descriptions


cashAndMarketableSecurities + totalAccountsReceivable
ACID = (7)
totalCurrentLiabilities
totalCurrentLiabilities
ACTIVITY = (8)
totalSales
financialStatementDate − incorporationDate
AGE = (9)
365.0
being ”financialStatementDate” the date of the financial statement and ”incorporationDate”
the date in which the entity was incorporated

totalSales
ASSET TURNOVER = (10)
totalAssets
totalCurrentAssets
CURRENT RATIO = (11)
totalCurrentLiabilities
ebitda
DEBT COVERAGE = (12)
totalInterestExpense
totalLiabilities
DEBT EQUITY = (13)
netWorth
ebitda
EBITDA RATIO = (14)
totalSales

FFO = netIncome + totalAmortizationAndDepreciaton + depreciationExpense (15)

workingCapital
IND ROTA = (16)
totalSales

24
PFN
IND STRUTT = (17)
netWorth
totalInventory
INVENTORY TURNOVER = (18)
totalSales
PFN
LEVERAGE 1 = (19)
ebitda
FFO
LEVERAGE 2 = (20)
PFN
totalLongTermDebt
LONG-TERM-DEBT EQUITY = (21)
netWorth
netIncome
NETINCOME RATIO = (22)
totalSales

PFN = totalLongTermDebt + longTermDebtCurrentMaturities − cashAndMarketableSecurities


(23)

netIncome
ROA = (24)
totalAssets
netIncome
ROE = (25)
netWorth
totalOperatingProfit
ROI = (26)
totalAssets
totalCurrentLiabilities
SHORT-TERM-DEBT EQUITY = (27)
netWorth

A.3 Macro-economic factors descriptions

Country specific indicators


Code Name Definition
The volume of goods and services
Consumption, private,
C consumed by households and non-profit
real
institutions serving households.
The volume of real personal consumption
CD Durable goods
expenditures.
The sovereign risk rating, based on the
CREDR Credit rating, average average of the sovereign ratings provided
by Moodys, S&P and Fitch.
A measure of the extent to which the
CU Capacity utilisation productive capacity of a business is being
used.

25
Country specific indicators
Code Name Definition
The volume of consumption, investment,
stockbuilding and government
DOMD Domestic demand, real
consumption expressed in local currency
and at prices of the country’s base year.
Employees in
EE Employees in employment.
employment
ET Employment, total Employment, total.
Consumption, The volume of government spending on
GC
government, real goods and services.
The volume of all final goods and
GDP GDP, real services produced within a country in a
given period of time.
GDP per capita, real, GDP per capita, real, US$, constant
GDPHEAD
US$, constant prices prices.
Investment, total fixed
IF Investment, total fixed investment, real.
investment, real
The volume of investment in tangible
Industrial production and intangible capital goods, including
IP
index machinery and equipment, software, and
construction.
Investment, private The volume of investment in private
IPNR
sector business, real sector business.
Investment, private The volume of investment in private
IPRD
dwellings, real dwellings.
The volume of stocks of outputs that are
still held by the units that produced
them and stocks of products acquired
IS Stockbuilding, real
from other units that are intended to be
used for intermediate consumption or for
resale.
Imports, goods & The volume of goods and services
M
services, real imports.
MG Imports, goods, real The volume of goods imports.
MS Imports, services, real The volume of services imports.
GDP, compensation of
The values of wages and salaries of
PEWFP employees, total,
employees as a component of GDP.
nominal
PH House price index Index of house prices.
POIL$ Oil price US$ per toe Oil price US$ per toe.
The rate that is used by central bank to
Interest rate, central
RCB implement or signal its monetary policy
bank policy
stance (expressed as an average).

26
Country specific indicators
Code Name Definition
The difference in yield between two
Credit spreads, end of bonds of similar maturity but different
RCORP SPREADEOP
period credit quality, expressed as end of period
value.
Interest rate, long-term Interest rate, long-term government bond
RLG
government bond yields yields.
Retail Sales volume
Volume index for retail sales excluding
RS index, excluding
automotive.
automotive
Interest rate,
RSH The 3-month interbank rate.
short-term
Interest rate,
The 3-month interbank rate for the end
RSHEOP short-term, end of
of period.
period
Stockmarket earnings per share
Stockmarket earnings
SMEPS calculated as Stockmarket earnings, LCU
per share
* 1000 / Stockmarket shares outstanding.
Share price total return
SMP TR Share price total return index.
index
The sum of volumes of consumption,
Total final expenditure,
TFE investment, stockbuilding, government
real
consumption and exports.
The total number of people without a
U Unemployment
job, but actively searching for one.
The percentage of the labour force that is
UP Unemployment rate
unemployed at a given date.
The volume of goods and services
Exports, goods &
X exports expressed in local currency and
services, real
at the country’s base year.
The volume of goods exports expressed
XG Exports, goods, real
in local currency.
The volume of services exports expressed
XS Exports, services, real
in local currency.

Eurozone indicators
Code Name Definition
PH House price index Index of house prices.
The rate that is used by central bank to
Interest rate, central
RCB implement or signal its monetary policy
bank policy
stance (expressed as an average).

27
Eurozone indicators
Code Name Definition
The rate that is used by central bank to
Interest rate, central
implement or signal its monetary policy
RCBEOP bank policy, end of
stance (expressed as an end of period
period
value).
1-day interbank interest rate for the
REONIA Interest rate, EONIA
Eurozone.
Interest rate, long-term Interest rate, long-term government bond
RLG
government bond yields yields.
Interest rate,
RSH The 3-month interbank rate.
short-term
RSH6M Interest rate, 6-month The 6-month interbank rate.
Interest rate,
The 3-month interbank rate for the end
RSHEOP short-term, end of
of period.
period
Interest Rate Swap,
RSWAP2YR Swap par-rate for 2 years tenor.
2-year
Interest Rate Swap,
RSWAP5YR Swap par-rate for 5 years tenor.
5-year
Interest Rate Swap,
RSWAP10YR Swap par-rate for 10 years tenor.
10-year
Interest Rate Swap,
RSWAP30YR Swap par-rate for 30 years tenor.
30-year

28

View publication stats

You might also like