Machine Learning Approach For Credit Scoring
Machine Learning Approach For Credit Scoring
Machine Learning Approach For Credit Scoring
net/publication/343441891
CITATIONS READS
0 2,491
10 authors, including:
All content following this page was uploaded by Giacomo Le Pera on 10 August 2020.
August 5, 2020
arXiv:2008.01687v1 [q-fin.ST] 20 Jul 2020
Working paper1
Abstract
In this work we build a stack of machine learning models aimed at composing a state-of-the-art
credit rating and default prediction system, obtaining excellent out-of-sample performances.
Our approach is an excursion through the most recent ML / AI concepts, starting from
natural language processes (NLP) applied to economic sectors’ (textual) descriptions using
embedding and autoencoders (AE), going through the classification of defaultable firms on
the base of a wide range of economic features using gradient boosting machines (GBM) and
calibrating their probabilities paying due attention to the treatment of unbalanced samples.
Finally we assign credit ratings through genetic algorithms (differential evolution, DE). Model
interpretability is achieved by implementing recent techniques such as SHAP and LIME,
which explain predictions locally in features’ space.
JEL Classification codes: C45, C55, G24, G32, G33 AMS Classification codes: 62M45,
68T01, 68T50, 91G40
Introduction
In the aftermath of the economic crisis, the probability of default (PD) has become a topical
theme in the field of financial research. Indeed, given its usage in the risk management, in the
valuation of the credit derivatives, in the estimation of the creditworthiness of a borrower and
in the calculation of economic or regulatory capital for banking institution (under Basel II),
incorrect PD prediction can lead to false valuation of risk, unreasonable rating and incorrect
pricing of financial instruments. In the last decades, a growing number of approaches has been
∗
corresponding author: [email protected]
1
This paper reflects the authors’ opinions and not necessarily those of their employers.
1
developed to model the credit quality of a company, by exploring statistical techniques. Several
works have employed probit models[1] or linear and logistic regression to estimate company rat-
ings using the main financial indicators as model input. However, these models suffer from their
clear inability to capture non-linear dynamics, which are prevalent in financial ratio data[2].
New statistical techniques, especially from the field of machine learning, have gained a world-
wide reputation thanks to their ability to efficiently capture information from big dataset by
recognizing non-linear patterns and temporal dependencies among data. Zhao et al. (2015) [3]
employed feed forward neural networks in credit corporate rating determination. Petropopulos
et al.[4] explore two state of the art techniques namely Extreme Gradient Boosting (XGBoost)
and deep learning neural networks in order to estimate loan PD and calibrate an internal rating
system, useful both for internal usage and regulatory scope. Addo et al. (2018)[5] built binary
classifiers based on machine and deep learning models on real data to predict loan probability of
default. They observed that the tree-based models are more stable than ones based on multilayer
artificial neural networks.
Starting from these studies, we propose a sophisticated framework of machine learning models
which, on the basis of company annual (end-of-year) financial statements coupled with relevant
macroeconomic indicators, attempts to classify the status of a company (performing - “in-bonis”
- or defaulted) and to build a robust rating system in which each rating class will be matched to
an internally calibrated default probability. In this regard, here the target variable is different
from a previous work by some of the authors [6], where the goal was to predict the credit rating
that Moody’s would assign, according to an approach commonly called “shadow rating”. The
novelty of our approach lies in the combination of data preprocessing algorithms, responsible for
feature engineering and feature selection, and a core model architecture made of a concatenation
of a Boosted Tree default classifier, a probability calibrator and a rating attribution system
based on genetic algorithm. Great attention is then given to model interpretability, as we
propose two intuitive approaches to interpret the model output by exploring the property of
local explainability. In details, the article is composed of the following sections: Section 1 is
devoted to describe the input dataset and the preprocessing phase; Section 2 in which the core
model architecture is explained; Section 3 which collects results from the core model structure
(i.e. default classifier, PD calibrator and rating clustering); finally Section 4 is left to model
explainability.
1 Dataset description
Data used for model training have been collected from the Credit Research Database (CRD)
provided by Moody’s, and consist of 919, 636 annual (end of year) financial statements of 157, 986
Italian companies belonging to different sectors (e.g. automotive, construction, consumer goods
durable and not-durable, energy, high-teach industries, media, etc. to the exclusion of FIRE
sector, i.e. finance, insurance, and real estate sectors). The dependent variable in our dataset,
i.e. the target of the proposed default prediction model, is a binary indicator with the value of
1 flagging a default event (i.e. a bankruptcy occurrence over a one-year horizon), 0 otherwise.
In accordance to the above-defined target variable, input variables of our model have been se-
lected to be consistent with factors that can affect the companies capacity to service external
debt (a full explanation of the input model’s features is reported in Appendix A). In partic-
ular they consist of balance-sheet indexes and ratios, and Key Performance Indicators (KPI)
calculated from CRDs financial reports[7]. The latter include indicators for efficiency (i.e. mea-
sures of operating performance), liquidity (i.e. ratios used to determine how quickly a company
can turn its assets into cash if it is experiencing financial distress or impending bankruptcy),
2
solvency (i.e. ratios that depict how much a company relies upon its debt to fund operations)
and profitability (i.e. measures that demonstrate how profitable a company is). Since business
cycles can have great impact on a firm profitability and influence its risk profile, we joined orig-
inal information with more general macro variables (2 years lagged historical data) addressing
the surrounding climate in which companies operate. Among the wide range of macroeconomic
indicators provided by Oxford Economics [8], a subset of the most influential ones has been
selected as explanatory variables2 . Some of them are country-specific, others are common to the
whole Eurozone3 . The combined dataset of balance-sheet indexes, financial ratios and macro
variables along with data transformations and feature selection (better described hereafter in
Section 1.1), led to a set of 179 features and covers the period 2011 − 2017.
As fully described hereafter, the obtained dataset was split into three parts: an out-of-time
dataset which includes the data referred to year 2017 (marked in light-blue in Figure 1), used
to test model performance; a stratified pair of train/test dataset (constituting each the 80%
and the 20% of the input dataset) which covers the period 2011 − 2016, employed for model
development and calibration.
Figure 1: Number of balance-sheets per financial statement year and the corresponding 1-year
default rate. The default rate shows an increasing trend in the 2011 − 2012 period, and then
decreases till 2017.
3
a number. By applying such transformation, known as Label Encoding, a model would treat
categories as ordered integers, which would imply non-existent ordinal relationships between
data, that could be misleading for model training. Another simple way to handle categorical
data is the One-Hot Encoding technique, which consists in transforming each categorical feature
into fixed-size sparse vector of all zeros but 1 in the cell used uniquely to identify specific real-
ization of that variable. The main drawback of this technique relies on the fact that categories
with an high number of possible realizations would generate large dimension datasets, which
makes it a memory-inefficient encoder. Moreover this sparse representation does not preserve
similarity between feature values. An alternative approach to overcome these issues is repre-
sented by Categorical Embedding, which consists in mapping via Deep Neural Network (DNN)
each possible discrete value of a given categorical variable in a low-dimensional, learned, contin-
uous vector representation. This method allows to place each categorical feature in a Euclidean
space, keeping coherent relationship with other realization of the same variable. The extension
of categorical embedding approach to words and document representation is known as Word
Embedding[10]. In particular, Sentence Embedding is an application of word embedding aiming
at representing a full sentence into a vector space. In this study, we applied sentence embed-
ding to represent industry sectors descriptions associated to each ”NACE code”4 . In order to
guarantee the ”semantics” of the original data and work in a low dimensional space, we pro-
pose a framework of embedding with autoencoder regularization, in which the original data are
embedded into low dimension vectors. The obtained embeddings maintain local similarity and
can be easily reverted to their original forms. The encoding of the NACE is a novel way to
overcome the NACE1-NACE2 mapping conundrum. In our dataset both NACE are used and as
already stated in many papers the two encoding systems are not fully compatible[11]. Moreover
the NACE encoding allows for proper industry segment description of multi sector firms that
cannot be easily described by a single NACE code, further extending the predictive power of
the economic sector category.
A different encoding method is the Target encoding, in which categorical features are replaced
with the mean target value for samples having that category. This allows to encode an arbi-
trary number of features without increasing data dimensionality. However, as a drawback, a
naive application of this type of encoding can allow data leakage, leading to model overfitting
and poor predictive performance. A target encoding algorithm developed for preventing data
leakage is know as James-Stein estimator, and is the one used in our model. In more details,
it transforms each categorical feature with a weighted average of the mean target value for the
observed feature value and the mean target value computed regardless of the feature realization.
As described above, some feature transformations could result in a general increase of input
data dimensionality, which makes it urgent to implement a robust and independent feature
selection framework. In fact, training a machine learning model to a huge number of independent
variables is doomed to suffer from the so-called curse of dimensionality[12], i.e. the problem
of exponential increase in volume associated with adding extra dimensions to a vector space.
We employed a voting ensemble of models to independently assign importance to the available
features and efficiently select those features which will contribute most to model prediction.
Hereafter in this section we will look into the implementation of satellite models aiming
at: performing sentence embedding of the industry sectors description; reducing embeddings
dimensionality via stacked autoencoder; selecting relevant features via voting approach.
4
The ”Statistical Classification of Economic Activities in the European Community”, commonly referred to as
NACE, is the industry standard classification system used in the European Union
4
Sentence Embedding of sector descriptions A common practice in Natural Language
Processing (NLP) is the use of pre-trained embeddings to represent words or sentence in a doc-
ument. Following this common practice, we use the pre-trained models built in SpaCy NLP
library for embedding the sequence of NACE sector textual description. In particular, we per-
formed sentence embedding i.e. we transform each description into a 300-dimensional real-value
vector. Each sentence embedding is automatically constructed by SpaCy averaging the 300-
dimensional real-value pre-trained vectors which map each word in that sentence.
Here’s a glimpse at how spaCy processes textual data. It first segments text into words, punc-
tuations, symbols and others by applying specific rules to each language (i.e. it tokenizes the
text). Then it performs Part-of-speech (POS) tagging to understand the grammatical properties
of each word by means of built in statistical model. A model consists of binary data trained on a
dataset large enough to allow the system to make predictions that generalize across the language.
A key assumption to the word embedding approach is the idea of using for each word a dense
distributed representation learned based on the usage of words [13]. This allows words that
are used in similar ways to have similar representations, naturally capturing their meaning[14].
Given the high importance industry sector has in financial literature as default prediction driver,
embedding NACE industry descriptions improves the overall model performance in application
by helping the model to generalize better and to smoothly handle unseen elements.
Voting approach for feature selection Feature selection is a key component when building
machine learning models. We can either demand this task to the main model or use a set of
lighter models in a preparatory task so that the required effort for further feature selections will
be reduced when training the main model. This is particularly useful for multi parameters models
like Light-GBM where the training phase involves also the calibration of a set of hyperparamters
usually spanning very wide ranges. Neglecting the expert based component, algorithmic feature
selection methods are usually divided into three classes: filter methods, wrapper methods and
5
a 3-layer encoder and 3-layer decoder
6
The reconstruction loss is the loss function (usually either the mean-squared error or cross-entropy between
the reconstructed output and the input) which penalizes the network for creating outputs different from the
original input
5
embedded methods. Filter-based methods apply a statistical measure to assign a score to each
feature; variables of the starting dataset are then ranked according to their scores and either
selected to be kept or removed. Wrapper-based methods consider the selection of a set of features
as a ”search problem”, where different combinations are prepared, evaluated and compared to
other combinations. In details, a predictive model is used to evaluate a combination of features
and assign a score based on model accuracy. Embedded-based methods learn which features
best contribute to the accuracy of the model while the model is being created.
We combined a set of 6 different models for feature selection, stacking each algorithm into a
hard voting framework where features which receive the highest number of votes among all the
models have been selected. In particular, after having transformed categorical features via target
encoding (by means of James-Stein encoder), each feature in the dataset has been ranked on
the basis of the following models:
• Random Forest Classifier (RF7 ). This is a wrapper-based method that uses a built-in
algorithm for feature selection. In particular, variables are selected accordingly to feature
importance, obtained by averaging of all decision tree feature importance.
2 Model architecture
Moving beyond the satellite models described in Section 1.1, and used in the preprocessor phase,
in this section we will present the core model architecture. It consists of a concatenation of three
7
RF is an ensemble of Decision Trees generally trained via bagging method : this approach consists in using
the same training algorithm for every predictor, but training them on different random subsets of the Train set.
Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the
predictions of all predictors
8
LGBM is a fast, high-performance gradient boosting framework based on decision tree algorithm
6
machine learning models aiming at building a robust and reliable framework whose purpose is
not only to classify the company status (as in bonis or defaulted), but also to construct an
internally calibrated rating system in which each rating class will correspond to a self-consistent
default probability.
In details, we developed a Boosted Tree algorithm in order to classify the status of a company
over a one-year horizon. This is a classical binary classification problem whose target variable
is 1 in case of default event or 0 otherwise. Hyper-parameters tuning has been performed via
an extension of cross validation for time-series that will be further described hereafter.
The purpose of the second model is to fit the output score of the binary classificator to the
actual default rate. We built a calibrator which consists of a Logistic Regression trained on the
leaf assignments of the Boosted Tree.
Finally, we calibrated our own rating system by splitting the refitted default probability into 9
clusters via genetic algorithm.
specificity · recall
Fβ = (1 + β 2 ) (1)
β2 · specificity + recall
In this procedure, each test sets consist of a single year of future observation, while the cor-
responding training sets are made up of the observations that occurred prior to the observation
that forms the test set. In this way, the model is optimized in predicting what will happen in
the future using only information available up to the the present day. The objective function
used for the classification problem was the log-loss, which measures the distance between each
predicted probability and the actual class output value by means of a logarithmic penalty. Due
9
The specificity is defined as the number of true negatives over the number of true negatives plus the number
of false negatives
10
The recall is defined as the number of true positives over the number of true positives plus the number of
false negatives
7
to the high unbalance between 0 and 1 target flags, we have used the modified unbalanced log-
loss, by setting the scale pos weight parameter of the Light-GBM equal to the ratio between
the number of 0s and the number of 1s. Other objective functions we have used like Focal
Loss[18] and custom weighted log losses have not given any specific advantage compared to the
unbalanced log loss.
8
dissimilarities between different groups (the so called separation i.e. any two clusters should be
as distinct as possible in terms of similarity of items), ensure both PD monotonicity (i.e. lower
default rates have to correspond to low rating grades and vice versa) and have an acceptable
cluster size (i.e. each cluster has to include a fraction of the total population that has to be
roughly homogeneous among them). Among the partitioning clustering algorithms, Genetic
Algorithms (GA) are stochastic search heuristic inspired by the concepts of Darwinian evolution
and genetics. They are based on the idea of creating a population of candidate solutions to
an optimization problem, which is iteratively refined by alteration (mutation) and selection of
good solutions for the next iteration. Candidate solutions are selected according to a so-called
fitness function, which evaluates their quality in respect to the optimization problem. In the
case of Differential Evolution (DE) algorithms the candidate solutions are linear combinations of
existing solutions. In the end, the best individual of the population is returned. This individual
represents the best solution discovered by the algorithm.
3 Results
The metric used to evaluate the model performance is the AUROC or AUC-ROC score (Area
Under the Receiver Operating Characteristics). In particular, ROC is a probability curve and
AUC represents the degree of separability. This measure tells how much a model is capable of
distinguishing between classes: for an excellent model it is near to 1, for a poor model it near
to 0. The ROC curve is constructed by evaluating the fraction of ”true positives”(tpr or True
Positive Rate) and ”false positives” (fpr or False Positive Rate) for different threshold values.
In details, tpr, also known as Recall or Sensitivity, is defined in Equation (2) as the number of
items correctly identified as positive out of total true positives:
TP
tpr = (2)
TP + FN
where TP is the number of true positives and FN is the number of false negatives. The fpr,
also known as Type I Error, is defined in Equation (3) as the number of items wrongly identified
as positive out of total true negatives:
FP
f pr = (3)
FP + TN
where FP is the number of false positive and TN is the number of true negatives.
Prediction results are then summarized into a confusion matrix which counts the number
of correct and incorrect predictions made by the classifier. A threshold is applied to the cut-off
point in probability between the positive and negative classes, which for the default classifier
has been set at 0.5. However, a trade-off exists between tpr and fpr, such that changing the
threshold of classification will change the balance of predictions towards improving the True
Positive Rate at the expense of False Positive Rate, or vice versa.
The metric used to evaluate the performance of internally calibrated PD prediction is the
Brier score (BS), i.e. a way to verify the accuracy of a probability forecast in term of distance
between the actual results. The most common formulation of the Brier score is mean squared
error :
N
1 X
BS = (ft − ot )2 (4)
N
t=1
9
in which ft is the forecast probability, ot the actual outcome of the event at instance t and
N is the number of forecasting instances. The best possible Brier score is 0, for total accuracy,
the lowest possible score is 1, which mean the forecast was wholly inaccurate.
Note that, all the metrics described so far have been calculated on the Test-set obtained by
splitting the dataset along the financial statement year11 .
11
Train-set spans from 2011 to 2016, Test-set covers 2017
10
Figure 3: Normalized confusion matrix with 50% threshold for Light-GBM classifier
3.2 PD refitting
Default probabilities forecast before and after refitting procedure are summarized in the cali-
bration plots (also called reliability curves) of Figure 4, which allow checking if the predicted
probabilities produced by the model are well calibrated. Specifically, a calibration plot con-
sists of a line plot of the relative observed frequency (y-axis) versus the predicted probabilities
(x-axis)12 . A perfect classifier would produce only 0 and 1 predictions but would not be able
to forecast actual default rates. A perfect actual default rate model would produce reliability
diagrams as close as possible to the main diagonal from the bottom left to the top right of the
plot. The refitting procedure maps the perfect classifier to a reliable default rate predictor.
The refitting procedure left AUROC score of the model unchanged (AUROC = 95.0%); the
calibration performance is evaluated with the Brier score (BS = 1.2%13 ); classification results
are summarized in the normalized confusion matrix reported in Figure 6, where the threshold to
the cut-off point between the positive and negative classes has been optimized on the Figure 5.
12
In details, the predicted probabilities are divided up into a fixed number of buckets along the x-axis. The
number of target events (i.e. the occurrence of 1-year default) are then counted for each bin (i.e. the relative
observed frequency). Finally, the counts are normalized and the results are plotted as a line plot
13
The closer the Brier score is to zero the better is the forecast of default probabilities
11
(a)
(b)
Figure 4: Calibration plots and log-scaled histograms of forecast probability before (Figure 8a)
and after (Figure 8b) refitting. Accuracy of predicted probabilities is expressed in term of log
loss measure
12
Figure 5: ROC curve for the calibrated classifier
Figure 6: Normalized confusion matrix with optimized threshold for the calibrated classifier
13
3.3 PD clustering
Among the several common statistical tests that can be performed to validate the assignment of
a probability of default to a certain rating grade, two approaches have been used: the one-sided
Binomial test and the Extended Traffic-Light Approach.
The Binomial Test is one of the most popular single-grade single-period14 test performed
for rating system validation. For a certain rating grade k ∈ {1, . . . , K}, being K the number of
rating classes15 , we made the assumptions that default events are independent within the grade
k and could be modelled as a binomially distributed random variable X with size parameter
Nk and ”success” probability P Dk . Thus, we can assess the correctness of the PD forecast by
testing the null hypothesis H0 , where:
• H0 : the actual default rate is less than or equal to the forecast default rate given by the
PD;
The null hypothesis H0 is rejected at a confidence level α in case the number of observed
defaults d per rating grade is greater than or equal to the critical value, as reported in Equation 5:
Nk
X Nk
dα = min d : P Dkj (1 − P Dkj )Nk −j ≤ 1 − α (5)
j
j=d
The Extended Traffic Light Approach is a novel technique for default probability validation,
first adopted by Tasche (2003)[21]. The implementation used in this section refers to a heuristic
approach proposed by Blochwitz et al. (2005) [22] which is based on the estimation of a relative
distance between observed default rates and forecast probabilities of default, under the key
assumption of binomially distributed default events. Four coloured zones, Green, Yellow, Orange,
Red, are established to analyse the deviation of forecasts and actual realizations. In details: if
the result of the validation assessment lies in the Green zone there is no obvious contradiction
between forecast and realized default rate; the Yellow and Orange lights indicate that the
realized default rate is not compatible with the PD forecast, however, the difference of realized
rate and forecast is still in the range of usual statistical fluctuations; and last red traffic light
indicates a wrong forecast of default probability. The boundaries between the afore mentioned
light-zones are summarized in Equation 6:
Green pk < P Dk
Yellow P D 6 p < P D + K y σ(P D , N )
k k k k k
y σ(P D , N ) 6 p < P D + K 0 σ(P D , N )
(6)
Orange P Dk + K k k k k k k
P Dk + K 0 σ(P Dk , Nk ) 6 pk
Red
p
where σ(P Dk , Nk ) = P Dk (1 − P Dk )/Nk . The parameters K y and K 0 play a major role
in the validation assessment, so have to be tuned carefully. A proper choice based on practical
considerations is setting K y = 0.84 and K 0 = 1.44, which corresponds to a probability of
observing green of 0.5, observing yellow with 0.3, orange with 0.15 and red with 0.05.
The results of the application of the above mentioned statistical test are summarized in
Table 1.
14
usually one year
15
9 in our case
14
Out-of-sample Extended
Rating PD Bins Rating Class One-sided
Default Rate Traffic Light
class (%) PD (%) Binomial Test
(%) Approach
AAA [0.00, 0.005) 0.03 0.00 Passed Green
AA [0.05, 0.42) 0.24 0.03 Passed Green
A [0.42, 0.55) 0.48 0.08 Passed Green
BBB [0.55, 0.74) 0.64 0.21 Passed Green
BB [0.74, 1.00) 0.87 0.40 Passed Green
B [1.00, 1.42) 1.21 0.83 Passed Green
CCC [1.42, 2.12) 1.77 1.29 Passed Green
CC [2.12, 9.03) 5.57 5.06 Passed Green
C [9.03,100) 54.52 33.77 Passed Green
Table 1: Internally calibrated PD clustering into 9 rating classes. Despite being borrowed from
S&P rating scales, the labels are assigned to a PD calibrated on an internal dataset (the one
used during the training-phase) and does not correspond to any rating agencies PD
4 Model explainability
Machine learning models which operate in higher dimensions than cannot be directly visualized
by human mind are often referred as ”black boxes”, in the sense that high model performance
is often achieved on the detriment of output explainability, leading users not to understand the
logic behind model predictions. Even greater attention to model interpretability has led to the
development of several methods to provide an explanation to machine learning outputs, both in
term of global and local interpretability. In the first case, the goal is being able to explain and
understand model decisions based on conditional interactions between the dependent variable
(i.e. target) and the independent features on the entire dataset. In the latter case, the aim is to
understand model output for a single prediction by looking at a local subregion in the feature
space around that instance.
Two popular approaches described hereafter in this section are SHAP and LIME, which explore
and leverage the property of local explainability to build surrogate models which are able to
interpret the output of any machine learning models. The technique upon which these algorithms
are based is slightly tweaking the input and modelling the changes in prediction by means of
surrogate agnostic models. In particular, SHAP measures how much each feature in our model
contributes, either positively or negatively, to each prediction, in term of difference between
the actual prediction and its expected value. LIME builds sparse linear models around each
prediction to explain how the black box model works in that local vicinity.
4.1 SHAP
SHAP, which stands for (SH apley Additive exPlanation) [23], is a novel approach for model
explainability which exploits the idea of Shapley regression value 16 to model feature influence
scoring. SHAP values quantify the magnitude and direction (positive or negative) of a feature’s
effect on a prediction via an additive feature attribution method. In simple words, SHAP builds
model explanations by asking, for each prediction i and feature j, how i changes when j is
removed from the model. Since SHAP considers all possible predictions for an instance using all
16
The technical definition of Shapley value is the average marginal contribution of a feature value over all
possible coalitions
15
possible combinations of feature inputs, it can guarantee both consistency and local accuracy.
More in details, SHAP method computes Shapley values from coalitional game theory. The
feature values of a data instance act as players17 in a coalition: Shapley values suggest how to
fairly distribute the payout (i.e. the prediction) among the features.
SHAP summary plot As reported in Figure 10, it combines feature importance with feature
effects to measure the global impact of features on the model. For each feature shown in the
y-axis, and ordered according to their importance, each point on the plot represents the Shapley
value (reported along the x-axis) for a given prediction. The color of each point represents the
impact of the feature on model output from low (i.e. blue) to high (i.e. red). Overlapping points
are littered in y-axis direction, so we get a sense of the distribution of the Shapley values per
feature.
Figure 7: SHAP summary plot for Light-GBM classifier. The details of model’s feature descrip-
tion are reported in Appendix A
17
note that a player can be an individual feature value or a group of feature values
16
SHAP dependence plot It is a scatter plot that shows the effect a single feature has on the
model predictions. In particular, each dot represents a single prediction where the feature value
is on the x-axis and its SHAP value, representing how much knowing that feature’s value changes
the output of the model for that sample’s prediction, on the y-axis. The color corresponds to
a second feature that may have an interaction effect with the plotted feature. If an interaction
effect is present it will show up as a distinct vertical pattern of colouring.
SHAP waterfall plot The waterfall plot reported in Figure 9 is designed to display how
the SHAP values of each feature move the model output from our prior expectation under
the background data distribution (E[(f (X)]), to the final model prediction (f (X)) given the
evidence of all the features. Features are sorted by the magnitude of their SHAP values with
the smallest magnitude features grouped together at the bottom of the plot. The color of each
row represents the impact of the feature on model output from low (i.e. blue) to high (i.e. red).
17
(a) (b)
(c) (d)
(e) (f)
(g)
19
4.2 Lime
LIME, Local I nterpretable M odel-agnostic Explanations, is a novel technique that explains the
predictions of any classifier in an interpretable and faithful manner, by learning an interpretable
model locally around the prediction[24]. Behind the workings of LIME lies the assumption that
every complex model is linear on a local scale, so it is possible to fit a simple model around a single
observation that will mimic how the global model behaves at that locality. The output of LIME
is a list of explanations, reflecting the contribution of each feature to the prediction of a data
sample, allowing to determine which feature changes will have most impact on the prediction.
Note that, LIME has the desirable property of additivity, i.e. the sum of the individual impact
is equal to the total impact. Results for a prediction are summarized in Figure 10.
Figure 10: Lime local explanation for a prediction from Light-GBM classifier. The details of
model’s feature description are reported in Appendix A
5 Conclusions
Starting from Moody’s dataset of historical balancesheets, bankruptcy statuses and macroeco-
nomic variables we have built three models: a classifier, a default probability model and a rating
system. By leveraging on modern techniques in both data processing and parameter calibration
we have reached state of the art results. The three models show stunning out of sample per-
formances allowing for an intensive usage in risk averse businesses where the occurrence of false
negatives can dramatically harm the firm itself. The explainability layers via Shap and Lime
give a set of extra tools to increase the confidence in the model and help in understanding the
main features determining a specific result. These information can be leveraged by the analyst
to understand how to reduce the bankruptcy probability of a specific firm or to get insight in
which balance sheet fields need to be improved to increase the rating, therefore providing a
business instrument to actively manage clients and structured finance deals.
20
Acknowledgements
We are grateful to Corrado Passera for encouraging our research.
References
[1] P. Mizen and S. Tsoukas, “Forecasting us bond default ratings allowing for previous and
initial state dependence in an ordered probit model,” International Journal of Forecasting,
vol. 28, no. 1, pp. 273–287, 2012.
[2] P. Gurnỳ and M. Gurnỳ, “Comparison of credit scoring models on probability of default
estimation for us banks,” 2013.
[3] Z. Zhao, S. Xu, B. H. Kang, M. M. J. Kabir, Y. Liu, and R. Wasinger, “Investigation and
improvement of multi-layer perceptron neural networks for credit scoring,” Expert Systems
with Applications, vol. 42, no. 7, pp. 3508–3516, 2015.
[5] P. M. Addo, D. Guegan, and B. Hassani, “Credit risk analysis using machine and deep
learning models,” Risks, vol. 6, no. 2, p. 38, 2018.
[9] A. Zheng and A. Casari, Feature engineering for machine learning: principles and tech-
niques for data scientists. ” O’Reilly Media, Inc.”, 2018.
[10] C. Guo and F. Berkhahn, “Entity embeddings of categorical variables,” arXiv preprint
arXiv:1604.06737, 2016.
[11] G. Perani, V. Cirillo, et al., “Matching industry classifications. a method for converting
nace rev. 2 to nace rev. 1,” tech. rep., 2015.
[12] R. Bellman, “Dynamic programming: Princeton univ. press,” NJ, vol. 95, 1957.
[14] Y. Goldberg, “Neural network methods for natural language processing,” Synthesis Lectures
on Human Language Technologies, vol. 10, no. 1, pp. 1–309, 2017.
[15] A. Géron, Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools,
and techniques to build intelligent systems. ” O’Reilly Media, Inc.”, 2017.
21
[16] https://github.com/microsoft/LightGBM.
[17] J. H. Friedman, “Stochastic gradient boosting,” Computational statistics & data analysis,
vol. 38, no. 4, pp. 367–378, 2002.
[18] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detec-
tion,” in Proceedings of the IEEE international conference on computer vision, pp. 2980–
2988, 2017.
[19] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers,
et al., “Practical lessons from predicting clicks on ads at facebook,” in Proceedings of the
Eighth International Workshop on Data Mining for Online Advertising, pp. 1–9, 2014.
[20] R. Storn and K. Price, “Differential evolution - a simple and efficient heuristic for global
optimization over continuous spaces,” Journal of Global Optimization, vol. 11, pp. 341–359,
01 1997.
[22] S. Blochwitz, S. Hohl, and C. Wehn, “Reconsidering ratings,” Wilmott Magazine, vol. 5,
pp. 60–69, 2005.
[23] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” in
Advances in neural information processing systems, pp. 4765–4774, 2017.
[24] M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining the predic-
tions of any classifier,” in Proceedings of the 22nd ACM SIGKDD international conference
on knowledge discovery and data mining, pp. 1135–1144, 2016.
22
Appendices
A Model’s Features descriptions
In this section the details of selected features, upon which the model have been trained, are
reported.
Code Definition
cashAndMarketableSe-
Cash and marketable securities.
curities
depreciationExpense The depreciation expense for current period.
Earnings before interest, taxes, depreciation and amortization
ebitda
before extraordinary items.
entityConsolidation-
For companies with subsidiaries, i
Type
incorporationRegion The entity’s incorporation region.
The entity’s incorporation province or administrative division
incorporationState
where the entity has a legal representation.
longTermDebtCurrent- The current maturities of long-term debt, principal payments
Maturities due within 12 months.
netIncome Net income is the total period-end earnings.
Net worth is the sum of all equity items, including retained
netWorth
earnings and other equity.
payableToTrade Accounts payable to regular trade accounts.
receivableFromTrade Accounts receivable from trade.
retainedEarnings Retained earnings.
Defined as the difference between netWorth and
tangibleNetWorth
totalIntangibleAssets.
totalAccountsPayable The sum of accounts payable.
totalAccountsReceiv-
The total accounts receivable, net of any provision from loss.
able
totalAmortizationAnd- The sum of amortization and depreciation expense for current
Depreciaton period.
The total assets of the borrower which is the sum of the current
totalAssets
assets and non-current assets.
totalCapital Total subscribed and share capital.
totalCapi-
Defined as the sum of totalLiabilities and totalCapital
tal and totalLiabilities
totalCurrentAssets The sum of all current assets.
totalCurrentLiabilities The sum of all current liabilities.
Total fixed assets are the Gross Fixed Assets less Accumulated
totalFixedAssets
Depreciation.
totalIntangibleAssets Total intangible assets.
23
Code Definition
The total interest expense is any gross interest expense generated
totalInterestExpense
from short-term, long-term, subordinated or related debt.
totalInventory The sum of all the inventories.
The sum of Total Current Liabilities and Total non-current
totalLiabilities
liabilities.
The amount due to financial and other institutions after 12
totalLongTermDebt
months.
totalOperatingExpense The sum of all operating expenses.
totalOperatingProfit The Gross Profit less Total Operating Expense
totalProvisions Total provisions for pensions, taxes, etc.
totalSales Total sales.
totalWageExpense The total wage expense.
Defined as the sum of receivableFromTrade,
workingCapital totalAccountsReceivable and totalInventory minus
payableToTrade and totalAccountsPayable.
totalSales
ASSET TURNOVER = (10)
totalAssets
totalCurrentAssets
CURRENT RATIO = (11)
totalCurrentLiabilities
ebitda
DEBT COVERAGE = (12)
totalInterestExpense
totalLiabilities
DEBT EQUITY = (13)
netWorth
ebitda
EBITDA RATIO = (14)
totalSales
workingCapital
IND ROTA = (16)
totalSales
24
PFN
IND STRUTT = (17)
netWorth
totalInventory
INVENTORY TURNOVER = (18)
totalSales
PFN
LEVERAGE 1 = (19)
ebitda
FFO
LEVERAGE 2 = (20)
PFN
totalLongTermDebt
LONG-TERM-DEBT EQUITY = (21)
netWorth
netIncome
NETINCOME RATIO = (22)
totalSales
netIncome
ROA = (24)
totalAssets
netIncome
ROE = (25)
netWorth
totalOperatingProfit
ROI = (26)
totalAssets
totalCurrentLiabilities
SHORT-TERM-DEBT EQUITY = (27)
netWorth
25
Country specific indicators
Code Name Definition
The volume of consumption, investment,
stockbuilding and government
DOMD Domestic demand, real
consumption expressed in local currency
and at prices of the country’s base year.
Employees in
EE Employees in employment.
employment
ET Employment, total Employment, total.
Consumption, The volume of government spending on
GC
government, real goods and services.
The volume of all final goods and
GDP GDP, real services produced within a country in a
given period of time.
GDP per capita, real, GDP per capita, real, US$, constant
GDPHEAD
US$, constant prices prices.
Investment, total fixed
IF Investment, total fixed investment, real.
investment, real
The volume of investment in tangible
Industrial production and intangible capital goods, including
IP
index machinery and equipment, software, and
construction.
Investment, private The volume of investment in private
IPNR
sector business, real sector business.
Investment, private The volume of investment in private
IPRD
dwellings, real dwellings.
The volume of stocks of outputs that are
still held by the units that produced
them and stocks of products acquired
IS Stockbuilding, real
from other units that are intended to be
used for intermediate consumption or for
resale.
Imports, goods & The volume of goods and services
M
services, real imports.
MG Imports, goods, real The volume of goods imports.
MS Imports, services, real The volume of services imports.
GDP, compensation of
The values of wages and salaries of
PEWFP employees, total,
employees as a component of GDP.
nominal
PH House price index Index of house prices.
POIL$ Oil price US$ per toe Oil price US$ per toe.
The rate that is used by central bank to
Interest rate, central
RCB implement or signal its monetary policy
bank policy
stance (expressed as an average).
26
Country specific indicators
Code Name Definition
The difference in yield between two
Credit spreads, end of bonds of similar maturity but different
RCORP SPREADEOP
period credit quality, expressed as end of period
value.
Interest rate, long-term Interest rate, long-term government bond
RLG
government bond yields yields.
Retail Sales volume
Volume index for retail sales excluding
RS index, excluding
automotive.
automotive
Interest rate,
RSH The 3-month interbank rate.
short-term
Interest rate,
The 3-month interbank rate for the end
RSHEOP short-term, end of
of period.
period
Stockmarket earnings per share
Stockmarket earnings
SMEPS calculated as Stockmarket earnings, LCU
per share
* 1000 / Stockmarket shares outstanding.
Share price total return
SMP TR Share price total return index.
index
The sum of volumes of consumption,
Total final expenditure,
TFE investment, stockbuilding, government
real
consumption and exports.
The total number of people without a
U Unemployment
job, but actively searching for one.
The percentage of the labour force that is
UP Unemployment rate
unemployed at a given date.
The volume of goods and services
Exports, goods &
X exports expressed in local currency and
services, real
at the country’s base year.
The volume of goods exports expressed
XG Exports, goods, real
in local currency.
The volume of services exports expressed
XS Exports, services, real
in local currency.
Eurozone indicators
Code Name Definition
PH House price index Index of house prices.
The rate that is used by central bank to
Interest rate, central
RCB implement or signal its monetary policy
bank policy
stance (expressed as an average).
27
Eurozone indicators
Code Name Definition
The rate that is used by central bank to
Interest rate, central
implement or signal its monetary policy
RCBEOP bank policy, end of
stance (expressed as an end of period
period
value).
1-day interbank interest rate for the
REONIA Interest rate, EONIA
Eurozone.
Interest rate, long-term Interest rate, long-term government bond
RLG
government bond yields yields.
Interest rate,
RSH The 3-month interbank rate.
short-term
RSH6M Interest rate, 6-month The 6-month interbank rate.
Interest rate,
The 3-month interbank rate for the end
RSHEOP short-term, end of
of period.
period
Interest Rate Swap,
RSWAP2YR Swap par-rate for 2 years tenor.
2-year
Interest Rate Swap,
RSWAP5YR Swap par-rate for 5 years tenor.
5-year
Interest Rate Swap,
RSWAP10YR Swap par-rate for 10 years tenor.
10-year
Interest Rate Swap,
RSWAP30YR Swap par-rate for 30 years tenor.
30-year
28