Performance Evaluation of Credit Risk Models
Performance Evaluation of Credit Risk Models
Keywords: machine learning, credit risk prediction, artificial intelligence, Peer-to-Peer lending, stacked
ensemble classifiers
Introduction
The market for alternative finance globally has grown remarkably. For the period starting
from 2015 to 2018 its volume has almost doubled, reaching USD 305 billion (Dimitrov et al.,
2020). Nearly 97% of this sector is made up of crowdlending platforms, with Peer-to-Peer
consumer lending being the most common alternative finance business model with a share of
around 64% of all models in this sector (Ziegler & Shneor, 2020).
Peer-to-Peer lending brings undeniable advantages for both investors and online platforms,
businesses, and individual consumers. However, this business model also poses risks arising from
the specifics of this sector and the dynamic environment. One of the main risks is related to an
increase in the share of default loans. Bad loans are a serious threat to many investors entering the
market as well as to online P2P lending platforms and borrowers. This also determines the crucial
importance of the process of assessing borrowers and predicting the risk of non-repayment of the
loan.
At the same time, the improvement of artificial intelligence and machine learning
technologies in recent years has led to their ubiquitous application in all spheres of economy and
public life (Turygina et al., 2019), (Petrov et al., 2021). Successful examples of the use of machine
learning to predict and prevent serious threats and unintended consequences are numerous and
continuously demonstrate the excellent capabilities of these credit risk predicting methods.
1. Research methodology
The model evaluation and comparison are essential to choose the best model for
implementation. When assessing model performance and its predictive power several metrics are
widely used depending on the model types. The most important evaluation measures for binomial
classification models are accuracy, balanced accuracy, sensitivity, specificity, area under ROC
curve, F1 score, Kappa coefficient, etc. It should be considered that the set threshold directly
influences the measures derived from the classification matrix. Changing the threshold level results
It can be concluded from (1) that the highest lift would give a model where there is a
minimum number of false negative and a maximum number of true negative cases. If the product of
the number of false negative and average profit is greater than the absolute value of the product of
truly negative cases and their respective misclassification costs, it will have a positive increase in
the financial result and therefore a profit from the application of machine learning. Otherwise, the
model will result in a loss for the online lending company.
Online P2P lending companies should strive to maximize profits but should at the same time
maintain the default loans ratio within acceptable limits. As a result of the application of a machine
learning model, the share of bad loans is calculated as in (2).
The next step is to determine the optimal threshold for generating the predictions. A
threshold of 0.5 is usually set for classification models, but this value does not always lead to
optimal performance metric values. A decrease of the threshold below 0.5 increases the number of
positive predictions, but also the false positive count. Respectively, an increase of the threshold
above 0.5 increases the number of false predictions which can lead to an increase in the number of
false negatives cases. The appropriate threshold for the best false positive rate can be determined by
the ROC curve as well. In this case, however, we are looking not only for a model that is accurate,
but also one that allows optimal financial results to be achieved from its implementation, while
naturally also observing an acceptable default credit ratio and false positive cases.
To determine the optimal threshold, we propose the following sequence of steps:
1. The trained binary classification model is applied to the data set and predictions are
generated in the form of a probability of belonging to each of the two classes.
2. The data set is sorted in increasing order of the estimated probability of belonging to
the positive class (p0).
3. The accumulated financial result for the entire data set is calculated. The revenue is
the interest paid on the good loans and the costs – the outstanding part of the
principal for bad loans.
4. A threshold should be determined where the financial result is optimal. For better
performance, we recommend building a graph of the cumulative financial result at
the different thresholds (see Figure1)
The chart presented on Figure 1 provides an opportunity to visually explore the advantages
of applying the trained model for predicting the status of loans. The cumulative result for
probability threshold p0j is calculated using the following equation (3):
Resulti is the financial result, profit or loss, for the ith case and cases from 1 to i belong to
the positive class with probability greater than or equal to the probability at threshold j – p0 j. The
financial result in point A (see Figure 1) is equal to the sum of all profits and costs associated with
loans granted and shows what is the financial result without machine learning implementation for
credit status prediction. Since the data set is pre-sorted by increasing p0 values, then if the trained
model had any predictive power at the beginning, negative cases (default loans) would prevail. The
cumulative financial result sums up the profits and losses of all loans for which p0 is greater than or
equal to the current threshold, and therefore at the beginning point A to point B there is an increase
in profit due to the elimination of default loans. The increase in profit in the part of the graph from
point A to point B shows the result of applying machine learning methods to predict default loans.
However, with the p0 threshold increasing, more loans are classified as bad and fewer are approved,
resulting in a reduction in profits due to lost income from solvent borrowers. At the end point C p0
is approximately equal to 1, which means that all loans are classified as bad and unapproved for
financing and therefore the financial result is 0 – no loss, but no profit.
After determining the value of the p0 threshold at which there is an optimal profit, the
different performance metrics such as accuracy, specificity, sensitivity, F1 score, Kappa coefficient,
etc. should be calculated. It is also important to analyze the default credit ratio after applying the
machine learning model at the selected threshold.
3. Model fitting
The fitting of machine learning models is implemented in H2O environment. H2O is an
open source, scalable, distributed, fast, memory-based machine learning platform. It enables the
building of machine learning models on big data and provides easy implementation of models in a
working environment. The platform is built and provided by the H2O.ai company, whose corporate
mission is the democratization of artificial intelligence. In the latest research for 2020 by the
consulting company Gartner, H2O.ai is listed as a visionary in the field of data science and machine
learning platforms (Krensky et al., 2020) and cloud services for artificial intelligence (Baker et al.,
2021).
For model training we consider homogeneous ensembles based on decision trees (XGBoost,
extreme gradient boosted decision trees, gradient boosting machine, distributed random forest),
deep learning networks and stacked ensembles, supported by H2O. Stacked ensemble is a
heterogeneous ensemble algorithm that finds the optimal combination of a set of prognostic models
using a process called “stacking” (H2O.ai, 2020). These ensembles support regression, binary and
multinomial classification. The concept of combining models and stacking them into an ensemble
model was published in 2007 (Van der Laan et al., 2007) and further developed in 2010 Polley &
4. Model comparison
When assessing the performance of models, the cost matrix was considered. Costs due to
misclassification of good credit as bad are equal to lost profits due to refusal of the loan request.
These costs amount to USD 2616, which is the average profit of a loan repaid. The costs incurred
by the incorrect classification of bad credit as good are equal to USD 5952, as is the average loss on
a default loan.
The performance of the best ten models on the test set is shown in Table II. All models were
applied to a test set of 406 368 cases where the total actual profit was USD 400 814 860. Using the
values of an average profit and loss, an estimated profit that Lending Club would have received as a
result of applying the relevant model was calculated. It is assumed that when using a trained model,
all loans for which there is a negative prediction are canceled and all positively predicted loans are
approved. The estimated profit shall be equal to the profit from the correctly classified actual
positive cases plus losses from the incorrectly classified actual negative cases, with losses recorded
with a negative sign. The "Lift" column calculates the percentage change in estimated profit relative
to the actual profit Lending Club received from these loans. In the last column "Default ratio" the
share of bad loans is calculated if the P2P lending platform applies the relevant model and decides
to approve or refuse the loan request based on the predictions generated by the model.
The ratio between costs related to misclassification is approximately 1:3 in favor of the
negative class, i.e., losses from misclassifying a negative case as positive are three times higher than
lost profits in misclassifying a positive case as negative. At the same time, the expected distribution
between the two classes is 1:4 in favor of the positive class. The evaluation measures showed that
models with better accuracy and Kappa values were those with better recognition of the positive
cases.
In all models, the default ratio after model application is significantly lower than the default
ratio in the original set. For the ten models presented in Table II, this ratio ranges from 12% to 13%,
while in the data set this ratio is about 20%. This strongly supports the advantages of machine
learning models for credit risk prediction and reduction of the share of bad credits.
The highest profit lift (9%) was observed in the heterogeneous ensemble model
StackedEnsemble AllModels. It had the highest Kappa values (0.2356) and sensitivity (0.7561).
Therefore, this model was chosen to demonstrate profit maximization by setting the optimal
threshold.
The structure of the loan portfolio by category before and after machine learning
implementation is shown on Figure 4. As evident from the figure the structure of the loan portfolio
by category is generally maintained. If we assume model implementation for credit approval, the
relative shares of loans of categories A, B, C will increase, with the largest increase being seen in
category B. The relative shares of loans in the riskier categories D, E, F, G would decrease, with the
largest reduction of 2.2 % in category F loans. These changes confirm the hypothesis that the use of
machine learning models helps diversify the loan portfolio while improving its structure by
Figure 4. Credit portfolio structure before and after applying machine learning models
Significant changes in the reduction in the share of bad loans are observed after machine
learning for the general data set. The default ratio for the initial data set is 20% and as a result of
predicting with a trained classification model, it could be lowered to 12%-13%.
Conclusion
Analyzing the impact of applying machine learning models by examining structural changes
and changes in the default ratio before and after machine learning shows that P2P lending
companies can gain important advantages by implementing machine learning models trained
according to the approach we offer. First, the default ratio as a result from more accurate predictions
can reduce the share of bad loans overall and by category. Another benefit is that machine learning
application can improve the credit portfolio structure by increasing the shares of loans from better
categories which are likely to repay credits and reducing the share of the riskier loans. Last, but not
least, the calculation of the optimal threshold for prediction generation following our proposed
approach can maximize profit for the online Peer-to-Peer lending platform and investors. With
regards to this results, we recommend using cost matrix and cumulative profit chart to determine the
threshold and at the same time consider the traditional measure for evaluation of binomial
classification models.
References
1. Ariza-Garzon, M. et al. (2020) Explainability of a machine learning granting scoring model in
Peer-to-Peer lending. IEEE Access. vol.8, pp. 64873 – 64890. doi:10.1109/
ACCESS.2020.2984412.
2. Baker, V. et al. (2021) Magic quadrant for cloud ai developer services. [Online] Available from
https://www.gartner.com/doc/reprints?id=1-1YGJKJ5P&ct= 200224&st =sb. [Accessed
03/10/2021].
3. Carmichael, D. (2014) Modeling default for Peer-to-Peer loans. SSRN´s eLibrary.
doi:10.2139/ssrn.2529240.
4. Dimitrov, G., Petrov, P., Dimitrova, I., Panayotova, G., Garvanov, I., Bychko, O. and Petrova,
P. (2020) Increasing the classification accuracy of EEG based brain-computer interface signals.
10th International Conference on Advanced Computer Information Technologies, ACIT 2020 -
Proceedings, pp. 386-390. doi:10.1109/ACIT49673.2020.9208944.
5. Federal Reserve Bank of St.Louis (2021) Effective federal funds rate. FRED. Economic Data,
[Online] Available from https://fred.stlouisfed.org/series/ FEDFUNDS. [Accessed 17/02/2021].
6. Federal Reserve Bank of St.Louis (2021) Unemployment rate FRED Economic Data, [Online]