0% found this document useful (0 votes)
98 views5 pages

Credit Scoring With A Feature Selection Approach Based Deep Learning PDF

Uploaded by

cagp5001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views5 pages

Credit Scoring With A Feature Selection Approach Based Deep Learning PDF

Uploaded by

cagp5001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

MATEC Web of Conferences 54, 05004 (2016) DOI: 10.

1051/ matecconf/20165405004
MIMT 2016

Credit scoring with a feature selection approach based deep learning


1 2
Van-Sang Ha , Ha-Nam Nguyen
1
Department of Economic Information System, Academy of Finance, Hanoi, Viet Nam
2
Department of Information Technology, University of Engineering and Technology, Hanoi, Viet Nam

Abstract. In financial risk, credit risk management is one of the most important issues in financial decision-making.
Reliable credit scoring models are crucial for financial agencies to evaluate credit applications and have been widely
studied in the field of machine learning and statistics. Deep learning is a powerful classification tool which is
currently an active research area and successfully solves classification problems in many domains. Deep Learning
provides training stability, generalization, and scalability with big data. Deep Learning is quickly becoming the
algorithm of choice for the highest predictive accuracy. Feature selection is a process of selecting a subset of relevant
features, which can decrease the dimensionality, reduce the running time, and improve the accuracy of classifiers. In
this study, we constructed a credit scoring model based on deep learning and feature selection to evaluate the
applicant’s credit score from the applicant’s input features. Two public datasets, Australia and German credit ones,
have been used to test our method. The experimental results of the real world data showed that the proposed method
results in a higher prediction rate than a baseline method for some certain datasets and also shows comparable and
sometimes better performance than the feature selection methods widely used in credit scoring.

1 Introduction Credit scoring is a technique using statistical analysis


data and activities to evaluate the credit risk against
The main purpose of credit risk analysis is to classify customers. Credit scoring is shown in a figure deter-
customers into two sets, good and bad ones [1]. Over the mined by the bank based on the statistical analysis of
last decades, there have been lots of classification models credit experts, credit teams or credit bureaus. In Vietnam,
and algorithms applied to analyse credit risk, for example some commercial banks start to perform credit scoring
decision tree [2], nearest neighbour K-NN, support vector against customers but it is not widely applied during the
machine (SVM) and neural network [3-7]. One important testing phase and still needs to improve gradually. For
goal in credit risk prediction is to build the best completeness, all information presented in this paper
classification model for a specific dataset. comes from credit scoring experience in Australia,
Financial data in general and credit data in particular Germany and other countries.
usually contain irrelevant and redundant features. The Many methods have been investigated in the last
redundancy and the deficiency in data can reduce the decade to pursue even small improvement in credit
classification accuracy and lead to incorrect decision [8- scoring accuracy. Artificial Neural Networks (ANNs)
9]. In that case, a feature selection strategy is deeply [10-13] and Support Vector Machine (SVM) [14-19] are
needed in order to filter the redundant features. Indeed, two commonly soft computing methods used in credit
feature selection is a process of selecting a subset of scoring modelling. Recently, other methods like
relevant features. The subset is sufficient to describe the evolutionary algorithms [20], stochastic optimization
problem with high precision. Feature selection thus technique and support vector machine [21] have shown
allows decreasing the dimensionality of the problem and promising results in terms of prediction accuracy.
shortening the running time. In this study, a new method for feature selection
Credit scoring and internal customer rating is a based on various criteria are proposed and integrated with
process of accessing the ability to perform financial a deep learning classifier in credit scoring tasks.
obligations of a customer against a bank such as paying The rest of the paper is organized as follows: Section
interest or an original loan on due date, or other credit 2 presents the background of credit scoring, deep learning
conditions for evaluating and identifying risks in the and feature selection. Section 3 is the most important
credit activities of the bank. The degree of credit risk section that describes the details of the proposed model.
changes over individual customers and is identified Experimental results are discussed in Section 4 while
through the evaluation process. It is based on existing concluding remarks and future works are presented in
financial and non-financial data of customers at the time Section 5.
of credit scoring and customer rating.

© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution
License 4.0 (http://creativecommons.org/licenses/by/4.0/).
MATEC Web of Conferences 54, 05004 (2016) DOI: 10.1051/ matecconf/20165405004
MIMT 2016

2 Materials function used throughout the network and the bias b


represents the neuron's activation thresh-old. Multi-layer,
feed-forward neural networks consist of many layers of
2.1 Feature selection inter-connected neuron units, starting with an input layer
Feature selection is the most basic step in data pre- to match the feature space, followed by multiple layers of
processing as it reduces the dimensionality of the data. nonlinearity, and ending with a linear regression or
Feature selection can be a part of the criticism which classification layer to match the output space. Multi-layer
needs to focus on only related features, such as the PCA neural networks can be used to accomplish Deep
method or an algorithm modeling. However, the feature Learning tasks. Deep Learning architectures are models
selection is usually a separate step in the whole process of of hierarchical feature extraction, typically involving
data mining. multiple levels of nonlinearity. Deep Learning models are
There are two different categories of feature selection able to learn useful representations of raw data and have
methods, i.e. filter approach and wrapper approach. The exhibited high performance on complex data such as
filter approach considers the feature selection process as a images, speech, and text. The procedure to minimize the
precursor stage of learning algorithms. The filter model loss function L(W,B | j) is a parallelized version of
uses evaluation functions to evaluate the classification stochastic gradient descent (SGD). Standard SGD can be
performances of subsets of features. There are many summarized as follows, with the gradient L(W,B | j)
evaluation functions such as feature importance, Gini, computed via back propagation.
information gain, the ratio of information gain, etc. A Parallel distributed and multi-threaded training with
disadvantage of this approach is that there is no SGD in H2O Deep Learning
relationship between the feature selection process and the Initialize global model parameters W,B
performance of learning algorithms. Distribute training data T across nodes
The wrapper approach uses a machine-learning Iterate until convergence criterion reached
algorithm to measure the good-ness of the set of selected For nodes n with training subset Tn, do in parallel:
features. The measurement relies on the performance of Obtain copy of the global model parameters
the learning algorithm such as its accuracy, recall and Wn ,Bn Select active subset Tna  Tn (user-
precision values. The wrapper model uses a learning given number of samples per iteration)
accuracy for evaluation. In the methods using the Partition Tna into Tnac by cores nc
wrapper model, all samples should be divided into two For cores nc on node n, do in parallel:
sets, i.e. training set and testing set. The algorithm runs
Get training example i  Tnac
on the training set, and then applies the learning result on
Update all weights wjk  wn, biases bjk 
the testing set to measure the prediction accuracy. The
Bn
disadvantage of this approach is highly computational
wjk := wjk–α(∂L(W,B | j))/∂wjk
cost. Some researchers proposed methods that can speed
bjk := bjk–α(∂L(W,B | j))/∂bjk
up the evaluating process to decrease this cost. Common
Set W,B := Avgn Wn ; Avgn Bn
wrapper strategies are Sequential Forward Selection (SFS)
Optionally score the model on train/validation
and Sequential Backward Elimination (SBE). The
scoring sets
optimal feature set is found by searching on the feature
space. In this space, each state represents a feature subset,
and the size of the searching space for n features is O(2n), 3 The Proposed Method
so it is impractical to search the whole space exhaustively,
unless n is small. Our method uses Deep Learning to estimate the
performance consisting of the cross validation accuracy
and the importance of each feature in the training data set.
2.2 Deep Learning In a multi-node system this parallelization scheme works
Deep learning (deep machine learning, or deep on top of H2O's distributed setup, where the training data
structured learning) attempt to model high-level is distributed across the cluster. Each node operates in
abstractions in data by using multiple processing layers parallel on its local data. After that, we determine best
with complex structures or otherwise, composed of feature set by choosing the best of Average score +
multiple non-linear transformations. There are several Median Score and the lowest SD. To deal with over-
theoretical frameworks for Deep Learning, but this fitting problem, we apply n-fold cross validation
research focuses primarily on the feed-forward technique to minimize the generalization error.
architecture used by H2O. The basic unit in the model is
Step 1: Train data by Random Forest via 20 trails,
the neuron, a biologically inspired model of the human
calculate and sort median of variables important
neuron. In humans, the varying strengths of the neurons'
Step 2: Add each feature with best variables important
output signals travel along the synaptic junctions and are
and train data by Deep Learning with the cross
then aggregated as input for a connected neuron's validation
activation. In the model, the weighted combination α Step 3: Calculate score for each feature Fiscore where
= of input signals is aggregated, and then i=1..n (n is the number of features in current loop).
an output signal f(α) transmitted by the connected neuron. Step 4: Select best feature using selection rules
The function f represents the nonlinear activation Step 5: Back to step 1 until reach the desired criteria

2
MATEC Web of Conferences 54, 05004 (2016) DOI: 10.1051/ matecconf/20165405004
MIMT 2016

In step 2, we use deep learning with n-fold cross tried with value of 10. The averages of classification
validation to train the classifier. In the jth cross validation, results are depicted in Fig. 1.
we will obtain a set of (Fj, Ajlearn, Ajvalidation) that are the The best subset contains 19 features and its accuracy
feature importance, the learning accuracy and the is 74.68 %.
validation accuracy respectively.
We will use those values to compute the score
criterion in step 3.
In step 3 we use the results from step 1 and step 2 to
build the score criterion which will be used in step 4. The
score of feature ith is calculated by:

(1)

The main of our algorithm is presented in step 4. In


this step, we will select best of features using rules: the Figure 1. Accuracy in case of German dataset.
best of Average + Median Score and the lowest standard Table 1 shows the performances of different
deviation (SD). classifiers over the German credit datasets. Baseline is
Rule 1: select features with the best of median score the classifier without feature selection. Classifiers used in
Rule 2: select features with the best of average score [22] include: Linear SVM, CART, k-NN, Naïve Bayes,
Rule 3: select features with the lowest SD MLP. Filter methods include: t-test, Linear Discriminant
These rules guarantee us to get the highest accuracy analysis (LDA), Logistic regression (LR). The wrapper
and the lowest Standard deviation. This proposed method methods include: Genetic algorithms (GA) and Particle
tends to find the smallest optimal set of features in order swarm optimization (PSO).
to reduce the number of output features as much as Table 1. Performances of different classifiers over the german
possible. Then, machine-learning algorithms are used to credit dataset
calculate the relevance of the feature. Based on the
calculated value of conformity level, we find the subset Classifier Filter methods Wrapper Baseline
of features having less number of features while methods
t-test LDA LR GA PSO
maintaining the objective of the problem. 76.74 75.72 75.10 76.54 73.76
Linear SVM 77.18
CART 74.28 73.52 73.66 75.72 74.16 74.30
k-NN 71.82 71.86 72.62 72.24 71.60 70.86
4 Experiment and results 72.40 70.88 71.44 71.56 74.16
Naïve Bayes 70.52
MLP 73.28 73.44 73.42 74.03 72.54 71.76
Our proposed algorithm was coded using R language
(http://www.r-project.org), using H2O Deep Learning RandomForests 73.40
package. This package is optimized for doing “in Our method 74.68
memory” processing of distributed, parallel machine Comparing the performances of various methods in
learning algorithms on clusters. A “cluster” is a software Table 1, we saw that the ac-curacy of deep learning on
construct that can be can be fired up on your lap-top, on a the subset of newly selected features obviously in-creases,
server, or across the multiple nodes of a cluster of real and the number of features has been reduced by 21%.
machines, including computers that form a Hadoop The average accuracy is 73.4% on the original data. After
cluster. We tested the proposed algorithm with several applying the feature selection, the aver-age accuracy
datasets including two public datasets, German and increases to 74.68%.
Australian credit approval, to validate our approach. Moreover, relying on a parallel processing strategy,
In this paper, we used Random Forest with the original time to run 20 trails with 5-fold cross validate taken by
dataset as the base-line method. The proposed method our method is only 5286 seconds (~88 minutes) while
and the base-line method were executed on the same other methods must run several hours. This result
training and testing datasets to compare their efficiency. highlights the efficiency in terms of running time of our
Those implementations were repeatedly done 20 times to method when filtering the redundant features.
test the consistency of obtained results.
4.2 Australian credit approval dataset
4.1. German credit approval dataset The Australian credit dataset is composed of 690
The German credit approval dataset consists of 1000 loan applicants, with 383 credit worthy and 307 default
applications, with 700 accepted and 300 rejected. Each examples. Each instance contains eight numerical
applicant is described by 20 attributes. Our final results features, six categorical features, and one discriminant
were averaged over these 20 independent trials (Fig. 1). feature, with sensitive information being transferred to
In our experiments, we use the default value for the symbolic data for confidentiality reasons. The averages of
hidden parameter and the number of epoch parameter was classification results are depicted in Fig. 2.

3
MATEC Web of Conferences 54, 05004 (2016) DOI: 10.1051/ matecconf/20165405004
MIMT 2016

1. E. I. Altman and A. Saunders, “Credit risk


measurement: developments over the last 20 years,”
Journal of Banking and Finance, vol. 21, no. 11-12,
pp. 1721–1742, (1997)
2. Z. Davoodabadi and A. Moeini, “Building C
ustomers’ Credit Scoring Models with Combination
of Feature Selection and Decision Tree Algorithms,”
vol. 4, no. 2, pp. 97–103, (2015)
3. A. Khashman, “A neural network model for credit
risk evaluation,” International Journal of Neural
Systems, vol. 19, no. 4, pp.285–294, (2009)
4. T. Bellotti and J. Crook, “Support vector machines
Figure 2. Accuracy in case of German dataset for credit scoring and discovery of significant
Table 2 shows that the accuracy of Deep learning on a features,” Expert Systems with Applications, vol. 36,
subset of 7 selected features obviously increases. The no. 2, pp. 3302–3308, (2009)
average accuracy is 85.82% on the original data. After 5. F. Wen and X. Yang, “Skewness of return
applying the feature selection, the average accuracy distribution and coefficient of risk premium,”
increases to 86.24%. Relying on parallel processing, time Journal of Systems Science and Complexity, vol. 22,
to run 20 trails with 5-fold cross validate taken by our no. 3, pp. 360–371, (2009)
method is only 2769 seconds (~46 minutes). 6. X. Zhou, W. Jiang, Y. Shi, and Y. Tian, “Credit risk
evaluation with kernel-based affine subspace nearest
Table 2. Performances of different classifiers over the points learning method,” Expert Systems with
Australian credit dataset Applications, vol. 38, no. 4, pp.4272–4279, (2011)
7. G. Kim, C. Wu, S. Lim, and J. Kim, “Modified
Classifier Filter methods Wrapper Baseline
methods matrix splitting method for the support vector
t-test LDA LR GA PSO machine and its application to the credit
Linear 85.52 85.52 85.52 85.52 85.52 85.52 classification of companies in Korea,” Expert
SVM Systems with Applications, vol. 39, no. 10, pp. 8824–
CART 85.25 85.46 85.11 84.85 84.82 85.20 8834, (2012)
k-NN 86.06 85.31 84.81 84.69 84.64 84.58 8. Liu, and Motoda, “Feature selection for Knowledge
Naïve 68.52 67.09 66.74 86.09 85.86 68.55 Discovery and Data mining,” Kluwer Academic
Bayes Publishers, 1998.
MLP 85.60 86.00 85.89 85.57 85.49 84.15 9. Guyon, and Elisseeff, “An Introduction to Variable
Random 85.82 and Feature Selection,” Journal of Machine Learning
forests Research, pp 1157-1182, 2003.
Our 86.24
10. Oreski, S., Oreski, D., & Oreski, G., “Hybrid system
method
with genetic algorithm and artificial neural networks
and its application to retail credit risk assessment”.
5 Conclusion Expert Systems with Applications, 39(16), 12605–
12617, 2012
In this paper, we focused on studying feature selection 11. Saberi, M., Mirtalaie, M. S., Hussain, F. K., Azadeh,
and Deep Learning meth-od. Features selection involves A., Hussain, O. K., & Ashjari, B. , “A granular
in determining the highest classifier accuracy of a subset computing-based approach to credit scoring
or seeking the acceptable accuracy of the smallest subset modeling”. Neurocomputing, 122, 100–115, (2013)
of features. We have introduced a new feature selection 12. Lee, S., & Choi, W. S., “A multi-industry bankruptcy
approach based on feature scoring. The accuracy of prediction model using back-propagation neural
classifier using the selected features is better than other network and multivariate discriminant analysis”.
methods. Fewer features allow a credit department to Expert Systems with Applications, 40(8), 2941–2946,
concentrate on collecting relevant and essential variables. (2013)
The parallel processing procedure leads to a significant 13. Ghatge, A. R., & Halkarnikar., “Ensemble Neural
decrement in runtime. As a result, the workload of credit Network Strategy for Predicting Credit Default
evaluation personnel can be reduced, as they do not have Evaluation”, 2(7), 223–225, (2013)
to take into account a large number of features during the 14. Chaudhuri, A., & De, K., “Fuzzy Support Vector
evaluation procedure, which will be somewhat less Machine for bankruptcy prediction. Applied Soft
computation-ally intensive. The experimental results Computing Journal”, 11(2), 2472–2486, (2011)
show that our method is effective in credit risk analysis. 15. Ghodselahi, A., “A Hybrid Support Vector Machine
It makes the evaluation more quickly and increases the Ensemble Model for Credit Scoring,” International
accuracy of the classification. Journal of Computer Applications, 17(5), 1–5, (2011)
16. Huang L., Chen C., and Wang J., “Credit Scoring
with a Data Mining Approach Based on Support
References Vector Machines,” Computer Journal of Expert

4
MATEC Web of Conferences 54, 05004 (2016) DOI: 10.1051/ matecconf/20165405004
MIMT 2016

Systems with Applications, vol. 33, no. 4, pp. 847-


856, (2007)
17. G. Eason, B. Li T., Shiue W., and Huang H., “The
Evaluation of Consumer Loans Using Support
Vector Machines,” Computer Journal of Expert
Systems with Applications, vol. 30, no. 4, pp. 772-
782, (2006)
18. Martens D., Baesens B., Gestel T., and Vanthienen J.,
“Comprehensible Credit Scoring Models Using Rule
Extraction from Support Vector Machines,”
European Computer Journal of Operational
Research, vol. 183, no. 3, pp. 1466-1476, (2007)
19. Wang Y., Wang S., and Lai K., “A New Fuzzy
Support Vector Machine to Evaluate Credit Risk,”
Computer Journal of IEEE Transactions on Fuzzy
Systems, vol. 13, no. 6, pp. 25-29, (2005)
20. S. Oreski and G. Oreski, “Genetic algorithm-based
heuristic for feature selection in credit risk
assessment,” Expert System. Appl., vol. 41, no. 4, pp.
2052–2064, (2014)
21. Y. Ling, Q. Y. Cao, and H. Zhang, “Application of
the PSO-SVM model for credit scoring,” Proc. -
2011 7th Int. Conf. Comput. Intell. Secur. CIS 2011,
pp. 47–51, (2011)
22. Deron Liang, Chih-Fong Tsai, Hsin-Ting Wua,
“The effect of feature selection on financial distress
prediction,” Knowledge-Based Systems 73 289–297
(2015)

You might also like