Genetic Algorithm-Based Feature Selection Method For Credit Risk Analysis
Genetic Algorithm-Based Feature Selection Method For Credit Risk Analysis
Abstract—Credit risk assessment of financial intermediaries is an between the complexity of algorithm and the sample dimension.
important problem in finance.The key is to find accurate However, high dimensional data from credit assessment problem
predictors of individual risk in the credit portfolios of institutions. may bring difficulty to the classification training as well as
However, accessing credit risk is very challenging as many factors influence the classification accuracy. How to reduce the feature
may contribute to the risk and their relationship is complicated to size of the data and select a group of effective features for SVM
capture. Recentyears have witnessed a growing trend in applying are very important. Feature selection methods are used to solve
statistical and machinelearning modeling methods such as this problem before the classifier is to be trained.
SVMclassifier, for credit risk analysis, which is effective in Feature selection has become the focus of many research
capturing nonlinear relationshipin the data. However, high areas in recent years. With the rapid advance of computer
dimensional training data not only results in time-consuming
science and information technologies, large scale of datasets with
computation but also affects the performance of the classifier. In
this paper, wepropose a wrapper feature selection method based on thousands of attributes are now ubiquitous in the fields of data
genetic algorithm to select a subset of essential featuresthat will mining, pattern recognition, and machine learning [2-3]. It is a
contribute to good performance in the credit risk classification. We challenging task to process such huge datasets because most of
test ourmethod in a real-world credit risk predictiontask, and our the machine learning techniques usually work well only on small
empirical results demonstrate the advantage ofour method over datasets [4]. The task of the feature subset selection is to
other competing ones. addresses this problem mainly by identifyingand eliminating the
irrelevant and redundant features so that the number of
Keywords-feature selection; machine learning; genetic algorithm; dimension of the dataset would drop. It performs to find a small
credit risk analysis; feature subset that can describe the data for a learning task as
good as or better than the original dataset in order to reduce the
I. INTRODUCTION computational cost and provide better understandings of the
Credit risk management has played a key role in financial datasets, as well as achieve highclassification accuracy [5].
and banking industry. Generally credit risk management includes Algorithms for feature selection or attribute reduction can be
credit risk analysis, assessment (measurement) of enterprise classified into two main categories depending on whether the
credit risk and how to manage the risk efficiently, while credit approach uses feedback from the subsequent performance of the
risk assessment is the basic and critical factor in credit risk machine learning algorithm (e.g. SVM for classification task).
management. The main purpose of credit risk assessment is to A filter method is a no-feedback, pre-selection method
measure the default possibility of borrowers and provide the without involving the later machine learning algorithm to be
loaner a decision-aid by conducting qualitative analysis and applied. The data is first analyzed by using statistical techniquein
qualitative computation to the possible factors that will cause order to determine which features describing the data records are
credit risk. At present, the classification method from machine relevant for the class attribute. Afterwards, the relevant features
learning is the most popular method used in credit risk subset is used to train a classifier for prediction. In filter method,
assessment. According to the financial status of the borrowers, no feedback from the subsequent performance of the induction
we can use a credit scoring system to estimate the corresponding algorithm is used. The typical filtermethods are ReliefF
risk rate so that the status can be classified as normal or default. algorithm, chi-squared (χ2) feature selection, information gain
Support vector machine (SVM) [1] is a relatively new (IG) based feature selection, gain ratio (GR) based feature
machine learning technique to train a powerful classifier which is selection, symmetrical uncertainty (SU) based feature selection,
a good to the credit assessment problem for better explanatory etc.[6,7]
power. The structure of SVM has many computation advantages, In contrast, a wrapper method is a feedback method that
such as special direction at a finite sample and irrelevance incorporates the machine learning algorithm in the feature
2234
C. Genetic algorithm-based feature selection
Feature selection problem can be considered as a
III. EXPERIMENTS AND ANALYSIS
combinatorial optimization problem that from feature space
searching the feature subset which can be applied to train a A. Dataset
classifier with maximal classification accuracy rate. Since The dataset used in this paper comes from the private label
genetic algorithm performs a randomized search and is not very credit card operation of a major Brazilian retail chain. There are
susceptible to getting stuck in local minima, it can be used to
50,000 instances in the original dataset, each being labeled as
search for relevant features. In the feature selection problem, the
GA population (the chromosomes) is coded as simple vectors of positive (good) and negative (bad). In this experiment, we use a
binary genes, where 1s represent relevant features. GA subset of the data which contains 5,000 instances with balanced
chromosome is shown in Figure 3. positive and negative labels.Each instance has 32 features,
including client ID, sex, age, education, shopping history,
monthly income, etc. In our experiments, we use 22 features
which we believe are most relevant to credit risk.
B. Exprimental setup
We randomly split the data into a training set (60% of points), a
validation set (20% of points) and a testing set (20% of points).
Figure 3. GA chromosome We use the training set to do feature selection and record the
optimum feature subset, and finally use this small subset of
feature to train the classifier for credit prediction. The SVM
Also, the fitness of solutions is mainly evaluated by training classification [12] accuracy with 10-folds cross validation on
classifier on the training data using only the features the whole dataset is used to measure the effect of the feature
corresponding to 1s in the chromosome, and returning the subset we select. The parameter setting in genetic algorithm are
classification accuracy as the fitness. Besides, the size of feature
set according to our empirical experience: the Population size:
subset is also a factor affecting the fitness of solution. We use the
following equation as the fitness formula: 20; Number of generations: 20; Probability of crossover:
0.6 ;Probability of mutation: 0.03.
We compare several feature selection methods:
= × +
∑
ReliefF is an instance-based learning methods that
where represents the weight value for classification accuracy,
sample instances randomly from the training set and
for the number of features; is the mask value of the i-th
check neighboringrecords of the same and different
feature, ‘1’ represents that feature is selected; ‘0’ represents classes—“near hits” and “near misses.” If anear hit has
that feature is not selected.It can be inferred that high fitness
a different value for a certain attribute, that
value is determined by high classification and small feature
featureseems to beirrelevant and its weight should be
number [13].Figure 4 illustrates the principle of GA-based
feature selection. decreased.
Sv
IG ( S , A) H ( S ) S
H ( S v ) ,
vV ( A )
Sc Sc
H (S ) S
log 2
S
,
cC
Figure 4. Genetic based feature selection algorithm
2235
whereS is the item collection, |S| its cardinality; V(A) is TABLE I. RESULT OF EXPERIMENTS
the set of allpossible values for featureA; Sv is the SVM
Method # of features
subset of S for which A has value v; C is theclass Accuracy
collection;S c is the subset of S containing items raw 22 71.140
belonging to class c. In the process of IG-based feature ReliefF 10 58.214
selection, the features are ranked with their Gain Ratio 15 75.808
information gains, and then filter out the non-
Information Gain 11 78.416
significant features by setting an appropriate threshold
in the ranking. Symmetrical Uncertainty 13 79.440
SVM wrapper + Genetic search 12 80.212
Gain Ratio (GR) evaluates the worth of an attribute by
measuring the gain ratio with respect to the class by the
following formulation: IV. CONCLUSIONS AND FUTURE WORK
In this paper, we address the credit risk analysis problem
GainR(Class, Attribute) = (H(Class) - H(Class | which is a crucial task in finance and management. Our work is
Attribute)) / H(Attribute). based on a machine learning method SVM. We have shown that
how a small size of subset of features for SVM by our feature
Symmetric uncertaintyis amethod of eliminating selection method based on genetic search algorithm. Our
redundant features as well as irrelevant ones isto select empirical study shows that the selected group of feature can
a subset of features that individually correlate well significantly improve SVM classification accuracy, compared to
several competing methods.
with the class buthave little intercorrelation. The
correlation between two nominal features X and Ycan REFERENCES
be measured using the symmetric uncertainty (SU) [1] Cortes, Corinna; and Vapnik, Vladimir N.; "Support-Vector Networks",
criterion, which also compensatesfor the inherent bias Machine Learning, 20, 1995.
of Information Gain by dividing it by the sum of [2] K.Fukunaga. Introduction to Statistical Pattern Recognition. Academic
theentropies of X and Y: Press,San Deigo, California, 1990.
[3] L. Yu, H. Liu. Efficiently handling feature redundancy in high-
dimensional data, in: Proceedings of The Ninth ACM SIGKDD
H (Y ) H ( X ) H ( X , Y ) 2 IG International Conference onKnowledge Discovery and Data Mining
SU ( X , Y ) 2
(KDD-03), Washington, DC, August, 2003, pp. 685-690.
H ( X ) H (Y ) H (Y ) H ( X )
[4] Lewis P M. The characteristic selection problem in recognition system.
IRETransaction on Information Theory,1962,8, pp.171-178.
whereH is the entropy function. The entropies [5] Kittler J.Feature set search algorithms.Pattern Recognition and Signal
arebased on the probability associated with each Processing.1978,41-60.
feature value; H(A,B), the jointentropy of A and B, is [6] M. Dash and H. Liu. Feature Selection for Classification. Intelligent Data
calculated from the joint probabilities of all Analysis, 1997, Vol. 1, No. 3, pp.131-156.
combinationsof values of A and B. Owing to the [7] Zexuan Zhu, Yew-Soon Ong, Manoranjan Dash.Wrapper-filter feature
selection algorithm using a memetic framework.IEEE transactions on
correction factor 2, SUgets values, which systems, man, and cybernetics. Part B, Cybernetics: a publication of the
arenormalized to the range [0, 1]. A value of SU=0 IEEE Systems, Man, and Cybernetics Society 2007;37(1):70-6.
indicatesthat X and Y are uncorrelated, and SU=1 [8] R. Kohavi and G. H. John. Wrapper for Feature Subset Selection.
meansthat the knowledge of one feature completely Artificial Intelligence, vol. 97, no. 1-2, pp.273-324, 1997.
predictsthe other. Similarly to GR, the SU is biased [9] De Jong, K. “Learning with Genetic Algorithms: An overview”. Machine
Learning Vol. 3, Kluwer Academic publishers, 1988.
toward featureswith fewer values [11].
[10] wei Han, Jian Pei, Yiwen Yin, Mining frequent patterns without candidate
generation, Proceedings of the 2000 ACM SIGMOD international
C. Result conference on Management of data, p.1-12, May 15-18, 2000, Dallas,
The result of experiments in Table 1 indicates that feature Texas, United States.
[11] Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to data
selection algorithm based on genetic algorithm can be used to mining. Addison Wesley Longman, 2006.
deal with feature selection problem, and it is able to reduce the [12] Hsu, Chih-Wei; Chang, Chih-Chung; and Lin, Chih-Jen (2003). A
number ofselected features significantly and produces obvious Practical Guide to Support Vector Classification. Department of Computer
improvement in the classification accuracy. It uses onlyhalf of Science and Information Engineering, National Taiwan University.
the original featuresto obtainhigher classification accuracy. [13] Eiben, A. E. et al (1994). "Genetic algorithms with multi-parent
recombination". PPSN III: Proceedings of the International Conference on
Evolutionary Computation. The Third Conference on Parallel Problem
Solving from Nature: 78–87.
2236