0% found this document useful (0 votes)
104 views9 pages

On Preprocessing Data For Financial Credit Risk Evaluation

This document discusses preprocessing data for financial credit risk evaluation using machine learning methods. It notes that while machine learning has been successfully used for credit risk evaluation, applying these methods without preprocessing the data may not yield expected results. The document specifically considers the effects of preprocessing data through feature selection and construction when using machine learning for financial credit risk evaluation. It provides an overview of some issues related to applying machine learning to financial credit risk data and discusses means of addressing these issues through preprocessing techniques.

Uploaded by

erwisme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views9 pages

On Preprocessing Data For Financial Credit Risk Evaluation

This document discusses preprocessing data for financial credit risk evaluation using machine learning methods. It notes that while machine learning has been successfully used for credit risk evaluation, applying these methods without preprocessing the data may not yield expected results. The document specifically considers the effects of preprocessing data through feature selection and construction when using machine learning for financial credit risk evaluation. It provides an overview of some issues related to applying machine learning to financial credit risk data and discusses means of addressing these issues through preprocessing techniques.

Uploaded by

erwisme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Expert Systems with Applications 30 (2006) 489–497

www.elsevier.com/locate/eswa

On preprocessing data for financial credit risk evaluation


Selwyn Piramuthu *
Decision and Information Sciences, University of Florida, 351 Stuzin Hall, P.O. Box 117169, Gainesville, FL 32611-7169, USA

Abstract
Financial credit-risk evaluation is among a class of problems known to be semi-structured, where not all variables that are used for decision-
making are either known or captured without error. Machine learning has been successfully used for credit-evaluation decisions. However, blindly
applying machine learning methods to financial credit risk evaluation data with minimal knowledge of data may not always lead to expected
results. We present and evaluate some data and methodological considerations that are taken into account when using machine learning methods
for these decisions. Specifically, we consider the effects of preprocessing of credit-risk evaluation data used as input for machine learning
methods.
q 2005 Elsevier Ltd. All rights reserved.

Keywords: Feature selection; Feature construction; Financial credit-risk evaluation; Decision tables

1. Introduction October 1993, and later Mexico’s spending of at least $20


billion to keep its financial system from collapsing.
According to a recent survey from Risk Waters Group and Although it is really difficult, if not impossible, to
SAS Institute, Inc., as cited in Computerworld (25 August completely be risk-free when dealing with financial credit, it
2004), organizations expect to benefit from significant rewards is possible to reduce some of these problems. Better estimation
(e.g. 10% reduction in economic capital and 14% reduction in of credit risk and implementing ways to completely or even
the cost of credit losses) through improved credit risk partially avoid some of these translates to effective utilization
management. Results from this survey also indicate that data of resources in these situations. Managing credit risk and
management as the biggest obstacle for successfully imple- reducing loan losses directly affect the bottom line for a
menting credit risk management systems. These findings are financial institution. The identification and quantification of
not surprising given the stakes that are involved when dealing credit risk is thus important in improving the efficiency,
with significant chunks of resources. These findings are accuracy, and consistency of financial credit risk management
especially significant when put in perspective of even a typical initiatives.
mid-sized financial institution where the loss that could be One of the ways to approach this problem is to utilize better
prevented with improved credit risk management systems run decision support tools when evaluating credit risk. The tools
in the hundreds of millions of dollars. for such decision support have been developed over the years,
Financial credit and the losses associated with it are not and range from simple ‘eye-balling’ of data to complex data
isolated incidents problems. These problems are common to analyses. These tools by themselves are only a part of the
most financial institutions, although only huge losses get solution. A complete system used for financial risk manage-
enough publicity to get attention from the general public. ment includes such tools along with means of interpreting
Recent financial crises include the US S&L’s crisis with an results generated by these tools, necessary resources to execute
estimate in the hundreds of billions of dollars, the Nordic appropriate decisions in a timely manner, among others. In this
paper, we consider the decision support tool part. Specifically,
countries’ injection of around $16 billion into their financial
we consider decision support tool as considered from a
system to keep them away from bankruptcy, Japan’s bad loans
machine learning perspective. Several previous studies have
that were estimated to be in the $160–240 billion range in
addressed some of the issues in applying machine learning
tools for credit-risk evaluation (e.g. Baesens, Setiono, Mues, &
* Tel.: C1 352 392 8882; fax: C1 352 392 5438. Vanthienen, 2003; Piramuthu, 1999; Shaw & Gentry, 1990).
E-mail address: [email protected]. We discuss a few means of improving the performance of these
0957-4174/$ - see front matter q 2005 Elsevier Ltd. All rights reserved. tools through data preprocessing, specifically through feature
doi:10.1016/j.eswa.2005.10.006 selection and construction.
490 S. Piramuthu / Expert Systems with Applications 30 (2006) 489–497

This paper is organized as follows: the next section provides superior for multi-modal distributions, and (3) statistical
a brief overview of some of the issues that relate to machine methods are computationally the most efficient. Curram and
learning and some of the issues that relate to applying these Mingers (1994) compare decision trees, neural networks and
methods to financial credit risk data. Section 3 discusses a few discriminant analysis on several real world data sets. Their
means to address some of the issues raised in Section 2. Section comparisons reveal that linear discriminant analysis is the
4 provides some illustrations and Section 5 ends the paper with fastest of the methods, when the underlying assumptions are
a brief discussion on the implications. met, and that decision trees methods overfit in the presence of
noise.
2. Relevant machine learning issues It should be noted that in most of these (e.g. neural
networks, genetic algorithms, decision trees) involve specific
Shaffer’s (1994) work and later the well-known Wolpert and learning strategies, and each application of these algorithm
Macready’s (1995, 1997) no free lunch (NFL) theorems for involves further specifying the strategy by tweaking various
search and optimization state that the performance of search, parameters (e.g. topology, learning rate, weight-update
optimization, or learning algorithms are equal when averaged mechanism, etc. in neural networks; encoding of genotype,
over all possible problems. A corollary of the no free lunch mutation rate, replication mechanism, selection mechanism,
theorem is that if an algorithm performs better than average on etc. in genetic algorithms; mechanism for splitting decisions at
a given set of functions, it must perform worse than average on each node, when to prune, etc. in decision trees). Without
the complementary set of these functions. In other words, an appropriate tweaking, the resulting performance of these
algorithm performs well on a subset of functions at the expense algorithms may not necessarily prove to be as reported in the
of poor performance on the complementary set of these literature. Moreover, tweaking tailors the different parameters
functions. A consequence of these is that all algorithms are of a given algorithm to data sets of interest.
equally specialized (Schumacher, Vose, & Whitley, 2001). There have also been studies that have compared different
Since the performance of all algorithms is similar, there can be algorithms, only to conclude that no one algorithm dominates
no algorithm that is more robust than the rest. NFL applies to the rest in performing better overall across several data sets.
cases where each function has the same probability to be the For example, Duin (1996) compared six methods (including
target function. This was later extended (e.g. Igel & Toussaint,
neural networks, decision trees, nearest mean) with seven data
2003; Schumacher et al., 2001) to provide necessary and
sets and concluded that there is no such thing as a best
sufficient conditions for subsets of functions as well as arbitrary
classifier. He continues on to state that for any classifier, a
non-uniform distributions of target functions.
problem or data can be selected where it can be shown to
The NFL theorems and related work that followed raises
perform well. Lim and Shih (2000) study 33 classification
serious question on blindly applying an algorithm (e.g. neural
methods using 32 data sets and report that the mean error rates
network, genetic algorithm) to data (e.g. Culberson, 1998).
of these algorithms are sufficiently similar that their differences
However, a cursory look at published literature reveals a
are statistically insignificant.
plethora of articles that compare across methods and conclude
What do all these studies have to do with using machine
that a given method is better than a few other methods. For
example, Giplin et al. (1990) compared stepwise linear learning methods for financial credit risk classification? For
discriminant analysis, stepwise logistic regression and CART one, it is clear that we cannot select an algorithm and claim its
to three senior cardiologists, for predicting whether a patient superiority over competing algorithms without regard to data
would die within a year of being discharged after an acute and/or problem characteristics as well as the suitability of the
myocardial infarction. Their results showed that there was no algorithm to such data and/or problem characteristics.
difference between the physicians and the computers, in terms However, given results from previous studies, one can possibly
of the prediction accuracy. Kors and Van Bemmel (1990) claim superiority of an algorithm for a specific data set or
compared statistical multivariate methods with heuristic problem. The lesson learned here is simply that one needs to
decision tree methods, in the domain of electrocardiogram take data and/or problem characteristics as well as he
(ECG) analysis. Their comparisons show that decision tree suitability of a given algorithm to obtain better performance.
classifiers are more comprehensible and flexible to incorporate Performance, in this context, depends on at least two different
or change existing categories. entities: the algorithm and the data set. Results mentioned in
Comparisons of CART to multiple linear regression and the last few paragraphs as well as the NFL theorems deal with
discriminant analysis can be found in Callahan and Sorensen the performance of algorithms. The studies, however, fail to
(1991) where it is argued that CART is more suitable than the deal with data characteristics and their appropriateness for a
other methods for very noisy domains with lots of missing given algorithm.
values. Feng, Sutherland, King, Muggleton, and Henery (1993) Data characteristics (noise, missing values, complexity of
present a comparison of several machine learning methods distribution of data, instance selection, etc.) can and do
(including decision trees, neural networks and statistical significantly affect the resulting performance of most, if not all,
classifiers) as a part of the European Statlog project. Their algorithms. Having selected an appropriate algorithm for a
main conclusions are that (1) no method seems uniformly given data set, it can be shown that performance can further be
superior to others, (2) machine learning methods seem to be improved by appropriate data characteristics.
S. Piramuthu / Expert Systems with Applications 30 (2006) 489–497 491

3. Reducing data complexity for learning learning of checker strategies based on observing the content of
board positions is more difficult than the learning problem
Assessing a firm’s financial risk is an important decision for based on training examples from observations described by
investors, companies that extend credit, and financial insti- piece advantage and mobility.
tutions. An incorrect valuation of potential risks can result in The same phenomenon with respect to the relationship
serious financial loss. Three aspects of financial risk between learning difficulty and proper representation of the
classification are critical but difficult: the development of a training examples is especially pronounced in the financial risk
compact model, the use and refinement of the classification evaluation domain. In determining companies’ credit worthi-
model for evaluation, and the identification of relevant ness, for example, the features used in training determine
financial features. For typical classification problems, values the learning complexity to a great extent, and sometimes even
for a set of independent variables are given in a set of (training) the degree of eventual success of the learning process itself.
examples, upon which a model is developed to categorize The credit worthiness of companies would be more difficult for
future observations into appropriate classes. Classification a learning system to learn from raw accounting data (e.g. those
problems arise in credit or loan evaluation (Carter & Cartlett, from the income statements and balance sheets) than from
1987), bond rating (Ang & Patel, 1975), market survey higher-level financial concepts such as liquidity, leverage
(Currim, Meyer, & Le, 1988), tax planning (Michaelsen, level, profit growth, and operating cash flow. Successful
1984), and bankruptcy prediction of firms (Messier & Hansen, learning hinges on the proper representation of training
1988; Shaw & Gentry, 1990; Tam & Kiang, 1992), among examples.
other applications.
A concept is an expression that identifies a subset of some 3.1. Feature construction
universe (Rendell & Seshu, 1990). The concept learning
problem can be represented by an instance space composed of Consider the XOR example in Fig. 1(a). This problem
the features used in the training examples as the axes. When requires at least two hyperplanes (straight lines, in this space)
there are multiple regions (peaks) in the instance space, the to be able to separate examples belonging to the two (C,K)
learning problem is characterized as ‘hard concept learning’ classes. The addition of a new feature X3 (X3ZX1oX2)
(Rendell & Seshu, 1990) for their inherent learning difficulty. decreases the learning difficulty by requiring just one hyper-
In most hard learning problems, using the appropriate set of plane (abcd in Fig. 1(b)) to be drawn that separates examples
features is critical for the success of the learning process and belonging to the two classes. Although the addition of a new
therefore, by itself, is an important decision. In the game of feature increased the number of effective features used, the
checkers, for example, detailed features such as the content of resulting space simplified the classification process.
each board position may not be as helpful for learning good Feature construction can be defined in terms of concept
strategies as higher-level information, such as piece advantage learning as follows: Feature construction is the process of
and mobility. It is therefore reasonable to hypothesize that the applying a set of constructive operators {01,02,.,0n} to a set of

Fig. 1. A new feature (i.e. X3) makes learning easier.


492 S. Piramuthu / Expert Systems with Applications 30 (2006) 489–497

existing features {f1,f2,.,fm}, resulting in the construction of confirmation measure, where only substitutions satisfying
one or more new features {f1,f2,.,fN}intended for use in explicitly the body of rules are taken into account.
describing the target concept (Matheus & Rendell, 1989). A In decision-tree based feature construction algorithms, as
separate learning method (e.g. neural network learning) can feature construction proceeds iteratively, the addition of new
then make use of the constructed features in attempting to features to the previous set of features can lead to a large
describe the target concept. number of features being used as input to the decision tree
Examples of feature construction systems include BACON construction algorithm. Thus, pruning of features is done
(Langley, Zytkow, Simon, & Bradshaw, 1986), FRINGE during each iteration. The most desirable features are kept to be
(Pagallo, 1989), CITRE (Matheus & Rendell, 1989), MIDOS carried over to the next iteration, as well as to form newer
(Wrobel, 1997), Explora (Klösgen, 1996), and Tertius (Flach & features, whereas the least desirable features are discarded.
Lachiche, 2001). This is done by the decision tree algorithm (e.g. ID3) through
BACON (Langley et al., 1986), a program that discovers pruning, as well as by the features that were not used in the
relationships among real-valued features of instances in data, formation of the decision tree.
uses two operators [multiply(_,_) and divide(_,_)]. The strong FC (Ragavan, Rendell, Shaw, & Tessmer, 1993) constructs
bias restricting the constructive operators allowed, leads to features iteratively from decision trees. It forms new features
manageable feature construction process, although concept by conjoining as well as disjoining two nodes at the fringe of
learning is severely restricted by these chosen operators. the tree-the parent and grandparent nodes of positive leaves are
FRINGE (Pagallo, 1989) is a decision-tree ([e.g. Quinlan, conjoined or disjoined to give a new feature. New features are
1986) based feature construction algorithm. New features are added to the set of original features and a new decision tree is
constructed by conjoining pairs of features at the fringe of each constructed using the maximum information-gain criterion
of the positive branches in the decision tree. During each (Quinlan, 1986). This feature selection phase thus chooses
iteration, the newly constructed features and the existing from both the newly-constructed features as well as the original
features are used as input space for the algorithm. This process features for rebuilding the decision tree. The iterative process
of tree-building and feature construction continues until no
is repeated until no new features are constructed.
new features are found. Splitting continues to purity, i.e. no
CITRE (Matheus & Rendell, 1989) and DC Fringe (Yang,
pruning (Breiman et al., 1984) is used in this study.
Rendell, & Blix, 1991) are also decision-tree based feature
Detailed steps for constructing Inductive Tree can be found
construction algorithms. They use a variety of operands such as
in Quinlan (1986). FC basically resolves the interactions
root (selects the first two features of each positive branch),
among features by conjoining and disjoining features that
fringe (similar to FRINGE), root-fringe (combination of both
appear close to the leaf nodes in a decision tree generated by an
root and fringe), adjacent (selects all adjacent pairs along each
inductive learning program such as ID3 (Quinlan, 1986). New
branch) and all (all of the above). All of these operands use
feature sets constructed through FC have been shown to be
conjunction as the operator. In DC Fringe, both conjunction as
easier for learning applications.
well as disjunction as operators are utilized.
MIDOS (Wrobel, 1997), which stands for multi-relational
3.2. Feature selection
discovery of subgroups, finds statistically unusual subgroups in
a database. It uses optimistic estimate and minimal support Consider the data given in Fig. 2. There are two independent
pruning, and an optimal refinement operator. MIDOS takes the variables V1 and V2, and a dependent variable Y. The examples
generality of a hypothesis (i.e. size of the subgroup) into from the two classes (YZC and K) are clearly separable.
account in addition to the proportion of positive examples in a Given a choice between the variables V1 and V2, we would
subgroup. choose V1 as the variable that distinguishes examples from the
Explora (Klösgen, 1996) is an interactive system for the two classes. Intuitively, if we project the data points on the V1
discovery of interesting patterns in databases. The number of
patterns presented to the user is reduced by organizing the
search hierarchically, beginning with the strongest, most
general, hypotheses. An additional refinement strategy selects
the most interesting statements and eliminates the overlapping
findings. The efficiency of discovery is improved by inverting
the record-oriented data structure and storing all values of the
same variable together, allowing efficient computation of
aggregate measures. Different data subsets are represented as
bit-vectors making computation of logical combinations of
conditions very efficient.
Tertius (Flach & Lachiche, 2001) uses first-order logic
representation and implements a top-down rule discovery
mechanism. It deals with extensional knowledge with explicit
negation or under the closed-world assumption. It employs a Fig. 2. An example.
S. Piramuthu / Expert Systems with Applications 30 (2006) 489–497 493

axis, the four data points belonging to the class ‘K’ would be such problematic features before the data enters the pattern
on V1Z1 and 2. And, the four points belonging to the class ‘C’ extraction stage in data mining systems.
would be on V1Z4 and 5. Here, given just the V1 axis, we can A goal of feature selection is to avoid selecting too many or
separate the examples belonging to the ‘K’ classes and those too few features than is necessary. If too few features are
belonging to the ‘C’ classes. Similarly, if we project the data selected, there is a good chance that the information content in
points on the V2 axis, the four data points belonging to the class this set of features is low. On the other hand, if too many
‘K’ would be on V2Z1 and 2. The four points belonging to the (irrelevant) features are selected, the effects due to noise
class ‘C’ would also be on V2Z1 and 2. Here, given just the V2 present in (most real-world) data may overshadow the
axis, we cannot separate the examples belonging to the ‘K’ information present. Hence, this is a tradeoff which must be
and ‘C’ classes. Here, we have perfect overlapping of addressed by any feature selection method.
examples in the two classes in dimension V2, whereas our The marginal benefit resulting from the presence of a feature
goal is to separate examples belonging to the two classes. in a given set plays an important role. A given feature might
Invariably, and unknowingly for the most part, irrelevant as provide more information when present with certain other
well as redundant variables are introduced along with relevant feature(s) than when considered by itself. Cover (1974);
variables to better represent the domain in credit risk Elashoff, Elashoff, and Goldman (1967), and Toussaint (1971),
evaluation applications. A relevant variable is neither among others, have shown the importance of selecting features
irrelevant nor redundant to the target concept of interest as a set, rather than selecting the best features to form the
(John et al., 1994). Whereas an irrelevant feature (variable) (supposedly) best set. They have shown that the best individual
does not affect describing the target concept in any way, a features do not necessarily constitute the best set of features.
redundant feature does not add anything new to describing the However, in most real-world situations, it is not known what
target concept while possibly adding more noise than useful the best set of features is nor the number (n) of features in such
information in concept learning. a set. Currently, there is no means to obtain the value of n,
Feature selection is the problem of choosing a small subset which depends partially on the objective of interest. Even
of features that ideally is necessary and sufficient to describe
assuming that n is known, it is extremely difficult to obtain the
the target concept (Kira & Rendell, 1992). Feature selection is
best set of n features since not all n of these features may be
of paramount importance for any learning algorithm which
present in the data comprising the available set of features.
when poorly done (i.e. a poor set of features is selected) may
There exists a vast amount of literature on feature selection.
lead to problems associated with incomplete information, noisy
Researchers have attempted feature selection through varied
or irrelevant features, not the best set/mix of features, among
means, such as statistical (e.g. Kittler, 1975), geometrical (e.g.
others. The learning algorithm used is slowed down unnecess-
Elomaa & Ukkonen, 1994), information-theoretic measures
arily due to higher dimensions of the feature space, while also
(e.g. Chambless & Scarborough, 2001), neuro-fuzzy (e.g.
experiencing lower classification accuracies due to learning
Benitez Castro, Mantas, & Rojas, 2001), receiver operating
irrelevant information. The ultimate objective of feature
selection is to obtain a feature space with (1) low curves (ROC) (Coetzee, Glover, Lawrence, & Giles, 2001),
dimensionality, (2) retention of sufficient information, (3) discretization (Liu & Setiono, 1997), mathematical program-
enhancement of separability in feature space for examples in ming (e.g. Bradley et al., 1998), among others.
different categories by removing effects due to noisy features, In statistical analyses, forward and backward stepwise
and (4) comparability of features among examples in same multiple regression (SMR) are widely used to select features,
category (Meisel, 1972). with forward SMR being used more often due to the lesser
Although seemingly trivial, the importance of feature magnitude of calculations involved. The output here is the
selection cannot be overstated. Consider for example a data smallest subset of features resulting in an R2 (correlation
mining situation where the concept to be learned is to classify coefficient) value that explains a significantly large amount of
good and bad creditworthy customers. The data for this the variance. In forward SMR, the analyses proceeds by adding
application could possibly include several variables including features to a subset until the addition of a new feature no longer
social security number, asset, liability, past credit history, results in a significant (usually at the 0.05 level) increment in
number of years with current employer, salary, and frequency explained variance (R2 value). In backward SMR, the full set of
of credit evaluation requests. Here, regardless of the other features are used to start with, while seeking to eliminate
variables included in the data, the social security number can features with the smallest contribution to R2.
uniquely determine a customer’s creditworthiness. The learned Malki and Moghaddamjoo (1991) apply the K-L transform
knowledge using only the social security number as predictor on the training examples to obtain the initial training vectors.
has extremely poor generalizability when applied to new Training is started in the direction of the major eigenvectors of
customers. Clearly, in this case, to avoid such a problem we can the correlation matrix of the training examples. The remaining
exclude social security numbers from the input data. It is not components are gradually included in their order of signifi-
always clear-cut as to which of the variables could result in cance. The authors generated training examples from a
such spurious patterns. A similar problem could possibly exist synthetic noisy image and compared the results obtained
among one or more other variables in the data. Feature using the proposed method to those of standard backpropaga-
selection methods can be used in similar situations to cull out tion algorithm. The proposed method converged faster than
494 S. Piramuthu / Expert Systems with Applications 30 (2006) 489–497

standard backpropagation with comparable classification those with ID3 had fewer number of nodes and took more than
performance. an order of magnitude less time.
Siedlecki and Sklansky (1989) use genetic algorithms for Based on the positions of instances in instance space, the
feature selection by encoding the initial set of n features as relief algorithm (Kira & Rendell, 1992) selects features that are
n-element bit string with 1 and 0 representing the presence and statistically relevant to the target concept, using a relevancy
absence, respectively of features in the set. They used threshold that is selected by the user. relief is noise-tolerant and
classification accuracy, as the fitness function (for genetic is unaffected by feature interaction. The complexity of relief is
algorithms while selecting features) and obtained good neural O(pn), where n and p are the number of instances and number
network results compared to branch and bound and sequential of features, respectively. relief was studied using two 2-class
search (Stearns, 1976) algorithms. They used a synthetic data problems with good results, compared to FOCUS (Almuallim
as well as digitized infrared imagery of real scenes, with & Dietterich, 1991) and heuristic search (Devijver & Kittler,
classification accuracy as the objective function. Yang and 1982), Kononenko (1994) extended relief to deal with noisy,
Honavar (1997) report a similar study. However, later Hopkins, incomplete, and multi-class data sets.
Routen, & Watson (1994) show that classification accuracy Milne (1995) used neural networks to measure the
may be a poor fitness function measure when searching for contribution of individual input features to the output of the
reducing the dimension of the feature set. neural network. A new measure of input features’ contribution
Using rough sets theory (Pawlak, 1982), PRESET (Mod- to output is proposed, and evaluated using data mapping
rzejewski, 1993) determines the degree of dependency (g) of species occurrence in a forest. Using a scatter plot of
sets of attributes for selecting binary features. Features leading contribution to output, subsets of features were removed and
to a minimal preset decision tree, which is the one with the remaining feature sets were used as input to neural
minimal length of all path from root to leaves, are selected. networks. Setino and Liu (1997) present a similar study using
Kohavi and Frasca (1994) use best-first search, stopping after a neural networks to select features.
predetermined number of nonimproving node expansions. Battiti (1994) developed MIFS to use mutual information
They suggest that it may be beneficial to use a feature subset for evaluating the information content of each individual
that is not a reduct, which has a property that a feature cannot feature with respect to the output class. The features thus
be removed from it without changing the independence selected were used as input in neural networks. The author
property of features. A table-majority inducer was used with shows that the proposed method is better than those feature
good results. selection methods that use linear dependence (e.g. correlations
The wrapper method (Kohavi, 1995a) searches for a good as in principal components analysis) measures. Al-Ani and
feature subset using the induction algorithm as a black box. Deriche (2001) extend this work by considering trade-offs
The feature selection algorithm exists as a wrapper around the between computational costs and combined feature selection.
induction algorithm. The induction algorithm is run on data Koller and Sahami (1996) use cross-entropy to minimize the
sets with subsets of features, and the subset of feature with the amount of predictive information lost during feature selection.
highest estimated value of a performance criterion is chosen. Piramuthu and Shaw (1994) use C4.5 (Quinlan, 1990), to select
The induction algorithm is used to evaluate the data set with the features used as input in neural networks. Their results showed
chosen features, on an independent test set. Yuan, Tseng, improvements, over just backpropagation, both in terms of
Gangshan, and Fuyan (1999) develop a two-phase method classification accuracy and time taken by neural networks to
combining wrapper and filter approaches. converge.
Almuallim and Dietterich (1991) introduce MIN- The most popular feature selection methods in machine
FEATURES (if two functions are consistent with the training learning literature are variations of sequential forward search
examples, prefer the function that involves fewer input (SFS) and sequential backward search (SBS) as described in
features) bias to select features in the FOCUS algorithm. Devijver and Kittler (1982) and its variants (e.g. Pudil, Ferri,
They used synthetic data to study the performance of the Novovicova, & Kittler, 1994). SFS (SBS) obtains a chain of
FOCUS, ID3, and FRINGE algorithms using sample complex- nested subsets of features by adding (subtracting) the locally
ity, coverage, and classification accuracy as performance best (worst) feature in the set. These methods are particular
criteria. They increased the number of irrelevant features and cases of the more general ‘plus l-take away r’ method (Stearns,
showed that FOCUS performed consistently better. 1976). Results from previous studies indicate that the
The IDG algorithm (Elomaa and Ukkonen, 1994) takes the performance using forward and backward searches are
positions of examples in the instance space to select features comparable. In terms of computing resources, forward search
for decision trees. They limit their attention to boundaries has the advantage since fewer number of features are evaluated
separating examples belonging to different classes, while at each iteration, compared to backward search where the
rewarding (penalizing) rules that separate examples from process begins using all the features.
different (same) classes. Eight data sets are used to compare the
performance (% accuracy, number of nodes in decision tree, 4. Financial risk classification applications
time) of decision trees constructed using the proposed
algorithm with ID3 (Quinlan, 1987). Decision trees generated As it is important for companies, investors, and financial
using the proposed algorithm had better accuracy whereas institutions to assess firms’ financial health or riskiness,
S. Piramuthu / Expert Systems with Applications 30 (2006) 489–497 495

numerous empirical models have been developed that use attributes. The class attribute describes people as either good
annual financial information to distinguish between firms that (about 700 observations) or bad (about 300 observations)
are healthy and those that are risky (e.g. Abdel-Khalik & credits. Other attributes include status of existing checking
El-Sheshai, 1980). Although the financial credit-risk analyses account, credit history, credit purpose, credit amount, savings
literature is extensive, research interest continues in the account/bonds, duration of present employment, installment
development of a theoretical foundation that would capture rate in percentage of disposable income, marital status and
the many dimensions of financial distress and failure. Likewise, gender, other debtors/guarantors, duration in current residence,
numerous lenders and investors are interested in improving property, age, number of existing credits at this bank, job,
their ability to interpret, explain, and predict credit-risk. This telephone ownership, whether foreign worker, and number of
type of financial risk analysis presents a challenge to the dependents.
development of appropriate classification models because of
the lack of linear relationships among features, the inherent 4.4. Australian credit approval data
level of noise in the training data, and the high degree of
interactions among features. We use four real-world financial This credit card applications data set (avaiable at ftp.ics.uci.
credit-risk evaluation data sets to illustrate the performance of edu/pub/machine-learning/databases/credit-screening/) was
data pre-processing effects on learning. used in Quinlan (1990), and has 690 observations with 15
attributes. Of the attributes, five are real-valued and the
4.1. Tam and Kiang (1992) data remaining are nominal attributes. There are 307 positive
examples and 383 negative examples in this data set.
This data set was used in the Tam and Kiang (1992) study.
Texas banks that failed during 1985–1987 were the primary
4.5. Results
source of data. Data from a year and two years prior to their
failure were used. Data from 59 failed banks were matched
We use decision table (e.g. Vanthienen & Wets, 1994) as an
with 59 non-failed banks, which were comparable in terms of
example of a tool for credit-risk evaluation decisions.
asset size, number of branches, age and charter status. Tam and
Specifically, we used a simple decision table majority classifier
Kiang had also used holdout samples for both the 1 and 2 year
as given in Kohavi (1995b). We use several feature selection
prior cases. The 1 year prior case consists of 44 banks, 22 of
algorithms for preprocessing input data used as input to
which belongs to failed and the other 22 to nonfailed banks.
decision tables. Specifically, we use relief (Kira & Rendell,
The 2 year prior case consists of 40 banks, 20 of which belongs
1992), gainratio (Quinlan, 1990), and chi-square (Chan &
to failed and 20 to nonfailed banks. The data describes each of
Wong, 1991). Table 1 provides results using these feature
these banks in terms of 19 financial ratios. For a detailed
selection algorithms on different data sets. Here, AE represents
overview of the data set, the reader is referred to Tam and
Abdel-Khalik and El-Sheshai (1980) data, KT represents Tam
Kiang (1992).
and Kiang (1992) data, GC represents the German credit data,
and AC represents tha Australian credit approval data. To
4.2. Abdel-Khalik and El-Sheshai (1980) data facilitate ease of comparison, we used the same sets of training
and testing data as in the initial reported study where these data
This data set was used in the Abdel-Khalik and El-Sheshai
sets were used.
(1980) study, among others. The data was used to classify a set
In Table 1, ‘all’ represents the case where no feature
of firms into those that would default and those that wouldn’t
selection was used to select input data for the decision tables.
default on loan payments. Of the 32 examples for training, 16
The numbers outside the parentheses are the percentages
belong to the default case and the other 16 to the non-default
correctly classified by the decision tables. The numbers outside
case. All the 16 holdout examples belong to the non-default
the parentheses are the number of attributes used in the final
case. The 18 variables in this data are: (1) net income/total decision tables. As can be seen in Table 1, feature selection
assets, (2) net income/sales, (3) total debt/total assets, (4) cash leads to reduction in the number of variables with comparable
flow/total debt, (5) long-term debt/net worth, (6) current assets/
or sometimes even better classification performance.
current liabilities, (7) quick assets/sales, (8) quick assets/
Preprocessing data can be beneficial for learning algorithms,
current liabilities, (9) working capital/sales, (10) cash at year-
by reducing the complexity of instance space used as input to
end/total debt, (11) earnings trend, (12) sales trend, (13) current
ratio trend, (14) trend of L.T.D./N.W., (15) trend of W.C./sales, Table 1
(16) trend of N.I./T.A., (17) trend of N.I./sales and (18) trend of % Correctly classified by decision table
cash flow/T.D. For a detailed description of this data, the reader Feature AE KT GC AC
is referred to Abdel-Khalik and El-Sheshai (1980). selection/
data set
4.3. German credit data All 68.75 (18) 83.43(19) 76.7 (20) 87.9 (15)
Relief 77.083 (5) 83.43 (1) 76.7 (3) 88.21 (14)
Gainratio 77.083 (5) 83.43 (5) 87.6 (4) 87.9 (10)
This data set (available at ftp.ics.uci.edu/pub/machine-
Chi-square 77.083 (5) 83.43 (5) 82.7 (11) 87.9 (12)
learning-databases/statlog/) contains 1000 observations on 20
496 S. Piramuthu / Expert Systems with Applications 30 (2006) 489–497

these algorithms. We can also improve learning performance Callahan, J. D., & Sorensen, S. W. (1991). Rule induction for group decisions
through appropriate sampling of input data to achieve better with statistical data—an example. Journal of the Operational Research
Society, 42(3), 227–234.
instance selection.
Carter, C., & Cartlett, J. (1987). Assessing credit card applications using
machine learning. IEEE Expert, Fall, 71–79.
5. Discussion Chambless, B., & Scarborough, D., (2001). Information-theoretic feature
selection for a neural behavioral model. Proceedings of the international
Financial credit-risk evaluation data are replete with noise joint conference on Neural Networks (IJCNN-01) (Vol. 2) (pp. 1443–1448).
Chan, K. C., & Wong, A. K. (1991). A statistical technique for extracting
as well as the available information itself being prone to
classificatory knowledge from databases. In G. Piatetsky-Shapiro, & W.
incompleteness. In spite of all these constraints, one should be Frawley (Eds.), Knowledge discovery in databases. Cambridge, MA: AAAI
able to efficiently obtain information from available data so as Press.
to compensate for the inadequacies in the data. Coetzee, F. M., Glover, E., Lawrence, S., & Giles, C. L. (2001). Feature
Financial credit-risk evaluation is done thousands of times selection in web applications by ROC inflections and powerset pruning.
every day in most financial institutions, among others, and Proceedings of the symposium on applications and the Internet (pp. 5–14).
Cover, T. M. (1974). The best two independent measurements are not the two
involves huge amounts of capital. Any improvement in best. IEEE Transactions on Systems, Man, and Cybernetics, SMC-4(1),
currently available methods would certainly benefit these 116–117.
institutions in a tangible way. This paper considered one facet Culberson, J. C. (1998). On the futility of blind search: An algorithmic view of
of financial credit-risk evaluation: decision-making tools. ‘No Free Lunch’. Evolutionary Computation, 6, 109–127.
Although there are several tools that can be successfully used Curram, S. P., & Mingers, J. (1994). Neural networks, decision tree induction
and discriminant analysis: An empirical comparison. Journal of the
for this purpose, it is better to select a tool that is tailored for
Operational Research Society, 45(4), 440–450.
this purpose, taking the characteristics of financial credit- Currim, I. S., Meyer, R. J., & Le, N. T. (1988). Disaggregate tree-structured
evaluation data into consideration. Another means to improve modeling of consumer choice data. Journal of Marketing Research, August,
performance of these tools is through proper preprocessing of 253–265.
data used in these decision support tools. Given the ready Devijver, P. A., & Kittler, J. (1982). Pattern recognition: A statistical
availability of tools for decision-making as well as those that approach. Englewood Cliffs, NJ: Prentice-Hall.
Duin, R. P. W. (1996). A note on comparing classifiers. Pattern Recognition
are used for preprocessing of input data, there really is no
Letters, 17, 529–536.
excuse not to utilize them to get the most benefit from risk Elashoff, J. D., Elashoff, R. M., & Goldman, G. E. (1967). On the choice of
analysis. variables in classification problems with dichotomous variables. Biome-
trika, 54, 668–670.
Acknowledgements Elomaa, T., & Ukkonen, E. (1994). A geometric approach to feature
selection. In Proceedings of the European conference on machine learning
(pp. 351–354).
I thank the referee for a thorough review, and for Feng, C., Sutherland, A, King, R, Muggleton, S, & Henery, R (1993).
highlighting important issues that have helped in improving Comparison of machine learning classifiers to statistics and neural
the clarity of presentation of this paper. networks. AI & Statistics-93 , 41–52.
Flach, P. A., & Lachiche, N. (2001). Confirmation-guided discovery of first-
References order rules with tertius. Machine Learning, 42, 61–95.
Giplin, EA, Olshen, R. A., Chatterjee, K., Kjekshus, J., Moss, A. J., Henning,
H., et al. (1990). Predicting 1-year outcome following acute myocardial
Abdel-Khalik, A. R., & El-Sheshai, K. M. (1980). Information choice and
infarction. Computers and Biomedical Research, 23(1), 46–63.
utilization in an experiment on default prediction. Journal of Accounting
Hopkins, C., Routen, T., & Watson, T. (1994). Problems with using genetic
Research, Autumn, 325–342.
algorithms for neural network feature selection. 11th European conference
Al-Ani, A., & Mohamed Deriche. (2001). An optimal feature selection
on Artificial Intelligence (pp. 221–225).
technique using the concept of mutual information. Proceedings of the
Igel, C., & Toussaint, M. (2003). On classes of functions for which no free
International Symposium on Signal Processing and its Applications
(ISSPA) (pp. 477–480). Kuala Lumpur. lunch results hold. Information Processing Letters, 86(6), 317–321.
Almuallim, H. M., & T. G. Dietterich. (1991). Learning with many irrelevant John, G. H., Kohavi, R., & Pfleger, K. (1994). Irrelevant features and the subset
features. Proceedings of the Ninth national conference on Artificial selection problem. In W. W. Cohen, & H. Hirsh (Eds.), Machine Learning:
Intelligence (pp. 547–552). Proceedings of the Eleventh International Conference (pp. 121–129). San
Ang, J., & Patel, K. (1975). Bond rating methods: Comparison and validation. Francisco, CA: Morgan Kaufmann Publishers.
Journal of Finance, 30(2), 631–640. Kira, K., & Rendell, L. A. (1992). A practical approach to feature selection. In
Baesens, B., Setiono, R., Mues, C., & Vanthienen, J. (2003). Using neural Proceedings of the ninth international conference on machine learning (pp.
network rule extraction and decision tables for credit-risk evaluation. 249–256).
Management Science, 49(3), 312–329. Kittler, J. (1975). Mathematical methods of feature selection in pattern
Battiti, R. (1994). Using mutual information for selecting features in supervised recognition. International Journal of Man–Machine Studies, 7, 609–637.
neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550. Klösgen, W. (1996). Explora: A multipattern and multistrategy discovery
Benitez, J. M., Castro, J. L., Mantas, C. J., & Rojas, F., (2001). A Neuro-Fuzzy assistant. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R.
approach for feature selection. Proceedings of IFSA World Congress and Uthurusamy (Eds.), Advances in knowledge discovery and data mining
20th NAFIPS international conference (Vol. 2) (pp. 1003–1008). (pp. 249–271). Menlo Park, CA: AAAI Press.
Bradley, P. S., Mangasarian, O. L., & Street, W. N. (1998). Feature selection Kohavi, R., & Frasca, B. (1994). Useful feature subsets and rough sets reducts.
in mathematical programming. INFORMS Journal on Computing, 10(2), Third international workshop on rough sets and soft computing (RSSC 94).
209–217. Kohavi, R. (1995). Wrappers for performance enhancement and oblivious
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). decision graphs, PhD dissertation. Computer Science Department Stanford
Classification and regression trees. Belmont, CA: Wadsworth. University.
S. Piramuthu / Expert Systems with Applications 30 (2006) 489–497 497

Kohavi, R. (1995). The power of decision tables. Proceedings of the Eighth Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1),
European Conference on Machine Learning, (pp. 174–189). 81–106.
Koller, D., & Sahami, M. (1996). Toward optimal feature selection. Machine Quinlan, J. R. (1990). Decision trees and decision making. IEEE Transactions
learning: Proceedings of the 13th international conference. on Systems, Man and Cybernetics, 20(2), 339–346.
Kononenko, I. (1994). Estimating attributes: Analysis and extensions of Ragavan, H., Rendell, L., Shaw, M., & Tessmer, A. (1993). Complex concept
RELIEF. Proceedings of the European conference on machine learning acquisition through directed search & feature caching, and practical results
(pp. 171–182). in a financial domain. Proceedings of the thirteenth international joint
Kors, J. A., & van Bemmel, J. H. (1990). Classification methods for conference on Artificial Intelligence (pp. 946–951).
computerized interpretation of the electrocardiogram. Methods of Infor- Rendell, L., & Seshu, R. (1990). Learning hard concepts through constructive
mation in Medicine, 29(4), 330–336. induction: Framework and rationale. Computational Intelligence, 6(4),
Langley, P., Zytkow, J. M., Simon, H. A., & Bradshaw, G. L. (1986). The 247–270.
search for regularity: Four aspects of scientific discovery. Machine Schumacher, C., Vose, M. D., & Whitley, L. D. (2001). The no free lunch and
learning: An artificial intelligence approach, (Vol. 2, pp. 425–470) Los
description length. Proceedings of Genetic and Evolutionary Computation
Altos, CA: Morgan Kaufmann.
conference (GECCO-2001) (pp. 565–570).
Lim, T.-S., & Shih, Y.-S. (2000). A comparison of prediction accuracy,
Setino, R., & Liu, H. (1997). Neural network feature selector. IEEE
complexity, and training time of thirty-three old and new classification
Transactions on Neural Networks, 8(3), 654–662.
algorithms. Machine Learning, 40, 203–229.
Siedlecki, W., & Sklansky, J. (1989). A note on genetic algorithms for large-
Liu, H., & Setiono, R. (1997). Feature selection via discretization. IEEE
scale feature selection. Pattern Recognition Letters, 10(5), 335–347.
Transactions on Knowledge and Data Engineering, 9(4), 642–645.
Malki, H. A., & Moghaddamjoo, A. (1991). Using the Karhunen-Loe’ve Shaffer, C. (1994). A conservative law for generalization performance.
transformation in the back-propagation training algorithm. IEEE Trans- Proceedings of the 1994 International Conference on Machine Learning.
actions on Neural Networks, 2(1), 162–165. San Mateo, CA: Morgan Kaufmann.
Matheus, C. J., & Rendell, L. (1989). Constructive induction in decision trees. Shaw, M. J., & Gentry, J. (1990). Inductive learning for risk classification.
Proceedings of the Eleventh IJCAI (pp 645–650). IEEE Expert, February, 47–53 (pp. 279–283).
Meisel, W. S. (1972). Computer-oriented approaches to pattern recognition. Stearns, S. D. (1976). On selecting features for pattern classifiers. Third
New York: Academic Press. international conference on Pattern Recognition , 71–75 (pp. 71–75).
Messier, W. F., & Hansen, J. V. (1988). Inducing rules for expert system Tam, K. Y., & Kiang, M. Y. (1992). Managerial applications of neural
development: An example using default and bankruptcy data. Management networks: The case of bank failure predictions. Management Science,
Science, 34(12), 1403–1415. 38(7), 926–947.
Michaelsen, R. H. (1984). An expert system for tax planning. Expert Systems, Toussaint, G. T. (1971). Note on optimal selection of independent binary-
October, 149–167. valued features for pattern recognition. IEEE Transactions on Information
Milne, L. (1995). Feature selection using neural networks with contribution Theory, IT-17, 618.
measures. AI’95. Canberra. Vanthienen, J., & Wets, G. (1994). From decision tables to expert system
Modrzejewski, M. (1993). Feature selection using rough sets theory. European shells. Data and Knowledge Engineering, 13(3), 265–282.
Conference on Machine Learning , 213–226. Wolpert, D. H., & Macready, W. G. (1995). No free lunch theorems for search.
Pagallo, G. (1989). Learning DNF by decision trees. Proceedings of the Technical report SFI-TR-05-010, Santa Fe Institute, Santa Fe, New Mexico.
Eleventh IJCAI (pp. 639–644). Wrobel, S. (1997). An algorithm for multi-relational discovery of subgroups
Pawlak, Z. (1982). Rough sets. International Journal of Computer and Proceedings of the first European symposium on principles of data mining
Information Sciences, 11(5), 341–356. and knowledge discovery. Berlin: Springer.
Piramuthu, S. (1999). Financial credit-risk evaluation with neural and Yang, J., & Honavar, V. (1997). Feature Subset Selection using a Genetic
neurofuzzy systems. European Journal of Operational Research, 112, Algorithm. Proceedings of the Genetic Programming Conference, GP’97
310–321. (pp. 380–385).
Piramuthu, S., & Shaw, M. J. (1994). On using decision tree as feature selector
Yang, D. -S., Rendell, L., & Blix, G. (1991). A scheme for feature construction
for feed-forward neural networks. International Symposium on Integrating
and a comparison of empirical methods. Proceedings of the Twelfth IJCAI
Knowledge and Neural Heuristics , 67–74.
(pp. 699–704).
Pudil, P., Ferri, F. J., Novovicova, J., & Kittler, J. (1994). Floating search
Yuan, H., Tseng, S. -S., Gangshan, W., & Fuyan, Z. (1999). A two-phase
methods for feature selection with nonmonotonic criterion functions.
feature selection method using both filter and wrapper. Proceedings of the
IEEE 12th international conference on pattern recognition (Vol. II,
pp. 279–283). IEEE conference on Systems, Man, and Cybernetics (Vol. 2, pp. 132–136).

You might also like