SVM - Report
SVM - Report
Abstract
The assessment of risk of default on credit is important for financial institutions.
Logistic regression and discriminant analysis are techniques traditionally used in
credit scoring for determining likelihood to default based on consumer application
and credit reference agency data. We test support vector machines against these
traditional methods on a large credit card database. We find that they are competitive
and can be used as the basis of a feature selection method to discover those features
that are most significant in determining risk of default.
1. Introduction
Credit scoring is the set of decision models and techniques that aid lenders in granting
consumer credit by assessing the risk of lending to different consumers. It is an
important area of research that enables financial institutions to develop lending
strategies to optimise profit. Additionally, bad debt is a growing social problem that
could be tackled partly by better informed lending enabled by more accurate credit
scoring models. A range of different data mining and statistical techniques have been
used since the 1930’s when numerical score cards were first introduced by mail-order
companies (Thomas et al. 2002, Section 1.3). It is now common for financial
institutions to use statistical methods such as logistic regression (LR) and linear
discriminant analysis (LDA) to build credit scoring models. Potential borrowers are
classified according to their probability to default on a loan, based on application and
credit reference agency data collected about them. Such models are used by setting a
threshold on the probability to default and rejecting loan applications that fall below
this level.
In this paper, our general framework is to compare the performance of SVM against
several other well-known algorithms: LR, LDA and k-nearest neighbours (kNN). We
extend the work on assessing SVM for credit scoring in several ways.
1. SVM is tested against a much larger database of credit card customers than
has been considered in the literature so far. We restrict our attention to those
accounts opened in the same three month period. Hand (2006) points out that
for many classification problems, the data suffers from population drift, in that
the class distributions shift over time. This is particularly true of credit data
with customer behaviour changing over time due to economic circumstances
or changes in product development and marketing. For this reason a clearer
model can be developed if it is based on data taken from a narrow time period
within which there is likely to be less variability in these circumstances.
Typically, credit data is not easily separable by any decision surface. This is natural
since the data at time of application cannot capture the complexities in each individual
customer’s life that may lead to default. The application data can at best only provide
an indication of default. Consequently, it is usual for the rates of misclassification on
credit data to be between around 20% and 30% (eg see Baesens et al, 2003). This
would be considered a poor result for many other classification problems but is
typical of credit data. The poor separability of the credit data is illustrated in Figure 1.
The good cases tend to cluster towards the bottom-right and the bad towards the top-
left, but this is only a very general trend and there is no clear separation.
Partial least squares was used to transform data for 100 good cases (black) and 100
bad cases (white) selected randomly from the data into two factors given as the x
and y axis of the graph.
3. Methods
The SVM is a relatively new learning algorithm that can be used for classification.
We compare its performance against three older statistical classification methods: LR,
LDA and kNN. All algorithms are described briefly below for a sequence of n
training examples (x1 , y1 , … , x n , y n ) with feature vectors xi and class labels y i . For
credit scoring, the class label is either bad or good.
3.1. Support Vector Machine (SVM) classifier
SVM separates binary classified data by a hyperplane such that the margin width
between the hyperplane and the examples is maximized. Statistical learning theory
shows that maximizing the margin width reduces the complexity of the model,
consequently reducing the expected general risk of error. For problems where data is
not separable by a hyperplane, typical of most real-world classification problems, a
soft margin is used. In this case, training examples are allowed some slack to be on
the wrong side of the margin. However, they accrue a penalty proportional to how far
they are on the wrong side. The sum of the penalties is minimized whilst maximizing
the margin width. A parameter C controls the relative cost of each goal in the overall
optimization problem. The SVM optimization problem can be expressed
algebraically as a dual form quadratic programming problem.
Let y i ∈ {− 1,+1} for all i=1 to n. Then the SVM optimization problem is
n 1 n
max ∑ α i + ∑ y i y j α iα j k (x i , x j )
α
i =1 2 i , j =1
subject to constraints
n
0 ≤ α i ≤ C for all i=1 to n and ∑yα
i =1
i i =0
where α i is a Lagrange multiplier for each training example i. The kernel function k
can be used to implement non-linear models of the data. For this paper, we consider
three commonly used kernels.
Linear model k (x i , x j ) = x i ⋅ x j
Using non-linear kernels means that it is not feasible to extract an explicit scorecard,
although predictions can be made using them.
The vector of Lagrange multipliers α is sufficient to define the output decision rule.
A classification prediction is made on a new example x as
n
yˆ = sgn ∑ α i y i k (x i , x) + b
i =1
where b is a threshold term computed as
n
b = ∑ α i y i k (x i , x j ) for any j ∈ {1,… , n} such that 0 < α j < C .
i =1
Training examples are called “support vectors” (SVs) if they are on the margin or are
on the wrong side of the margin. This is because together they are sufficient to
“support” the optimal separating hyperplane, since only SVs are such that α i > 0 . It
follows that the decision rule can be expressed simply in terms of SVs. See Vapnik
(1998) and Cristianini and Shawe-Taylor (2000) for details about SVMs.
my =
1
∑ w ⋅ x i , s y2 =
1
∑ (w ⋅ x i − m y )2 where C y = {i = 1,…, n | yi = y} .
| C y | i∈C y | C y | i∈C y
neighbours taken from the training set. The probability of the example belonging to
class y is estimated as pˆ = k y / k (Hand 1981, Section 2.4). We use the usual
Euclidean distance measure to determine the neighbourhood of an example. Henley
and Hand (1997) use kNN for credit scoring and compare with other methods
including LR.
Error rates are reported on the test set as the proportion of test examples wrongly
classified. It is possible to set a threshold term for the decision rule output by each
algorithm to control the distribution of cases classified as good or bad. For example,
in LR, we set a threshold t and classify all examples x with P ( y = 1 | x) < t as good
( yˆ = 0 ). Otherwise x is classified as a bad case. The threshold setting depends on a
prior assumption of the relative cost of misclassifying good or bad cases. For
example, we expect that a bad case misclassified as good, and so given a loan, would
yield a greater loss – ie the loss of a substantial part of the loan value – than a good
case misclassified as bad, leading to a loan not being made and the subsequent loss of
interest payments. However, it is not reasonable to assume this relative cost for
assessment. Also, using error rates makes it difficult to compare algorithms with
different threshold terms that would lead to different distributions of misclassification
of good and bad cases. Therefore it is usual to measure performance with a receiver
operating characteristic (ROC) curve which plots sensitivity (true positive rate)
against 1-specificity (false negative rate) for the full range of possible threshold
values. This is a typical performance measure for credit scoring (eg Engelmann et al.
2003, Baesens et al. 2003). The area under the ROC curve (AUC) is used as a single
summary statistic for measuring performance and comparing algorithms (DeLong et
al. 1988). Note that a ROC curve is constructed for SVM by varying the threshold
term b. Reducing this threshold will increase the number of cases classified as bad
( yˆ = +1 ).
Guyon et al. (2002) propose using the square of the weights from the hyperplane
generated by SVM as a feature selection criterion. They show this will minimize
generalized risk and apply the technique to cancer classification. They used a
recursive procedure, removing a few features at a time. However, since for the credit
scoring problem, there are relatively few features to begin with, we do not need to
apply a recursive procedure. We simply use the magnitude of weights on features as a
feature selection criterion. We set a threshold of 0.1 and all features with weights
greater than this will be selected as significant features. This threshold level is chosen
since we found it yields approximately the same number of features as the LR method
described above. Since the data is standardized, it is reasonable to directly compare
the magnitudes of weights on different features.
4. Results
Results are given in this section for pre-classification parameter tuning, algorithm
comparison and significant feature discovery using LR and SVM.
It is interesting that the non-linear kernels do not perform better than the simple linear
model. In particular, the polynomial kernel performs poorly. This may be because this
non-linear model is over-fitting the data. This is evident in the difference between the
relatively high training AUC and low test AUC. However, the results are not
sufficient to assert this conclusively.
The best results are achieved when large numbers of SVs are extracted. Over 50% of
training examples are SVs. This is due to the fact that credit data is not easily
separable by any decision surface as explained in Section 2, so many of the training
examples remain misclassified.
For kNN, test AUC was stable at over 0.760 for values of k between 500 and 4000. It
is usual for performance to be stable across a wide range of values of k for large
training sets (Olsson 2006). We choose the mid-range figure k=2000 as an optimal
value for the following comparative experiments.
The standard deviations are relatively low (less than 1% of mean AUC) indicating
that the measured performance is stable.
SVM with a linear or Gaussian model performs best yielding the highest AUC.
However, the differences in performance are small and are not significant. Schebesch
and Stecking (2005) reach a similar conclusion with their experiments.
The only algorithms that stand out as particularly poor are SVM with polynomial
kernel and kNN. As mentioned in Section 4.1, we suspect the polynomial kernel over-
fits the training data. The poor result with kNN corroborates the results given by
Baesens et al (2003).
Neither LR with interaction variables nor SVM with non-linear kernels give an
improvement over the simpler models. This indicates that the data is broadly linearly
separable. Gayler (2006) has argued that interaction variables are less stable than the
main effects and they would usually only be included in a model if the modeller has
prior belief in their relevance to credit scoring. Our results tend to support this view.
Figure 2 shows typical ROC curves taken from one experiment. It is clear that the
ROC curve for SVM, LR and LDA are all very similar. The only algorithm which
gives a distinctly poor ROC curve is kNN which is outperformed by SVM across the
whole range of the graph.
Figure 2. ROC curves for performance on test data, comparing the performance of
linear SVM with LR, LDA and kNN.
SVM (unbroken line) and LR (broken line) SVM (unbroken line) and LDA (broken line)
SVM (unbroken line) and kNN (broken line)
Error rates can be derived by setting a cut-off threshold for each model and predicting
those test cases with a score computed from the model below the cut-off as bads and
those above as goods. Error rates are given by comparing predicted against actual
classifications across the test set. However, error rates are not comparative since each
classifier will yield a different distribution of errors on good and bad cases. This is
why using AUC is a better comparative measure, since it measures predictive
performance across all possible chosen cut-off thresholds. Nevertheless, it is
interesting to review error rates on good and bad cases for SVM to ensure that they
represent an acceptable performance. The natural cut-off for SVM is the threshold
term b described in Section 3.1. For linear SVM the error rates for good and bad
cases in the test set are 27.4% and 29.6% respectively and for SVM with a Gaussian
RBF kernel the error rates for good and bad cases are 27.2% and 30.1%. These
outcomes are within the range of error rates we would expect from predicting default
in credit data, as we discussed in Section 2.
The direction of SVM weights and LR coefficient estimates is the same and indicates
how each feature contributes to the risk of default. A positive value indicates higher
risk and a negative value a lower risk. For example, an applicant who has already
applied for credit several times (F6) will be more likely to default and a home owner
(F1) is less likely to default.
These results show that the two methods agree strongly on the most significant
features. The fact that two very different methods give the same results provides
further confidence that these features can be taken forward for use in credit scorecards
to determine the risk of default for individual applicants for credit. It shows that SVM
can be used successfully for feature selection in credit scoring.
5. Conclusions
SVMs are a relatively new technique for application to credit scoring. We test them
on a much larger credit data set than has been used in previous studies. We find that
SVMs are successful in comparison to established approaches to classifying credit
card customers who default. This corroborates the findings of previous researchers. In
addition, we find that, unlike many other learning tasks, a large number of support
vectors are required to achieve the best performance. This is due to the nature of the
credit data for which the available application data can only be broadly indicative of
default. Finally, we show that SVM can be used successfully as a feature selection
method to determine those application variables that can be used to most significantly
indicate the likelihood of default.
There are several further lines of investigation. Firstly, we discovered that the type of
product (F9 in Table 3) is an important indicator of default. It would be interesting to
build separate models for each product to determine how performance and significant
features vary between them. Secondly, we took data from just one three-month
period to avoid the problem of population drift (Hand 2006). It would be interesting to
see how models and performance change across time and how robust simple and
complex models are when tested against test sets drawn from later dates.
Acknowledgements
We used SVM light for this project which is available at
http://svmlight.joachims.org and is documented by Joachims (1999). This
research is funded through EPSRC grant EP/D505380/1.
References
Baesens B, van Gestel T, Viaene S, Stepanova M, Suykens J and Vanthienen J
(2003). Benchmarking state-of-the-art classification algorithms for credit scoring.
Journal of the Operational Research Society 54: 1082-1088.
Cristianini N and Shawe-Taylor J (2000). Support vector machines and other kernel-
based learning methods. Cambridge University Press.
Huang C-L, Chen M-C, Wang C-J (2007). Credit scoring with a data mining
approach based on support vector machines. Expert Systems with Applications
33(4):847-856
Huang Z, Chen H, Hsu C-J, Chen W-H, Wu S (2004). Credit rating analysis with
support vector machines and neural networks: a market comparative study. Decision
Support Systems (Special issue: Data mining for financial decision making)
37(4):543-558.
Lee Y-C (2007). Application of support vector machines to corporate credit rating
prediction. Expert Systems with Applications 33(1):67-74.
Li ST, Shiue W, Huang MH (2006). The evaluation of consumer loans using support
vector machines. Expert Systems with Applications 30(4):772-782.
Schebesch K B and Stecking R (2005). Support vector machines for classifying and
describing credit applicants: detecting typical and critical regions. Journal of the
Operational Research Society 56:1082-1088.
Thomas LC, Edelman DB and Crook JN (2002). Credit Scoring and its Applications.
SIAM Monographs on Mathematical Modeling and Computation. SIAM:
Philidelphia, USA.
Van Gestel T, Baesens B, Suykens JAK, Van den Poel D, Baestaens D, Willekens M
(2006). Bayesian kernel based classification for financial distress detection. European
Journal of Operational Research 172: 979-1003.