Devos 2009
Devos 2009
a r t i c l e i n f o a b s t r a c t
Article history: Support vector machines (SVM) are learning algorithms that present good generalization performance and
Received 2 July 2008 can model complex non linear boundaries through the use of adapted kernel functions. They have been
Accepted 25 November 2008 introduced recently in chemometrics and have proven to be powerful in NIR spectra classification. But one of
Available online 6 December 2008
the major drawbacks of SVM is that training the model requires optimization of the regularization and kernel
meta-parameters in order to control the risk of overfitting and the complexity of the boundary. Furthermore
Keywords:
the interpretation of the SVM models remains difficult and these tools are then often considered as black box
Support vector machines
Near infrared spectroscopy
techniques.
Classification We propose a methodological approach to guide the choice of the SVM parameters based on a grid search for
Model visualization minimizing the classification error rate but also relying on the visualization of the number of support vectors
Interpretation (SVs). We also demonstrate the interest of visualizing the SVs in principal components subspaces to go
deeper into the interpretation of the trained SVM. The proposed methods are applied on two NIR datasets:
the first one is a slightly non linear 2-class problem and the second one a more complex 3-class task. The
optimized SVM models are quite parsimonious, relying on 8 and 35 support vectors respectively, and good
classification performances is obtained (classification rate of 98.9% and 91% on the test sets, respectively).
© 2008 Elsevier B.V. All rights reserved.
1. Introduction classification models obtained are robust and less subject to the curse
of dimensionality and over-fitting [14], properties that should be
NIR spectroscopy is widely used in food [1] and pharmaceutical emphasized when dealing with spectroscopic data. The major
industries [2] for analysis and quality control. From the NIR spectrum, difficulties considering SVM are the parameters optimization and
quantitative information can be obtained with regression models, or the lack of model interpretability. The SVM parameters tuning is a
qualitative information with classification models. Many methods critical step and they are classically optimized using an exhaustive
exist for sample classification from spectroscopic data. Very often search algorithm. Furthermore the SVM models are quite difficult to
the choice of the classification algorithm depends first on the data interpret and they are often used as “black box” methods.
structure under study and is guided by the prediction performance We propose here an approach where the choice of SVM meta-
obtained with this model. Support vector machines (SVM) are a part of parameters is performed on two simultaneous one-criterion grid search
a new generation of learning algorithms used for classification and optimization: the cross validation classification rate and the number of
regression tasks [3,4]. The SVM have been introduced in chemo- support vectors (SVs). To further interpret the classification model it might
metrics only recently [5] and have been successfully applied for mid be interesting to focus on these SVs, which are particular data points for
and near infrared classification tasks, such as material identification the SVM model, in term of their number and repartition. We exemplify
[6,7] and food discrimination [8–11]. In the case of classification, SVM these two points on two near infrared (NIR) spectroscopy datasets
minimize simultaneously the empirical classification error and max- presenting both overlapping class and multi-class task for the latter.
imize the inter-class geometric margin [12,13] leading to a unique
solution. One of the major features of SVM models is that they can 2. Support vector machines
operate in a kernel-induced feature space allowing non linear mo-
deling. Furthermore good generalization performance is obtained The theory of SVM has been extensively described in literature
even with relatively small datasets [3]. It has been reported that the [13,15]. Therefore, only a brief description of the concept of SVM in the
framework of classification will be given here.
☆ Dedicated to Prof. Jean-Pierre Huvenne on the occasion of his retirement.
Considering a binary classification problem, the objective is to
⁎ Corresponding author. Tel.: +33 320434748; fax: +33 320436755. predict for all the objects their belonging to a class y {−1, + 1}, from m
E-mail address: [email protected] (O. Devos). dimensional input data represented by a vector written x = (x1, x2,…,xm)
0169-7439/$ – see front matter © 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.chemolab.2008.11.005
28 O. Devos et al. / Chemometrics and Intelligent Laboratory Systems 96 (2009) 27–33
and xi for the ith object of the training set. In the case of spec- is large, the error minimization is predominant. The classifier function
tra, m represents the number of wavelengths. The class prediction in the dual form is closed to the one in the linearly separable case (Eq.
first requires training on a data set containing the spectra correspon- (3)) and is given in Eq. (5).
ding to n objects or samples with known class, that is to say n {x,y}
n
values. f ðxÞ = w:x + b = ∑ yi α i xi :x + b with 0Vα i VC ð5Þ
i=1
2.1. Separable data
It should be noticed that a new constraint is added as αi values are
The idea of linear SVM is to search for the hyperplane which now limited by the upperbound C and the SVs with αi = C are the ones
correctly separates the data while maximizing the shortest distances on the bad side of the margin border.
from this hyperplane to the closest training samples for each class (d+
2.3. Non-linear classification
for class {+1}, d− for class {−1}). The distance (d+ +d−) defines the margin
associated to the separating hyperplane. This hyperplane is unique and
The SVM classification methodology can be extended to non linear
is called optimal hyperplane. Moreover, maximizing the margin ensures
classification. For this purpose the data are first projected in a high
good generalization performance. For linear SVM, the classification rule
dimensional feature space by the mean of a mapping function ϕ. A
is thus that all the training objects must be on the good side of the
linear SVM is then applied in the feature space (Eq. (6)), where ideally
margin borders as written in Eq. (1).
the data can be linearly separable, which corresponds to an implicit
yi ðw:xi + bÞ≥1 ð1Þ non linear boundary in the input space.
n
Where w and b are respectively the normal vector and the bias of f ðxÞ = f ð/ðxÞÞ = ∑ yi α i /ðxi Þ:/ðxÞ + b with 0Vα i VC ð6Þ
the hyperplane. In order to define the margin hyperplanes the points to i=1
consider are the ones which respect the equality in Eq. (1) as these
points lie exactly on one of the margin border. The distances for these The dot product ϕ(xi).ϕ(x) is often replaced by a so-called kernel
points to the optimal hyperplane (w.x i + b = 0) are therefore function K [15]. Among existing kernel functions, the radial basis
1
d + = d− = jjwjj and the margin width is equal thus to jjwjj 2
. The function (RBF) kernel is the most widely used as almost any boundary
optimization of the margin width is obtained by solving the shape can be obtained with this kernel and good performance is
constrained quadratic optimization problem presented in Eq. (2). generally obtained. The RBF kernel function is given in Eq. (7) where G
is related to the kernel width meta-parameter.
minimize 1=2jjwjj2 ð2Þ
−jjx1 −x2 jj
under the constraints yi ðw:xi + bÞ−1≥0 K ðx1 ; x2 Þ = /ðx1 Þ:/ðx2 Þ = exp = expð−Gjjx1 −x2 jjÞ ð7Þ
2σ 2
According to the optimization theory [16], the Lagrangian dual
formulation, which expresses the importance of each example in the
2.4. SVM parameters optimization
training set, can be used to solve this problem. The implementation of
the dual form is more efficient and leads to express the optimal
Optimization of the meta-parameters C (regularization parameter)
hyperplane as a linear combination of the training observations (Eq. (3)).
and G (RBF kernel parameter) is the key step in SVM as their combined
n values determine the boundary complexity and thus the classification
f ðxÞ = w:x + b = ∑ yi α i xi :x + b with 0Vα i ð3Þ
i=1 performance. In order to perform this optimization, different methods
such as grid search [18] or gradient descent algorithm [19,20] can be
The Lagrangian multipliers αi obtained after training are associated used. All these methods are usually based on cross validation clas-
to two types of constraints: the constraints for which αi N 0 are said sification rate to evaluate the performance of the model and minimize
active and the constraints for which αi = 0 are inactive. Active the risk of overfitting.
constraints correspond to objects for which distance to the optimal In this study an exhaustive grid search is used as this method is easy
hyperplane is exactly equal to half the margin. These objects are called to use and performs well (fast) when only two parameters require
support vectors (SVs). tuning. Furthermore this approach enables to visualize directly the
effect of both parameters and provides useful information. To
2.2. Non separable data illustrate the effects of the SVM parameters, we propose in Fig. 1 a
simple 2-dimensionnal classification task where direct visualization of
In the case of non separable data, the linear SVM defined the margin borders and boundary are possible. From Fig. 1, it can be
previously must be adapted to tolerate errors for some objects i. noticed that the data are almost linearly separable which is confirmed by
Their total amount is accounted for with the introduction of slack the satisfying classification performance of a linear discriminant analysis
variables ξi [17] and should be minimized. Each ξi corresponds to the (LDA) (classification rate = 91%). Sixteen different SVM models are build
distance between the object i and the corresponding margin hyper- on the basis of a small optimization grid with 16 points (C = 1, 10, 100,
plane. It is equal to 0 when the sample i is correctly assigned. The 1000 and G = 0.01, 0.1, 1, 10). The support vectors (SVs), boundaries and
Eq. (2) can be generalized to the non separable case, as presented in margins are presented Fig. 1. The number of SVs and the leave one out
Eq. (4). cross validation classification rate are given in Table 1. Globally when C is
small the margin maximization is emphasized leading to large margin
n
minimize 1=2jjwjj2 + C ∑ ξi and smooth boundary. The number of support vectors is therefore more
i=1 ð4Þ important (remind that the SVs correspond to data points on the margin
under the constraints ξi + yi ðw:xi + bÞ−1≥0 and ξi ≥0 borders inside the margin or misclassified). When C is large the error
minimization is emphasized leading to more complex boundary and
The parameter C is a regularization meta-parameter that balances smaller margin. In this case, the misclassified objects are more
the penalization of errors and should be optimized beforehand. More important for the SVM model and the boundary seems to be “attracted”
precisely, it controls the tradeoff between two conflicting objectives: by these objects, as seen clearly for C = 1000 and G = 0.01 for example
when C is small, margin maximization is emphasized whereas when C (Fig. 1). For small values of G (large kernel bandwidth) the boundary is
O. Devos et al. / Chemometrics and Intelligent Laboratory Systems 96 (2009) 27–33 29
Fig. 1. Boundaries (black lines) and class margin borders (gray lines) of SVM models with a RBF kernel and different values of the kernel (G) and regularization (C) meta-parameters.
The support vectors (SVs) are marked in bold.
almost linear. It can be noticed that when both parameters are small samples in the calibration set and 96 in the validation set. The
(C = 1 and G = 0.01) the boundary is comparable to LDA classification. As NIR spectra were measured on a FT-NIR BOMEM MB160 DiffusIR
the value of G increased the boundary becomes more complex. instrument (ABB) in reflectance mode from 3800 to 7500 cm− 1 at
Regarding the number of SVs, it presents an optimum for intermediate 7.7 cm− 1 apparent resolution (480 data points). The raw spectra are
G values and when G is large (G = 10) the number of Support Vectors is presented in Fig. 2.
very high and the value of C has a very small influence on the margin and The second dataset contains 221 NIR spectra of different materials:
boundary. When focusing on the results in Table 1 it can be seen that 130 for the calibration subset and 91 for the validation subset. The
many parameter combinations lead to almost identical classification purpose here is to determine a discrete physical property which can
rates, and therefore the choice of the parameters might be difficult when
the optimization are only based on this criterion.
3. Experimental
Two NIR datasets have been used in this study. The first dataset
contains 288 NIR spectra of manufactured samples. The classification
task here is to assign the products in two classes depending on their
composition. This dataset has been split randomly in two subsets: 192
Table 1
Number of support vectors and leave one out cross validation classification rate (in %)
for different SVM parameters setting
3.2. Software
4.1. Dataset 1
Table 2
Performance of LDA and SVM classifiers for dataset 1
4.2. Dataset 2 subspaces (Fig. 6), where overlapping classes are observed and no
clear (linear) separation can be drawn. Among the different PC
The classification problem for dataset 2 is more complex as can be subspaces, the projection on PC2/PC3 will be used for interpretation
observed from the visualization of the data scores on different PC purpose because better class separations are observed compared to
Fig. 5. Dataset 1 projected on the PC1/PC2 subspace with the two classes indicated by ‘o’
and ‘x’. The SVs are marked in bold and correspond to SVM model with RBF kernel and Fig. 7. 20 × 20 optimization grid results for dataset 2 in term of A) classification rate
A) G = 0.55 and C = 3.31, B) G = 0.0078 and C = 2637. obtained for full cross validation B) number of support vectors.
32 O. Devos et al. / Chemometrics and Intelligent Laboratory Systems 96 (2009) 27–33
Table 3 Table 4
Performance of SVM classifier for dataset 2 Lagrangian multipliers for the SVs marked in Fig. 8
[16] M.S. Bazaraa, H.D. Sherali, C.M. Shetty, Nonlinear Programming: Theory and [21] C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, 2001 Software
Algorithms, 2nd ednWiley, New York, 1992. available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[17] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (1995) 273–297. [22] V. Franc, V. Hlavac, Statistical Pattern Recognition Toolbox for Matlab, 2004
[18] C.-W. Hsu, C.-C. Chang, C.-J. Lin, A practical guide to support vector classification, Software available at ttp://cmp.felk.cvut.cz/cmp/software/stprtool/index.html.
2007 http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf. [23] C.-W. Hsu, C.-J. Lin, A comparison of methods for multi-class support vector
[19] C.M. Bishop, Neural networks for pattern recognition, Oxford University Press, machines, IEEE T. Neural Network 13 (2002) 415–425.
Oxford, 1995.
[20] O. Chapelle, V. Vapnik, O. Bousquet, S. Mukherjee, Choosing multiple parameters
for support vector machines, Mach. Learn 46 (2002) 131–159.