0% found this document useful (0 votes)
52 views30 pages

A Tutorial On Support Vector Regression

Uploaded by

fming792
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views30 pages

A Tutorial On Support Vector Regression

Uploaded by

fming792
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Statistics and Computing 14: 199–222, 2004

⃝C 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.

A tutorial on support vector regression∗


ALEX J. SMOLA and BERNHARD SCHO¨ LKOPF
RSISE, Australian National University, Canberra 0200, Australia
[email protected]
Max-Planck-Institut fu¨r biologische Kybernetik, 72076 Tu¨bingen, Germany
[email protected]

Received July 2002 and accepted November 2003

In this tutorial we give an overview of the basic ideas underlying Support Vector (SV) machines
for function estimation. Furthermore, we include a summary of currently used algorithms for
training SV machines, covering both the quadratic (or convex) programming part and advanced
methods for dealing with large datasets. Finally, we mention some modifications and extensions
that have been applied to the standard SV algorithm, and discuss the aspect of regularization from a
SV perspective.

Keywords: machine learning, support vector machines, regression estimation

1. Introduction (Vapnik and Lerner 1963, Vapnik and Chervonenkis 1964).


As such, it is firmly grounded in the framework of statistical
The purpose of this paper is twofold. It should serve as a self- learn- ing theory, or VC theory, which has been developed over
contained introduction to Support Vector regression for the last three decades by Vapnik and Chervonenkis (1974) and
readers new to this rapidly developing field of research.1 On Vapnik (1982, 1995). In a nutshell, VC theory characterizes
the other hand, it attempts to give an overview of recent properties of learning machines which enable them to
developments in the field. generalize well to unseen data.
To this end, we decided to organize the essay as follows. In its present form, the SV machine was largely developed
We start by giving a brief overview of the basic techniques in at AT&T Bell Laboratories by Vapnik and co-workers (Boser,
Sections 1, 2 and 3, plus a short summary with a number of Guyon and Vapnik 1992, Guyon, Boser and Vapnik 1993,
figures and diagrams in Section 4. Section 5 reviews current Cortes and Vapnik, 1995, Scho¨lkopf, Burges and Vapnik
algorithmic techniques used for actually implementing SV 1995, 1996, Vapnik, Golowich and Smola 1997). Due to this
machines. This may be of most interest for practitioners. industrial con- text, SV research has up to date had a sound
The following section covers more advanced topics such as orientation towards real-world applications. Initial work
extensions of the basic SV algorithm, connections between focused on OCR (optical character recognition). Within a
SV machines and regularization and briefly mentions methods short period of time, SV clas- sifiers became competitive with
for carrying out model selection. We conclude with a the best available systems for both OCR and object recognition
discussion of open questions and problems and current tasks (Scho¨lkopf, Burges and Vapnik 1996, 1998a, Blanz et
directions of SV research. Most of the results presented in this al. 1996, Scho¨lkopf 1997). A comprehensive tutorial on SV
review paper already have been published elsewhere, but the classifiers has been published by Burges (1998). But also in
comprehensive presentations and some details are new. regression and time series predic- tion applications, excellent
performances were soon obtained (Mu¨ller et al. 1997,
Drucker et al. 1997, Stitson et al. 1999, Mattera and Haykin
1.1. Historic background 1999). A snapshot of the state of the art in SV learning was
The SV algorithm is a nonlinear generalization of the Gener- recently taken at the annual Neural In- formation Processing
alized Portrait algorithm developed in Russia in the sixties2 Systems conference (Scho¨lkopf, Burges, and Smola 1999a).
SV learning has now evolved into an active area of research.

An extended version of this paper is available as NeuroCOLT Technical Moreover, it is in the process of entering the standard methods
Report TR-98-030. toolbox of machine learning (Haykin 1998, Cherkassky and
Mulier 1998, Hearst et al. 1998). Scho¨lkopf and

0960-3174 ⃝C 2004 Kluwer Academic Publishers


200 Smola and
Scho¨lkopf
Smola (2002) contains a more in-depth overview of SVM regres-
sion. Additionally, Cristianini and Shawe-Taylor (2000) and
Her- brich (2002) provide further details on kernels in the
context of classification.

1.2. The basic idea


Suppose we are given training data {(x1, y 1 ),..., (x4, y4) } ⊂
X × R, where X denotes the space of the input patterns (e.g.
Fig. 1. The soft margin loss setting for a linear SVM (from
X = Rd ). These might be, for instance, exchange rates for Scho¨lkopf and Smola, 2002)
some currency measured at subsequent days together with
correspond- ing econometric indicators. In ε-SV regression Figure 1 depicts the situation graphically. Only the points outside
(Vapnik 1995), our goal is to find a function f (x ) that has at the shaded region contribute to the cost insofar, as the
most ε deviation from the actually obtained targets yi for all deviations are penalized in a linear fashion. It turns out that in
the training data, and at the same time is as flat as possible. In most cases the optimization problem (3) can be solved more
other words, we do not care about errors as long as they are easily in its dual formulation.4 Moreover, as we will see in
less than ε, but will not accept any deviation larger than this. Section 2, the dual for- mulation provides the key for extending
This may be important if you want to be sure not to lose more SV machine to nonlinear functions. Hence we will use a
than ε money when dealing with exchange rates, for instance. standard dualization method uti- lizing Lagrange multipliers,
For pedagogical reasons, we begin by describing the case of as described in e.g. Fletcher (1989).
linear functions f , taking the form
f (x ) = (w, x ⟩ + b with w ∈ X, b ∈ R (1) 1.3. Dual problem and quadratic programs
where , denotes the dot product in X . Flatness in the case The key idea is to construct a Lagrange function from the ob-
(· ·⟩ that one seeks a small w. One way to ensure this
of (1) means jective function (it will be called the primal objective function
is in the rest of this article) and the corresponding constraints, by
to minimize the norm,3 i.e. w = 2 introducing a dual set of variables. It can be shown that this
( w, w . We can write
this function has a saddle point with respect to the primal and dual
problem as a convex optimization problem: variables at the solution. For details see e.g. Mangasarian
minimize 1 w 2 (1969), McCormick (1983), and Vanderbei (1997) and the
2
( explanations
yi − (w, xi ⟩ − b ≤ (2) in Section 5.2. We proceed as follows:
εsubject to
1 Σ 4 Σ 4
(w, xi ⟩ + b − yi ≤ ε L: w 2 C ∗
(ξi + ξi )
∗ ∗
(ηi ξi + ηi ξi )
= 2 +
The tacit assumption in (2) was that such a function f actually i i
=1
− =1
exists that approximates all pairs (xi , yi ) with ε precision, or Σ 4
in
other words, that the convex optimization problem is feasible. — αi (ε + ξi − yi + (w, xi ⟩ + b)
Sometimes, however, this may not be the case, or we also may i =1
want to allow for some errors. Analogously to the “soft mar- 4
Σ
gin” loss function (Bennett and Mangasarian 1992) which was ∗ ∗
— αi (ε + ξi + yi − (w, xi ⟩− b) (5)
used in SV machines by Cortes and Vapnik (1995), one can i =1
in- troduce slack variables ξi , ξi∗ to cope with otherwise Here L is the Lagrangian and ηi , ηi∗, αi , αi∗ are Lagrange
infeasible constraints of the optimization problem (2). Hence multi- pliers. Hence the dual variables in (5) have to satisfy
we arrive at positivity constraints, i.e.
the formulation stated in Vapnik (1995).4 (∗) (∗)
1
minimize w 2+C αi , ηi ≥ 0. (6)
Σ i +ξ
2 Note that by α(∗), we refer to and α∗.
(ξ ∗
) i
α
, i =1 (3) i i i
b y x
, i − (w, i ⟩ − ≤ ε + It follows from the saddle point condition that the partial
ξi
subject to (w, xi ⟩ + b − yi ≤ ε + derivatives of L with respect to the primal variables (w, b, ξi ,
,ξ∗ ∗
, ξii , ξ ∗ have
ξi )
to vanish for optimality.
i ≥0
4
Σ ∗
The constant C > 0 determines the trade-off between the flat- ∂b L = (αi − αi ) = 0 (7)
ness of f and the amount up to which deviations larger than i =1
ε are tolerated. This corresponds to dealing with a so called 4
Σ
ε-insensitive loss function |ξ |ε described by ∗
∂w L = w − (αi − αi )xi = 0 (8)
(0 if |ξ | ≤ ε i =1
|ξ |ε := |ξ | − ε otherwise. (4) ∂ξi (∗) L = C − α(∗) − η(∗)
i i =0 (9)
A tutorial on support vector regression 201
Substituting (7), (8), and (9) into (5) yields the dual
optimization problem. In conjunction with an analogous analysis on αi∗ we
have max{−ε + yi − (w, xi ⟩| αi < C or αi∗ > 0} (16)
≤b≤
, min{−ε + yi − (w, xi ⟩| αi > 0 or αi∗ < C }
4
1 Σ
maximize ,,
− (αi − αi∗)(α j − α j )(xi , x j i
2 ⟩ If some α(∗) ∈ (0, C ) the inequalities become equalities. See
Σ
i, j =1 Σ
4 4 also Keerthi et al. (2001) for further means of choosing b.
, ∗
,
, (αi + αi ) yi (αi − (10)

Another way of computing b will be discussed in the
−ε i
+ i αi ) context of interior point optimization (cf. Section 5). There b
=1 =1
Σ4 turns out to be a by-product of the optimization process.
∗ ∗ Further consid-
subject to ](αi − αi ) = 0 and αi , αi ∈ [0, C
i erations shall be deferred to the corresponding section. See
=1
also Keerthi et al. (1999) for further methods to compute the
constant
In∗deriving (10) we already eliminated the dual variables ηi , offset.
ηi
A final note has to be made regarding the sparsity of the SV
through condition (9) which can be reformulated as ηi (∗) = C − expansion. From (12) it follows that only for | f (xi )− yi| ≥ ε
αi(∗). Equation (8) can be rewritten as follows the Lagrange multipliers may be nonzero, or in other words,
Σ4 4
Σ for all samples inside the ε–tube (i.e. the shaded region in
∗ ∗
w= (αi −αi )xi , thus f (x ) = (αi −αi )(xi , x ⟩+ b. Fig. 1) ∗
the α , α vanish: for | f (x ) − y | < ε the second factor in
(11)
i =1 i =1 i i i i

This is the so-called Support Vector expansion, i.e. w can be (12) is nonzero, hence αi , αi∗ has to be zero such that the KKT
completely described as a linear combination of the training conditions are satisfied. Therefore we have a sparse expansion
patterns xi . In a sense, the complexity of a function’s of w in terms of xi (i.e. we do not need all xi to describe w).
represen- tation by SVs is independent of the dimensionality The examples that come with nonvanishing coefficients are
of the input space X , and depends only on the number of SVs. called Support Vectors.
Moreover, note that the complete algorithm can be
described in terms of dot products between the data. Even 2. Kernels
when evalu- ating f (x ) we need not compute w explicitly.
These observa- tions will come in handy for the formulation 2.1. Nonlinearity by preprocessing
of a nonlinear extension.
The next step is to make the SV algorithm nonlinear. This, for
instance, could be achieved by simply preprocessing the
1.4. Computing b training patterns xi by a map→Ф : X F into some feature
So far we neglected the issue of computing b. The latter can space F, as described in Aizerman, Braverman and Rozonoe´r
be done by exploiting the so called Karush–Kuhn–Tucker (1964) and Nilsson (1965) and then applying the standard SV
(KKT) conditions (Karush 1939, Kuhn and Tucker 1951). regression algorithm. Let us have a brief look at an example
These state that at the point of the solution the product between given in Vapnik (1995).
dual variables and constraints has to vanish.
Example 1 (Quadratic features in √R 2 ). Consider the map
Ф:
R2 → R3 with Ф(x1, x2) = (x 12, 2x1x2, x 22). It is understood
αi (ε + ξi − yi + (w, xi ⟩ + b) = 0 that the subscripts in this case refer to the components∈of x
∗ ∗ (12) R2. Training a linear SV machine on the preprocessed features
αi (ε + ξi + yi − (w, xi ⟩− b) = 0 would yield a quadratic function.

and While this approach seems reasonable in the particular ex-


ample above, it can easily become computationally infeasible
(C − αi )ξi = 0 for both polynomial features of higher order and higher di-
∗ ∗ (13) mensionality, as the number of different monomial features
(C − αi )ξi =
p ), where d = dim(X ). Typical values
of degree p is (d+ p−1
0.
This allows us to make several useful conclusions. Firstly only for OCR tasks (with good performance) (Scho¨lkopf, Burges
samples (xi , yi ) with corresponding

α(∗) C lie outside the ε- and Vapnik 1995, Scho¨lkopf et al. 1997, Vapnik 1995) are
i = p = 7, d = 28 · 28 = 784, corresponding to approximately
insensitive tube. Secondly αi αi = 0, i.e. there can never be a of dual variables αi , αi∗ which are both simultaneously nonzero.
set This allows us to conclude that
202 Smola and
ε − yi + (w, xi ⟩ + b ≥ 0 and ξi = 0 if αi < C (14) 3.7 · 10 features.
16 Scho¨lkopf
ε − yi + (w, xi ⟩ + b ≤ 0 if αi > 0 (15)
2.2. Implicit mapping via kernels
Clearly this approach is not feasible and we have to find a com-
putationally cheaper way. The key observation (Boser, Guyon
A tutorial on support vector regression 203
and Vapnik 1992) is that for the feature map of example 2.1 condition (Scho¨lkopf, Burges and Smola 1999a). In the follow-
we have ing we will call such functions k admissible SV kernels.
¡ , ¢ ¡ , ¢®
1x 2, 2x1x2, 2x 2 , 1x r2, 2x r1x r2,2x r2 = (x, x r⟩2. Corollary 3 (Positive linear combinations of kernels).
(17) Denote by k1, k2 admissible SV kernels and c1, c2 ≥ 0 then
As noted in the previous section, the SV algorithm only k(x, x r) := c1k1(x, x r) + c2k2(x, x r) (22)
depends on dot products between patterns xi . Hence it suffices
is an admissible kernel. This follows directly from (21) by
to know
k(x, x r) := (Ф(x ), Ф(x r) rather than Ф explicitly which virtue of the linearity of integrals.
allows us to restate
, the SV optimization ∗ problem:
4
1 Σ(α — α )(α — α )k(x , x
maximize ,,— i i j j i j More generally, one can show that the set of admissible ker-
2 i, j )
=1 nels forms a convex cone, closed in the topology of pointwise
,
Σ4 Σ4
, —
,

(αi + αi ) yi (αi — (18) convergence (Berg, Christensen and Ressel 1984).
i ∗
i + α i )
ε =1 =1 Corollary 4 (Integrals of kernels). Let s(x, x r) be a function
Σ4
∗ ∗
subject to (αi — αi ) = 0 and αi , αi e [0, C on X × X such that
i ] ∫
=1 s
Likewise the expansion of f (11) may be written r
k(x, x ) (x, z)s(x r, z) dz (23)
X
as Σ :=

w= (αi — αi )Ф(xi ) exists. Then k is an admissible SV kernel.
and
i =1 This can be shown directly from (21) and (23) by rearranging the
Σ4 ∗
f (x ) (αi — αi )k(xi , x ) + b. (19) order of integration. We now state a necessary and sufficient
= con- dition for translation invariant kernels, i.e. k(x, x r) := k(x
i
The difference to the linear case is that w is no longer given — x r)
as derived in Smola, Scho¨lkopf and Mu¨ller (1998c).
=1
ex- plicitly. Also note that in the nonlinear setting, the
optimization problem corresponds to finding the flattest Theorem 5 (Products of kernels). Denote by k1 and k2 admis-
function in feature space, not in input space. sible SV kernels then
k(x, x r) := k1(x, x r)k2(x, x r) (24)
2.3. Conditions for kernels
is an admissible kernel.
The question that arises now is, which functions k(x, x r)
corre- spond to a dot product in some feature space F. The This can be seen by an application of the “expansion part” of
following Mercer’s theorem to the kernels
theorem characterizes these functions (defined on X ). Σ k1 and k2 and observing that
each term in the double sum i, j i λ1λ ψ
2 1
j i
(x )ψi 1(x r)ψ2j (x )ψ2j(x r)
Theorem 2 (Mercer 1909). Suppose k e L ∞(X 2) such that gives rise to a positive coefficient when checking (21).
the integral operator Tk : L2(X ) → L2(X ), Theorem 6 (Smola, Scho¨lkopf and Mu¨ller 1998c). A transla-

k(·, x ) f (x )dµ(x ) (20) tion invariant kernel k(x, x r) = k(x x r) is an admissible SV
T f (·) :=
k kernels if and only if the Fourier transform
X
is positive (here µ denotes a measure on X with µ(X ) finite ∫
and supp(µ) = X ). Let ψ j e L2(X ) be the eigenfunction of Tk
d
e —i(ω,x⟩ k(x )dx (25)
associated with the eigenvalue λ j /= 0 and normalized such F [k](ω) = (2π ) —2
X
that is nonnegative.
ψ j L2 = 1 and let ψ j denote its complex conjugate. Then
1. (λ j (T )) j e 41. We will give a proof and some additional explanations to this
2. k(x, x r) Σ r r theorem in Section 7. It follows from interpolation theory
j N λ j ψ j (x )ψj (x ) holds for almost all (x, x ),
where the= series converges absolutely and uniformly for al- (Micchelli 1986) and the theory of regularization networks
e
most all (x, x r). (Girosi, Jones and Poggio 1993). For kernels of the dot-
product
Less formally speaking this theorem means that if type, i.e. k(x, x r) = k( (x, x r ), there exist sufficient conditions
for being admissible.
∫ ⟩
k(x, x r) f (x ) f (x r) dxdx r ≥ 0 for all f e L2(X ) Theorem 7 (Burges 1999). Any kernel of dot-product type
X (21) k(x, x r) = k((x, x r⟩) has to satisfy
×X
holds we can write k(x, x r) as a dot product in some feature k(ξ ) ≥ 0, ∂ξ k(ξ ) ≥ 0 and ∂ξ k(ξ ) + ξ∂ξ k(ξ ) ≥ 0 (26)
2
space. From this condition we can conclude some simple rules
for compositions of kernels, which then also satisfy Mercer’s for any ξ ≥ 0 in order to be an admissible SV kernel.
204 Smola and
Scho¨lkopf
Note that the conditions in Theorem 7 are only necessary but B-splines of order 2n +1, defined by the 2n +1 convolution of
not sufficient. The rules stated above can be useful tools for the unit inverval, are also admissible. We shall postpone
practitioners both for checking whether a kernel is an further considerations to Section 7 where the connection to
admissible SV kernel and for actually constructing new kernels. regulariza- tion operators will be pointed out in more detail.
The general case is given by the following theorem.

Theorem 8 (Schoenberg 1942). A kernel of dot-product type 3. Cost functions


k(x, x r) = k( x, r
( x ) defined on an infinite dimensional Hilbert
space, with a power series expansion So far the SV algorithm for regression may seem rather


Σ strange and hardly related to other existing methods of
k(t ) = ant (27) function esti- mation (e.g. Huber 1981, Stone 1985, Ha¨rdle
nn=0 1990, Hastie and Tibshirani 1990, Wahba 1990). However,
is admissible if and only if all an ≥ 0. once cast into a more standard mathematical notation, we will
observe the connec- tions to previous work. For the sake of
A slightly weaker condition applies for finite dimensional simplicity we will, again, only consider the linear case, as
spaces. For further details see Berg, Christensen and Ressel extensions to the nonlinear one are straightforward by using
(1984) and Smola, O´ va´ri and Williamson (2001). the kernel method described in the previous chapter.

2.4. Examples 3.1. The risk functional


In Scho¨lkopf, Smola and Mu¨ller (1998b) it has been shown, Let us for a moment go back to the case of Section 1.2. There,
by explicitly computing the mapping, that homogeneous we had some training data=X{ : (x1, y 1 ),..., (x}4,cy4) × X
polyno- mial kernels k with p e N and R. We will assume now, that this training set has been drawn
k(x, x r) = (x, x r⟩p (28) iid (independent and identically distributed) from some
probabil- ity distribution P(x, y). Our goal will be to find a
function f minimizing the expected risk (cf. Vapnik 1982)
are suitable SV kernels (cf. Poggio 1975). From this ∫
observation one can conclude immediately (Boser, Guyon and R[ f = c(x, f (x ))dP(x, y) (33)
Vapnik 1992, Vapnik 1995) that kernels of the type ] y,
k(x, x r) = ((x, x r ⟩+ c)p (29) (c(x, y, f (x )) denotes a cost function determining how we will
i.e. inhomogeneous polynomial kernels with pe N, c 0 are penalize estimation errors) based on the empirical data X.
admissible, too: rewrite k as a sum of homogeneous kernels Given that we do not know the distribution P(x, y) we can
and apply Corollary 3. Another kernel, that might seem only use X for estimating a function f that minimizes R[ f ].
appealing due to its resemblance to Neural Networks is the A possi- ble approximation consists in replacing the
hyperbolic tangent kernel integration by the empirical estimate, to get the so called
empirical risk functional
4

k(x, x r) = tanh(ϑ + κ(x, x r⟩). (30) Remp [ f ] := 4 i i , i y , f (x
c(x i )). (34)
=1
By applying Theorem 8 one can check that this kernel does A first attempt would be to find the empirical risk minimizer
not actually satisfy Mercer’s condition (Ovari 2000). Curiously, f0 : =argmin f eH Remp[ f ] for some function class H . However,
the kernel has been successfully used in practice; cf. if H is very rich, i.e. its “capacity” is very high, as for instance
Scholkopf (1997) for a discussion of the reasons. when dealing with few data in very high-dimensional spaces,
Translation invariant kernels k(x, x r) = k(x x r) are this may not be a good idea, as it will lead to overfitting and
quite widespread. It was shown in Aizerman, Braverman and
thus bad generalization properties. Hence one should add a
Rozonoe´r (1964), Micchelli (1986) and Boser, Guyon and
capacity control term, in the SV case w 2, which leads to the
Vap- nik (1992) that
regularized risk functional (Tikhonov and Arsenin 1977,
——
Morozov 1984,
e r
k(x, r
x x Vapnik 1982)
x ) 2 2σ
2
(31)
=
is an admissible SV kernel. Moreover one can show (Smola λ
1996, Vapnik, Golowich and Smola 1997) that (1X denotes the Rreg[ f ] := Remp[ f ] + 2 w 2 (35)
indicator function on the set X and ⊗ the convolution where λ> 0 is a so called regularization constant. Many
algorithms like regularization networks (Girosi, Jones and
operation) Poggio 1993) or neural networks with weight decay networks
kO
(e.g. Bishop 1995) minimize an expression similar to
k(x, x ) = B2n+1( x — x
r r
) with Bk := 1[2—21 , 1 ] (35).
(32) i
=1
A tutorial on support vector regression 205
3.2. Maximum likelihood and density models However, the cost function resulting from this reasoning
The standard setting in the SV case is, as already mentioned in might be nonconvex. In this case one would have to find a
Section 1.2, the ε-insensitive loss convex proxy in order to deal with the situation efficiently
(i.e. to find an efficient implementation of the corresponding
c(x, y, f (x )) = |y — f (x )|ε. (36)
optimization problem).
It is straightforward to show that minimizing (35) with the If, on the other hand, we are given a specific cost function
par- ticular loss function of (36) is equivalent to minimizing from a real world problem, one should try to find as close a
(3), the proxy to this cost function as possible, as it is the performance
only difference being that C = 1/(λ4). wrt. this particular cost function that matters ultimately.
Loss functions such like |y — f (x ) εp with p > 1 may not Table 1 contains an overview over some common density
be desirable, as the superlinear increase leads to a loss of the models and the corresponding loss functions as defined by
robustness properties of the estimator (Huber 1981): in those (37).
cases the derivative of the cost function grows without bound. The only requirement we will impose on c(x, y, f (x )) in
For p < 1, on the other hand, c becomes nonconvex. the following is that for fixed x and y we have convexity in f
For the case of c(x, y, f (x )) = (y f (x ))2 we recover the (x ). This requirement is made, as we want to ensure the
least mean squares fit approach, which, unlike the standard existence and uniqueness (for strict convexity) of a minimum
SV loss function, leads to a matrix inversion instead of a of optimization problems (Fletcher 1989).
quadratic programming problem.
The question is which cost function should be used in (35).
3.3. Solving the equations
On the one hand we will want to avoid a very complicated
function c as this may lead to difficult optimization problems. For the sake of simplicity we will additionally assume c to
On the other hand one should use that particular cost function be symmetric and to have (at most) two (for symmetry) dis-
that suits the problem best. Moreover, under the assumption continuities at ±ε, ε ≥ 0 in the first derivative, and to be
that the samples were generated by an underlying functional zero in the interval [ —ε, ε]. All loss functions from Table 1
dependency plus additive
= noise,
+ i.e. yi ftrue(xi ) ξi with belong to this class. Hence c will take on the following
density p(ξ ), then the optimal cost function in a maximum form.
likelihood sense is
c(x, y, f (x )) = — log p(y — f (x )). (37) c(x, y, f (x )) = c˜(|y — f (x )|ε) (41)
This can be seen as follows. The likelihood of an estimate Note the similarity to Vapnik’s ε-insensitive loss. It is rather
X f := {(x1, f (x 1 )),..., (x4, f (x4))} (38) straightforward to extend this special choice to more general
convex cost functions. For nonzero cost functions in the inter-
for additive noise and iid data is
.4 .
4
val [—ε, ε] use an additional pair of slack variables. Moreover
p(X f | X) = p( f (xi ) | (xi , yi )) = p(yi — f (xi )). (39) we might choose different cost functions c˜i , c˜i∗ and different
i
=1
i
=1 values of , ∗
εi εi for each sample. At the expense of additional
Maximizing P(X f | X) is equivalent to minimizing — log P Lagrange multipliers in the dual formulation additional
(X f | X). By using (37) we get discon- tinuities also can be taken care of. Analogously to (3)
Σ4 we arrive at a convex minimization problem (Smola and
— log P(X f | X) = c(xi , yi , f (xi )). (40) Scho¨lkopf 1998a). To simplify notation we will stick to the
i =1 one of (3) and use C

Table 1. Common loss functions and corresponding density models


Loss function Density model
1
ε-insensitive c(ξ ) = |ξ |ε p(ξ ) = 2(1+ε) exp(—|ξ |ε)
Laplacian c(ξ ) = |ξ | p(ξ ) = 1 exp(—|ξ |)
2 ³ 2´
Gaussian c(ξ ) = 1 ξ 2 p(ξ ) = ,1 exp —ξ
(2 , 2π ³ 2 ´
 exp — ξ
1 2
(ξ )2 if |ξ | ≤ σ if |ξ | ≤ σ
2 p(ξ ) ∝ ³ 2 ´
Huber’s robust loss c(ξ ) |ξ | — σ2
σ otherwise , exp σ — |ξ | otherwise
σ

= p 2
Polynomial c(ξ ) = 1 |ξ |p p(ξ ) = exp(—|ξ |´p )
p
( 1 ,2Г(1/ p) ³
pσ p—1 (ξ )p if |ξ | ≤  exp ξp
pσ p—1
if |ξ | ≤ σ
σ

Piecewise polynomial c(ξ ) |ξ |— σ p—1 p(ξ ) ∝ ³ p ´
p otherwise , exp σ p—1 — |ξ | otherwise
=
206 Smola and
Scho¨lkopf
instead of normalizing by λ and 4. Table 2. Terms of the convex optimization problem depending on the
choice of the loss function

minimize 1 Σ
4
w 2
C (c˜(ξi ) +
2 + ε α CT (α)

c˜(ξi ))
, i =1 (42)
b y x
, i — (w, i ⟩— ≤ ε + ε-insensitive ε /= 0 α e [0, C ] 0
ξi
subject to (w, xi ⟩ + b — yi ≤ ε + Laplacian ε=0 α e [0, C ] 0
,ξ∗ Gaussian ε=0 α e [0, ∞) — 21 C —1α2
i
1
, ξi , ∗ Huber’s ε=0 α e [0, C — 2 C —1α2
≥0 ]
ξi
Again, by standard Lagrange multiplier techniques, exactly in robust loss σ 1 p

the same manner as in the |·|ε case, one can compute the dual Polynomial ε=0 α e [0, ∞) —
p—1 — p—1
C α p—1
op- p 1 p
p—1
timization problem (the main difference is that the slack Piecewise
polynomial ε=0 α e [0, C ] — p σ C — p—1 α p—1
variable
terms c˜( (∗)) now have nonvanishing derivatives). We will omit
ξi
the indices i and ∗, where applicable to avoid tedious notation.
This yields , In the second case (ξ ≥ σ ) we have
41
, — — α∗)(α — α∗)(x , x ⟩ T( ) p—1 p—1 (48)
Σ i

, 2 i, j =1 i j j i j
ξ =ξ— — ξ = —σ
,  p p
Σ σ
4
maximize ∗ and ξ = inf{ξ |C ≥ α} = σ , which, in turn yields α e [0, C
+ yi (αi — αi ) — ε(αi +
i ∗ ]. Combining both cases we have
=1 αi )
,
, Σ4 T (ξi ) + T p — 1 — p—1p
p

,, α e [0, C ] and T (α) = σ α p—1 . (49)



i
(ξi ) — p
C
+C
, Table 2 contains a summary of the various conditions on α
=1Σ4
, (αi — and formulas for T (α) (strictly speaking T (ξ (α))) for
where  i ∗ (43) different cost functions.5 Note that the maximum slope of c˜
α )xi
=1 i determines the
w= = e ∞
,
,
T (ξ ) := c˜(ξ ) — ξ∂ξ region of feasibility of α, i.e. s : supξ R+ ∂ξ c˜(ξ ) < leads to
, c˜(ξ ) compact intervals [0, Cs] for α. This means that the influence
, Σ4
,, ∗ of a single pattern is bounded, leading to robust estimators
i =1 (αi — αi ) =
0 (Huber 1972). One can also observe experimentally that the
subject to α ≤ C∂ξ c˜(ξ ) performance of a SV machine depends significantly on the
ξ = inf{ξ | C∂ξ c˜ ≥ cost function used (Mu¨ller et al. 1997, Smola, Scho¨lkopf and
, Mu¨ller 1998b)
, α}
, A cautionary remark is necessary regarding the use of cost
α, ξ ≥ 0 functions other than the ε-insensitive one. Unless ε / 0 we
3.4. Examples will lose the advantage of a sparse decomposition. This = may
be acceptable in the case of few data, but will render the pre-
Let us consider the examples of Table 1. We will show explicitly diction step extremely slow otherwise. Hence one will have to
for two examples how (43) can be further simplified to bring it trade off a potential loss in prediction accuracy with faster
into a form that is practically useful. In the ε-insensitive case, pre- dictions. Note, however, that also a reduced set algorithm
i.e. c˜(ξ ) = |ξ | we get like in Burges (1996), Burges and Scho¨lkopf (1997) and
T (ξ ) = ξ — ξ · 1 = 0. (44) Scho¨lkopf et al. (1999b) or sparse decomposition techniques
(Smola and Scho¨lkopf 2000) could be applied to address this
Morover one can conclude from ∂ξ c˜(ξ ) = 1 that issue. In a Bayesian setting, Tipping (2000) has recently
ξ = inf{ξ | C ≥ α} = 0 and α e [0, C ]. (45) shown how an L2 cost function can be used without sacrificing
sparsity.
For the case of piecewise polynomial loss we have to
distinguish two different cases: ξ ≤ σ and ξ > σ . In the first
case we get 4. The bigger picture
A tutorial on support vector regression 207
Before delving into algorithmic details of the implementation
1 1 p — 1 1—p p
T (ξ ) pξ
p
ξ =— σ ξ (46) let us briefly review the basic properties of the SV algorithm
= — p
pσ p—1 σ p—1 for regression as described so far. Figure 2 contains a graphical

and ξ = inf{ξ | Cσ 1—pξ p—1p ≥ overview over the different steps in the regression stage.
1 = σ—Cp — pα— and thus
—1 p 1 1 p 1
—α}
T (ξ ) σC α . (47) The input pattern (for which a prediction is to be made) is
mapped into feature space by a map Ф. Then dot products
p—1 p—1
=— are computed with the images of the training patterns under
p
208 Smola and
Scho¨lkopf
ization operators. This will be explained in more detail in
Section 7.
Finally Fig. 4 shows the relation between approximation
qual- ity and sparsity of representation in the SV case. The
lower the precision required for approximating the original
data, the fewer SVs are needed to encode that. The non-SVs
are redundant, i.e. even without these patterns in the training
set, the SV machine would have constructed exactly the same
function f . One might think that this could be an efficient way
of data compression, namely by storing only the support
patterns, from which the es- timate can be reconstructed
completely. However, this simple analogy turns out to fail in
the case of high-dimensional data, and even more drastically
Fig. 2. Architecture of a regression machine constructed by the SV in the presence of noise. In Vapnik, Golowich and Smola
algorithm (1997) one can see that even for moderate approximation
quality, the number of SVs can be considerably high, yielding
rates worse than the Nyquist rate (Nyquist 1928, Shannon
the map Ф. This corresponds to evaluating kernel functions
k(xi , x ). Finally the dot products are added up using the 1948).

= νi αi αi . This, plus the constant term b yields the
weights
final prediction output. The process described here is very 5. Optimization algorithms
similar to regression in a neural network, with the difference,
that in the While there has been a large number of implementations of
SV case the weights in the input layer are a subset of the SV algorithms in the past years, we focus on a few algorithms
training patterns. which will be presented in greater detail. This selection is
Figure 3 demonstrates how the SV algorithm chooses the somewhat biased, as it contains these algorithms the authors
flattest function among those approximating the original data are most fa- miliar with. However, we think that this overview
with a given precision. Although requiring flatness only in contains some of the most effective ones and will be useful for
feature space, one can observe that the functions also are practitioners who would like to actually code a SV machine
very flat in input space. This is due to the fact, that ker- by themselves. But before doing so we will briefly cover
nels can be associated with flatness properties via regular- major optimization packages and strategies.

Fig. 3. Left to right: approximation of the function sinc x with precisions ε= 0.1, 0.2, and 0.5. The solid top and the bottom lines indicate the
size of the ε-tube, the dotted line in between is the regression

Fig. 4. Left to right: regression (solid line), datapoints (small dots) and SVs (big dots) for an approximation with ε = 0.1, 0.2, and 0.5. Note the
decrease in the number of SVs
A tutorial on support vector regression 209
5.1. Implementations subset selection techniques (see Section 5.5) to address this
Most commercially available packages for quadratic program- problem.
ming can also be used to train SV machines. These are usually
numerically very stable general purpose codes, with special 5.2. Basic notions
en- hancements for large sparse systems. While the latter is a
feature that is not needed at all in SV problems (there the dot Most algorithms rely on results from the duality theory in convex
product matrix is dense and huge) they still can be used with optimization. Although we already happened to mention some
good suc- cess.6 basic ideas in Section 1.2 we will, for the sake of convenience,
briefly review without proof the core results. These are needed
in particular to derive an interior point algorithm. For details
OSL: This package was written by IBM-Corporation (1992). It and proofs (see e.g. Fletcher 1989).
uses a two phase algorithm. The first step consists of
solving a linear approximation of the QP problem by the Uniqueness: Every convex constrained optimization problem
simplex al- gorithm (Dantzig 1962). Next a related very has a unique minimum. If the problem is strictly convex then
simple QP prob- lem is dealt with. When successive the solution is unique. This means that SVs are not plagued
approximations are close enough together, the second with the problem of local minima as Neural Networks are.7
subalgorithm, which permits a quadratic objective and Lagrange function: The Lagrange function is given by the pri-
converges very rapidly from a good starting value, is used. mal objective function minus the sum of all products between
Recently an interior point algorithm was added to the constraints and corresponding Lagrange multipliers (cf. e.g.
software suite. Fletcher 1989, Bertsekas 1995). Optimization can be seen
CPLEX by CPLEX-Optimization-Inc. (1994) uses a primal- as minimzation of the Lagrangian wrt. the primal variables
dual logarithmic barrier algorithm (Megiddo 1989) instead and simultaneous maximization wrt. the Lagrange multipli-
with predictor-corrector step (see e.g. Lustig, Marsten and ers, i.e. dual variables. It has a saddle point at the solution.
Shanno 1992, Mehrotra and Sun 1992). Usually the Lagrange function is only a theoretical device
MINOS by the Stanford Optimization Laboratory (Murtagh to derive the dual objective function (cf. Section 1.2).
and Saunders 1983) uses a reduced gradient algorithm in Dual objective function: It is derived by minimizing the
con- junction with a quasi-Newton algorithm. The Lagrange function with respect to the primal variables and
constraints are handled by an active set strategy. Feasibility subsequent elimination of the latter. Hence it can be
is maintained throughout the process. On the active written solely in terms of the dual variables.
constraint manifold, a quasi-Newton approximation is used. Duality gap: For both feasible primal and dual variables the
MATLAB: Until recently the matlab QP optimizer delivered pri- mal objective function (of a convex minimization
only agreeable, although below average performance on problem) is always greater or equal than the dual objective
classifi- cation tasks and was not all too useful for function. Since SVMs have only linear constraints the
regression tasks (for problems much larger than 100 constraint qual- ifications of the strong duality theorem
samples) due to the fact that one is effectively dealing with (Bazaraa, Sherali and Shetty 1993, Theorem 6.2.4) are
an optimization prob- lem of size 24 where at least half of satisfied and it follows that gap vanishes at optimality.
the eigenvalues of the Hessian vanish. These problems seem Thus the duality gap is a measure how close (in terms of
to have been addressed in version 5.3 / R11. Matlab now the objective function) the current set of variables is to the
uses interior point codes. solution.
LOQO by Vanderbei (1994) is another example of an interior Karush–Kuhn–Tucker (KKT) conditions: A set of primal and
point code. Section 5.3 discusses the underlying strategies in dual variables that is both feasible and satisfies the KKT
detail and shows how they can be adapted to SV algorithms. conditions is the solution (i.e. constraint· dual variable= 0).
Maximum margin perceptron by Kowalczyk (2000) is an The sum of the violated KKT terms determines exactly the
algo- rithm specifically tailored to SVs. Unlike most other size of the duality gap (that is, we simply compute the
tech- niques it works directly in primal space and thus does constraint · Lagrangemultiplier part as done in (55)). This
not have to take the equality constraint on the Lagrange allows us to compute the latter quite easily.
multipli- A simple intuition is that for violated constraints the dual
ers into account explicitly. variable could be increased arbitrarily, thus rendering the
Iterative free set methods The algorithm by Kaufman (Bunch, Lagrange function arbitrarily large. This, however, is in
Kaufman and Parlett 1976, Bunch and Kaufman 1977, con- tradition to the saddlepoint property.
1980, Drucker et al. 1997, Kaufman 1999), uses such a
technique starting with all variables on the boundary and 5.3. Interior point algorithms
adding them as the Karush Kuhn Tucker conditions become In a nutshell the idea of an interior point algorithm is to com-
more violated. This approach has the advantage of not pute the dual of the optimization problem (in our case the dual
having to compute the full dot product matrix from the dual of Rreg[ f ]) and solve both primal and dual
beginning. Instead it is evaluated on the fly, yielding a simultaneously. This is done by only gradually enforcing the
performance improvement in comparison to tackling the KKT conditions
whole optimization problem at once. However, also other
algorithms can be modified by
210 Smola and
Scho¨lkopf
to iteratively find a feasible solution and to use the duality Assuming that the (dominant) convex part q(α) of the pri-
gap between primal and dual objective function to determine mal objective is quadratic, the q scales with τ 2 where as
the quality of the current set of variables. The special flavour the linear part scales with τ . However, since the linear term
of algorithm we will describe is primal-dual path-following dom- inates the objective function, the rescaled values are
(Vanderbei 1994). = α 0. In practice a speedup
still a better starting point than
In order to avoid tedious notation we will consider the of approximately 95% of the overall training time can be
slightly more general problem and specialize the result to the ob- served when using the sequential minimization
SVM later. It is understood that unless stated otherwise, algorithm, cf. (Smola 1998). A similar reasoning can be
variables like α denote vectors and αi denotes its i -th applied when retraining with the same regularization
component. 1 parameter but differ- ent (yet similar) width parameters of
minimize q(α) + (c, α⟩ the kernel function. See
2 (50) Cristianini, Campbell and Shawe-Taylor (1998) for details
subject to Aα = b and l ≤ α ≤ u thereon in a different context.
with c, α, l, u Rn, A Rn·m, b Rm, the inequalities be- Monitoring convergence via the feasibility gap: In the case of
e
tween vectors holding e
componentwise e and q(α) being a
both primal and dual feasible variables the following con-
convex
nection between primal and dual objective function holds:
function of α. Now we will add slack variables to get rid of all
Σ
inequalities but the positivity constraints. This yields:
1 Dual Obj. = Primal Obj. — (gi zi + si ti ) (55)
i
minimize q(α) + (c, α⟩

subject to 2 This can be seen immediately by the construction of the


Aα = b,α — g = l, α + t = u, (51)
Lagrange function. In Regression EstimationΣ (with the ε-
g, t ≥ 0, α free insensitive loss function) one obtains for i gi zi + si ti
The dual of (51) is
1 , ,
+ max(0, f (xi ) — (yi + εi ))(C — αi∗)
maximize (q(α) — (∂→q(α), α)⟩+ (b, y⟩+ (l, z⟩— (u, s⟩ Σ ∗ — min(0, f (xi ) — (yi + εi . (56)
))αi  
,
12 i ,
subject to ∂→q(α) + c — ( Ay)T + s = z, s, z ≥ 0, y max(0, (yi — ε∗i∗) — f (xi ))(C — αi )
free +
2 — min(0, (yi — εi ) — f (xi ))αi
(52) Thus convergence with respect to the point of the solution
can be expressed in terms of the duality gap. An effective
Moreover we get the KKT conditions, namely stopping rule is to require
gi zi = 0 and si ti = 0 for all i e [1 . . . n]. (53) Σ
i gi zi + si ti ≤ εtol (57)
A necessary and sufficient condition for the optimal solution |Primal Objective|+ 1
is
that the primal/dual variables satisfy both the feasibility condi- for some precision εtol. This condition is much in the spirit
tions of (51) and (52) and the KKT conditions (53). We pro- of primal dual interior point path following algorithms,
ceed to solve (51)–(53) iteratively. The details can be found in where convergence is measured in terms of the number of
Appendix A. significant figures (which would be the decimal logarithm
of (57)), a convention that will also be adopted in the
5.4. Useful tricks subsequent parts of this exposition.

Before proceeding to further algorithms for quadratic


optimiza- tion let us briefly mention some useful tricks that 5.5. Subset selection algorithms
can be applied to all algorithms described subsequently and The convex programming algorithms described so far can be
may have signif- icant impact despite their simplicity. They used directly on moderately sized (up to 3000) samples
are in part derived from ideas of the interior-point approach. datasets without any further modifications. On large datasets,
Training with different regularization parameters: For however, it is difficult, due to memory and cpu limitations, to
several reasons (model selection, controlling the number of compute the dot product matrix k(xi , x j ) and keep it in
support vectors, etc.) it may happen that one has to train a memory. A simple calculation shows that for instance storing
SV ma- chine with different regularization parameters C , the dot product matrix of the NIST OCR database (60.000
but other- wise rather identical settings. If the parameters samples) at single precision would consume 0.7 GBytes. A
= Cholesky decomposition thereof, which would additionally
Cnew τ Cold is not too different it is advantageous to use the
rescaled val- require roughly the same amount of memory and 64 Teraflops
ues of the Lagrange multipliers (i.e. αi , αi∗) asa starting (counting multiplies and adds sepa- rately), seems unrealistic,
point at least at current processor speeds.
for the new optimization problem. Rescaling is necessary to A first solution, which was introduced in Vapnik (1982)
satisfy the modified constraints. One gets relies on the observation that the solution can be reconstructed
from the SVs alone. Hence, if we knew the SV set
αnew = ταold and likewise bnew = τbold. (54)
beforehand, and
A tutorial on support vector regression 211
it fitted into memory, then we could directly solve the reduced tions an explicit solution of the restricted quadratic
problem. The catch is that we do not know the SV set before programming problem is impossible. Yet, one could derive an
solving the problem. The solution is to start with an arbitrary analogous non- quadratic convex optimization problem for
subset, a first chunk that fits into memory, train the SV general cost func- tions but at the expense of having to solve it
algorithm on it, keep the SVs and fill the chunk up with data numerically.
the current estimator would make errors on (i.e. data lying The exposition proceeds as follows: first one has to derive
outside the s- tube of the current regression). Then retrain the the (modified) boundary conditions for the constrained 2
system and keep on iterating until after training all KKT- indices (i, j ) subproblem in regression, next one can proceed to
conditions are satisfied. The basic chunking algorithm just solve the optimization problem analytically, and finally one
postponed the underlying problem of dealing with large has to check, which part of the selection rules have to be
datasets whose dot-product matrix cannot be kept in memory: modified to make the approach work for regression. Since
it will occur for larger training set sizes than originally, but it most of the content is fairly technical it has been relegated to
is not completely avoided. Hence the solution is Osuna, Appendix C.
Freund and Girosi (1997) to use only a subset of the variables The main difference in implementations of SMO for regres-
as a working set and optimize the problem with respect to sion can be found in the way the constant offset b is
them while freezing the other variables. This method is determined (Keerthi et al. 1999) and which criterion is used to
described in detail in Osuna, Freund and Girosi (1997), select a new set of variables. We present one such strategy in
Joachims (1999) and Saunders et al. (1998) for the case of Appendix C.3. However, since selection strategies are the
pattern recognition.8 focus of current re- search we recommend that readers
An adaptation of these techniques to the case of regression interested in implementing the algorithm make sure they are
with convex cost functions can be found in Appendix B. The aware of the most recent de- velopments in this area.
basic structure of the method is described by Algorithm 1. Finally, we note that just as we presently describe a
Algorithm 1.: Basic structure of a working set generaliza- tion of SMO to regression estimation, other
learning problems can also benefit from the underlying ideas.

= αi , αi
algorithm Initialize 0 Recently, a SMO algorithm for training novelty detection
Choose arbitrary working set Sw systems (i.e. one-class classification) has been proposed
repeat (Scho¨lkopf et al. 2001).
Compute coupling terms (linear and constant) for Sw (see
Appendix A.3)
Solve reduced optimization problem 6. Variations on a theme
Choose new Sw from variables αi , αi∗ not satisfying the
KKT conditions There exists a large number of algorithmic modifications of
the SV algorithm, to make it suitable for specific settings
until working set Sw = ∅
(inverse problems, semiparametric settings), different ways of
measuring capacity and reductions to linear programming
5.6. Sequential minimal optimization (convex com- binations) and different ways of controlling
capacity. We will mention some of the more popular ones.
Recently an algorithm—Sequential Minimal Optimization
(SMO)—was proposed (Platt 1999) that puts chunking to the
extreme by iteratively selecting subsets only of size 2 and op- 6.1. Convex combinations and l1-norms
timizing the target function with respect to them. It has been All the algorithms presented so far involved convex, and at
reported to have good convergence properties and it is easily best, quadratic programming. Yet one might think of reducing
implemented. The key point is that for a working set of 2 the the problem to a case where linear programming techniques
optimization subproblem can be solved analytically without can be applied. This can be done in a straightforward fashion
ex- plicitly invoking a quadratic optimizer. (Mangasarian 1965, 1968, Weston et al. 1999, Smola,
While readily derived for pattern recognition by Platt Scho¨lkopf and Ra¨tsch 1999) for both SV pattern recognition
(1999), one simply has to mimick the original reasoning to and regression. The key is to replace (35) by
obtain an extension to Regression Estimation. This is what
will be done in Appendix C (the pseudocode can be found in Rreg[ f ] := Remp[ f ] + λ α 1 (58)
Smola and Scho¨lkopf (1998b)). The modifications consist of where α 1 denotes the 41 norm in coefficient space. Hence
a pattern de- pendent regularization, convergence control via one uses the SV kernel expansion (11)
the number of significant figures, and a modified system of
4
Σ
equations to solve the optimization problem in two variables
for regression analyt- f (x ) = αi k(xi , x ) + b
ically. i =1

with a different way of controlling capacity by minimizing


Note that the reasoning only applies to SV regression with 1Σ 4 Σ 4

Rreg[ f ] = c(xi , yi , f (xi )) + |αi |. (59)


the s insensitive loss function—for most other convex cost
4 i =1 λ i
func- =1
212 Smola and
Scho¨lkopf
For the s-insensitive loss function this leads to a linear for some ν > 0. Hence (42) becomes (again carrying out the
program- ming problem. In the other cases, however, the usual transformation between λ, 4 and C )
problem still stays
a quadratic or general convex one, and therefore may not yield ∗
1 4 (c˜(ξi ) + c˜(ξi )) +
the desired computational advantage. Therefore we will limit w 2

ourselves to the derivation of the linear programming problem


minimize
2 +C !
i
in the case of | · |s cost function. Reformulating (59) yields 4νs =1 (61)
,
Σ4 subject to ,(w,yi x—⟩ +
(w,b x—
i ⟩ —≤b s≤+sξ ∗
y
Σ4 + ξi
minimize ∗
(αi + αi ) + (ξi + ,, i

i i
i C i

ξi ) ξi , ξi ≥ 0
=1 =1
,
Σ
4
We consider the standard|· s loss function. Computing the
(α j — α∗j )k(x j , xi ) — b ≤ s
yi — dual of (62) yields |
, j + ξi ,
 4 =1
subject to Σ —
, 1 Σ4
, ∗
, , j =1 (α j — α∗j )k(x j , xi ) + b — yi ≤ s + , , (αi αi )(α j α∗j )k(xi , x j )
maximize — 4i, j= 1
ξi∗ ∗ ∗ 2
αi , αi , ξi , ξi ≥ 0 ,
Σ
,,
Unlike in the classical SV case, the transformation into its dual , + yi (αi
does not give any improvement in the structure of the ∗
— αi ) (62)
optimiza- i
, =1
4
tion problem. Hence it is best to minimize Rreg[ f ] directly, Σ ∗
(αi — αi ) = 0
which can be achieved by a linear optimizer, (e.g. Dantzig , ,i
1962, Lustig, Marsten and Shanno 1990, Vanderbei 1997). subject to  =1

In (Weston et al. 1999) a similar variant of the linear SV ap- Σ(αi + αi ) ≤ Cν4
proach is used to estimate densities on a line. One can show 4 ∗

,,,
i =1

(Smola et al. 2000) that one may obtain bounds on the gener- that s becomes a variable of the optimization problem,
alization error which exhibit even better rates (in terms of the including an extra term in the primal objective function which
entropy numbers) than the classical SV case (Williamson, attempts to minimize s. In other words
Smola and Scho¨lkopf 1998).

6.2. Automatic tuning of the insensitivity tube


Besides standard model selection issues, i.e. how to specify
the trade-off between empirical error and model capacity there
also exists the problem of an optimal choice of a cost function.
In particular, for the s-insensitive cost function we still have
the problem of choosing an adequate parameter s in order to
achieve good performance with the SV machine.
Smola et al. (1998a) show the existence of a linear depen-
dency between the noise level and the optimal s-parameter for
SV regression. However, this would require that we know
some- thing about the noise model. This knowledge is not
available in general. Therefore, albeit providing theoretical
insight, this find- ing by itself is not particularly useful in
practice. Moreover, if we really knew the noise model, we
most likely would not choose the s-insensitive cost function
but the corresponding maximum likelihood loss function
instead.
There exists, however, a method to construct SV machines
that automatically adjust s and moreover also, at least
asymptot- ically, have a predetermined fraction of sampling
points as SVs (Scho¨lkopf et al. 2000). We modify (35) such
A tutorial on support vector regression 213
αi , αi e [0, C ] asymptotically,ν equals the fraction of SVs and the fraction
of errors.
Note that the optimization problem is thus very similar to the
s- SV one: the target function is even simpler (it is Ã
Essentially, ν-SV regression improves upon s-SV regression by
homogeneous), but there is an additional constraint. For Σ automatically to the data.
allowing the tube width to adapt
information on how this affects the implementation (cf.
What is kept fixed up to this point, however, is the shape of
Chang and Lin 2001).
the tube. One can, however, go one step further and use
Besides having the advantage of being able to
parametric tube models with non-constant width, leading to
automatically determine s (63) also has another advantage.
almost identical op- timization problems (Scho¨lkopf et al.
It can be used to pre–specify the number of SVs:
2000).
Combining ν-SV regression with results on the
Theorem 9 (Scho¨lkopf et al. 2000).
asymptotical optimal choice of s for a given noise model
1. ν is an upper bound on the fraction of errors. (Smola et al. 1998a) leads to a guideline how to adjust ν
2. ν is a lower bound on the fraction of SVs. provided the class of noise models (e.g. Gaussian or
3. Suppose the data has been generated iid from a Laplacian) is known.
distribution p(x, y) p(x ) p(y x ) with a continuous =
conditional distri- bution p(y x ). With probability 1, Remark 10 (Optimal choice of ν). Denote by p a probability
| |
minimize Rν [ f ] := Remp[ f ] λ density with unit variance, and by P a famliy of noise models
2 + νs σ σ
generated from p by P := { p| p = 1 p( y )}. Moreover assume
+ 2 w (60)
214 Smola and
Scho¨lkopf
by making problems seemingly easier yet reliable via a map
into some even higher dimensional space.
In this section we focus on the connections between SV
methods and previous techniques like Regularization
Networks (Girosi, Jones and Poggio 1993).9 In particular we
will show that SV machines are essentially Regularization
Networks (RN) with a clever choice of cost functions and that
the kernels are Green’s function of the corresponding
regularization operators. Fora full exposition of the subject the
reader is referred to Smola, Scho¨lkopf and Mu¨ller (1998c).

7.1. Regularization networks


Let us briefly review the basic concepts of RNs. As in (35)
we minimize a regularized risk functional. However, rather
than enforcing flatness in feature space we try to optimize
some smoothness criterion for the function in input space.
Thus we get
Fig. 5. Optimal ν and s for various degrees of polynomial additive
noise λ
Rreg [ f ] : Remp [ f ] + Pf 2. (64)
= 2
that the data were drawn iid from p(x, y) = p(x ) p(y — f (x
))
with p(y — f (x )) continuous. Then under the assumption Here P denotes a regularization operator in the sense of
of uniform convergence, the asymptotically optimal value of ν Tikhonov and Arsenin (1977), i.e. P is a positive semidefinite
is operator mapping from the Hilbert space H of functions f
∫ s under
ν = 1 p(t ) consideration to a dot product space D such that the expression
dt
— —s
µ ∫ τ ¶ ( Pf · Pg ⟩ is well defined for f, g e H . For instance by
—2 choos- ing a suitable operator that penalizes large variations
of f one
where s := argmin(p(—τ ) + p(τ )) 1 p(t ) (63) can reduce the well–known overfitting effect. Another possible
— dt
τ —τ
setting also might be an operator P mapping from L2 (Rn ) into
For polynomial noise models, i.e. densities of type exp(—| ξ ) p
some Reproducing Kernel Hilbert Space (RKHS) (Aronszajn,
one may compute the corresponding (asymptotically) optimal| 1950, Kimeldorf and Wahba 1971, Saitoh 1988, Scho¨lkopf
values of ν. They are given in Fig. 5. For further details see 1997,
(Scho¨lkopf et al. 2000, Smola 1998); an experimental validation Girosi 1998).
has been given by Chalimourda, Scho¨lkopf and Smola Using an expansion of f in terms of some symmetric
(2000). function k(xi , x j ) (note here, that k need not fulfill Mercer’s
We conclude this section by noting that ν-SV regression is condition and can be chosen arbitrarily since it is not used to
related to the idea of trimmed estimators. One can show that define a regularization term),
the regression is not influenced if we perturb points lying
outside the tube. Thus, the regression is essentially computed Σ4
by discarding f (x) αi k(xi , x ) + b, (65)
a certain fraction of outliers, specified by ν, and computing = i
=1
the regression estimate from the remaining points (Scho¨lkopf
2000). and the s-insensitive cost function, this leads to a quadratic pro-
gramming problem similar to the one for SVs. Using
7. Regularization
Dij := (( Pk)(xi , .) · ( Pk)(x j , .)⟩ (66)
So far we were not concerned about the specific properties of
the map Ф into feature space and used it only as a convenient we get α = D—1 K (β — β∗), with β, β∗ being the solution of
trick to construct nonlinear regression functions. In some 1 ∗
cases the map was just given implicitly by the kernel, hence minimize (β — β)TKD—1 K (β∗ — β)
the map itself and many of its properties have been neglected. 2 4Σ

A deeper understanding of the kernel map would also be —(β∗ — β)T y — s (βi + βi ) (67)
useful to choose Σ4 i
=1
appropriate kernels for a specific task (e.g. by incorporating
∗ ∗
prior knowledge (Scho¨lkopf et al. 1998a)). Finally the feature subject to (βi — βi ) = 0 and βi , βi e [0, C ].
map seems to defy the curse of dimensionality (Bellman 1961) i =1
A tutorial on support vector regression 215
Unfortunately, this setting of the problem does not preserve
spar- sity in terms of the coefficients, as a potentially sparse with f˜(ω) denoting the Fourier transform of f (x ), and P(ω)=
decom- P( —ω) real valued, nonnegative and converging to 0 for| ω| →
position in terms of βi and βi∗ is spoiled by D—1 K , which is ∞ and ▲ : supp[ P(ω)]. Small values of P(ω) correspond to
not in general diagonal. a strong attenuation of the corresponding frequencies. Hence
small values of P(ω) for large ω are desirable since high fre-
quency components of f˜ correspond to rapid changes in f .
7.2. Green’s functions
P(ω) describes the filter properties of P∗ P. Note that no
Comparing (10) with (67) leads to the question whether and atten- uation takes place for P(ω) 0 as these frequencies
have been =
un- der which condition the two methods might be equivalent
and therefore also under which conditions regularization excluded from the integration domain.
networks might lead to sparse decompositions, i.e. only a few For regularization operators defined in Fourier Space by (70)
one can show by exploiting P(ω) = P(—ω) = P(ω) that
of the ex-

pansion coefficients αi in f would differ from zero. A G(x x ) 1 (71)
sufficient eiω(x —x ) P( ) d
condition is D = K and thus KD —1
K = K (if K does not have i ,
i
ω ω
full rank we only need that KD—1 K = K holds on the image of (2π )n/2 Rn

=
K ): is a corresponding Green’s function satisfying translational in-
variance, i.e.
k(xi , x j ) = (( Pk)(xi , .) · ( Pk)(x j , .)⟩ (68)
Our goal now is to solve the following two problems: G(xi , x j ) = G(xi — x j ) and G˜(ω) = P(ω). (72)
This provides us with an efficient tool for analyzing SV
1. Given a regularization operator P, find a kernel k such that
kernels and the types of capacity control they exhibit. In fact
a SV machine using k will not only enforce flatness in
feature space, but also correspond to minimizing a the above
regularized risk functional with P as regularizer. is a special case of Bochner’s theorem (Bochner 1959) stating
2. Given an SV kernel k, find a regularization operator P such that the Fourier transform of a positive measure constitutes a
that a SV machine using this kernel can be viewed as a positive Hilbert Schmidt kernel.
Reg- ularization Network using P.
Example 2 (Gaussian kernels). Following the exposition of
These two problems can be solved by employing the concept Yuille and Grzywacz (1988) as described in Girosi, Jones and
Poggio (1993), one can see that for
of Green’s functions as described in Girosi, Jones and Poggio ∫ 2m
(1993). These functions were introduced for the purpose of Pf = dx Σ σ m (Oˆm f (x 2 (73)
2 m!2
solv- ing differential equations. In our context it is sufficient m ))
that the Green’s functions Gxi (x ) of P∗ P satisfy with Oˆ2m = ∆m and Oˆ2m+1 = ∇∆m, ∆ being the Laplacian
and ∇ the Gradient operator, we get Gaussians kernels (31).
( P∗ PG xi )(x ) = δix (x ). (69)
Moreover, we can provide an equivalent representation of P
σ2 ω 2
Here, (x ) is the δ-distribution (not to be confused with the in terms of its Fourier properties, i.e. P(ω) = e— 2 up to a
δxi Kro-
necker symbol δij ) which has the property that( f· δx⟩i = f (xi multiplicative constant.
). The relationship between kernels and regularization operators
is formalized in the following proposition: Training an SV machine with Gaussian RBF kernels (Scho¨lkopf
et al. 1997) corresponds to minimizing the specific cost func-
Proposition 1 (Smola, Scho¨lkopf and Mu¨ller 1998b). Let P tion with a regularization operator of type (73). Recall that
be a regularization operator, and G be the Green’s function (73) means that all derivatives of f are penalized (we have a
of P∗ P. Then G is a Mercer Kernel such=that D K. SV pseudod- ifferential operator) to obtain a very smooth
machines estimate. This also explains the good performance of SV
using G minimize risk functional (64) with P as regularization
machines in this case, as it is by no means obvious that
operator.
choosing a flat function in some high dimensional space will
correspond to a simple function in low dimensional space, as
In the following we will exploit this relationship in both ways: shown in Smola, Scho¨lkopf and Mu¨ller (1998c) for Dirichlet
to compute Green’s functions for a given regularization kernels.
operator P and to infer the regularizer, given a kernel k. The question that arises now is which kernel to choose. Let
us think about two extreme situations.
7.3. Translation invariant kernels 1. Suppose we already knew the shape of the power spectrum
Let us now more specifically consider regularization operators Pow(ω) of the function we would like to estimate. In this
Pˆ that may be written as multiplications in Fourier space case we choose k such that k˜ matches the power spectrum
(Smola 1998).
( Pf · Pg⟩ =
(2π )n/2 ▲
216 Smola and
∫ f˜(ω)g˜(ω) Scho¨lkopf
2. If we happen to know very little about the given data a gen-
1 dω (70)
P(ω) eral smoothness assumption is a reasonable choice. Hence
A tutorial on support vector regression 217
we might want to choose a Gaussian kernel. If computing function class in all possible ways, the Covering Number
time is important one might moreover consider kernels which is the number of elements from F that are needed to
with compact support, e.g. using the Bq –spline kernels (cf. cover F with accuracy of at least s, Entropy Numbers which
(32)). This choice will cause many matrix elements kij are the functional inverse of Covering Numbers, and many
k(xi x j ) to vanish. =
more variants thereof (see e.g. Vapnik 1982, 1998, Devroye,
The usual scenario will be in between the two extreme cases Gyo¨rfi and Lugosi 1996, Williamson, Smola and Scho¨lkopf
and we will have some limited prior knowledge available. For 1998, Shawe-Taylor et al. 1998).
more information on using prior knowledge for choosing
kernels (see Scho¨lkopf et al. 1998a). 8. Conclusion
Due to the already quite large body of work done in the field
7.4. Capacity control of SV research it is impossible to write a tutorial on SV
All the reasoning so far was based on the assumption that regression which includes all contributions to this field. This
there exist ways to determine model parameters like the also would be quite out of the scope of a tutorial and rather be
regularization constant λ or length scales σ of rbf–kernels. The relegated to textbooks on the matter (see Scho¨lkopf and
model selec- tion issue itself would easily double the length of Smola (2002) for a comprehensive overview, Scho¨lkopf,
this review and moreover it is an area of active and rapidly Burges and Smola (1999a) for a snapshot of the current state of
moving research. Therefore we limit ourselves to a the art, Vapnik (1998) for an overview on statistical learning
presentation of the basic con- cepts and refer the interested theory, or Cristianini and Shawe- Taylor (2000) for an
reader to the original publications. It is important to keep in introductory textbook). Still the authors hope that this work
mind that there exist several fun- damentally different provides a not overly biased view of the state of the art in SV
approaches such as Minimum Description Length (cf. e.g. regression research. We deliberately omitted (among others)
Rissanen 1978, Li and Vita´nyi 1993) which is based on the the following topics.
idea that the simplicity of an estimate, and therefore also its
plausibility is based on the information (number of bits) 8.1. Missing topics
needed to encode it such that it can be reconstructed. Mathematical programming: Starting from a completely
Bayesian estimation, on the other hand, considers the pos- differ- ent perspective algorithms have been developed that
terior probability of an estimate, given the observations X= are sim- ilar in their ideas to SV machines. A good
{ 1, y 1 ),... (x4, y4) , an observation noise model, and a prior
(x primer might be (Bradley, Fayyad and Mangasarian 1998).
probability
} distribution p( f ) over the space of estimates (Also see Mangasarian 1965, 1969, Street and
(parameters). It is given by Bayes Rule p( f| X ) p(X ) = Mangasarian 1995). A comprehensive discussion of
p(X | f ) p( f ). Since p(X ) does not depend on f , one can connections between mathe- matical programming and SV
maxi- mize | p(X f ) p( f ) to obtain the so-called MAP machines has been given by (Bennett 1999).
estimate.10 As a rule of thumb, to translate regularized risk Density estimation: with SV machines (Weston et al. 1999,
functionals into Bayesian MAP estimation schemes, all one Vapnik 1999). There one makes use of the fact that the cu-
has to do — is to con- sider
= exp( Rreg[ f ]) p( f X ). For a more mulative distribution function is monotonically increasing,
detailed discussion |(see e.g. Kimeldorf and Wahba 1970, and that its values can be predicted with variable
MacKay 1991, Neal 1996, Rasmussen 1996, Williams 1998). confidence which is adjusted by selecting different values
A simple yet powerful way of model selection is cross of s in the loss function.
valida- tion. This is based on the idea that the expectation of Dictionaries: were originally introduced in the context of
the error on a subset of the training sample not used during wavelets by (Chen, Donoho and Saunders 1999) to allow
training is identical to the expected error itself. There exist for a large class of basis functions to be considered
several strate- gies such as 10-fold crossvalidation, leave-one simulta- neously, e.g. kernels with different widths. In the
out error (4-fold crossvalidation), bootstrap and derived standard SV case this is hardly possible except by defining
algorithms to estimate the crossvalidation error itself (see e.g. new kernels as linear combinations of differently scaled
Stone 1974, Wahba 1980, Efron 1982, Efron and Tibshirani ones: choosing the regularization operator already
1994, Wahba 1999, Jaakkola and Haussler 1999) for further determines the kernel com- pletely (Kimeldorf and Wahba
details. 1971, Cox and O’Sullivan 1990, Scho¨lkopf et al. 2000).
Finally, one may also use uniform convergence bounds Hence one has to resort to lin- ear programming (Weston et
such as the ones introduced by Vapnik and Chervonenkis al. 1999).
(1971). The basic idea is that one may bound with probability
— Applications: The focus of this review was on methods and
1 y (with y > 0) the expected risk R[+f ] by Remp[ f ] theory rather than on applications. This was done to limit
Ф(F, y), where Ф is a confidence term depending on the class the size of the exposition. State of the art, or even record
of functions F. Several criteria for measuring the capacity of F performance was reported in Mu¨ller et al. (1997), Drucker
exist, such as the VC-Dimension which, in pattern recognition et al. (1997), Stitson et al. (1999) and Mattera and Haykin
problems, is given by the maximum number of points that can (1999).
be separated by the
218 Smola and
Scho¨lkopf
In many cases, it may be possible to achieve similar per- this exciting field, may find it useful to consult the web page
formance with neural network methods, however, only if www.kernel-machines.org.
many parameters are optimally tuned by hand, thus
depend- ing largely on the skill of the experimenter.
Certainly, SV machines are not a “silver bullet.” However, Appendix A: Solving the interior-point
as they have only few critical parameters (e.g.
regularization and kernel width), state-of-the-art results can
equations
be achieved with relatively little effort.
A.1. Path following
Rather than trying to satisfy (53) directly we will solve a
8.2. Open issues
modified version thereof for some µ > 0 substituted on the rhs
Being a very active field there exist still a number of open is- in the first place and decrease µ while iterating.
sues that have to be addressed by future research. After that
the algorithmic development seems to have found a more sta- gi zi = µ, si t i = µ for all i e [1 . . . n]. (74)
ble stage, one of the most important ones seems to be to find
tight error bounds derived from the specific properties of ker- Still it is rather difficult to solve the nonlinear system of equa-
nel functions. It will be of interest in this context, whether tions (51), (52), and (74) exactly. However we are not
SV machines, or similar approaches stemming from a lin- interested in obtaining the exact solution to the approximation
ear programming regularizer, will lead to more satisfactory (74). In- stead, we seek a somewhat more feasible solution for
results. a given µ, then decrease µ and repeat. This can be done by
Moreover some sort of “luckiness framework” (Shawe- linearizing the above system and solving the resulting
Taylor et al. 1998) for multiple model selection parameters, equations by a predictor– corrector approach until the duality
similar to multiple hyperparameters and automatic relevance gap is small enough. The advantage is that we will get
detection in Bayesian statistics (MacKay 1991, Bishop 1995), approximately equal performance as by trying to solve the
will have to be devised to make SV machines less dependent quadratic system directly, provided that the terms in ∆2 are
on the skill of the experimenter. small enough.
It is also worth while to exploit the bridge between
regulariza- tion operators, Gaussian processes and priors (see A(α + ∆α) = b
e.g. (Williams 1998)) to state Bayesian risk bounds for SV α + ∆α — g — ∆g = l
machines in order to compare the predictions with the ones
from VC theory. Op- timization techniques developed in the α + ∆α + t + ∆t = u
context of SV machines also could be used to deal with large 1 1
c + ∂α q(α) + ∂α2q(α)∆α — ( A(y + ∆y))T
datasets in the Gaussian process settings. 2 2
Prior knowledge appears to be another important question + s + ∆s = z + ∆z
in SV regression. Whilst invariances could be included in (gi + ∆gi )(zi + ∆zi ) = µ
pattern recognition in a principled way via the virtual SV (si + ∆si )(ti + ∆ti ) = µ
mechanism and restriction of the feature space (Burges and
Scho¨lkopf 1997, Scho¨lkopf et al. 1998a), it is still not clear Solving for the variables in ∆ we get
how (probably) more subtle properties, as required for
regression, could be dealt with efficiently. A∆α = b — Aα =: ρ
Reduced set methods also should be considered for ∆α — ∆g = l — α + g =: ν
speeding up prediction (and possibly also training) phase for
large datasets (Burges and Scho¨lkopf 1997, Osuna and Girosi ∆α + ∆t = u — α — t =: τ
1
1999, Scho¨lkopf et al. 1999b, Smola and Scho¨lkopf 2000). ( A∆y)T + ∆z — ∆s — ∂2q(α)∆α
This topic is of great importance as data mining applications
require algorithms that α
are able to deal with databases that are often at least one order of 2
magnitude larger (1 million samples) than the current practical 1
size for SV regression. = c — ( Ay)T + s — z + 2 ∂α q(α) =: σ
Many more aspects such as more data dependent g—1z∆g + ∆z = µg—1 — z — g—1∆g∆z =: γz
generaliza- tion bounds, efficient training algorithms,
t —1s∆t + ∆s = µt —1 — s — t —1∆t∆s =: γs
automatic kernel se- lection procedures, and many techniques
that already have made their way into the standard neural where g—1 denotes the vector (1/g 1 ,..., 1/gn), and t analo-
networks toolkit, will have to be considered in the future. gously. Moreover denote g—1z and t —1s the vector generated
Readers who are tempted to embark upon a more detailed by the componentwise product of the two vectors. Solving for
exploration of these topics, and to contribute their own ideas
to
A tutorial on support vector regression 219
∆g, ∆t, ∆z, ∆s we get and subsequently restricts the solution to a feasible set
u
—1
∆g = z g(γz — ∆z) ∆z = g z(νˆ — ∆α) —1
x max µ x, ¶
100
∆t = s—1t (γs — ∆s) ∆s = t —1s(∆α — τˆ) =

where νˆ := ν — z—1gγz (75) g = min(α — l, u)


—1
τˆ := τ — s tγs t = min(u
µ —µ α, u) ¶ ¶(79)
1 u
z = min ① T
∂ q(α) + c — ( Ay) + ,u
Now we can formulate the reduced KKT–system (see 2 α 100
(Vanderbei
1994) for the quadratic case):

· ¸· ¸ · ¸ µ µ α ¶ ¶
21 100
u
∆α s = min ① — ∂ q(α) — c + ( Ay) T
+ ,u
= σ — g zνˆ — t
—1 —
—H AT
(76)
A 0 ∆y
1
sτˆ ρ ①(·) denotes the Heavyside function, i.e. ①(x ) = 1 for x > 0
and ①(x ) = 0 otherwise.
1 2 —1 —1
where H := ( ∂ q(α) + g z + t s).
2 α
A.3. Special considerations for SV regression
A.2. Iteration strategies
The algorithm described so far can be applied to both SV
For the predictor-corrector method we proceed as follows. In pattern recognition and regression estimation. For the standard
the predictor step solve the system of (75) and (76) with µ = 0 setting in pattern recognition we have
and all ∆-terms on the rhs set to 0, i.e. γz = z, γs = s. The
Σ4
values in ∆ are substituted back into the definitions for γz and q(α) = αi α j yi y j k(xi , x j ) (80)
γs and (75) and (76) are solved again in the corrector step. As i, j =0
the
and consequently ∂αi q(α) = 0, ∂α2 i αj q(α) = y i y jk(x i, x j), i.e.
quadratic part in (76) is not affected by the predictor–corrector
steps, we only need to invert the quadratic matrix once. This is the Hessian is dense and the only thing we can do is compute
done best by manually pivoting for the H part, as it is positive its Cholesky factorization to compute (76). In the case of SV
definite. re- gression, however we have (with α := (α 1 ,..., α4, α 1 ∗ ,...,
Next the values in ∆ obtained by such an iteration step are α4∗))
used to update the corresponding values in α, s, t, z,. To 4
Σ
ensure q(α) =

(αi — αi )(α j — α∗j )k(xi , x j )
that the variables meet the positivity constraints, the steplength i, j =1
ξ is chosen such that the variables move at most 1 — s of their
Σ4

initial distance to the boundaries of the positive orthant. + T (αi ) + T (αi ) (81)
Usually (Vanderbei 1994) one = sets s 0.05.
2C
i
=1
Another heuristic is used for computing µ, the parameter de- and therefore
termining how much the KKT-conditions should be enforced.
d
Obviously it is our aim to reduce µ as fast as possible, ∂αi q(α) = T (αi )
however if we happen to choose it too small, the condition of dα i
the equa- tions will worsen drastically. A setting that has d2
proven to work robustly is ∂α2i αj q(α) = k(xi , x j ) + δij T (αi ) (82)
µ dαi2
(¶g, z ⟩+ (s, t⟩ ξ — 1 ∂2 ∗ q(α) = —k(xi , x j )
2

αi αj
µ= . (77)
ξ+ 2 q(α), ∂α2 ∗α j q(α) analogously. Hence we are dealing with
and
10 αi∗∗ i
∂ α j
=
2n —K K +D

The rationale behind (77) is to use the average of the satisfac- " ¡1 ¢ #· ¸ · ¸
tion of the KKT2 conditions
α (74) as point of reference and then — ∂2q(α) + 1 AT α c
decrease µ rapidly if we are far enough away (78)
= from the bound- 1 y b
A
aries of the positive orthant, to which all variables (except y)
are constrained to.
Finally one has to come up with good initial values. Analo-
gously to Vanderbei (1994) we choose a regularized version
of
(76) in order to determine the initial conditions. One solves
220 Smola and
a matrix of type M : [ K +D —K r ] where D, Dr are diagonal Scho¨lkopf
matrices. By applying an orthogonal transformation M can
be
inverted essentially by inverting an 4 4 matrix instead of a ×
24 24 system. This is exactly the additional advantage one ×
can gain from implementing the optimization algorithm
directly instead of using a general purpose optimizer. One
can show that for practical implementations (Smola,
Scho¨lkopf and Mu¨ller 1998b) one can solve optimization
problems using nearly ar- bitrary convex cost functions as
efficiently as the special case of s-insensitive loss functions.
Finally note that due to the fact that we are solving the pri-
mal and dual optimization problem simultaneously we are also
A tutorial on support vector regression 221
computing parameters corresponding to the initial SV optimiza- at sample xi , i.e.
tion problem. This observation is useful as it allows us to "
obtain the constant term b directly, namely by setting b y. m #
(see Smola (1998) for details). = Σ ∗
ϕi := yi — f (xi ) = yi — k(xi , x j )(αi — αi ) + . (84)
j
=1 b

Appendix B: Solving the subset selection Rewriting the feasibility conditions (52) in terms of α yields
problem
B.1. Subset optimization problem 2∂αi T (αi ) + s — ϕi + si — zi = 0
∗ ∗ ∗ (85)
2∂αi∗ T (αi ) + s + ϕi + si — zi = 0
We will adapt the exposition of Joachims (1999) to the case of
regression with convex cost functions. Without loss of
for all i 1 , . . . , m with zi , zi∗, si , ≥ 0. A set of dual
general- ity we will assume/= s e0 and α [0, C ] (the other

situations can be treated as a special case). First we will si e {
extract a reduced optimization problem for the working set feasible variables z, s is given by
when all other vari- ¡
ables are kept fixed. Denote Sw c {1 , . . . , 4 the working set zi = max 2∂αi T (αi ) + s — ϕi , 0
and S f := 1{, . . . , 4 }\
Sw the fixed set. Writing (43) as an opti- ¢ ¡
si = — min 2∂αi T (αi ) + s — ϕi , (86)
mization problem only in terms of Sw yields ¢
0
¡ ¢
, ∗ ∗
zi = max 2∂αi∗ T (αi ) + s + ϕi , 0
1 Σ ∗
— (αi — αi )(α j — α∗j )(xi , x j ⟩ ∗
¡ ∗
¢
2 i, j eSwà ! si = — min 2∂αi∗ T (αi ) + s + ϕi , 0
,
 Σ Σ
+ (αi — yi — (α j — α∗j )(xi , x Consequently the KKT conditions (53) can be translated into
maximize ∗
, αi ) j ⟩
, i eSw
Σ j eSf
, ∗
+ (—s(αi + αi ) + C (T (αi ) + T

, Σ i eSw (αi ))) Σ αi zi = 0 and (C — αi )si =
 (αi — αi∗) = — (αi — 0∗ ∗
αi zi = 0 and
∗ ∗
(C — αi )si = 0
(87)

αi )
subject to i eSw i eSf
,
αi e [0, C All variables αi , αi∗ violating some of the conditions of (87)
] (83) may be selected for further optimization. In most cases,
especially in the initial stage of the optimization algorithm,
this set of pat-
Hence we only have Σ to update the linear
Σ term by the coupling terns is much larger than any practical size of Sw.
with the fixed set — i eSw (αi — αi∗) j eSf (α j — α∗j )(xi , x j ⟩ Unfortunately Osuna, Freund and Girosi (1997) contains little
and
Σ information on how to select Sw. The heuristics presented here
the equality constraint by i eSf (αi — αi ). It is easy to see are an adaptation

that maximizing (83) also decreases
∗ (43) by exactly the same modified form. Denote ϕi the error made by the current estimate
amount. If we choose variables for which the KKT–conditions
are not satisfied the overall objective function tends to ofonJoachims (1999)
optimization to regression. See also Lin (2001) for
for SVR.
decrease whilst still keeping all variables feasible. Finally it is
bounded from below.
Even though this does not prove convergence (contrary to
B.3. Selection rules
statement in Osuna, Freund and Girosi (1997)) this algorithm
proves very useful in practice. It is one of the few methods Similarly to a merit function approach (El-Bakry et al. 1996)
(be- sides (Kaufman 1999, Platt 1999)) that can deal with the idea is to select those variables that violate (85) and (87)
problems whose quadratic part does not completely fit into most, thus contribute most to the feasibility gap. Hence one
memory. Still in practice one has to take special precautions to defines a score variable ζi by
avoid stalling of convergence (recent results of Chang, Hsu
and Lin (1999) indicate that under certain conditions a proof ζi := gi zi + si ti
of convergence is possible). The crucial part is the one of Sw. ∗ ∗ ∗ ∗
= αi zi + αi zi + (C — αi )si + (C — αi )si (88)
Σ
B.2. A note on optimality By construction, i ζi is the size of the feasibility gap (cf. (56)
for the case of s-insensitive loss). By decreasing this gap, one
For convenience the KKT conditions are repeated in a slightly
222 Smola and
approaches the the solution (upper bounded by the primal Scho¨lkopf
objec- tive and lower bounded by the dual objective
function). Hence, the selection rule is to choose those
patterns for which ζi is
A tutorial on support vector regression 223
largest. Some algorithms use C.2. Analytic solution for regression
r ∗ ∗
ζi := αi ①(zi ) + αi ①(zi ) Next one has to solve the optimization problem analytically.
∗ We make use of (84) and substitute the values of φi into the
+ (C — αi )①(si ) + (C — αi )①(si ) reducedΣoptimization problem (83). InΣparticular we use
rr (89) ¡ ¢
or ζi := ①(αi )zi + y — (α — α∗)K = ϕ + b + αold — α∗old K .
∗ ∗
①(αi )zi
∗ i i i ij i i i ij
+ ①(C — αi )si + ①(C — αi )si . j /eSw j eSw

One can see that ζi = 0, ζir = 0, and ζirr = 0 mutually imply (91)
each
other. However, only ζi gives a measure for the contribution of
the variable i to the size of the feasibility Moreover with the auxiliary variables γ = αi —αi ∗ +αj —αj ∗ and
gap.
Finally, note that heuristics like assigning sticky–flags (cf. y := (Kii +
K jj 2K ij ) one obtains the following constrained
Burges 1998) to variables at the boundaries, thus effec- optimization
— problem in i (after eliminating j , ignoring terms
tively solving smaller subproblems, or completely removing independent of αj , α∗j and noting that this only holds for αi αi∗
the corresponding patterns from the training set while ac- =
counting for their couplings (Joachims 1999) can signifi- αj α∗j = 0):
cantly decrease the size of the problem one has to solve and 1
∗ ∗
thus result in a noticeable speedup. Also caching (Joachims maximize — (αi — αi )2y — s(αi + αi )(1 — s)
2
1999, Kowalczyk 2000) of already computed entries of the ∗¡ ¡ ¢
∗old
+ (αi — αi ¢) φi — φ j + y i αold — αi (92)
dot product matrix may have a significant impact on the
performance. subject to α(∗) e [L(∗), H (∗)].
i
The unconstrained maximum of (92) with respect to αi or αi∗
can be found below.
Appendix C: Solving the SMO equations
C.1. Pattern dependent regularization (I) αi ,α j αiold + y—1(ϕi — ϕj )
Consider the constrained optimization problem (83) for two in- (II) αi∗, α∗j i + y (ϕi — ϕj — 2s)
α∗old
old —1

(III) α ,α j α — y—1(ϕi — ϕ j + 2s)


dices, say (i, j ). Pattern dependent regularization means that (IV)
i
α∗ , α∗
i
α∗old — y—1(ϕi — ϕ j )
Ci
may be different for every pattern (possibly even different for i j i

αi and αi∗). Since at most two variables may become nonzero


at the same time and moreover we are dealing with a con- The problem is that we do not know beforehand which of the
strained optimization problem we may express everything in four quadrants (I)–(IV) contains the solution. However, by
terms of just one variable. From the summation constraint we con- sidering the sign of γ we can distinguish two cases: for γ
obtain > 0 only (I)–(III) are possible, for γ < 0 the coefficients
In case of γ
satisfy one of the cases (II)–(IV). = 0 only (II)
¡ ¢ ¡ ¢

(αi — αi ) + (α j — α∗j ) =i αold — αi
∗old ∗ old and (III) have to be considered. See also the diagram below.
j+ α — α j
old

:= γ
(90)

j e [0, C j ] yields αi e [L, H ].


for regression. Exploiting α(∗) (∗) (∗)

This is taking account of the fact that there may be only


four different pairs of nonzero variables: (αi ,αj ), (αi∗,α j ), (αi
, α∗j ), and (αi∗, α∗j ). For convenience define an auxiliary
variables s =
such that s 1 in the first and the last= case and s
1 other-
wise. —

For γ > 0 it is best to start with quadrant (I), test whether the
αj α∗ j unconstrained solution hits one of the boundaries L, H and if
αi L max(0,γ — C j ) max(0,γ ) so, probe the corresponding adjacent quadrant (II) or (III). γ
H min(Ci ,γ ) min(Ci , C ∗j + γ < 0 can be dealt with analogously.
Due to numerical instabilities, it may happen that y < 0. In
)
αi
∗ L max(0, —γ ) max(0, —γ — C ∗j )
that case y should be set to 0 and one has to solve (92) in a
H min(Ci∗, —γ + C j ) min(Ci∗, —γ )
linear fashion directly.11
224 Smola and
Scho¨lkopf
C.3. Selection rule for regression
Acknowledgments
Finally, one has to pick indices (i, j ) such that the objective
function is maximized. Again, the reasoning of SMO (Platt 1999, This work has been supported in part by a grant of the DFG
Section 12.2.2) for classification will be mimicked. This (Ja 379/71, Sm 62/1). The authors thank Peter Bartlett, Chris
means that a two loop approach is chosen to maximize the Burges, Stefan Harmeling, Olvi Mangasarian, Klaus-Robert
objective function. The outer loop iterates over all patterns Mu¨ller, Vladimir Vapnik, Jason Weston, Robert Williamson,
violating the KKT conditions, first only over those with and Andreas Ziehe for helpful discussions and comments.
Lagrange multipliers neither on the upper nor lower boundary,
and once all of them are satisfied, over all patterns violating
the KKT conditions, to ensure self consistency on the Notes
complete dataset.12 This solves the problem of choosing i .
Now for j : To make a large step towards the minimum, one 1. Our use of the term ‘regression’ is somewhat lose in that it also includes
cases of function estimation where one minimizes errors other than the
looks for large steps in αi . As it is computationally expensive mean square loss. This is done mainly for historical reasons (Vapnik,
to compute y for all possible pairs (i, j ) one chooses the Golowich and Smola 1997).
heuristic to maximize the absolute value of the numerator in the 2. A similar approach, however using linear instead of quadratic
programming, was taken at the same time in the USA, mainly by
expressions
i for
i
α andi α∗, ji.e. |ϕ —i ϕ | jand |ϕ — ϕ ± 2s|. Mangasarian (1965, 1968, 1969).
The index j 3. See Smola (1998) for an overview over other ways of specifying flatness
corresponding to the maximum absolute value is chosen for this of such functions.
purpose. 4. This is true as long as the dimensionality of w is much higher than the
number of observations. If this is not the case, specialized methods can
If this heuristic happens to fail, in other words if little offer considerable computational savings (Lee and Mangasarian 2001).
progress is made by this choice, all other indices j are looked 5. The table displays CT (α) instead of T (α) since the former can be plugged
at (this is what is called “second choice hierarcy” in Platt directly into the corresponding optimization equations.
(1999) in the following way: 6. The high price tag usually is the major deterrent for not using them. Moreover
one has to bear in mind that in SV regression, one may speed up the
solution considerably by exploiting the fact that the quadratic form has a
1. All indices j corresponding to non–bound examples are special structure or that there may exist rank degeneracies in the kernel
matrix itself.
looked at, searching for an example to make progress on.
7. For large and noisy problems (e.g. 100.000 patterns and more with a
2. In the case that the first heuristic was unsuccessful, all substan- tial fraction of nonbound Lagrange multipliers) it is impossible
other samples are analyzed until an example is found to solve the problem exactly: due to the size one has to use subset
where progress can be made. selection algorithms, hence joint optimization over the training set is
impossible. However, unlike in Neural Networks, we can determine the
3. If both previous steps fail proceed to the next i . closeness to the optimum. Note that this reasoning only holds for convex
cost functions.
For a more detailed discussion (see Platt 1999). Unlike 8. A similar technique was employed by Bradley and Mangasarian (1998) in
the context of linear programming in order to deal with large datasets.
interior point algorithms SMO does not automatically provide 9. Due to length constraints we will not deal with the connection between
a value for b. However this can be chosen like in Section 1.4 Gaussian Processes and SVMs. See Williams (1998) for an excellent
by having a close look at the Lagrange multipliers α(∗) overview.
i
10. Strictly speaking, in Bayesian estimation one is not so much concerned
obtained.
about the maximizer fˆ| of p( f X ) but rather about the posterior
distribution of f .
11. Negative values of y are theoretically impossible since k satisfies
C.4. Stopping criteria ≤
Mercer’s condition: 0 — Ф(xi ) = Ф(x + j )
2 —K
ii =K jj 2Kij y.
12. It is sometimes useful, especially when dealing with noisy data, to iterate
By essentially minimizing a constrained primal optimization over the complete KKT violating dataset already before complete self
problem one cannot ensure that the dual objective function in- con- sistency on the subset has been achieved. Otherwise much
creases with every iteration step.13 Nevertheless one knows computational resources are spent on making subsets self consistent that
are not globally self consistent. This is the reason why in the pseudo
that the minimum value of the objective function lies in the code a global loop is initiated already when only less than 10% of the
interval non bound variables changed.
[dual objectivei , primal objectivei ] for all steps i , hence also 13. It is still an open question how a subset selection optimization algorithm
in the interval [(max j≤i dual objective j ), primal objectivei ]. could be devised that decreases both primal and dual objective function
at the same time. The problem is that this usually involves a number of
One uses the latter to determine the quality of the current dual variables of the order of the sample size, which makes this attempt
solution. unpractical.
The calculation of the primal objective function from the
pre- diction errors is straightforward. One uses
References
Σ Σ
∗ ∗
(αi — αi )(α j — α∗j )kij = — (αi — αi )(ϕi + yi — b),
i, j i
Aizerman M.A., Braverman E´ .M., and Rozonoe´r L.I. 1964.
(93) Theoretical foundations of the potential function method in
pattern recognition learning. Automation and Remote Control
25: 821–837.
i.e. the definition of ϕi to avoid the matrix–vector multiplication Aronszajn N. 1950. Theory of reproducing kernels. Transactions of the
with the dot product matrix. American Mathematical Society 68: 337–404.
A tutorial on support vector regression 225
Bazaraa M.S., Sherali H.D., and Shetty C.M. 1993. Nonlinear Burges C.J.C. and Scho¨lkopf B. 1997. Improving the accuracy and
Program- ming: Theory and Algorithms, 2nd edition, Wiley. speed of support vector learning machines. In Mozer M.C.,
Bellman R.E. 1961. Adaptive Control Processes. Princeton Jordan M.I., and Petsche T., (Eds.), Advances in Neural
University Press, Princeton, NJ. Information Processing Systems 9, MIT Press, Cambridge, MA,
Bennett K. 1999. Combining support vector and mathematical pp. 375–381.
program- ming methods for induction. In: Scho¨lkopf B., Burges Chalimourda A., Scho¨lkopf B., and Smola A.J. 2004.
C.J.C., and Smola A.J., (Eds.), Advances in Kernel Methods— Experimentally optimal ν in support vector regression for
SV Learning, MIT Press, Cambridge, MA, pp. 307–326.
different noise models and parameter settings. Neural Networks
Bennett K.P. and Mangasarian O.L. 1992. Robust linear program-
17(1): 127–141.
ming discrimination of two linearly inseparable sets.
Chang C.-C., Hsu C.-W., and Lin C.-J. 1999. The analysis of decom-
Optimization Methods and Software 1: 23–34.
position methods for support vector machines. In Proceeding of
Berg C., Christensen J.P.R., and Ressel P. 1984. Harmonic Analysis
IJCAI99, SVM Workshop.
on Semigroups. Springer, New York.
Chang C.C. and Lin C.J. 2001. Training ν-support vector classi-
Bertsekas D.P. 1995. Nonlinear Programming. Athena Scientific,
fiers: Theory and algorithms. Neural Computation 13(9): 2119–
Belmont, MA.
2147.
Bishop C.M. 1995. Neural Networks for Pattern Recognition.
Chen S., Donoho D., and Saunders M. 1999. Atomic decomposition by
Clarendon Press, Oxford.
basis pursuit. Siam Journal of Scientific Computing 20(1): 33–61.
Blanz V., Scho¨lkopf B., Bu¨lthoff H., Burges C., Vapnik V., and
Cherkassky V. and Mulier F. 1998. Learning from Data. John Wiley and
Vetter
Sons, New York.
T. 1996. Comparison of view-based object recognition algorithms
Cortes C. and Vapnik V. 1995. Support vector networks. Machine
using realistic 3D models. In: von der Malsburg C., von Seelen
Learn- ing 20: 273–297.
W., Vorbru¨ggen J.C., and Sendhoff B. (Eds.), Artificial Neural
Cox D. and O’Sullivan F. 1990. Asymptotic analysis of penalized like-
Networks ICANN’96, Berlin. Springer Lecture Notes in
lihood and related estimators. Annals of Statistics 18: 1676–1695.
Computer Science, Vol. 1112, pp. 251–256.
CPLEX Optimization Inc. Using the CPLEX callable library. Manual,
Bochner S. 1959. Lectures on Fourier integral. Princeton Univ. Press,
1994.
Princeton, New Jersey.
Cristianini N. and Shawe-Taylor J. 2000. An Introduction to Support
Boser B.E., Guyon I.M., and Vapnik V.N. 1992. A training algorithm
Vector Machines. Cambridge University Press, Cambridge, UK.
for optimal margin classifiers. In: Haussler D. (Ed.),
Cristianini N., Campbell C., and Shawe-Taylor J. 1998.
Proceedings of the Annual Conference on Computational
Multiplicative updatings for support vector learning. NeuroCOLT
Learning Theory. ACM Press, Pittsburgh, PA, pp. 144–152.
Technical Re- port NC-TR-98-016, Royal Holloway College.
Bradley P.S., Fayyad U.M., and Mangasarian O.L. 1998. Data min-
Dantzig G.B. 1962. Linear Programming and Extensions. Princeton
ing: Overview and optimization opportunities. Technical Re-
Univ. Press, Princeton, NJ.
port 98–01, University of Wisconsin, Computer Sciences
Devroye L., Gyo¨rfi L., and Lugosi G. 1996. A Probabilistic Theory
Depart- ment, Madison, January. INFORMS Journal on
of Pattern Recognition. Number 31 in Applications of
Computing, to appear.
mathematics. Springer, New York.
Bradley P.S. and Mangasarian O.L. 1998. Feature selection via con-
Drucker H., Burges C.J.C., Kaufman L., Smola A., and Vapnik V.
cave minimization and support vector machines. In: Shavlik J.
1997. Support vector regression machines. In: Mozer M.C.,
(Ed.), Proceedings of the International Conference on Machine
Jordan M.I., and Petsche T. (Eds.), Advances in Neural
Learning, Morgan Kaufmann Publishers, San Francisco,
Information Processing Systems 9, MIT Press, Cambridge, MA,
Califor- nia, pp. 82–90. ftp://ftp.cs.wisc.edu/math-prog/tech-
pp. 155–161.
reports/98- 03.ps.Z.
Efron B. 1982. The jacknife, the bootstrap, and other resampling plans.
Bunch J.R. and Kaufman L. 1977. Some stable methods for calculat-
SIAM, Philadelphia.
ing inertia and solving symmetric linear systems. Mathematics
Efron B. and Tibshirani R.J. 1994. An Introduction to the Bootstrap.
of Computation 31: 163–179.
Chapman and Hall, New York.
Bunch J.R. and Kaufman L. 1980. A computational method for the
El-Bakry A., Tapia R., Tsuchiya R., and Zhang Y. 1996. On the
indefinite quadratic programming problem. Linear Algebra and
formula- tion and theory of the Newton interior-point method for
Its Applications, pp. 341–370, December.
nonlinear programming. J. Optimization Theory and
Bunch J.R., Kaufman L., and Parlett B. 1976. Decomposition of a
Applications 89: 507– 541.
sym- metric matrix. Numerische Mathematik 27: 95–109.
Fletcher R. 1989. Practical Methods of Optimization. John Wiley and
Burges C.J.C. 1996. Simplified support vector decision rules. In
Sons, New York.
L. Saitta (Ed.), Proceedings of the International Conference on
Girosi F. 1998. An equivalence between sparse approximation and
Machine Learning, Morgan Kaufmann Publishers, San Mateo,
sup- port vector machines. Neural Computation 10(6): 1455–
CA, pp. 71–77.
1480.
Burges C.J.C. 1998. A tutorial on support vector machines for pattern
Girosi F., Jones M., and Poggio T. 1993. Priors, stabilizers and ba-
recognition. Data Mining and Knowledge Discovery 2(2): 121–
sis functions: From regularization to radial, tensor and additive
167.
splines. A.I. Memo No. 1430, Artificial Intelligence Laboratory,
Burges C.J.C. 1999. Geometry and invariance in kernel based
Massachusetts Institute of Technology.
methods. In Scho¨lkopf B., Burges C.J.C., and Smola A.J., (Eds.),
Guyon I., Boser B., and Vapnik V. 1993. Automatic capacity tuning
Advances in Kernel Methods—Support Vector Learning, MIT
of very large VC-dimension classifiers. In: Hanson S.J., Cowan
Press, Cam- bridge, MA, pp. 89–116.
J.D., and Giles C.L. (Eds.), Advances in Neural Information
Processing Systems 5. Morgan Kaufmann Publishers, pp. 147–
155.
Ha¨rdle W. 1990. Applied nonparametric regression, volume 19 of
Econometric Society Monographs. Cambridge University Press.
226 Smola and
Scho¨lkopf
Hastie T.J. and Tibshirani R.J. 1990. Generalized Additive Models, gramming. Princeton Technical Report SOR 90–03., Dept. of
volume 43 of Monographs on Statistics and Applied Probability. Civil Engineering and Operations Research, Princeton
Chapman and Hall, London. University.
Haykin S. 1998. Neural Networks: A Comprehensive Foundation. Lustig I.J., Marsten R.E., and Shanno D.F. 1992. On implement-
2nd edition. Macmillan, New York. ing Mehrotra’s predictor-corrector interior point method for lin-
Hearst M.A., Scho¨lkopf B., Dumais S., Osuna E., and Platt J. 1998. ear programming. SIAM Journal on Optimization 2(3): 435–
Trends and controversies—support vector machines. IEEE Intel- 449.
ligent Systems 13: 18–28. MacKay D.J.C. 1991. Bayesian Methods for Adaptive Models. PhD
Herbrich R. 2002. Learning Kernel Classifiers: Theory and Algorithms.
thesis, Computation and Neural Systems, California Institute of
MIT Press. Technology, Pasadena, CA.
Huber P.J. 1972. Robust statistics: A review. Annals of Statistics Mangasarian O.L. 1965. Linear and nonlinear separation of patterns by
43: 1041. linear programming. Operations Research 13: 444–452.
Huber P.J. 1981. Robust Statistics. John Wiley and Sons, New York. Mangasarian O.L. 1968. Multi-surface method of pattern separation.
IBM Corporation. 1992. IBM optimization subroutine library guide IEEE Transactions on Information Theory IT-14: 801–807.
and reference. IBM Systems Journal, 31, SC23-0519. Mangasarian O.L. 1969. Nonlinear Programming. McGraw-Hill,
Jaakkola T.S. and Haussler D. 1999. Probabilistic kernel regression New York.
models. In: Proceedings of the 1999 Conference on AI and Mattera D. and Haykin S. 1999. Support vector machines for dy-
Statis- tics. namic reconstruction of a chaotic system. In: Scho¨lkopf B., Burges
Joachims T. 1999. Making large-scale SVM learning practical. C.J.C., and Smola A.J. (Eds.), Advances in Kernel Methods—
In: Scho¨lkopf B., Burges C.J.C., and Smola A.J. (Eds.), Ad- Support Vector Learning, MIT Press, Cambridge, MA, pp. 211–
vances in Kernel Methods—Support Vector Learning, MIT Press, 242.
Cambridge, MA, pp. 169–184. McCormick G.P. 1983. Nonlinear Programming: Theory,
Karush W. 1939. Minima of functions of several variables with Algorithms, and Applications. John Wiley and Sons, New York.
inequal- ities as side constraints. Master’s thesis, Dept. of Megiddo N. 1989. Progressin Mathematical Programming, chapter
Mathematics, Univ. of Chicago. Pathways to the optimal set in linear programming, Springer,
Kaufman L. 1999. Solving the quadratic programming problem arising New York, NY, pp. 131–158.
in support vector classification. In: Scho¨lkopf B., Burges Mehrotra S. and Sun J. 1992. On the implementation of a (primal-
C.J.C., and Smola A.J. (Eds.), Advances in Kernel Methods— dual) interior point method. SIAM Journal on Optimization
Support Vector Learning, MIT Press, Cambridge, MA, pp. 147– 2(4): 575– 601.
168 Mercer J. 1909. Functions of positive and negative type and their
Keerthi S.S., Shevade S.K., Bhattacharyya C., and Murthy K.R.K. 1999. con- nection with the theory of integral equations. Philosophical
Improvements to Platt’s SMO algorithm for SVM classifier design. Trans- actions of the Royal Society, London A 209: 415–446.
Technical Report CD-99-14, Dept. of Mechanical and Micchelli C.A. 1986. Algebraic aspects of interpolation. Proceedings
Production Engineering, Natl. Univ. Singapore, Singapore. of Symposia in Applied Mathematics 36: 81–102.
Keerthi S.S., Shevade S.K., Bhattacharyya C., and Murty K.R.K. Morozov V.A. 1984. Methods for Solving Incorrectly Posed Problems.
2001. Improvements to platt’s SMO algorithm for SVM classifier Springer.
design. Neural Computation 13: 637–649. Mu¨ller K.-R., Smola A., Ra¨tsch G., Scho¨lkopf B., Kohlmorgen J., and
Kimeldorf G.S. and Wahba G. 1970. A correspondence between Vapnik V. 1997. Predicting time series with support vector ma-
Bayesian estimation on stochastic processes and smoothing by chines. In: Gerstner W., Germond A., Hasler M., and Nicoud J.-
splines. Annals of Mathematical Statistics 41: 495–502. D. (Eds.), Artificial Neural Networks ICANN’97, Berlin.
Kimeldorf G.S. and Wahba G. 1971. Some results on Tchebycheffian Springer Lecture Notes in Computer Science Vol. 1327 pp.
spline functions. J. Math. Anal. Applic. 33: 82–95. 999–1004.
Kowalczyk A. 2000. Maximal margin perceptron. In: Smola A.J., Murtagh B.A. and Saunders M.A. 1983. MINOS 5.1 user’s guide.
Bartlett P.L., Scho¨lkopf B., and Schuurmans D. (Eds.), Tech- nical Report SOL 83-20R, Stanford University, CA, USA,
Advances in Large Margin Classifiers, MIT Press, Cambridge, Revised 1987.
MA, pp. 75– 113. Neal R. 1996. Bayesian Learning in Neural Networks. Springer.
Kuhn H.W. and Tucker A.W. 1951. Nonlinear programming. In: Nilsson N.J. 1965. Learning machines: Foundations of Trainable
Proc. 2nd Berkeley Symposium on Mathematical Statistics and Pattern
Proba- bilistics, Berkeley. University of California Press, pp. Classifying Systems. McGraw-Hill.
481–492. Nyquist. H. 1928. Certain topics in telegraph transmission theory.
Lee Y.J. and Mangasarian O.L. 2001. SSVM: A smooth support
Trans. A.I.E.E., pp. 617–644.
vector machine for classification. Computational optimization
Osuna E., Freund R., and Girosi F. 1997. An improved training algo-
and Ap- plications 20(1): 5–22.
rithm for support vector machines. In Principe J., Gile L.,
Li M. and Vita´nyi P. 1993. An introduction to Kolmogorov Complexity
Morgan N., and Wilson E. (Eds.), Neural Networks for Signal
and its applications. Texts and Monographs in Computer
Processing VII—Proceedings of the 1997 IEEE Workshop, pp.
Science. Springer, New York.
276–285, New York, IEEE.
Lin C.J. 2001. On the convergence of the decomposition method for
Osuna E. and Girosi F. 1999. Reducing the run-time complexity in
support vector machines. IEEE Transactions on Neural
support vector regression. In: Scho¨lkopf B., Burges C.J.C., and
Networks 12(6): 1288–1298.
Smola A. J. (Eds.), Advances in Kernel Methods—Support
Lustig I.J., Marsten R.E., and Shanno D.F. 1990. On implementing
Vector Learning, pp. 271–284, Cambridge, MA, MIT Press.
Mehrotra’s predictor-corrector interior point method for linear
Ovari Z. 2000. Kernels, eigenvalues and support vector machines.
pro-
Hon- ours thesis, Australian National University, Canberra.
A tutorial on support vector regression 227
Platt J. 1999. Fast training of support vector machines using sequen- Gaussian kernels to radial basis function classifiers. IEEE
tial minimal optimization. In: Scho¨lkopf B., Burges C.J.C., and Trans- actions on Signal Processing, 45: 2758–2765.
Smola A.J. (Eds.) Advances in Kernel Methods—Support Shannon C.E. 1948. A mathematical theory of communication. Bell
Vector Learning, pp. 185–208, Cambridge, MA, MIT Press. System Technical Journal, 27: 379–423, 623–656.
Poggio T. 1975. On optimal nonlinear associative recall. Biological Shawe-Taylor J., Bartlett P.L., Williamson R.C., and Anthony M.
Cybernetics, 19: 201–209. 1998. Structural risk minimization over data-dependent hierar-
Rasmussen C. 1996. Evaluation of Gaussian Processes and chies. IEEE Transactions on Information Theory, 44(5): 1926–
Other Methods for Non-Linear Regression. PhD thesis, 1940.
Department of Computer Science, University of Toronto, Smola A., Murata N., Scho¨lkopf B., and Mu¨ller K.-R. 1998a.
ftp://ftp.cs.toronto.edu/pub/carl/thesis.ps.gz. Asymp- totically optimal choice of s-loss for support vector
Rissanen J. 1978. Modeling by shortest data description. Automatica,
machines. In: Niklasson L., Bode´n M., and Ziemke T. (Eds.)
14: 465–471.
Proceed- ings of the International Conference on Artificial
Saitoh S. 1988. Theory of Reproducing Kernels and its Applications.
Neural Net- works, Perspectives in Neural Computing, pp. 105–
Longman Scientific & Technical, Harlow, England.
110, Berlin, Springer.
Saunders C., Stitson M.O., Weston J., Bottou L., Scho¨lkopf B., and
Smola A., Scho¨lkopf B., and Mu¨ller K.-R. 1998b. The connection
Smola A. 1998. Support vector machine—reference manual. Tech-
be- tween regularization operators and support vector kernels.
nical Report CSD-TR-98-03, Department of Computer Science,
Neural Networks, 11: 637–649.
Royal Holloway, University of London, Egham, UK. SVM
Smola A., Scho¨lkopf B., and Mu¨ller K.-R. 1998c. General cost
avail- able at http://svm.dcs.rhbnc.ac.uk/.
func- tions for support vector regression. In: Downs T.,
Schoenberg I. 1942. Positive definite functions on spheres. Duke
Frean M., and Gallagher M. (Eds.) Proc. of the Ninth Australian
Math. J., 9: 96–108.
Conf. on Neural Networks, pp. 79–83, Brisbane, Australia.
Scho¨lkopf B. 1997. Support Vector Learning. R. Oldenbourg
University of Queensland.
Verlag, Mu¨nchen. Doktorarbeit, TU Berlin. Download:
Smola A., Scho¨lkopf B., and Ra¨tsch G. 1999. Linear programs for
http://www.kernel-machines.org.
automatic accuracy control in regression. In: Ninth International
Scho¨lkopf B., Burges C., and Vapnik V. 1995. Extracting support
Conference on Artificial Neural Networks, Conference Publica-
data for a given task. In: Fayyad U.M. and Uthurusamy R.
tions No. 470, pp. 575–580, London. IEE.
(Eds.), Pro- ceedings, First International Conference on
Smola. A.J. 1996. Regression estimation with support vector learning
Knowledge Discovery & Data Mining, Menlo Park, AAAI
machines. Diplomarbeit, Technische Universita¨t Mu¨nchen.
Press.
Smola A.J. 1998. Learning with Kernels. PhD thesis, Technische
Scho¨lkopf B., Burges C., and Vapnik V. 1996. Incorporating
Uni- versita¨t Berlin. GMD Research Series No. 25.
invariances in support vector learning machines. In: von der
Smola A.J., Elisseeff A., Scho¨lkopf B., and Williamson R.C. 2000.
Malsburg C., von Seelen W., Vorbru¨ggen J. C., and Sendhoff
Entropy numbers for convex combinations and MLPs. In Smola
B. (Eds.), Artificial Neural Networks ICANN’96, pp. 47–52,
A.J., Bartlett P.L., Scho¨lkopf B., and Schuurmans D. (Eds.)
Berlin, Springer Lecture Notes in Computer Science, Vol. 1112.
Ad- vances in Large Margin Classifiers, MIT Press, Cambridge,
Scho¨lkopf B., Burges C.J.C., and Smola A.J. 1999a. (Eds.) Ad-
MA,
vances in Kernel Methods—Support Vector Learning. MIT Press, pp. 369–387.
Cambridge, MA. Smola A.J., O´ va´ri Z.L., and Williamson R.C. 2001. Regularization
Scho¨lkopf B., Herbrich R., Smola A.J., and Williamson R.C. 2001. with dot-product kernels. In: Leen T.K., Dietterich T.G., and
A generalized representer theorem. Technical Report 2000-81, Tresp V. (Eds.) Advances in Neural Information Processing
Neu- roCOLT, 2000. To appear in Proceedings of the Annual
Systems 13, MIT Press, pp. 308–314.
Conference on Learning Theory, Springer (2001).
Smola A.J. and Scho¨lkopf B. 1998a. On a kernel-based method for
Scho¨lkopf B., Mika S., Burges C., Knirsch P., Mu¨ller K.-R., Ra¨tsch
pattern recognition, regression, approximation and operator in-
G., and Smola A. 1999b. Input space vs. feature space in kernel-
version. Algorithmica, 22: 211–231.
based methods. IEEE Transactions on Neural Networks, 10(5):
Smola A.J. and Scho¨lkopf B. 1998b. A tutorial on support vector re-
1000– 1017.
gression. NeuroCOLT Technical Report NC-TR-98-030, Royal
Scho¨lkopf B., Platt J., Shawe-Taylor J., Smola A.J. , and Williamson
Holloway College, University of London, UK.
R.C. 2001. Estimating the support of a high-dimensional
Smola A.J. and Scho¨lkopf B. 2000. Sparse greedy matrix
distribution. Neural Computation, 13(7): 1443–1471.
approximation for machine learning. In: Langley P. (Ed.),
Scho¨lkopf B., Simard P., Smola A., and Vapnik V. 1998a. Prior
Proceedings of the In- ternational Conference on Machine
knowl- edge in support vector kernels. In: Jordan M.I., Kearns
Learning, Morgan Kaufmann Publishers, San Francisco, pp.
M.J., and Solla S.A. (Eds.) Advances in Neural Information
911–918.
Processing Sys- tems 10, MIT Press. Cambridge, MA, pp. 640–
Stitson M., Gammerman A., Vapnik V., Vovk V., Watkins C., and
646.
Weston J. 1999. Support vector regression with ANOVA
Scho¨lkopf B., Smola A., and Mu¨ller K.-R. 1998b. Nonlinear
decom- position kernels. In: Scho¨lkopf B., Burges C.J.C., and
compo- nent analysis as a kernel eigenvalue problem. Neural
Smola A.J. (Eds.), Advances in Kernel Methods—Support
Computation, 10: 1299–1319.
Vector Learning, MIT Press Cambridge, MA, pp. 285–292.
Scho¨lkopf B., Smola A., Williamson R.C., and Bartlett P.L. 2000. Stone C.J. 1985. Additive regression and other nonparametric models.
New support vector algorithms. Neural Computation, 12: 1207– Annals of Statistics, 13: 689–705.
1245.
Stone M. 1974. Cross-validatory choice and assessment of statistical
Scho¨lkopf B. and Smola A.J. 2002. Learning with Kernels. MIT
predictors (with discussion). Journal of the Royal Statistical
Press. Scho¨lkopf B., Sung K., Burges C., Girosi F., Niyogi P.,
Soci- ety, B36: 111–147.
Poggio T., and Vapnik V. 1997. Comparing support vector
machines with
228 Smola and
Scho¨lkopf
Street W.N. and Mangasarian O.L. 1995. Improved generalization Vapnik V.N. 1982. Estimation of Dependences Based on Empirical
via tolerant training. Technical Report MP-TR-95-11, Data. Springer, Berlin.
University of Wisconsin, Madison. Vapnik V.N. and Chervonenkis A.Y. 1971. On the uniform convergence
Tikhonov A.N. and Arsenin V.Y. 1977. Solution of Ill-posed problems.
of relative frequencies of events to their probabilities. Theory of
V. H. Winston and Sons. Probability and its Applications, 16(2): 264–281.
Tipping M.E. 2000. The relevance vector machine. In: Solla S.A., Wahba G. 1980. Spline bases, regularization, and generalized
Leen T.K., and Mu¨ller K.-R. (Eds.), Advances in Neural Information cross-validation for solving approximation problems with large
Processing Systems 12, MIT Press, Cambridge, MA, pp. 652–658. quantities of noisy data. In: Ward J. and Cheney E. (Eds.), Proceed-
Vanderbei R.J. 1994. LOQO: An interior point code for quadratic ings of the International Conference on Approximation theory in
pro- gramming. TR SOR-94-15, Statistics and Operations Research, honour of George Lorenz, Academic Press, Austin, TX, pp. 8–10.
Princeton Univ., NJ.
Wahba G. 1990. Spline Models for Observational Data, volume 59 of
Vanderbei R.J. 1997. LOQO user’s manual—version 3.10. Technical CBMS-NSF Regional Conference Series in Applied
Report SOR-97-08, Princeton University, Statistics and Oper- Mathematics. SIAM, Philadelphia.
ations Research, Code available at http://www.princeton.edu/ Wahba G. 1999. Support vector machines, reproducing kernel Hilbert
˜rvdb/. spaces and the randomized GACV. In: Scho¨lkopf B., Burges
Vapnik V. 1995. The Nature of Statistical Learning Theory. Springer, C.J.C., and Smola A.J. (Eds.), Advances in Kernel Methods—
New York. Support Vector Learning, MIT Press, Cambridge, MA. pp. 69–88.
Vapnik V. 1998. Statistical Learning Theory. John Wiley and Sons, Weston J., Gammerman A., Stitson M., Vapnik V., Vovk V., and
New York. Watkins
Vapnik. V. 1999. Three remarks on the support vector method of C. 1999. Support vector density estimation. In: Scho¨lkopf B.,
function estimation. In: Scho¨lkopf B., Burges C.J.C., and Smola Burges C.J.C., and Smola A.J. (Eds.) Advances in Kernel
A.J. (Eds.), Advances in Kernel Methods—Support Vector Methods—Support Vector Learning, MIT Press, Cambridge,
Learning, MIT Press, Cambridge, MA, pp. 25–42. MA. pp. 293–306.
Vapnik V. and Chervonenkis A. 1964. A note on one class of Williams C.K.I. 1998. Prediction with Gaussian processes: From linear
perceptrons. Automation and Remote Control, 25. regression to linear prediction and beyond. In: Jordan M.I. (Ed.),
Vapnik V. and Chervonenkis A. 1974. Theory of Pattern Recognition Learning and Inference in Graphical Models, Kluwer Academic,
[in Russian]. Nauka, Moscow. (German Translation: Wapnik pp. 599–621.
W. & Tscherwonenkis A., Theorie der Zeichenerkennung, Williamson R.C., Smola A.J., and Scho¨lkopf B. 1998.
Akademie-Verlag, Berlin, 1979). Generalization performance of regularization networks and
Vapnik V., Golowich S., and Smola A. 1997. Support vector method support vector machines via entropy numbers of compact
for function approximation, regression estimation, and signal operators. Technical Report 19, NeuroCOLT,
processing. In: Mozer M.C., Jordan M.I., and Petsche T. (Eds.) http://www.neurocolt.com. Published in IEEE Transactions on
Advances in Neural Information Processing Systems 9, MA, Information Theory, 47(6): 2516–2532 (2001).
MIT Press, Cambridge. pp. 281–287. Yuille A. and Grzywacz N. 1988. The motion coherence theory.
Vapnik V. and Lerner A. 1963. Pattern recognition using generalized In: Proceedings of the International Conference on Computer
portrait method. Automation and Remote Control, 24: 774–780. Vision, IEEE Computer Society Press, Washington, DC, pp. 344–
354.

You might also like