Introduction of Support Vector Machines
Introduction of Support Vector Machines
• Mathematical Background
• and
• Key Features
1
Overview
• Introduction
• Mathematical background
- Maximum margin classifier (hard margin)
- Maximum margin classifier (soft margin)
- Kernel trick for nonlinear classifiers
- Multiclass classifications
- Extension to regression (SVR)
• Optimization of parameters
• Practical application
2
Introduction to SVMs
History:
- Duda and Hart (1973) discuss large margin hyperplanes in the input space
- SVMs can be said to have started when statistical learning theory was developed
further with Vapnik (1979) (in Russian)
- Poggio and Girosi (1990) and Wahba (1990) discuss the use of kernels
- SVMs close to their current form were first introduced with a conference paper (Boser,
Guyon and Vapnik 1992)
- The soft margin classifier was introduced by Cortes and Vapnik (1995) and extended to
the case of regression by Vapnik (1995) in The Nature of Statistical Learning Theory
Key Features
- “Maximize Margins”
- “Penalize Noisy Points”
- “Construct Kernels”
- “Sparseness of Solution”
3
ing to (8.26), given by
two can be separated
points
First x1 and
we x2 by a hyperplane
lying
consider theon 2-class decision
the hyperplane y=boundary.
0 satisfy The separating hyperplane
When a new predictor
(Fig. 8.3)=iswgiven
Maximum
y(x) T
x + by
margin vector x becomes available, it isthe
w0 the
problem and assume
0.equation (hard margin)
= classifier
data from
classfied
(8.23)
the two
according to classes
can be separated
the sign of the function by a hyperplane
w (x1 − x2 ) = 0,
T decision boundary.
y dn y(x n ) The y separating
(8.24)
dn (w T
xn +hyperplane
w0 )
(Fig. 8.3) is given by the equation = ,
y(x)
d x2 lying on the hyperplane y = 0 satisfyT = w T
x + w = 0.
0 8.4."w" (8.23)
"w" (SVM) CLASSIFIER
SUPPORT VECTOR MACHINE
the unit vector y(x) = w x + w0 . (8.27)
wTFigure
(x1 −8.3: The hyperplane ==0, wwith
T the vector w normal to this hyperplane.
Any two points
Formulation of where
) and
= 0,x ydnw contributes
2 lying
xx21hyperplane: onythe
y(x) +w
x the
hyperplane 0y== 0.
correct
(8.24) 0 satisfy (8.23) given
sign to ensure the distance
The normal distance from
The component of a point
ŵ the x
= vector to the decision boundary
n x − x0 projected onto the = 0 is, accord-
w direction is shown.
y
(8.25)
non-negative.
"w"
ingAny
toTwo
(8.26),
two given
points
points on theby and x2 w
x1plane:
T
(x1on
lying 2 ) =hyperplane
− xthe 0, y = 0 satisfy (8.24)
The margin (Fig. 8.4a) is given by the distance of the closest p
mal to the yWhen
= 0 hyperplane
aw newthe surface.vector x becomes
predictor available, itclassifier
is classfied determines
according to the decisi
henceFigure
ny point xtheonunit
the8.4: (a)
vector
= 0
y Adataset.
dn y(x
surface n )
dataset A
y
satistifes dn maximum
(w
containing
T
=
wmargin
xn +two
. 0 )classes
In Fig. 8.3,(shown
the by round dots and
0 the = of theorthogonal
ŵsigny function =
tow w
T
plane x
(x10 − x2 )0(8.25)
T
−w = 0, , (8.28) (8.24)
squares)
onent of the − xby
"w"
vector separable
x maximizing
"w"
by a onto
0 projected
theŵ margin
hyperplane
the "w"decision
wdirection
through
l,is given by searching
boundary y = 0. for Thethe optimal
margin is val
hencemaximized. w0 . Thevectors,
Support
the unit vector optimization
ŵ =y(x) = wproblem
i.e. dataT
points
x is then
+ w0 . used in determining(8.25)the margins
(8.27)
0 where
hyperplane ydn contributes
surface. wTthe (x − correct
x0 ) sign wT xto +"w"wensure the distance given by (8.28) is
y(x)
yŵ =(x±1,
non-negative.
T
− x0 )are = circled. (b)= A dataset=not separable
0
. by a hyperplane boundary.
(8.26)
he y = 0 surfaceThe normal satistifes wT xfrom
distance
"w" 0 = −w 0 . Inxn
a point
"w" Fig.to w 8.3,decision
the
"w" ythe (w T
x
boundary + y)= 0 is, accord-
w0data
isThe Slack
normal to variables
the y = 0 ξhyperplane
≥ 0
max are introduced,
surface.
subject
ŵ = to withdn ξ = n 0 for ≥ xpoints
l,n in(n = lying1 . . on
(8.25) . , N ),
tor x − x0margin ing to (Fig.
projected (8.26), 8.4a)
onto theisby
n
given ŵgiven lby the
direction is distance
given"w" by of the closest point(s)
n
"w"
or point
Any
the isdataset.
y(x) within
proportional the
A xmaximum
0to thecorrect
on the
normal = margin,
0distance
w,w
ymargin surface
0 ξsatistifes
classifier 1xfor
n > determines
between w points
and
T
x0 =
the −w
the lying
hyperplane 0 . In
decision toFig.the wrong
8.3,
boundary the side of the
, by
andcomponent
thew T ofofy(x)
decision
sign
maximizing (x − x the
the ) vector
boundary. T xl,on
indicates
margin
w x +Support
xwhich
−through
w projected
0ydn vectors
side
y(x
y(x)searching onto
n )of the are
ydn the
(w circled.
T ŵ8.3:
Figurexn +
hyperplane
for the direction
The
wx0lies.
optimal is given
)hyperplane
values of by
y = 0, with
w the vector w normal to this hyp
and(8.28)
− x
etwthe is
) = normal
Distance
training
0 . The
to
dataset the
maximize:
0
= where
y
be = 0
composed
the constraint
hyperplane
0
=
of predictors surface.
. =
x
simply ensures
The component
and (8.26)
target data
that
of ,the vector nox − xdata points
0 projected onto thelie insideis
w direction
optimization
"w" problem is then
"w" n "w" y dn
0
1, . . . , NAny
). Since point there x0areon2"w"
Obviously
the yw=Tythe
classes, 0"w"
(x −margin
surface
takesx0 )on the isTvalue
determined
satistifes
w x + wof0w−1 T
x 0 =
or
y(x) by−w
+1. relatively
0 . In Fig. few8.3, pointsthe in the
ŵ T
(x − x ) = dn
= When a new = predictor . vector x (8.26)
becomes available, it is classfied acco
andwhere
component maximizing
of
ydnthe these the
contributes
vector
0
y margin
points
x the
(w − Txcorrect
x circled
"w" + becomes
w sign
lprojected )in to equivalent
Fig.ensure
onto
"w"
the 8.4a
sign ofthe
thetheare
ŵ todirection
distance
"w" minimizing
called
function given is
support by (8.28)
given
"w", by which
is
vectors. is
onal to the max normal l
non-negative.
distance
subject to between
dn x 0and
n the
0 hyperplane
≥ l, (n = 1 . . . , N ), (8.29)
equivalent
y(x) indicates
w,w0 on to solving
which Since
side the
of8.4a)
the "w" distance (8.28)
hyperplane x lies. is unchanged y(x) if we= wTmultiply
x + w0 . w and w
The
Thus y(x) is proportional margin (Fig.
to the normalis Tgiven
w (x distance− by the
0 ) are
xwe distance
between
wfree
T
x +x of wthe
and closest
they(x) point(s)
hyperplane x in
ataset be composed
the dataset. of
ŵ trary
Tpredictors
(x −
Ay(x)
scale
x
maximum ) =x factor
and
margin
s,
target data
classifier=
1 points
The y
normal
determines
to
distance
0choose
=
from
the the a "w".
point
decision .
x to If
the we
n
decision
boundary
choose
boundary y "w"
(8.26) = 0 is
where
y = 0,the andconstraint
the sign of simply ensures
indicates
0 n that
on no
which dataside ofdn
the lie inside
hyperplane x margins.
lies.
n
therePrimal
e Obviously are by2theform
classes, the
of optimizationconstraint
problem: in"w" min
(8.29) becomes
"w"
ing to 2
+1."w"
(8.26), given by "w" (8.31)
Let the maximizing
margin
training
ydnistakes
the on
margin
determined
dataset
thel,by
be composed
value
through of
relatively
w,w
−1 or
of searching
0 2 few points
predictors forxnthe inoptimal
and the
target values
dataset,
data ofdnw and
and
y
ydn y(xn ) ydn (wT xn + w0 )
these
(n =points w
1,y(x) . The
. . 0. ,circled optimization
in Fig.
).proportional
Since there8.4a problem
are are called
2 classes,is then
support takes vectors.
onbetween
the valuexof = or +1.
Thus subject
Nis to the(8.28) constraint to the normal
(8.30).
ydndistance
ydn This(woptimization
T
xn + w0 ) ≥of1, a quadratic"w"and
−1 the hyperplane
(n = 1 . . .function
"w"
,
, N ),
Since
y = subjectthe
0, and the distance is unchanged
sign of y(x)isindicates ydn (w xto if
on we multiply
which yside
w and w
of thethehyperplaneby an arbi- x lies.
n+ a0If)quadratic
T
wwhere 0
dn contributes correct sign to ensure the distance given by
trary scale factor tomax constraints
s, we subject
l are free toreferred
to choose "w". as we
≥ l,choose
non-negative. (n = programming
1"w"
. . . ,=N ),l−1 , then problem
(8.29) (Gill
Let
et al.,the1981). training
w,w0 dataset be composed "w" of Thepredictors
margin (Fig. x n and
8.4a) is giventarget dataofythe
by the distance dnclosest4 point
the constraint in (8.29) becomes the dataset. A maximum margin classifier determines the decision b
of optimization
nts is referred with inequality
to as a quadratic constraint
programming problemby a(Gill
generalization 2of the Lagrange
multipliers, i.e. method.
ject tomultiplier
the constraint (8.30).
With This optimization
a Lagrange of a quadratic
multiplier
n=1
λn ≥ 0 function
n=1 j=1
introduced for each of
ject Maximum
and tothe
constraints
Kuhn Nand margin
constraints
Tucker inclassifier
is referred
(1951) to(KKT)
as a (see
(8.30) (hard
quadratic
Appendix
have solved margin)
programming
B),
this problem
the (8.34),
type Lagrange (Gill
function L takes
l., inequality
1981). N N N subject to λ n ≥ 0 and with L D referred to as the dual
h the form !
constraint by 1a ! !
generalization of the Lagrange
(λ) = isλthe
mλ(1951) dimension
x of x andthis then the original primal optimiz
w,(8.35)
nx
T
Karush L (1939) and Kuhn
λ
With a Lagrange multiplier
D n −and Tucker 0 n j y
introduced
dn(KKT)
y dj for have
j , solved
each of type
inequality2constraint
λ ≥ of (8.31)generalization
is
N a quadratic of programming
the Lagrange #problem with about m var
n
ptimization withn=1 n=1 j=1 1 by2 a function
! " L takes
n (8.30)Introduction
(see Appendix B), the Lagrange
trast, the
tiplier method. With L(w, 0 , λ) = multiplier
aofLagrange
wLagrange multipliers
"w" − λndual problem
ydn (wT x(8.35)
ton0 incorporate
≥λ introduced + 0is) −
inequality
nfor each
w also
of1 ,a quadratic
(8.32)programming
2N variables n=1 from λ. The main reason we want to work with the
≥ 0 and (8.34),
Nn constraints
λ in (8.30) with D referred
(seeLAppendix B),tothe
as Lagrange
the dual function
Lagrangian. L takesIf
form Dual
mension of form
x and !N w,
of then
optimization the insteadprimal
original
problem: of #the primal problemproblem
optimization is that, in the next stage, to gener
1 "
=with λ 2=−(λ1 , .λ. . ,yλN (w ) .T x
Setting the to
derivatives of Lclassifier,
with
184Inrespect
SVMto and w0CHAPTER
w invoke
T
0 ,aλ) "w"
quadratic programming n dn problem nclassifier
+ wwith
0) − 1 ,a nonlinear
about m (8.32)
variables. con- will kernels (see
8. C
2
to zero yields, n=1=respectively,
1 !N
"
xλnwill #
dual problem L(w, w0(8.35), λ) is"w"
also2 −a quadraticydnbe
(w replaced
xn + w0 )by
programming
T
− φ(x) , in a(8.32)
1problem feature
with space of dimension M , h
2 replaced by , which
s from
T λ. The main
λN ) . Setting the derivatives reasonofwe L want
n=1
withOurto work
respect
! N Mwith
to w thewis
and
usually
dual much larger or even infinite.
problem
the primal problem is that, in the w next= constrained
stage, λto y x optimization
generalize
,
0
the linearproblem satisfies
(8.33) the Karush
ectively,
h λ = (λSetting N ) derivatives
the
1, . . . , λ
T
. Setting the derivatives
of w and
(KKT) w0 toofzero : respect
L with
conditions
n dn n
(Appendix to w and
B): w0
oero a nonlinear classifier,
yields, respectively, SVM will invoke kernels
n=1 (see Chapter 7), i.e.
eplaced by φ(x) ! in
N
a feature space of dimension ! N M , hence m will be
w = λ y x , ! 0or = λn ydn(8.33)
. ≥ 0,
λn(8.34)
y M , which is usuallynmuch dn n largerN
even infinite.
n=1 w = λn ydn xn n=1
, ydn (w (8.33)
T
xn + w0 ) − 1 ≥ 0 ,
nstrained optimization N
problem n=1
satisfies the Karush-Kuhn-Tucker
ditions (Appendix ! B): λ [ y (w T
xn + w0 ) − 1 ] = 0 .
Substituting
Karush-Kuhn-Tucker
0 = these
λn ydninto (8.32)
. conditions
! N allows
apply: us to express(8.34) L solely in terms of the
n dn
0 = λn ydn . (8.34)
n=1
Combining (8.27) and (8.33),
n=1 λn ≥ 0 ,
we have(a)the
(8.36)
Figure 8.4: following
A dataset formulatwofo
containing
into (8.32) allows us to
T express L new datainpoint
solely termsx: of the Classification
squares) separable by a hyperplane decisio
y
stituting these into (w x + w ) − 1 ≥ 0 ,
dn (8.32) nallows0 us to express L solely in terms (8.37)
of of new data point:
the Support
maximized. vectors, i.e. data poin
λn [ ydn (wT xn + w0 ) − 1 ] = 0 . y = (8.38)
!
±1,N are circled. (b) A dataset not se
= variables
Slack
y(x) ξnx≥
λn ydn nx
T
0+arew0introduced,
, wit
(8.27) and (8.33), we have the following formula for or within
classifying
n=1athe correct margin, ξn > 1 for po
decision boundary. Support vectors are circ
oint x:
where the class (+1 or −1) is decided by the sign of y(x). The K
5
N (8.38) implies that for every data point x n , either λ = 0 or
and maximizing the margin l becomes equi
n
!
s beyond the correct margin (Fig. 8.4b). A data point which lies right
d the correct
ecision margin
boundary y(x(Fig. 8.4b). A data point which lies right
n ) = 1 will have ξn = 1, while ξn > 1 corresponds
n=1
Maximum
oundary
)sosubject
lying toy(x
the
to
= 1margin
n ) wrong
constraint
haveofξclassifier
willside nthe
(8.42),
while ξ(soft
= 1,decision
we margin)
> 1 corresponds
nboundary,
again turn to i.e.
the misclassified.
method !N
iththe
0 <wrong
ξn ≤ side of the decision boundary, i.e. misclassified.
1 protrude beyond the correct margin but not enough 0to = λn ydn ,
rs (Appendix B), where the Lagrange function
ξn ≤ 1 protrude beyond the correct margin but not enough to is now
decision boundary to be misclassified. n=1
n boundary to be misclassified.
constraint (8.30) is modified to allow for data points extending beyond λn = C − µn .
nt
N
! (8.30) is modified
Introduction
N
!i.e. " to allow
of for
slack data points
variables extending
to beyond
penalizeN miss-classifications
ct margin, 184 # ! CHAPTER 8. NONLINEAR CLASSIFICATION
in, i.e.
ξ − λ y (w x + w ) − 1 + ξ −
T
µ ξ ,
n n dn n 0 Substituting
n these into
n n(8.44) again allows L to be expressed so
8.4. SUPPORT (w xnVECTOR MACHINE (SVM) CLASSIFIER 187
1+−wξ0n) ,≥ 1(n−=ξn1,,. .the
. (n =
). 1, . . . , Nmultipliers
). (8.42)
T
ydn (wT xyn=1
=1
n +
dn w0) ≥ , NLagrange n=1
(8.42) λ, i.e.
(8.44)
imization
n problem problem (8.31)subject
(8.31) is(8.43)
To optimize modified is to
modified to
to constraint (8.42), we again turn ! N to the method N N
Primal form of optimization problem: 1 !!
≥of0 Lagrange
(n = 1, ." . . , N )
multipliers "the Lagrange
(Appendix # multipliers.
B), where
# the Lagrange
L D (λ) = function λ n −is now 187
λ n λ j y dn y dj xn xj ,
T
8.4. SUPPORT VECTOR
1 !2 N MACHINE
!N (SVM) CLASSIFIER 2
1
tives of L minwith$w$ respect
min 2
+ C to
$w$ w,
ξ n+ w
C. 0 and ξ n ξ .
n to 0 yields,
(8.43) respec- n=1
(8.43)
n=1 j=1
1
w,w0 2 w,w0! N
2 n=1 ! N
" # ! N
L = To!w! optimize
2
+ C (8.43) ξn subject
− λto constraint
nwith
n=1
ydn (wLD the
T
xn(8.42),
+dual we
− 1again
+ ξn turn
w0 ) Lagrangian. − L toDµthe
has
n ξn method
,the same form as that
2
of Lagrange multipliers (Appendix B), where theerror
Lagrange function n=1 is now
expression
ide thisDual byform
expressionC, theofby second
optimization
n=1
C, the term cancase,
n=1problem:
second be
term butcanthe
viewed as
be anconstraints
viewed as anare somewhat
error changed. As λn ≥ 0 an
first termfirst
ile the canterm
be viewedcan
N
!beasviewed a implies
a weightaspenalty weight term, with C
penalty term,aswith C −1 as N
−1 (8.44)
1= analogous
wparameter xnregularization (8.45)
! N N
ty penalty
ht parameter — 2 —n to Figure
λ analogous
ydnthe ,! 8.4:
to (a)" regularization
the A dataset
of T containing
NN models of NN two
models # !
classes (shown by round dots and
L
with λnhave =
≥ 0we !w! + C ξ − λ y (w x + w ) − 1 + ξ − µnyξn=, 0. The margin is
2aand ≥ 0 (n =
here n 1,(MSE)
squares) . .separable) the
dnby Lagrange
a hyperplane 0multipliers.
decision n boundary
n 6.5wewhere mean
have
µnsquare
n=1 a mean error
square
n term
. error
,N plus
(MSE) anterm
weight plus a weight 0 ≤ λn ≤ C. the margins
n=1maximized. n=1 Support vectors, i.e. data points used in determining
the Setting
objective the derivatives
function.
term in the objective The of
effect
L with
of respect to
misclassification
N function. The effect of misclassification w, on
w 0 and
the ξ to
n on the 0 yields,
n=1
respec-
ntively,
is only linearly
!
related y = ±1, are circled.
to ξn ofin (8.43), in contrast $ (b) A dataset
to the not (8.44)
separable by a hyperplane boundary.
Setting
function 0is only the derivatives
is quadratic.
=
Since
linearly
any n yrelated
λmisclassified
dn toFurthermore,
, variables
Slack ξn in
point has ξn (8.43),
≥ξn0>are
there are(8.46)
in contrast
1, introduced,
toconstraints
thewith
$ the
ξn = 0 forfrom
ξn Karush-Kuhn-Tucker data the KKT
points
conditions
condition
lying
applyon:
m which w isand w
quadratic.
with λnan≥ upper to
0 and zero :
Since
µn ≥ or any
0on(n misclassified
= 1,number
within the point
, N )ofthe
correct has
Lagrange
margin, n 1,
ξn n> 1multipliers.
for points
ξ > n n lying to the wrong side of the
ξ
0 . . .N
s providing n=1bound the ! misclassified
viewed asSettingproviding the an upper
derivatives bound
decision
of withon respect
boundary. theSupport
number
to of
vectors misclassified
are yξcircled.
and (w T
xn + (8.45)0 ) − 1 + ξn ≥ 0 ,
λn = C − µn . w L = λ y
n dn nx w,
, w 0 (8.47) yields,
dn
n to 0 wrespec-
tively, n=1 λn [ ydn (wT xn + w0 ) − 1 + ξn ] = 0 ,
and maximizing the margin l becomes equivalent to minimizing "w", which is
to (8.44) again allows L equivalent
to be expressed
!N
N solely in terms of µn ξn = 0,
0 = to solving
!λn ydn , (8.46)
iers λ, i.e. w = λn ydn xn , (8.45)
n=1
for n = 1, . . . , N . 1 6
λn
n=1
= C −Nµn . min "w"2 (8.47) (8.31)
w,w0 2
requires 164O(nM ) operations. of the nonlinear
CHAPTER mapping
7. KERNEL functions φ, the
METHODS
ure
andspace,
,rnels
utation and instead of evaluating φ explicitly, we introduce a
costs,
savehadcomputation
we Kernel G trick
= XX T costs,
for nonlinear
and k instead
= x of
T classifiers
x̃. Now, becomes
evaluating
F as we will linear.
explicitly, we introduce a
φbe
Kernels
K
on
to evaluate
in φ 7.3
function
T
the (x) = to
feature
K
the
[φ1φ inner
Kernels
(x),
evaluate
(x)
space, =
we
product
i
T . . . , φM (x)].
[φthe
need
i
inner
(x),
to
(i.e.
work
. . . ,
dot
product
φ
product)
In
(x)].
with (i.e.
method, a feature map φ(x)1 maps fromMthe input space X to the
in
Section
(7.22)
dot
the feature
7.2, we
product) had
in the
(7.22) G =
featureXX T
. For
itF is
rnel In
instance,
convenient
method,
Kernel the = kernel
if
toφfeature
a T
describes (x
x∈X=R
incorporate
method,
)φ(x map
feature)
m
aamap , then φ(x)
constant
feature
φ(x) performing
mapselement
map
∈ Fxi0
from
φ(x)
M
⊆≡ R regression
the
maps
(7.23)
where F in the feature spac
1 ,input
from space the inputX to
spacetheX to the
ms where
space,
e.g. K(x,
G
and
to deal
ijit is
with
convenient
the
i jto
T T constant weight w0 inm
, incorporate
! a
(7.4)),
constant
one also
element xi0 ≡ 1
pace F .
from For
z)
feature an
≡ space
input
φ (x)φ(z)
instance, space
F . if
For=x instance,
∈ X φ = (x)φR if x!
(z)
, tothen
∈ a,X =
feature R
φ(x) ,
mspace
then
∈ F (7.26)
⊆
φ(x) R , where F
M∈ F ⊆ RM , where F
ectorelement
tant xi k(e.g.
i φ = to(x)
K(x,φdeal
(≡(xz) 1) with
i )φ(x̃)
≡to φ .the
φ(x).
T constant
(x)φ(z)
With
l
= l weight
appropriate φ w(7.24)
(x)φ
choice
0 in (z)(7.4)),
, one also G (7.26)= φ T
(
ture
add
pping
is
space,
with
a functions the
constantφ, φ Tfeature
and
0
(x)
element= [φspace,
the relation 1 (x),
φ0 (x) and
in (≡
. . .the 1)
,
l φ high
M (x)].
todimensional
φ(x). With
l
space
l
(7.22)
appropriate choice
ij
equires O(M ) operations, then φT (xi )φ(xj ) is still l of O(M ), k = φ T
(
linear mapping functions φ, the relation in the high dimensional space i
he
where
e done input Kernel
it
n 2
isspace.
typically
convenient
times asTφ GSince
Tisispositive
(x)to
an incorporate
K(x,
n ×
ki==[φxSince
n z)
definite =
T and symmetric
matrix,
φ (x)
(x), . K(x, a = [φ
constant
K(z,thus
.. ,φ x), (x),
requiring
(x)].K is
element
. . . , aφ symmetric
x (x)]. ≡ 1 (7.22) (7.22)
we
sx,
or x
had
z
linear.
perations. in
(e.g.
G the
To
= XX
getinput
to kernel
deal αwith
and
from space.(7.18),
the i x̃. Now,
1T
computing
asM
the
1we
z) =
wT0inverse
will K(z,be M
of x), i0
K is a symmetric
key
on in ito
the the
feature space, trickwe isconstant
need that
to work ifweight
anIf
with algorithm
we
in (7.4)),
assume inφ(x) one
the also input
requires O(M ) operati
tion
on. 7.2,
The
+constant
Gmulated
a Let’s we
key
assume
Ininvolving
pI) takes problemshadto G the
O(n ) operations,
element 3 where=
(x) kernel
XX
(≡ it
T
1)is
thus and
intrick
regression
convenient
to a k
total i is =
of that
With x
O(nto x̃.
analysis if Now,
an
and
incorporate
+
i appropriate
2
M n 3
) as
algorithm awe willinbe
requires
constant
choice the
O(M)elementinput xi0 ≡ 1
ms where it is φonly
convenient
0 inner to products,
φ(x).
incorporate then athe algorithm
constant can
element x ≡ 1
g regression
can
ed.
ear be into
mapping Gformulated
ij the
in the
= vector
functionsφT (x feature
involving
iφ,)φ(x space,
i (e.g.
xthe , …to
j ) relation
only we
deal
taking but
need
ininnerwith
the
inverse to
high(7.23)
the work
products, has
with
constant
dimensional
(7.23)
requires O(n 2to
then be
weight
M+n 3) done
the
space algorithm i02
w0 inn(7.4)), times
canoneas G
also
feature
vector xthe space
(e.g. with
toT deal the with kernel the function
constant evaluating
weight 2wφ(x).the inner
in (7.4)), one also
mved x̃, (7.20)
near. in tends i now tobecomes
feature
= φ G(x add a space
constant
i )φ(x̃)
with
… the
element
requires total
kernel
φO(M) (x) of (≡ O(n
function
1)
(7.24) to M ) operations.
0 evaluatingWith the
appropriateTo get
inner choice
α
ugh
add the
a
n 7.2,Although
we
kalgorithm
constant
i
had = ijmay
element = only
φ . φ
and
Tbe solving for a linear
(x)(x (≡
)φ(x 1) ) to
,
0
φ(x). With problem
appropriate in
(7.23) choice
i = only
Gthe XX T j x̃. φ,Now,
xthe as weforwill be+
cts. of the n nonlinear algorithm
mapping 0 may kfunctions
i T
i be
n the
× solving
n relation
matrix in a
(G linear
the high
pI) problem
dimensional
takes in
O(n
165
space
3
) op
e, it
nlinear is
7.3. equivalent
!
KERNELS to solving a nonlinear problem in the input
egression ỹmapping
in )the φfunctions
αfeature space, , φ,
we the
need relation
jto .work in
with the highproblemdimensional space
T
equiresspace,
ature O(M
F operations,
=becomes it iis
Tlinear.
equivalent
k(x then
= φ
i i )φ(x̃) φ T (x
(x
to )φ(x
)φ(x̃)
solving ) is still
a of O(M
nonlinear
(7.25) ), (7.24)
in the input
es linear. In
e done n 2
times as G is an n ×
…
n
iirequires
matrix, operations
O(nM)
thus requiring are
a needed.
i=1Section 7.2, T we had G = XX and ki = xi x̃. Now, as we will be
T T
le, for
perations.
me
ction Linear
φ(x) x
7.2, = (x
To regression
requires
we
performing get
G
1 ,ijxO(M
had
α = performed
2 )regression
from φ
= (x
7.3. XX
i )φ(x
, consider
(7.18),
∈G)Roperations,
2
in
in
computing
T
the
) ,
j then
KERNELSF andis then
the feature
feature φ
ktheWith
T of
(x= the
inverse
i )φ(x
space, x new
Tform
map x̃.
we of
) isdatum
need
(7.23)
still
Now, toof x̃,we
O(M
as
work
(7.20)
),
withwill now
be becom
GMan+ example,
) pI) Example
operations.
takes O(n for
3 x = (xT1 , x2 ) ∈ R ,: consider
for ) operations,
2= φ (xthus a
2
total of
i
O(n 2
M the
+i√ 3 feature map
j
n )thus (7.24)
) has
ng to be done
regression in the times
feature asiaG )φ(x̃) is
space, an we need matrix, , requiring a
2 T 3 work
to 2 ax1 x2with
k in √
y = + a . x 2 n
+ ×a xn 2
+ a (7.28)
ation
ed. 2 costs, instead of2 evaluating φ0 explicitly,
1 1 3 we
√ 2 introduce
! n
O(n M ) operations.
φ(x) = (x1 , To 2
2α
x2 , getLinear x12x from ) R
(7.18),
regression
G∈ .= performed
φ (x
computing )φ(x in )
the
F , is(7.27)
inverse
then of of
the form (7.23)T
oφ(x)
m evaluate
x̃, (7.20)
requires
so the
now
quadratic inner
becomesproduct
φ(x)
O(Mrelations
) operations, = (i.e.(x
3 1then
with dot , 2 2 ijT
x
the product)
, 2
φ (xi )φ(x
T2 inputs x inx the ) ∈ i 3
R
feature
2 j ) is linear
1become . j
still of relations
O(M =(7.27)
2 ), in 3theỹ feature αi φ
matrix (G + pI) 2 takesG O(n=) operations,
φ (x k )φ(x
= thus )
φ ,
Ta total
(x of
)φ(x̃)O(n
. M + n ) √
(7.23) (7.24)
as to space.
beRegression
done For n times
this
then
φ,isas
ij G
linear is an
in the i
feature
n × matrix,
in space:j =i a0requiring
ythus + a1 x1 + aa2 x2 + a3 2 xi=1
2 2
1 x2 ,
2s are needed.
! n
M ) operations.
ỹ = α To
φ T get
(x !
k α
)φ(x̃)from
=, (7.18),
φ T
(x computing
i )φ(x̃) . with the inverse
(7.25) T2 of T 2 (7.24)
new
K(x, datum
z) If
≡ we
Tφ T
x̃, (7.20)
assume
(x)φ(z)
(x)φ(z)
φ pI)
i
takes = =now
i i
2 2φ
φ(x) becomes
requires
so
(x)φ O(M
2quadratic
2 (z)
+l x2 z2l + 2xthus
x13z)1 operations, , ) operations,
relations
1 x2 z1 2 = (x
ztotal then
(7.26)the
1 +2
1 zO(n xM φ)n
2 z2+ (x
inputs
= )φ(x
3 i (x z)j ). is linear
become still
(7.29)ofrelation
O(M ),
trix (G + i=1 O(n
space. For
which
a
2 this φ, requires
of O(nM ) ) operations.
e needed.but (7.23) has tol be done n times as G T is an n × n matrix, thus requiring a
ume φ(x)
Hence requires
M ) operations. O(M
2 ! n ) operations, thenTo save
φ (xicomputation
)φ(xj ) is still costs,
of O(Minstead), 7 of
datum
input total of
x̃, (7.20)
space. O(n
Sincenow M ) operations.
2becomes
K(x, z) = K(z, To get α
TT x), K is a 2symmetric2 from 2 2(7.18), computing the inverse2 of
classifier !from
N being T linear to nonlinear. LD (λ)computations
L= D (λ) = λnλ−
The n− involving
isn=1achieved λthe
n λλjyninner
dn product
λyjdjyK(x
dn yndj
, xjφ ). (x )φ(x
(8.60)j ).
φ n(x)φ(x ) is on
n=1of then=1 kernel trick 22 n=1 j=1
(seej=1 Section 7.3), where a kernel functio
y(x) =
classification λn ydn
not with φ the(xninput
)φ(x) x+data w0 . but with the (8.55)
φ(x) n=1
Kernel trick Since for
the nonlinear
dimension classifiers
of thethe feature space canis again
be very high programming
! or even T infinite,
ce, where φ is the nonlinear
n=1 function
Classification Optimizing mapping
is based dualon problem from
the T the
(8.60)
sign of y(x), withxy(x)
a quadratic
K(x, ) ≡ φmodified
(x)φ(x !
problem.)from
, (8
o theoffeaturecomputations
space space involving
(Section Sincethetheinner
objective product
function φ (x)φ(x
is only
!
quadratic, ) is only
and withpractical because
the constraints being
sion theoffeature
the kernel can7.3),
trick be
(see
i.e.therehigh
very
linear,
Section is
is7.3),
no orwhere
introduced
local even
minima N
infinite,
to
a obviatefunction
problem,
kernel theunlike
i.e. direct NNKcomputation
models, the minimum of the inner
ing they(x)
inner= product T
(x)φ(x found) is only practical ! because
! by SVM is the global minimum. The parameter C and the parameter σ
wT φ(x)φ+ w 0 . now leadstheto
[assuming
used
RBF
kernel
!y(x)kernel = functions
(8.53)
T(8.58) λnused]
is
include
!ydnare φTnot the polynomial
(xobtained
n )φ(x)from
kernel of degr
+ wthe0 .optimiza-
see Section 7.3), where a kerneltion. function
ToK(x,
determinexK)the φ (x)φ(x
≡ values of these )two , hyperparameters, " usually
one (8.56)
#
T trains
! p
n=1 K(x, x !
) = 1 + x x ,
an LD for this !new problem is obtained
multiple modelsfrom with(8.48)
varioussimply
values of C and σ, and from their classification
K(x,is introduced
x ) ≡ φ T to obviate
(x)φ(x
Since ) ,thethedimension
! performance directovercomputationofGaussian
validation the
data, of the space
(8.56)
feature
determines inner
the bestproduct.
can Commonly
beforvery
values C andhigh
σ. Fig.or even
y φ(x), giving and the or radial basis function (RBF) kernel
used kernel computations
functions include
8.5 illustatesthethe
involving polynomial
classification ofkernel
noiseless ofdata degree
and
T moderately
p, noisy data by
the SVM the code inner product (x)φ(x $) is SVMonlywith practica
!
SVM, with from LIBSVM (Chang φ and Lin, 2001). %
ate
N the direct N computation
N
! of of the inner product. " Commonly # #x − x )#
! 2
! 1 !choices
Typical with the
SVMskernel T trick
the Gaussian
are: K(x, (see
x ! Section
kernel
) =is similar in7.3),
1 + x x ! where
Tstructurep
,, K(x,to thea RBF
x ! kernel
) = exp function
NN (Section
− 4.6), K
(8.57)but
.
s include
λn − the polynomial λn λj ydn kernel φ of(x
ydj SVM degree
n )φ(x
appears tojp, ).
perform (8.54)
better (Schölkopf et al., 1997). 2σ 2
2 n=1 j=1" T
n=1 and !the Gaussian T or
#
! radial basis function
p K(x,
(RBF) x !
) ≡
kernelφ (x)φ(x! ) ,
K(x, x ) = 1 + x x , Under the kernel (8.57) approach, (8.55) becomes
)*+,-./012100,3*4*
$ % )5+,-./06,3*4*
on the sign of y(x), with is y(x) modified
introduced to$ obviate fromthe (8.39)direct to
#x − x )#computation
! 2 $ of the inner product. Co
Nonlinear
r radial basis functionused kernel
(RBF) cankernel
K(x, x ) = exp
!
− the polynomial ! N
(8.58)
N produce convoluted kernel functions include 2σ 2 y(x). = kernel λn ydn ofK(x degree
n , x)p, + w0 ,
! $ % !"( !"(
y(x) = ! and φT (x#x − x! + )#2w0 . " n=1 #p
λn ydn n )φ(x)
discontiguous (8.55)x ) = 1 + x T x! ,
!
K(x, x ) =Under exp the − kernel2 approach, . (8.55) becomes K(x,(8.58)
n=1 hyperplanes in 2σ the !"' !"'
"
!
(8.55) becomes y(x) = λn ydn K(xn , x) + $ w0!"& , % (8.59)
g the inner product φT (x)φ(x! ) is !"& only practical because #x − x )# ! 2
n=1
e Section 7.3),
! where a kernel function
N K K(x, x !
) = exp − .
!"% !"% 2σ 2
y(x) = ! λn ydn T
K(xn , !x) + w0 , (8.59)
K(x,n=1 x ) ≡ φ (x)φ(x ) , the kernel approach,(8.56)
Under (8.55) becomes
! !
! !"# $ ! !"# $
te the direct computation of the inner product. Commonly !!
N
!!
!
include the polynomial kernel of degree p, y(x) = λn ytodn(a)K(x n , x) +dataset
w0 , of Fig. 8.1 8
Figure 8.5: The SVM classifier applied the noiseless
!
" #
T ! p and (b) the moderately noisy dataset n=1 of Fig. 8.2. The two classes of data points
Multiclass classifications
2.) 3.)
9
ve =
function (ξ)n++!,! ξcan
is then
C y(x )+ 1 $w$ −2 (µnthe + µnvariables
ξn slack ξn )
J=ydnC≤! N (ξ
n=1
n
n + ξ ) n+ now2
$w$ be
2
, extended
!N via (9.5) to allow for data
T VECTOR
points REGRESSION
n=1
lying
n
outside 2 1
(SVR) n=1 203
=Extension
Cn=1 (ξ! n +to
N ) the
+! tube,
ξn! regression$w$
1
yielding
2
− the ξ
(µ n
conditions
n + µ n ξn )
! !
nverse weight
J!=
N C penalty
(ξn + ξ ) 2
parameter,
+ $w$ 2E202is an error
, !N function(9.5)and the
CHAPTER 9. NONLINEAR REGRESSION
ct to the constraints
n
2
0,ydn y0,the
(9.3) and (9.4).
n=1 ! n=1
ons
he for−a data
weight [y(x
λpoint
penalty
n=1 toterm.
lie ≥!To
)ξn+within
+ ξn −
the ≥
ξretain
≤ y(x
!-tube, ) +y(x
] i.e. !+ !
)ξ−
−sparseness
n nλ n[y − y(x of) + ! + ξ(9.3)
property
,! ≤ !
].(9.6)
n n n dn n dn n n
nts,
!, canLagrange
now !be multipliers,
N extended ≥ 0,
via theλslack λ!n y(x
variables
≥ to0,)! n! − ξ0! data
N ≥for
allow
µ and (9.4)
er,
d, andE is
subject usually
n=1
to taken
λn [y(x
the )+to
constraints be
! ξ+ of
nydn
the
≥ form − ! n.
− yξndn≥] −0, (9.3)λnand
n n=1
[ydn(9.4).
− y(xn ) + ! + ξn! ].(9.6)
n 0,
!
tside Epsilon-insensitive
the−tube,
the yielding
Lagrangian the
n error
n ξ norm:
conditions
function is
≥
constraints,
The Lagrange
n=1 " multipliers,
objective function is λn ≥ 0, λ!n ≥
then n=10, µn ≥ 0 and
milar to
oduced,Eand SVM
y the ≤classification
|z| −)
Lagrangian
y(x !,
+ ! + if
function
ξ (Section
|z|
, >is ! 204
8.4), the(9.3)
regression
Figure 9.1:is
Theperformed CHAPTER
!-insensitive in E!(z).
error function 9. Dashed
NONLINE
line shows t
1 ! (z)2 = (9.2)
dn
!N n n
absolute error (MAE) function.
ture
ξn! ) +space,
milar to$w$ ydn i.e.
SVM 0,
−≥classification
(µnn)ξ−
y(x n !+−µ otherwise
ξ!nn! ξ(Section
)! . 8.4), the
! N
.n regression
(9.4) is performed in
2 1
N
! 1i.e.n=1 ! J =C
N
(ξ + ξwhere
n ) + $w$
!
the
2
kernel function (9.5)
x !
) = T
(x)φ(x !
). The opti
ature
ve (ξ
space, ! ! n 2
, K(x, φ
n +Primal
ξn )is+then (µ + )
! 2
function $w$ − ξ µ ξ
nsitive error2form of optimization
function, ! Nas
y(x)it ignores
n n
= n=1 problem:
wn Tn errors
φ(x) + of
now size
w0to smaller L
, maximize than
D subject to constraints. (9.7) The Lagrange
=1 n=1
(x n ) + ! + ξ n −!Nydn ] −
y(x)λ !
[y=dn w − T y(xφ(x) n ) + +µ ! +
w andξ ,
!
].(9.6)
µ!n0, are all 0 (Appendix
≥ (9.3) (9.7)B). Together with (9.10)
N
! to be minimized subject 1
n
! to
N the constraints n ξ
0n
n ≥ ≥ 0,
ξn! function andline(9.4).
= (ξ + !
) + 2
the (9.5)
+following ! thatconstraints on λn0 and
es $w$
φ λare introduced
J C as
n in ξ n SVM classification,
, except for ! each
Figure 9.1: The error (z). Dashed shows the!mean
is the n )feature map. ] −Substituting this for ) in (9.6), then setting
n=1 !-insensitive E
[y(x
nTo handle + ! the
+ ξnconstraints,
− ydn 2 λ!n [ydn
Lagrange −multipliers,
absolutey(x n )(MAE)
error ! +function.
λξy(x
n ].(9.6)
≥ n 0, λ
!
≥ 0, µ n ≥ and λn
n=1 n
φ is
here
rivatives
! the
µare
n ≥two
assification
=1 0,
of feature
slack
are with
(Section
L map.
variables
introduced,
respect
8.4), Substituting
and
the to the
≥w, 0Lagrangian
ξnregression
n=1 andw ,ξisthis
!
ξ ≥and forWe
0.
function
performed ξ ! assign
y(x to n is)zero
in in ξ(9.6), > 0 then
nyields, setting
respectively,
d subject Dual to form
the of optimization
constraints problem:
0,w,
0 n
≥0 , 0,ξi.e.
n
(9.3) and
n
! (9.4). 0 ≤ λn ≤ C,
rivatives
nts lying of
above with
the respect ξnin≥ toFig. ξn9.2, and to zero ) + yields,and respectively,
!
VM classification L !-tube
(Section 8.4), λthe≥regression w y ξ> y(x !,
N0,isµn performed
≥ 0 and in
n dn n n
constraints,
oints lying Lagrange
below ! N multipliers,
the !-tube, n1
i.e. y 0, λ!n ≥
< ! y(x ) − !. 0 ≤ λ !
n ≤C,
e, i.e.
roduced, L and = the Lagrangian
C (ξn +function!
ξN !N
n ) + $w$ is dn 2
− (µn ξn + µn ξn )
n ! !
y(x) = wT φ(x) + = ! (λ 2 )φ(x (9.7)
andn ), (9.8)
!
n=1 w w 0 ,
n − λ! n n=1 (9.9).
w = (λ − λ )φ(x ), (9.8)
1 =! w! + w0 ,n (9.7)
T
!N y(x) φ(x)
N
201 n=1 n N
map.
(ξn + Substituting
ξn ) + $w$ −this (µ
! 2
N
for ξny(x
n
+ µnn)ξnin) (9.6), then setting
! ! The
! Karush-Kuhn-Tucker (KKT) conditions (Appendix
n=1
λn [y(xn! ) N + ! + ξn − ydnuct ] − of each λn [yLagrange
dn − y(xn )multiplier + ! + ξn ].(9.6)
n ! Figure 9.2: A !
schematic
feature map.
h respect to w, w0n=1
=1
2 −
Substituting
, ξn and ξn!
n=1 this ! for ) in
to zero nyields, respectively,
y(x (9.6), then setting points lying within the
and
“!
its associated constraint
diagram of support vector regression (SVR
tube” (i.e. between y − ! and y + !) are ig
N
! yielding n=1
n and ξ(λ ) = 0,
respectively, the (9.9)
!-insensitive error function. Actually the commonly used term “! t
of
!N L with respect to w, w , ξ ! N !
to zero
− λ ! yields,
! (λn − λn ) = 0, !
n (9.9)
0 n n misnomer since when the predictor is multi-dimensional, the “tube” is
λn [y(x ) +
Solution! +
Similar to SVM
!N n ξis:
n − y dn ] −
classification λ n [y dn − y(x n ) + ! + ξ ].(9.6)
(Section 8.4), the regression is performed in
n a slab of thickness 2!. For data points lying above and below the tu
=1w = (λn! λn )φ(x
n=1 λ [y(x ) + " + ξ n− ydn ] =ξ and0,ξ! , resp
i.e.!n ), (9.8)
N ! n=1 distances from the tube are given by the slack variables
the feature −space, n=1 Figure 9.2: A schematic diagram of support vector
n regression
n (SVR). Data
w = (λ n − λ )φ(x n ), λλ n +
the+ µ µ n = =
points lying within the “! tube” (i.e. Data
C, C,
error function. (9.8)
Actually
between
the
points !
commonly
! andinside
y −lying
[ydn
λn have
y + !)(or
used −0 y(x
term (9.10)
areright
(9.10) “!
ignored
tube”
on) by
the tube have
) 0,+while
is a" +those
ξ = 0 = ξ ! . Th
ξn ]lying=below0,have ξ
!
(Sectionn 8.4), the regression n !-insensitive
n performed in
SVM classification
n=1
misnomer is
above the tube and ξn= !
ξ>
since when the predictor is multi-dimensional, the “tube” is actually
!
0.
+
!a slab !
= (9.11)
n=1 !=
ce, i.e. ! N y(x) λ λ n +w T
µ µφ(x)
of
n =
! thickness+C.
C. w 0 ,are data
2!. For
ξ >
points lying above and below
µ (9.7)
(9.11)
ξ the tube, their
= (C − λn )ξn = 0,
N n distances n from the tube given by the slack variables ξ n
and
!
n
ξ , respectively.
(λn! − λ!n ) = 0, Data points lying inside (9.9)
(or right on) the tube have ξ = 0
!
= ξ . Those lying
wheren=1 is the
φy(x) (λ
= wfeature
Tn − λ !
φ(x)n+map. ) = 0,
w0 , Substituting above the tube
this for (9.7)
have ξ > 0 (9.9)
y(xn ) in (9.6), then setting
and!
ξ = 0, while those lying µ
below! have
ξ
n n
!
= ξ =(C0 and− λ!n )ξn! = 0.
tuting
ituting these
these into n=1 (9.6),
(9.6), we
we obtain
obtain the
the
ξ > 0.
dual
dual Lagrangian
Lagrangian
!
the derivatives λ + of
µ L =withC, respect to w, w 0 , ξ n and ξn! to zero yields, respectively,
(9.10)
feature map. Substituting n λn + µthis
n
n =for C, y(xn ) in (9.6), For then λnsetting to (9.10)
be nonzero, y(xn ) + " + ξn − ydn = 0 must 10 b
of L with respect λ to+w,µw
! !
! = , ξC. ! andN Nξ to
!
NNzero yields, respectively,
N (9.11)
Extension to regression 206 CHAPTER 9. NONLINEAR REGRESSION
$%& $'&
# #
"
"
validation " "
(b) Overfitting due to both a
# #
larger C (less weight
! " # ! " #
penalty) and a smaller σ ! !
(narrower Gaussian $(& $)&
# #
kernel)
(c) Underfitting due to a " "
"
smaller C (more weight " "
penalty) # #
! " # ! " #
! !
Figure 9.3: SVR applied to a test problem: (a) Optimal values of the hyper-
parameters C, ! and σ obtained from validation are used. (b) A larger C (i.e. 11
less weight penalty) and a smaller σ (i.e. narrower Gaussian functions) result
U,
!7%6%74!)$+)!'(! ε ∝ σ EA4"4F4;AG8!H%&%!#%!+((2*%!)$+)!)$%!()+,5+&5!5%6'+)'-,!-.!,-'(%!
4%# 0 2 6 ' @ 6 3 ≤ P >! ! W+)5+! F+! (6#%')! #4+! 1(**(F'),! -//+$! 6(-)&! ()! ABC! $+,$+77'()!
Optimization of parameters
C,-#,!-&!:+,!D%!%()'*+)%5!.&-*!5+)+!10&+:)':+7!+00&-+:$%(!)-!,-'(%!%()'*+)'-,!+&%!
((%5!',!I%:)'-,!"38!H-#%6%&4!)$%!:$-':%!-.!ε ($-275!+7(-!5%0%,5!-,!)$%!,2*D%&!-.!
#'()G!
!!!!!! + 26 3 ≤ ! ⋅ ) "1 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!2PX3!
,/!(+*07%(8!J&-*!()+,5+&5!()+)'()':+7!)$%-&<4!)$%!6+&'+,:%!-.!-D(%&6+)'-,(!+D-2)!)$%!
!7',%!1.-&!7',%+&!&%/&%(('-,3!'(K!!
Mostly recommended: grid search
Y8/$+77'()!2PX3!'7!5()5+/#-%**9!'0/($#%)#@!%7!'#!$+*%#+7!$+,-*%$'.%#'()!/%$%0+#+$!!!%)&!
σ ;+5#($7@!
L
;%*-+! (1!ε >! W(F+;+$@! )(#+! #4%#! #4+! $+*%#';+!
!!!!!!!!!(coarse
)-06+$! σ(1! grid
L 7-//($#!
∝ followed by%!fine
1($! ,';+)! grid)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!1;L3!
σ ($&+$! #(! +7#'0%#+! #4+! ;%*-+! σ (1! !!
&+/+)&7! ()! #4+!ε :;%*-+>! Z)!
!M-
6+$! (1! 7-//($#! ;+5#($7!*
(2//%()(!)$%!.-77-#',/!0&%(:&'0)'-,!.-&!:$--(',/!
/+)&+)#*9!(1!2-)I)(F)3! or) 23 @!()+!5%)!$(6-7#*9!*+#! ε! K! ≥ + 2 6 3 !1($!%**!#$%')'),!7%0/*+7@!
σ
!!!! ε ∝
Analytically !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!1;A3!
54!*+%&7!#(!7+##'),!!*+[-%*!#(!#4+!$%),+!(1!$+7/()7+!;%*-+7!(1!#$%')'),!&%#%!<\=>!W(F+;+$@!
found parameters
*
!%!7+##'),!'7![-'#+!7+)7'#';+!#(!#4+!/(77'6*+!/$+7+)5+!(1!(-#*'+$7@!7(!F+!/$(/(7+!#(!-7+! C ε C
(Cherkassky and Ma, 2004)
O+(%5!-,!+!,2*D%&!-.!%*0'&':+7!:-*0+&'(-,(4!#%!.-2,5!)$+)!1;A3!#-&C(!#%77!#$%,!)$%!
%&!#4+!1(**(F'),!/$+75$'/#'()!1($!$+,-*%$'.%#'()!/%$%0+#+$G!! Classification Regression
D%&! -.!!(+*07%(! '(! (*+774! $-#%6%&! .-&! 7+&/%! 6+72%(! -.! ,! 0&%(:&'0)'-,! 1;A3!
= 0%82N 4 + Eσ 4 N@ N 4 − Eσ 4 N3 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!2PP3! <'%75(!
72%(!)$+)!+&%!)--!(*+778!H%,:%!#%!0&-0-(%!)$%!.-77-#',/!1%*0'&':+73!5%0%,5%,:<K!
7, *
ε = τσ plus 1-d search over σ= 2-10… 24
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!1;P3!
* (Lin and Lin, 2003)
O+(%5! -,! %*0'&':+7! )2,',/4! )$%! :-,()+,)! 6+72%! τ = A !/'6%(! /--5! 0%&.-&*+,:%! .-&!
2(! 5+)+! (%)! ('Q%(4! ,-'(%! 7%6%7(! +,5! )+&/%)! .2,:)'-,(! .-&! IRS! &%/&%(('-,8! N$2(!"
or
%(('-,!1;P3!'(!2(%5!',!+77!%*0'&':+7!:-*0+&'(-,(!0&%(%,)%5!',!I%:)'-,(!P!+,5!U8!
Use of genetic algorithm
%&'()*'+,-.#/'01.,0#23(#4-100)-+#53)0'#
Sample from chromosome populations in shares of
'&()! #%! 5%(:&'D%! %T0%&'*%,)+7! 0&-:%52&%! 2(%5! .-&! :-*0+&'(-,(4! +,5! )$%,! 0&%(%,)!
fitness values
&':+7!&%(27)(8!
/0(*(*1+20'0K!('*27+)%5!)&+',',/!5+)+ 1 % ( 4 ! ( 34 1 ( = ;4888 * 3 !#$%&%!%B6+72%(!+&%!(+*07%5!
,'.-&*7<B(0+:%5! /&'5! ',! )$%! ',02)! (0+:%4! +,5! !B6+72%(! +&%! /%,%&+)%5! +::-&5',/! 12
Practical Example
+ =
!!! ! !!! ! !!!!!!
! ! $%&
RADAR Doppler RADAR Reflectivity
Figure 7.!$%&'(%)%*+,-)).!/+)-01/!'%2+0+31!4)1506!-7/!71*-0+31!4&+*(06!-8+9:0(-)!2(1-&!5+1)/2!
-0!)%;!)131)!5&%9!<=>?!-0!@@A@B!%7!$-.!CD!EFFF!G=HI!J1))%;!,+&,)12!42K10,(1/!9-7:-)).6!
-)%20! 5&+*(06! 2+1)/8! 2&%9! :;<=! -0! >>?>@! %7! $-.! AB! CDDD!
Tornado Possibility
Figure 8. $%&'(%)%*+,-)).! /+)-01/! &12)1,0+3+0.! -0! )%4! )131)! 5)1206! -7/! /+)-01/!
E;FG!9.H1))%4!
! &12)1,0+3+0.!
,+&,)18! 58I10,(1/!
Figure !$%&!(!)*++,!-./0%1.!2.334'454-,!)4651!7/6%-61!',!%88/68
2(%;!0(1!-&1-2!%5!0%&7-/%I! !
9-7J-)).6!8(%4!0(1!-&1-8!%2!0%&7-/%G! $'&! (! )*++,! -./0%1.! 2.334'454-,! )4651! 43! 3;.<0! 3*26/4=2.361! '
! >.-;!%/6!-%?60!)/.=!@ABC!%-!DDEDF!.0!G%,!HI!JKKK!LAM:!
Training using 33 storm days, 20 thereof with tornadoes
Verification
Maximum lead time is 30 minutes (used to issue
warnings)
!
Spatiotemporal Predictions of Tornado Possibilities
2.80
2.60
2.40
2.20
2.00
1.80
1.60
Score
1.40
1.20
1.00
0.80
0.60
0.40
0.20
0.00
POD- POD- POD- - FAR- FAR- FAR- - CSI- CSI- CSI- - Bias- Bias- Bias- - HSS- HSS- HSS-
SVM NN LDA SVM NN LDA SVM NN LDA SVM NN LDA SVM NN LDA
!
! Comparison of SVM, Neural Network (NN) and Linear Discriminant Analysis (LDA)
Figure 12.! $%&'()*+%,! %-! +.''%)/! 012/%)! &(23*,1+4! ,1.)(5! ,1/6%)7+4! (,8! 5*,1()!
14
8*+2)*&*,(,/!(,(59+*+!-%)!8*--1)1,/!+7*55!+2%)1+!:;<=4!>?@4!$AB4!C*(+4!(,8!DAAE!.+*,F!GHI!
SVMs vs. Neural Networks
“SVM have been developed in the reverse order to the development of neural
networks (NNs). SVMs evolved from the sound theory to the implementation and
experiments, while the NNs followed more heuristic path, from applications and
extensive experimentation to the theory” (Wang, 2005)
"In problems when linear decision hyperplanes are no longer feasible, an input
space is mapped into a feature space (the hidden layer in neural network
models), resulting in a nonlinear classifier.” (Kecman, 2001)
"In contrast to neural networks, SVMs automatically select their model size by
selecting the support vectors” (Rychetsky, 2001)
“SVM training always finds a global minimum, and their simple geometric
interpretation provides fertile ground for further investigation” (Burgess, 1998)
“SVM with the Gaussian kernel is similar in structure to the radial basis function
neural network, but SVM appears to perform better” (Schölkopf, 1997)
15
References
Ø Vojislav Kecman, 2001: Support Vector Machines, Neural Networks and Fuzzy
Logic Models, MIT Press
Ø Lipo Wang, 2005: Support Vector Machines: Theory and Applications, Springer
Ø Bernhard Schölkopf & Alex Smola, 2002: Learning with Kernels/ Support
Vector Machines: Regularization, Optimization and Beyond, MIT Press
Ø Nello Cristianini & John Shawe-Taylor 2000: An Introduction to Support Vector
Machines and Other Kernel-based Learning Methods, Cambridge University
Press
16