10 SVM
10 SVM
The slides are from Raymond J. Mooney (ML Research Group @ Univ. of Texas)
f(x) = sign(wTx + b)
Ch. 15
Support vectors
SVMs maximize the margin around
the separating hyperplane.
◼ A.k.a. large margin classifiers
Geometric Margin
wT x + b
Distance from example to the separator is r=y
w
Examples closest to the hyperplane are support vectors.
Margin ρ of the separator is the width of separation between support vectors of
classes.
x ρ Derivation of finding r:
Dotted line x’ − x is perpendicular to
r decision boundary so parallel to w.
x′ Unit vector is w/|w|, so line is rw/|w|.
x’ = x – yrw/|w|.
x’ satisfies wTx’ + b = 0.
So wT(x –yrw/|w|) + b = 0
Recall that |w| = sqrt(wTw).
So wTx –yr|w| + b = 0
So, solving for r gives:
r = y(wTx + b)/|w| 6
Sec. 15.1
Assume that the functional margin of each data item is at least 1, then the
following two constraints follow for a training set {(xi ,yi)}
wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ −1 if yi = −1
For support vectors, the inequality becomes an equality
Then, since each example’s distance from the hyperplane is
w x+b
T
r=y
w
ρ wTxa + b = 1
wTxb + b = -1
Hyperplane
wT x + b = 0
This implies:
wT(xa–xb) = 2
wT x + b = 0
ρ = ‖xa–xb‖2 = 2/‖w‖2
8
Worked example: Geometric margin
normal (“perpendicular”)
to it halfway between.
It passes through (1.5, 2)
So y = x1 +2x2 − 5.5
Geometric margin is √5
9
Worked example: Functional margin
Let’s minimize w given that
yi(wTxi + b) ≥ 1
Constraint has = at SVs;
w = (a, 2a) for some a
a+2a+b = −1 2a+6a+b =
1
So, a = 2/5 and b = −11/5
Optimal hyperplane is:
w = (2/5, 4/5) and b =
−11/5
Margin ρ is 2/|w|
= 2/√(4/25+16/25)
= 2/(2√5/5) = √5
10
Sec. 15.1
f(x) = ΣαiyixiTx + b
Notice that it relies on an inner product between the test point x and the
support vectors xi
We will return to this later.
Also keep in mind that solving the optimization problem involved computing the
inner products xiTxj between all pairs of training points.
13
Sec. 15.2.1
15
Sec. 15.2.1
Neither slack variables ξi nor their Lagrange multipliers appear in the dual
problem!
Again, xi with non-zero αi will be support vectors.
Solution to the dual problem is:
w is not needed explicitly
w = Σαiyixi for classification!
b = yk(1- ξk) - wTxk where k = argmax αk’
k’ f(x) = ΣαiyixiTx + b
16
Sec. 15.1
The most “important” training points are the support vectors; they define the
hyperplane.
Both in the dual formulation of the problem and in the solution, training
points appear only inside inner products:
Non-linear SVMs
Datasets that are linearly separable (with some noise) work out great:
0 x
0 x
0 x 19
Sec. 15.2.3
Φ: x → φ(x)
20
Non-linear SVMs: Feature spaces
21
Sec. 15.2.3
Kernels
Why use kernels?
Make non-separable problem separable.
Map data into better representational space
Common kernels
Linear
23