Class 0420
Class 0420
(機器學習)
Lecture 10: Support Vector Machine (1)
Hsuan-Tien Lin (林軒田)
[email protected]
Roadmap
1 When Can Machines Learn?
2 Why Can Machines Learn?
3 How Can Machines Learn?
4 How Can Machines Learn Better?
5 Embedding Numerous Features: Kernel Models
PLA/pocket
h(x) = sign(s)
x0
x1
s
x2 h(x)
(linear separable)
xd
informal argument
if (Gaussian-like) noise on future x ≈ xn :
xn further from hyperplane distance to closest xn
⇐⇒ tolerate more noise ⇐⇒ amount of noise tolerance
⇐⇒ more robust to overfitting ⇐⇒ robustness of hyperplane
Fat Hyperplane
max fatness(w)
w
subject to w classifies every (xn , yn ) correctly
fatness(w) = min distance(xn , w)
n=1,...,N
max margin(w)
w
subject to every yn wT xn > 0
margin(w) = min distance(xn , w)
n=1,...,N
Questions?
max margin(w)
w
subject to every yn wT xn > 0
margin(w) = min distance(xn , w)
n=1,...,N
‘shorten’ x and w
distance needs w0 and (w1 , . . . , wd ) differently (to be derived)
b = w0 x0X
X
=
X1
X
| w1 | x1
.. ; .
w = . x = ..
| wd | xd
Distance to Hyperplane
want: distance(x, b, w), with hyperplane wT x′ + b = 0
x
consider x′ , x′′ on hyperplane
w
1 wT x′ = − b, wT x′′ = − b dist(x, h)
x′′
2 w ⊥ hyperplane: x ′
(x′′ − x′ )
T
w =0
| {z }
vector on hyperplane
T
w ′ 1 1
|wT x + b|
distance(x, b, w) =
(x − x ) =
∥w∥ ∥w∥
yn (wT xn + b) > 0
• distance to separating hyperplane:
1
distance(xn , b, w) = yn (wT xn + b)
∥w∥
max margin(b, w)
b,w
1
max ∥w∥
b,w
1 T
min
b,w 2w w
Questions?
0.707
0
1 √1
margin(b, w) = =
=
1
∥w∥ 2
−
2
x
−
1
x
• examples on boundary: ‘locates’ fattest hyperplane
other examples: not needed
• call boundary example support vector (candidate)
Quadratic Programming
1 T 1 T
min 2w w min 2 u Qu + pT u
b,w u
0Td
b 0
objective function: u= ;Q = ; p = 0d+1
w 0d Id
aTn = yn 1 xTn
constraints: ; cn = 1; M = N
want non-linear?
zn = Φ(xn )—remember? :-)
1 T
min
b,w 2w w
minimize constraint
regularization Ein wT w ≤ C
SVM wT w Ein = 0 [and more]
Questions?
Claim
SVM ≡ min max L(b, w, α) = min ∞ if violate ; 12 wT w if feasible
b,w all αn ≥0 b,w
• any ‘violating’ (b, w): max □ + n αn (some positive) → ∞
P
all αn ≥0
• any ‘feasible’ (b, w): max □ + n αn (all non-positive) = □
P
all αn ≥0
Questions?
but wait!
N
!
X
P max P min 21 wT w + αn − wT w
all αn ≥0, yn αn =0,w= αn yn zn b,w
n=1
N
X N
X
2
⇐⇒ P max P − 21 ∥ αn yn zn ∥ + αn
all αn ≥0, yn αn =0,w= αn yn zn
n=1 n=1
αn (1 − yn (wT zn + b)) = 0
Questions?
1 T
min 2α Q D α + pT α
α
subject to special equality and bound constraints
comp. slackness:
αn > 0 ⇒ on fat boundary (SV!)
Hsuan-Tien Lin (NTU CSIE) Machine Learning 36/42
Support Vector Machine (1) Solving Dual SVM
Questions?
0
=
1
−
support vectors h (
(candidates)
h (h
2
(h
x
( (h
−
1
x
• SV (positive αn )
⊆ SV candidates (on boundary)
N
• only SV needed to compute w: w =
P P
αn yn zn = αn yn zn
n=1 SV
• only SV needed to compute b: b = yn − wT zn with any SV (zn , yn )
1 T 1 T
min 2w w min 2 α QD α − 1T α
b,w α
1 T
min 2 α QD α − 1T α
α
subject to yT α = 0;
αn ≥ 0, for n = 1, 2, . . . , N
• N variables, N + 1 constraints: no dependence on d̃?
• qn,m = yn ym zTn zm : inner product in Rd̃
—O(d̃) via naïve computation!
no dependence only if
avoiding naïve computation (next lecture :-))
Questions?
Summary
1 Embedding Numerous Features: Kernel Models
Lecture 10: Support Vector Machine (1)
Large-Margin Separating Hyperplane
intuitively more robust against noise
Standard Large-Margin Problem
minimize ‘length of w’ at special separating scale
Support Vector Machine
‘easy’ via quadratic programming
Motivation of Dual SVM
want to remove dependence on d̃
Lagrange Dual SVM
KKT conditions link primal/dual
Solving Dual SVM
another QP, better solved with special solver
Messages behind Dual SVM
SVs represent fattest hyperplane