Lecture5 SVM
Lecture5 SVM
DSA5103 Lecture 5
Yangjing Zhang
12-Sep-2023
NUS
Today’s content
1. SVM
2. Lagrange duality and KKT
3. Dual of SVM and kernels
4. SVM with soft constraints
lecture5 1/43
SVM
Idea of support vector machine (SVM)
lecture5 2/43
Idea of support vector machine (SVM)
lecture5 2/43
Maximum margin separating hyperplane
lecture5 3/43
Maximum margin separating hyperplane
lecture5 3/43
Normal cone of a hyperplane
Hyperplane H = Hβ,β0 = {x ∈ Rp | β T x + β0 = 0}
• a linear decision boundary
• (p − 1)-dimensional subspace, closed, convex
lecture5 4/43
Normal cone of a hyperplane
Hyperplane H = Hβ,β0 = {x ∈ Rp | β T x + β0 = 0}
• a linear decision boundary
• (p − 1)-dimensional subspace, closed, convex
|β T xi + β0 |
• Margin γ = γ(β, β0 ) = min
i=1,...,n kβk
• All data points must lie on the correct side:
β T xi + β0 ≥ 0 when yi = 1 β T xi + β0 ≤ 0 when yi = −1
⇐⇒ yi (β T xi + β0 ) ≥ 0, ∀ i ∈ [n] = {1, . . . , n}
|β T xi + β0 |
max min
β,β0 i=1,...,n kβk
s.t. yi (β T xi + β0 ) ≥ 0, ∀ i ∈ [n]
lecture5 6/43
Simplify the optimization problem
1
max
kβk
1 β,β0
T
max min |β xi + β0 |
β,β0 kβk i=1,...,n
⇐⇒ s.t. yi (β T xi + β0 ) ≥ 0 ∀i
T
s.t. yi (β xi + β0 ) ≥ 0 ∀i
min |β T xi + β0 | = 1
i=1,...,n
lecture5 7/43
Simplify the optimization problem
1
max
kβk
1 β,β0
T
max min |β xi + β0 |
β,β0 kβk i=1,...,n
⇐⇒ s.t. yi (β T xi + β0 ) ≥ 0 ∀i
T
s.t. yi (β xi + β0 ) ≥ 0 ∀i
min |β T xi + β0 | = 1
i=1,...,n
lecture5 7/43
Simplify the optimization problem
1
max
β,β0 kβk
min kβk2
β,β0
s.t. T
yi (β xi + β0 ) ≥ 0 ∀i ⇐⇒
s.t. yi (β T xi + β0 ) ≥ 1 ∀i
T
min |β xi + β0 | = 1
i=1,...,n
lecture5 8/43
SVM
• Later, we will discuss the Lagrangian duality and derive the dual
problem of the above
• The dual problem will play a key role in allowing us to use kernels
(introduced later)
• The dual problem will also allow us to derive an efficient algorithm
better than generic QP solvers (especially when n p)
lecture5 9/43
Support vectors
yi (β T xi + β0 ) = 1
lecture5 10/43
Lagrange duality and KKT
Lagrangian
lecture5 11/43
Lagrangian
lecture5 11/43
Lagrangian
∂L
We may set = 0 if f , gi , hj are convex and differentiable
∂x
• The dual function θ is concave even when (P) is not convex — verify
that θ(λv + (1 − λ)v 0 , λu + (1 − λ)u0 ) ≥ λθ(v, u) + (1 − λ)θ(v 0 , u0 )
lecture5 12/43
Lagrangian
∂L
We may set = 0 if f , gi , hj are convex and differentiable
∂x
• The dual function θ is concave even when (P) is not convex — verify
that θ(λv + (1 − λ)v 0 , λu + (1 − λ)u0 ) ≥ λθ(v, u) + (1 − λ)θ(v 0 , u0 )
≤ f (x)
lecture5 12/43
Lagrangian dual problem
is a lower bound of
• Dual function θ(v, u) ≤ Primal function f (x)
v ∈ Rm , u ∈ Rl+ x is primal feasible
dual feasible
s.t. v ∈ Rm , u ∈ Rl+
lecture5 13/43
Primal and dual
lecture5 15/43
Example
lecture5 15/43
Example
lecture5 15/43
Example
u21 u1 u1
= 4u1 − (Attained at x1 = , x2 = )
2 2 2
The dual problem is
u2
max 4u1 − 1
2
s.t. u1 ≥ 0
lecture5 15/43
Example: LP
lecture5 16/43
Example: LP
max bT v
v,u max bT v
Dual problem: s.t. A v + u = c i.e.,
T v
s.t. AT v ≤ c
u≥0
lecture5 16/43
Example: LP
s.t. Ax ≤ b
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and the inequality in the constraint
Ax ≤ b is interpreted component-wise. Find the dual function and dual
problem.
lecture5 17/43
Example: LP
s.t. Ax ≤ b
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and the inequality in the constraint
Ax ≤ b is interpreted component-wise. Find the dual function and dual
problem.
Solution. Let u ∈ Rm
+ . The dual function is
max −bT u
u
Dual problem: s.t. AT u + c = 0
u≥0 lecture5 17/43
Example: Lasso
lecture5 18/43
Example: Lasso
n1 o
= min kzk2 − hy, zi + λkβk1 − hX T y, βi + hy, Y i
β,z 2
n o n1 o
= min λkβk1 − hX T y, βi + min kzk2 − hy, zi + hy, Y i
β z 2
2 Lecture 4, page 34
lecture5 19/43
Example: Lasso
n1 o
= min kzk2 − hy, zi + λkβk1 − hX T y, βi + hy, Y i
β,z 2
n o n1 o
= min λkβk1 − hX T y, βi + min kzk2 − hy, zi + hy, Y i
β z 2
n1 o 1
• Set ∇z = z − y = 0 ⇒ z = y. min kzk2 − hy, zi = − kyk2
z 2 2
n o n XT y o
• min λkβk1 − hX T y, βi = −λ max h , βi − kβk1 =
β β λ | {z }
h(β)
T 2 T
X y X y
−λh∗ = −δB1 , B1 = {β ∈ Rp | kβk∞ ≤ 1}
λ λ
2 Lecture 4, page 34
lecture5 19/43
Example: Lasso
lecture5 20/43
Example: Lasso
max θ(y)
y
T
X y 1
= − min δB1 + kyk2 − hy, Y i
y λ 2
1
= − min kyk2 − hy, Y i s.t. XT y ∞ ≤ λ
y 2
lecture5 20/43
KKT
Assumptions
Under the above assumptions, strong duality holds, and there exist a
solution x∗ to (P) and a solution (u∗ , v ∗ ) to (D) satisfying the
Karush-Kuhn-Tucker (KKT) conditions:
m l
∂ X X
L(x∗ , u∗ , v ∗ ) = ∇f (x∗ ) + vi∗ ∇gi (x∗ ) + u∗j ∇hj (x∗ ) = 0
∂x i=1 j=1
lecture5 21/43
KKT
lecture5 23/43
Dual of SVM
lecture5 23/43
Dual of SVM
lecture5 24/43
Dual of SVM
Primal Dual
1 n n
min kβk2
X 1 X
β,β0 2 max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
s.t. yi (β T xi + β0 ) ≥ 1, i ∈ [n]
Xn
s.t. αi yi = 0
i=1
αi ≥ 0, i ∈ [n]
Verify the assumptions (in Page 21) for strong duality and the existence
of KKT points: (Slater’s condition) there exist β̂, β̂0 such that
yi (β̂ T xi + β̂0 ) > 1, i ∈ [n]. It requires that the two classes are strictly
separable.
Figure 2: Left: strictly separable. Right: separable but not strictly separable
lecture5 25/43
KKT of SVM
KKT conditions:
n
X n
X
αi∗ yi xi = β ∗ , αi∗ yi = 0
i=1 i=1
∗ T
yi ((β ) xi + β0∗ ) ≥ 1, i ∈ [n]
αi∗ ≥ 0, i ∈ [n]
αi∗ (1 − yi ((β ∗ )T xi + β0∗ )) = 0, i ∈ [n]
lecture5 26/43
KKT of SVM
KKT conditions:
n
X n
X
αi∗ yi xi = β ∗ , αi∗ yi = 0
i=1 i=1
∗ T
yi ((β ) xi + β0∗ ) ≥ 1, i ∈ [n]
αi∗ ≥ 0, i ∈ [n]
αi∗ (1 − yi ((β ∗ )T xi + β0∗ )) = 0, i ∈ [n]
lecture5 27/43
KKT of SVM
KKT conditions:
n
X n
X
αi∗ yi xi = β ∗ , αi∗ yi = 0
i=1 i=1
∗ T
yi ((β ) xi + β0∗ ) ≥ 1, i ∈ [n]
αi∗ ≥ 0, i ∈ [n]
αi∗ (1 − yi ((β ∗ )T xi + β0∗ )) = 0, i ∈ [n]
4. Decision boundary
n
X
0 = (β ∗ )T x + β0∗ = αi∗ yi hxi , xi + β0∗
i=1
X
= αi∗ yi hxi , xi + β0∗
i, α∗
i >0
For a new test point x, the prediction only depends on hxi , xi where
αi∗ > 0, namely, the inner products between x and support vectors.
lecture5 28/43
Primal vs Dual
Primal Dual
1 n n
min kβk2
X 1 X
β,β0 2 max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
s.t. yi (β T xi + β0 ) ≥ 1, i ∈ [n]
Xn
s.t. αi yi = 0
i=1
αi ≥ 0, i ∈ [n]
Classifier Classifier
n
!
f (x) = sign(β T x + β0 ) X
f (x) = sign αi yi hxi , xi + β0
i=1
z1 z2
• Instead of using the original feature vectors xi , i ∈ [n], we may
apply SVM using new features φ(xi ), i ∈ [n]
• New feature space can be very high dimensional lecture5 30/43
Kernel
Primal Dual
1 n n
min kβk2
X 1 X
β,β0 2 max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
s.t. yi (β T xi + β0 ) ≥ 1, i ∈ [n]
Xn
s.t. αi yi = 0
i=1
αi ≥ 0, i ∈ [n]
Using dual:
• For feature expansion, simply replace hxi , xj i with hφ(xi ), φ(xj )i
• Given a feature mapping φ, we define the corresponding kernel
K(a, b) = hφ(a), φ(b)i, a, b ∈ Rp
• Usually computing K(a, b) may be very cheap, even though
computing φ(a), φ(b) (high dimensional vectors) may be expensive
• The dual of SVM only requires the computation of kernels
K(xi , xj ). Explicitly calculating φ(xi ) is not necessary lecture5 31/43
Example: (homogeneous) polynomial kernel
For a, b ∈ Rp , consider
K(a, b) = (aT b)2
It can be written as p
! p
p
X X X
K(a, b) = ai bi aj bj =
a i a j bi bj
i=1 j=1 i,j=1
p
X
= (ai aj )(bi bj )
i,j=1
Thus, we see that K(a, b) = hφ(a), φ(b)i, where the feature mapping
2
φ : Rp → Rp is given by
a1 a1
a1
a1 a2
a2
a a
φ(a) = φ .. = 1 3
. ..
.
ap
ap ap
Computing φ(a): O(p2 ) operations; computing K(a, b): O(p) operations
lecture5 32/43
Example: (inhomogeneous) polynomial kernel
Thus, we see that K(a, b) = hφ(a), φ(b)i, where the feature mapping
2
φ : Rp → R(p +p+1) is given by
√ √ √ T
φ(a) = a1 a1 , a1 a2 , a1 a3 , . . . , ap ap , 2ca1 , 2ca2 , . . . , 2cap , c
| {z }| {z }
second order terms first order terms
Parameter c controls the relative weighting between first order and
second order terms.
lecture5 33/43
Common kernels
• Polynomials of degree d
• Polynomials up to degree d
ka − bk2
K(a, b) = exp − , σ>0
2σ 2
3 ex
P∞ xn
= n=0 n!
lecture5 34/43
Kernel
lecture5 35/43
SVM with soft constraints
SVM with soft constraints
When the two classes are not separable, no feasible separating hyperplane
exists. We allow the constraints to be violated slightly (C > 0 is given)
n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. yi (β T xi + β0 ) ≥ 1 − εi ∀ i ∈ [n]
εi ≥ 0, i ∈ [n]
lecture5 36/43
SVM with soft constraints
When the two classes are not separable, no feasible separating hyperplane
exists. We allow the constraints to be violated slightly (C > 0 is given)
n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. yi (β T xi + β0 ) ≥ 1 − εi ∀ i ∈ [n]
εi ≥ 0, i ∈ [n]
(
1 − yi (β T xi + β0 ), if yi (β T xi + β0 ) < 1
εi =
0, if yi (β T xi + β0 ) ≥ 1
= max{1 − yi (β T xi + β0 ), 0}
lecture5 37/43
SVM with soft constraints
When the two classes are not separable, no feasible separating hyperplane
exists. We allow the constraints to be violated slightly (C > 0 is given)
n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. yi (β T xi + β0 ) ≥ 1 − εi ∀ i ∈ [n]
εi ≥ 0, i ∈ [n]
(
1 − yi (β T xi + β0 ), if yi (β T xi + β0 ) < 1
εi =
0, if yi (β T xi + β0 ) ≥ 1
= max{1 − yi (β T xi + β0 ), 0}
SVM with soft constraints solves
n
1 X
min kβk2 +C max{1 − yi (β T xi + β0 ), 0}
β,β0 2
| {z } i=1
ridge regularization
| {z }
hinge-loss function
lecture5 37/43
Logistic regression
Recall that4
n
X T
logistic-loss = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
i=1
Equivalently,
T
log 1 + e−(β xi +β0 ) , if yi = 1
logistic-loss =
log 1 + eβ T xi +β0 ,
if yi = 0
4 Lecture 3, Page 19
lecture5 38/43
Logistic regression
Recall that4
n
X T
logistic-loss = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
i=1
Equivalently,
T
log 1 + e−(β xi +β0 ) , if yi = 1
logistic-loss =
log 1 + eβ T xi +β0 ,
if yi = 0
Hinge-loss Logistic-loss
hinge-loss = max{1 − z, 0} logistic-loss = log(1 + e−z )
z = yi (β T xi + β0 ) z = yi (β T xi + β0 )
hope z ≥ 1 hope z 0
lecture5 39/43
SVM vs. logistic regression
Hinge-loss Logistic-loss
hinge-loss = max{1 − z, 0} logistic-loss = log(1 + e−z )
z = yi (β T xi + β0 ) z = yi (β T xi + β0 )
hope z ≥ 1 hope z 0
4
hinge-loss
3
logistic loss is a “smoothed
2
version” of hinge loss
logistic-loss
0
-3 -2 -1 0 1 2 3 lecture5 39/43
SVM with soft constraints: dual
lecture5 40/43
SVM with soft constraints: dual
n n n n
1 X X X X
min kβk2 − αi yi xTi β − αi yi β0 + (C − αi − ri )εi + αi
β,β0 ,ε 2 i=1 i=1 i=1 i=1
n n
∂ X ∂ X
Setting L=β− αi yi xi = 0, L=− αi yi = 0,
∂β i=1
∂β0 i=1
∂
L = C − αi − ri = 0,
∂εi
lecture5 41/43
SVM with soft constraints: dual
n n n n
1 X X X X
min kβk2 − αi yi xTi β − αi yi β0 + (C − αi − ri )εi + αi
β,β0 ,ε 2 i=1 i=1 i=1 i=1
n n
∂ X ∂ X
Setting L=β− αi yi xi = 0, L=− αi yi = 0,
∂β i=1
∂β0 i=1
∂
L = C − αi − ri = 0,
∂εi
we obtain that θ(α, r) =
n n n
X 1 X X
αi − αi αj yi yj xTi xj if αi yi = 0 and αi + ri = C
i=1
2 i,j=1 i=1
−∞ otherwise
lecture5 41/43
SVM with soft constraints: dual
n n
X 1 X
⇐⇒ max αi − αi αj yi yj hxi , xj i
α,r
i=1
2 i,j=1
n
X
s.t. αi yi = 0
i=1
αi + ri = C, i ∈ [n]
α ∈ Rn+ , r ∈ Rn+
n n
X 1 X
⇐⇒ max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
Xn
s.t. αi yi = 0
i=1
0 ≤ αi ≤ C, i ∈ [n]
lecture5 42/43
In practice
You are encouraged to learn two popular open source machine learning
libraries:
LIBLINEAR https://www.csie.ntu.edu.tw/~cjlin/liblinear/
LIBSVM https://www.csie.ntu.edu.tw/~cjlin/libsvm/
lecture5 43/43