0% found this document useful (0 votes)

31 views

Lecture5 SVM

1. The document discusses support vector machines (SVMs), beginning with an overview of SVMs and how they find the maximum margin separating hyperplane between two classes of data. 2. It then derives the optimization problem to solve for the SVM, which is to minimize the norm of the weight vector subject to constraints on the data points. 3. The document introduces concepts like support vectors, Lagrange duality, and kernels that allow SVMs to be applied to non-linear classification problems.

Uploaded by

jiayuan0113

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Lecture5 SVM

Uploaded by

jiayuan0113

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Support Vector Machine (SVM)

DSA5103 Lecture 5

Yangjing Zhang
12-Sep-2023
NUS
Today’s content

1. SVM
2. Lagrange duality and KKT
3. Dual of SVM and kernels
4. SVM with soft constraints

lecture5 1/43
SVM
Idea of support vector machine (SVM)

• Data: xi ∈ Rp , yi ∈ {−1, 1} (instead of {0, 1} in logistic

regression), i = 1, . . . , n
• The two classes are assumed to be linearly separable
• Aim: Learn a linear classifier: f (x) = sign(β T x + β0 )

lecture5 2/43
Idea of support vector machine (SVM)

• Data: xi ∈ Rp , yi ∈ {−1, 1} (instead of {0, 1} in logistic

regression), i = 1, . . . , n
• The two classes are assumed to be linearly separable
• Aim: Learn a linear classifier: f (x) = sign(β T x + β0 )

• Question: what is the “best”

separating hyperplane?
• SVM answer: the hyperplane
with maximum margin.
• Margin = the distance to the
closet data points.

lecture5 2/43
Maximum margin separating hyperplane

For the separating hyperplane with maximum margin,

distance to points in positive class = distance to points in negative class

lecture5 3/43
Maximum margin separating hyperplane

For the separating hyperplane with maximum margin,

distance to points in positive class = distance to points in negative class

lecture5 3/43
Normal cone of a hyperplane

Hyperplane H = Hβ,β0 = {x ∈ Rp | β T x + β0 = 0}
• a linear decision boundary
• (p − 1)-dimensional subspace, closed, convex

Figure 1: Left: p = 2, H is a line. Right: p = 3, H is a plane.

• For any x̄ ∈ H, normal cone NH (x̄) = {λβ | λ ∈ R}

. The normal cone of H must be 1-dimensional. We can show that
β ∈ NH (x̄), i.e., hβ, z − x̄i ≤ 0 ∀z ∈ H.

lecture5 4/43
Normal cone of a hyperplane

Hyperplane H = Hβ,β0 = {x ∈ Rp | β T x + β0 = 0}
• a linear decision boundary
• (p − 1)-dimensional subspace, closed, convex

Figure 1: Left: p = 2, H is a line. Right: p = 3, H is a plane.

• For any x̄ ∈ H, normal cone NH (x̄) = {λβ | λ ∈ R}

. The normal cone of H must be 1-dimensional. We can show that
β ∈ NH (x̄), i.e., hβ, z − x̄i ≤ 0 ∀z ∈ H.
This is true since z, x̄ ∈ H ⇒ β T z + β0 = 0, β T x̄ + β0 = 0
⇒ β T (z − x̄) = 0. Obviously, we also have −β ∈ NH (x̄).
lecture5 4/43
Distance of a point to a hyperplane

Compute the distance of a point x to a hyperplane

H = {x ∈ Rp | β T x + β0 = 0}.
1. x̄ = ΠH (x) ⇐⇒ 1 x − x̄ ∈ NH (x̄) ⇐⇒ x − x̄ = λβ for some
λ∈R
2. x̄ ∈ H ⇒ β T x̄ + β0 = 0
⇒ β T (x − λβ) + β0 = 0
β T x + β0
⇒λ=
βT β
β T x + β0
3. x − x̄ = β
βT β
|β T x + β0 |
kx − x̄k =
kβk
|β T x + β0 |
The distance of a point x to a hyperplane H is ; it is
kβk
invariant to scaling of the parameters β, β0
1 Lecture
lecture5 5/43
4, page 24
Maximize margin

|β T xi + β0 |
• Margin γ = γ(β, β0 ) = min
i=1,...,n kβk
• All data points must lie on the correct side:
β T xi + β0 ≥ 0 when yi = 1 β T xi + β0 ≤ 0 when yi = −1

⇐⇒ yi (β T xi + β0 ) ≥ 0, ∀ i ∈ [n] = {1, . . . , n}

• Therefore, the optimization problem is

|β T xi + β0 |

max min
β,β0 i=1,...,n kβk

s.t. yi (β T xi + β0 ) ≥ 0, ∀ i ∈ [n]

lecture5 6/43
Simplify the optimization problem

1
max
kβk

1 β,β0
T
max min |β xi + β0 |
β,β0 kβk i=1,...,n
⇐⇒ s.t. yi (β T xi + β0 ) ≥ 0 ∀i
T
s.t. yi (β xi + β0 ) ≥ 0 ∀i
min |β T xi + β0 | = 1
i=1,...,n

lecture5 7/43
Simplify the optimization problem

1
max
kβk

1 β,β0
T
max min |β xi + β0 |
β,β0 kβk i=1,...,n
⇐⇒ s.t. yi (β T xi + β0 ) ≥ 0 ∀i
T
s.t. yi (β xi + β0 ) ≥ 0 ∀i
min |β T xi + β0 | = 1
i=1,...,n

• The hyperplane and margin are scale

invariant (β, β0 ) → (cβ, cβ0 ), for any
c 6= 0
• If xk is the closest point to H, i.e.,
k = arg min |β T xi + β0 |, we can
i=1,...,n
scale β, β0 such that |β T xk + β0 | = 1

lecture5 7/43
Simplify the optimization problem

1
max
β,β0 kβk
min kβk2
β,β0
s.t. T
yi (β xi + β0 ) ≥ 0 ∀i ⇐⇒
s.t. yi (β T xi + β0 ) ≥ 1 ∀i
T
min |β xi + β0 | = 1
i=1,...,n

• “⇒” Note that yi ∈ {−1, 1}

• “⇐” Note that we minimize kβk

lecture5 8/43
SVM

SVM is a quadratic programming (QP) problem — it can be solved by

generic QP solvers
1
min kβk2
β,β0 2
s.t. yi (β T xi + β0 ) ≥ 1 ∀ i ∈ [n]

• Later, we will discuss the Lagrangian duality and derive the dual
problem of the above
• The dual problem will play a key role in allowing us to use kernels
(introduced later)
• The dual problem will also allow us to derive an efficient algorithm
better than generic QP solvers (especially when n p)

lecture5 9/43
Support vectors

Support vectors are some xi having tight constraints

yi (β T xi + β0 ) = 1

• Support vectors must exist

• Number of support vectors sample
size n
• The resulting hyperplane may change
if some support vectors are removed

lecture5 10/43
Lagrange duality and KKT
Lagrangian

Consider a general nonlinear programming problem (NLP), which is

known as a primal problem
(P) min f (x)
x∈Rp
s.t. gi (x) = 0, i ∈ [m]
hj (x) ≤ 0, j ∈ [l]

• Define the Lagrangian

m
X l
X
L(x, v, u) = f (x) + vi gi (x) + uj hj (x)
i=1 j=1

for v = [v1 ; . . . ; vm ] ∈ Rm , u = [u1 ; . . . ; ul ] ∈ Rl+ .

lecture5 11/43
Lagrangian

Consider a general nonlinear programming problem (NLP), which is

known as a primal problem
(P) min f (x)
x∈Rp
s.t. gi (x) = 0, i ∈ [m]
hj (x) ≤ 0, j ∈ [l]

• Define the Lagrangian

m
X l
X
L(x, v, u) = f (x) + vi gi (x) + uj hj (x)
i=1 j=1

for v = [v1 ; . . . ; vm ] ∈ Rm , u = [u1 ; . . . ; ul ] ∈ Rl+ .

• Define the Lagrange dual function (always concave)

θ(v, u) = min L(x, v, u)

lecture5 11/43
Lagrangian

• In evaluating θ(v, u) for each v, u, we must solve

m
X l
X
min L(x, v, u) = f (x) + vi gi (x) + uj hj (x)
x
i=1 j=1

∂L
We may set = 0 if f , gi , hj are convex and differentiable
∂x
• The dual function θ is concave even when (P) is not convex — verify
that θ(λv + (1 − λ)v 0 , λu + (1 − λ)u0 ) ≥ λθ(v, u) + (1 − λ)θ(v 0 , u0 )

lecture5 12/43
Lagrangian

• In evaluating θ(v, u) for each v, u, we must solve

m
X l
X
min L(x, v, u) = f (x) + vi gi (x) + uj hj (x)
x
i=1 j=1

• Suppose x is a feasible point of (P). Then for any v ∈ Rm , u ∈ Rl+ ,

θ(v, u) = min L(x, v, u) ≤ L(x, v, u)
x
m
X l
X
= f (x) + vi gi (x) + uj hj (x)
i=1 j=1

≤ f (x)

lecture5 12/43
Lagrangian dual problem

is a lower bound of
• Dual function θ(v, u) ≤ Primal function f (x)
v ∈ Rm , u ∈ Rl+ x is primal feasible
dual feasible

• We want to search for the largest lower bound — leading to the

Lagrangian dual problem

(D) max θ(v, u)

v,u

s.t. v ∈ Rm , u ∈ Rl+

Here vi ,uj are called dual variables or Lagrange multipliers.

lecture5 13/43
Primal and dual

Definition (Lagrangian dual problem)

For a primal nonlinear programming problem (P)
(P) min f (x)
x∈Rp
s.t. gi (x) = 0, i ∈ [m]
hj (x) ≤ 0, j ∈ [l]
The Lagrangian dual problem (D) is the following nonlinear programming
problem
( m l
)
X X
(D) max θ(v, u) = min f (x) + vi gi (x) + uj hj (x)
v,u x
i=1 j=1
m
s.t. v∈R , u∈ Rl+
• Weak duality: optimal value for (D) ≤ optimal value for (P)
• Under certain assumptions (see page 21),
strong duality: optimal value for (D) = objective value for (P)
lecture5 14/43
Example

Find the dual problem of the convex program

min x21 + x22
s.t. x1 + x2 ≥ 4

lecture5 15/43
Example

Find the dual problem of the convex program

min x21 + x22 f (x) = x21 + x22
s.t. x1 + x2 ≥ 4 h1 (x) = 4 − x1 − x2 ≤ 0 ← u1 ≥ 0

lecture5 15/43
Example

Find the dual problem of the convex program

min x21 + x22 f (x) = x21 + x22
s.t. x1 + x2 ≥ 4 h1 (x) = 4 − x1 − x2 ≤ 0 ← u1 ≥ 0
Solution. For u1 ≥ 0, the Lagrangian is
L(x1 , x2 , u1 ) = f (x) + u1 h1 (x) = x21 + x22 + u1 (4 − x1 − x2 )

lecture5 15/43
Example

Find the dual problem of the convex program

min x21 + x22 f (x) = x21 + x22
s.t. x1 + x2 ≥ 4 h1 (x) = 4 − x1 − x2 ≤ 0 ← u1 ≥ 0
Solution. For u1 ≥ 0, the Lagrangian is
L(x1 , x2 , u1 ) = f (x) + u1 h1 (x) = x21 + x22 + u1 (4 − x1 − x2 )
The dual function is
θ(u1 ) = min x21 + x22 + u1 (4 − x1 − x2 )
x1 ,x2

= 4u1 + min {x21 − u1 x1 } + min {x22 − u1 x2 }

x1 x2

u21 u1 u1
= 4u1 − (Attained at x1 = , x2 = )
2 2 2
The dual problem is
u2
max 4u1 − 1
2
s.t. u1 ≥ 0
lecture5 15/43
Example: LP

Consider the linear programming (LP) problem in standard form

min cT x
x
s.t. Ax = b
x≥0
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and x ≥ 0 means xi ≥ 0, i ∈ [n].
Find the dual function and dual problem.

lecture5 16/43
Example: LP

Consider the linear programming (LP) problem in standard form

min cT x
x
s.t. Ax = b
x≥0
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and x ≥ 0 means xi ≥ 0, i ∈ [n].
Find the dual function and dual problem.
Solution. Let v ∈ Rm and u ∈ Rn+ . The dual function is
θ(v) = min {cT x + v T (b − Ax) − uT x} = v T b + min {xT (c − AT v − u)}
x x
(
T T
v b, if c − A v − u = 0
=
−∞, otherwise

max bT v
v,u max bT v
Dual problem: s.t. A v + u = c i.e.,
T v
s.t. AT v ≤ c
u≥0
lecture5 16/43
Example: LP

Consider the LP in standard inequality form

min cT x
x

s.t. Ax ≤ b
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and the inequality in the constraint
Ax ≤ b is interpreted component-wise. Find the dual function and dual
problem.

lecture5 17/43
Example: LP

Consider the LP in standard inequality form

min cT x
x

s.t. Ax ≤ b
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and the inequality in the constraint
Ax ≤ b is interpreted component-wise. Find the dual function and dual
problem.
Solution. Let u ∈ Rm
+ . The dual function is

θ(u) = minn {cT x + uT (Ax − b)} = −uT b + minn {xT (c + AT u)}

x∈R x∈R
(
T T
−u b, if c + A u = 0
=
−∞, otherwise

max −bT u
u
Dual problem: s.t. AT u + c = 0
u≥0 lecture5 17/43
Example: Lasso

Consider the problem

1
min kzk2 + λkβk1
β,z 2
s.t. z + Xβ = Y

where X ∈ Rn×p , Y ∈ Rn , λ > 0. Find the dual function and dual

problem.
Solution.

1. Let y ∈ Rn . The Lagrangian is

1
L(β, z, y) = kzk2 + λkβk1 + hy, Y − z − Xβi
2
1
= kzk2 − hy, zi + λkβk1 − hX T y, βi + hy, Y i
2

lecture5 18/43
Example: Lasso

2. The dual function θ(y) = min L(β, z, y)

β,z

n1 o
= min kzk2 − hy, zi + λkβk1 − hX T y, βi + hy, Y i
β,z 2
n o n1 o
= min λkβk1 − hX T y, βi + min kzk2 − hy, zi + hy, Y i
β z 2

2 Lecture 4, page 34
lecture5 19/43
Example: Lasso

2. The dual function θ(y) = min L(β, z, y)

β,z

n1 o
= min kzk2 − hy, zi + λkβk1 − hX T y, βi + hy, Y i
β,z 2
n o n1 o
= min λkβk1 − hX T y, βi + min kzk2 − hy, zi + hy, Y i
β z 2
n1 o 1
• Set ∇z = z − y = 0 ⇒ z = y. min kzk2 − hy, zi = − kyk2
z 2 2
n o n XT y o
• min λkβk1 − hX T y, βi = −λ max h , βi − kβk1 =
β β λ | {z }
h(β)
T 2 T
X y X y
−λh∗ = −δB1 , B1 = {β ∈ Rp | kβk∞ ≤ 1}
λ λ

2 Lecture 4, page 34
lecture5 19/43
Example: Lasso

2. The dual function

T
X y 1
θ(y) = −δB1 − kyk2 +hy, Y i, B1 = {β ∈ Rp | kβk∞ ≤ 1}
λ 2

3. The dual problem

lecture5 20/43
Example: Lasso

2. The dual function

T
X y 1
θ(y) = −δB1 − kyk2 +hy, Y i, B1 = {β ∈ Rp | kβk∞ ≤ 1}
λ 2

3. The dual problem

max θ(y)
y
T
X y 1
= − min δB1 + kyk2 − hy, Y i
y λ 2
1
= − min kyk2 − hy, Y i s.t. XT y ∞ ≤ λ
y 2

lecture5 20/43
KKT

Assumptions

1. f, hj : Rp → R differentiable and convex

2. gi : Rp → R affine (gi (x) = aTi x + bi )
3. Slater’s condition holds, i.e., there exists x̂ such that

gi (x̂) = 0, ∀ i hj (x̂) < 0, ∀ j

Under the above assumptions, strong duality holds, and there exist a
solution x∗ to (P) and a solution (u∗ , v ∗ ) to (D) satisfying the
Karush-Kuhn-Tucker (KKT) conditions:
m l
∂ X X
L(x∗ , u∗ , v ∗ ) = ∇f (x∗ ) + vi∗ ∇gi (x∗ ) + u∗j ∇hj (x∗ ) = 0
∂x i=1 j=1

gi (x∗ ) = 0, hj (x∗ ) ≤ 0, u∗j ≥ 0, u∗j hj (x∗ ) = 0, ∀ i ∈ [m], j ∈ [l]

lecture5 21/43
KKT

• We say (x∗ , u∗ , v ∗ ) (or simply x∗ ) is a KKT point or a KKT solution

if (x∗ , u∗ , v ∗ ) satisfies the KKT conditions
• Under the above assumptions, (x∗ , u∗ , v ∗ ) is a KKT solution ⇐⇒
x∗ is an optimal solution to (P) and (u∗ , v ∗ ) is an optimal solution
to (D)
• We call
u∗j hj (x∗ ) = 0, ∀ j ∈ [l]
complementary slackness condition. It implies
u∗j = 0 if hj (x∗ ) < 0, hj (x∗ ) = 0 if u∗j > 0

 • If the constraint hj (x∗ ) ≤ 0 is slack

∗
 hj (x ) ≤ 0
 (hj (x∗ ) < 0), then the constraint u∗j ≥ 0 is
∗
uj ≥ 0 active (u∗j = 0)
 u∗ h (x∗ ) = 0

j j • If the constraint u∗j ≥ 0 is slack (u∗j > 0),
then the constraint hj (x∗ ) ≤ 0 is active
(hj (x∗ ) = 0)
lecture5 22/43
Dual of SVM
Dual of SVM

Derive the dual of the following SVM problem

1
min kβk2
β,β0 2
s.t. 1 − yi (β T xi + β0 ) ≤ 0 ∀ i ∈ [n]

1. For α ∈ Rn+ , the Lagrangian is

n
1 X
L(β, β0 , α) = kβk2 + αi (1 − yi (β T xi + β0 ))
2 i=1

lecture5 23/43
Dual of SVM

Derive the dual of the following SVM problem

1
min kβk2
β,β0 2
s.t. 1 − yi (β T xi + β0 ) ≤ 0 ∀ i ∈ [n]

1. For α ∈ Rn+ , the Lagrangian is

n
1 X
L(β, β0 , α) = kβk2 + αi (1 − yi (β T xi + β0 ))
2 i=1

2. The dual function is

θ(α) = min L(β, β0 , α)

β,β0
n n n
1 X X X
= min kβk2 − αi yi xTi β − αi yi β0 + αi
β,β0 2 i=1 i=1 i=1

lecture5 23/43
Dual of SVM

We need to solve the optimization problem

n n n
1 X X X
min kβk2 − αi yi xTi β − αi yi β0 + αi
β,β0 2 i=1 i=1 i=1
n n
∂ X ∂ X
Setting L=β− αi yi xi = 0, L=− αi yi = 0,
∂β i=1
∂β0 i=1

lecture5 24/43
Dual of SVM

We need to solve the optimization problem

n n n
1 X X X
min kβk2 − αi yi xTi β − αi yi β0 + αi
β,β0 2 i=1 i=1 i=1
n n
∂ X ∂ X
Setting L=β− αi yi xi = 0, L=− αi yi = 0, we
∂β i=1
∂β0 i=1
obtain that
 n n n
 X 1 X X
 αi − αi αj yi yj xTi xj if αi yi = 0
θ(α) = i=1
2 i,j=1 i=1

−∞ otherwise


3. The dual problem is

n n
X 1 X
max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
Xn
s.t. αi yi = 0, αi ≥ 0, i ∈ [n]
i=1 lecture5 24/43
KKT of SVM

Primal Dual
1 n n
min kβk2
X 1 X
β,β0 2 max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
s.t. yi (β T xi + β0 ) ≥ 1, i ∈ [n]
Xn
s.t. αi yi = 0
i=1

αi ≥ 0, i ∈ [n]
Verify the assumptions (in Page 21) for strong duality and the existence
of KKT points: (Slater’s condition) there exist β̂, β̂0 such that
yi (β̂ T xi + β̂0 ) > 1, i ∈ [n]. It requires that the two classes are strictly
separable.

Figure 2: Left: strictly separable. Right: separable but not strictly separable
lecture5 25/43
KKT of SVM

KKT conditions:
n
X n
X
αi∗ yi xi = β ∗ , αi∗ yi = 0
i=1 i=1
∗ T
yi ((β ) xi + β0∗ ) ≥ 1, i ∈ [n]
αi∗ ≥ 0, i ∈ [n]
αi∗ (1 − yi ((β ∗ )T xi + β0∗ )) = 0, i ∈ [n]

1. If we obtain a dual solution α∗ (via solving the SVM dual problem),

then we can construct a primal solution (β ∗ , β0∗ ) from KKT
conditions
Xn
β∗ = αi∗ yi xi
i=1
n
X
β0∗ = yk − αi∗ yi hxi , xk i, for some k satisfying αk∗ > 0
i=1

lecture5 26/43
KKT of SVM

KKT conditions:
n
X n
X
αi∗ yi xi = β ∗ , αi∗ yi = 0
i=1 i=1
∗ T
yi ((β ) xi + β0∗ ) ≥ 1, i ∈ [n]
αi∗ ≥ 0, i ∈ [n]
αi∗ (1 − yi ((β ∗ )T xi + β0∗ )) = 0, i ∈ [n]

2. If αi∗ > 0, then xi is a support vector, by complementary slackness

condition:
αi∗ > 0 ⇒ yi ((β ∗ )T xi + β0∗ ) = 1
3. |{i | αi∗ > 0}| ≤ the number of support vectors n
Dual solution is sparse (many αi∗ = 0)

lecture5 27/43
KKT of SVM

KKT conditions:
n
X n
X
αi∗ yi xi = β ∗ , αi∗ yi = 0
i=1 i=1
∗ T
yi ((β ) xi + β0∗ ) ≥ 1, i ∈ [n]
αi∗ ≥ 0, i ∈ [n]
αi∗ (1 − yi ((β ∗ )T xi + β0∗ )) = 0, i ∈ [n]

4. Decision boundary
n
X
0 = (β ∗ )T x + β0∗ = αi∗ yi hxi , xi + β0∗
i=1
X
= αi∗ yi hxi , xi + β0∗
i, α∗
i >0

For a new test point x, the prediction only depends on hxi , xi where
αi∗ > 0, namely, the inner products between x and support vectors.
lecture5 28/43
Primal vs Dual

Primal Dual
1 n n
min kβk2
X 1 X
β,β0 2 max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
s.t. yi (β T xi + β0 ) ≥ 1, i ∈ [n]
Xn
s.t. αi yi = 0
i=1

αi ≥ 0, i ∈ [n]
Classifier Classifier
n
!
f (x) = sign(β T x + β0 ) X
f (x) = sign αi yi hxi , xi + β0
i=1

Many αi ’s are zero (sparse solutions)

• Optimize p + 1 variables for primal, n variables for dual
• When n p, it might be more efficient to solve the dual
• Dual problem only involves hxi , xj i — allowing the use of kernels
lecture5 29/43
Feature mapping

• Recall feature expansion, for example,

 
xi1
" #  x 
xi1  i2 
feature  2 
i-th feature vector xi = →  xi1 
xi2 expansion  2 
 xi2 
xi1 xi2
• Let φ denote the feature mapping, which maps from original
features to new features  
z1
" #!  z 
z1  2 
For example, φ =  z12 
 
z2
 z22 
 

z1 z2
• Instead of using the original feature vectors xi , i ∈ [n], we may
apply SVM using new features φ(xi ), i ∈ [n]
• New feature space can be very high dimensional lecture5 30/43
Kernel

Primal Dual
1 n n
min kβk2
X 1 X
β,β0 2 max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
s.t. yi (β T xi + β0 ) ≥ 1, i ∈ [n]
Xn
s.t. αi yi = 0
i=1

αi ≥ 0, i ∈ [n]
Using dual:
• For feature expansion, simply replace hxi , xj i with hφ(xi ), φ(xj )i
• Given a feature mapping φ, we define the corresponding kernel
K(a, b) = hφ(a), φ(b)i, a, b ∈ Rp
• Usually computing K(a, b) may be very cheap, even though
computing φ(a), φ(b) (high dimensional vectors) may be expensive
• The dual of SVM only requires the computation of kernels
K(xi , xj ). Explicitly calculating φ(xi ) is not necessary lecture5 31/43
Example: (homogeneous) polynomial kernel

For a, b ∈ Rp , consider
K(a, b) = (aT b)2
It can be written as p
! p 
p
X X X
K(a, b) = ai bi  aj bj =
 a i a j bi bj
i=1 j=1 i,j=1
p
X
= (ai aj )(bi bj )
i,j=1

Thus, we see that K(a, b) = hφ(a), φ(b)i, where the feature mapping
2
φ : Rp → Rp is given by  

a1 a1

a1
  a1 a2 
 
a2  
  a a 

φ(a) = φ  ..  =  1 3 
 .   .. 
 . 
ap
ap ap
Computing φ(a): O(p2 ) operations; computing K(a, b): O(p) operations
lecture5 32/43
Example: (inhomogeneous) polynomial kernel

Given c ≥ 0. For a, b ∈ Rp , consider

K(a, b) = (aT b + c)2

p p
X X √ √
= (ai aj )(bi bj ) + ( 2cai )( 2cbi ) + c2
i,j=1 i=1

Thus, we see that K(a, b) = hφ(a), φ(b)i, where the feature mapping
2
φ : Rp → R(p +p+1) is given by
√ √ √ T
φ(a) = a1 a1 , a1 a2 , a1 a3 , . . . , ap ap , 2ca1 , 2ca2 , . . . , 2cap , c
| {z }| {z }
second order terms first order terms
Parameter c controls the relative weighting between first order and
second order terms.

lecture5 33/43
Common kernels

• Polynomials of degree d

K(a, b) = (aT b)d

• Polynomials up to degree d

K(a, b) = (aT b + 1)d

• Gaussian kernel — polynomials of all orders3

ka − bk2

K(a, b) = exp − , σ>0
2σ 2

3 ex
P∞ xn
= n=0 n!

lecture5 34/43
Kernel

• SVM can be applied in high dimensional feature spaces, without

explicitly applying the feature mapping
• The two classes might be separable in high dimensional space, but
not separable in the original feature space
• Kernels can be used efficiently in the dual problem of SVM because
the dual only involves inner products

lecture5 35/43
SVM with soft constraints
SVM with soft constraints

lecture5 36/43
SVM with soft constraints

lecture5 37/43
SVM with soft constraints

When the two classes are not separable, no feasible separating hyperplane
exists. We allow the constraints to be violated slightly (C > 0 is given)
n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. yi (β T xi + β0 ) ≥ 1 − εi ∀ i ∈ [n]
εi ≥ 0, i ∈ [n]
(
1 − yi (β T xi + β0 ), if yi (β T xi + β0 ) < 1
εi =
0, if yi (β T xi + β0 ) ≥ 1
= max{1 − yi (β T xi + β0 ), 0}
SVM with soft constraints solves
n
1 X
min kβk2 +C max{1 − yi (β T xi + β0 ), 0}
β,β0 2
| {z } i=1
ridge regularization
| {z }
hinge-loss function
lecture5 37/43
Logistic regression

Recall that4
n
X T
logistic-loss = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
i=1
Equivalently,

T
 log 1 + e−(β xi +β0 ) , if yi = 1

logistic-loss =
 log 1 + eβ T xi +β0 ,
 if yi = 0

4 Lecture 3, Page 19
lecture5 38/43
Logistic regression

Change label yi = 0 → yi = −1,

T

logistic-loss = log 1 + e−yi (β xi +β0 ) , yi ∈ {−1, 1}

Logistic regression with ridge regularization

n
X T
min log 1 + e−yi (β xi +β0 ) + λkβk2
β,β0
i=1
4 Lecture 3, Page 19
lecture5 38/43
SVM vs. logistic regression

SVM with soft constraints Logistic regression with

ridge regularization
n n
X T 1 2
X −yi (β T xi +β0 ) 2
min C max{1 − yi (β xi + β0 ), 0} + kβk min log 1+e +λkβk
β,β0 2 β,β0
i=1 i=1

Hinge-loss Logistic-loss
hinge-loss = max{1 − z, 0} logistic-loss = log(1 + e−z )
z = yi (β T xi + β0 ) z = yi (β T xi + β0 )
hope z ≥ 1 hope z 0

lecture5 39/43
SVM vs. logistic regression

SVM with soft constraints Logistic regression with

ridge regularization
n n
X T 1 2
X −yi (β T xi +β0 ) 2
min C max{1 − yi (β xi + β0 ), 0} + kβk min log 1+e +λkβk
β,β0 2 β,β0
i=1 i=1

Hinge-loss Logistic-loss
hinge-loss = max{1 − z, 0} logistic-loss = log(1 + e−z )
z = yi (β T xi + β0 ) z = yi (β T xi + β0 )
hope z ≥ 1 hope z 0
4

hinge-loss
3
logistic loss is a “smoothed
2
version” of hinge loss
logistic-loss

0
-3 -2 -1 0 1 2 3 lecture5 39/43
SVM with soft constraints: dual

SVM with soft constraints

n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. 1 − εi − yi (β T xi + β0 ) ≤ 0 ∀ i ∈ [n]
−εi ≤ 0, i ∈ [n]

Find the dual problem.

1. For α ∈ Rn+ , r ∈ Rn+ , the Lagrangian L(β, β0 , ε, α, r) =

n n n
1 X X X
kβk2 + C εi + αi (1 − εi − yi (β T xi + β0 )) − ri εi
2 i=1 i=1 i=1
n n n n
1 2
X
T
X X X
= kβk − αi yi xi β − αi yi β0 + (C − αi − ri )εi + αi
2 i=1 i=1 i=1 i=1

lecture5 40/43
SVM with soft constraints: dual

2. The dual function is θ(α, r) = min L(β, β0 , ε, α, r) =

β,β0 ,ε

n n n n
1 X X X X
min kβk2 − αi yi xTi β − αi yi β0 + (C − αi − ri )εi + αi
β,β0 ,ε 2 i=1 i=1 i=1 i=1

n n
∂ X ∂ X
Setting L=β− αi yi xi = 0, L=− αi yi = 0,
∂β i=1
∂β0 i=1
∂
L = C − αi − ri = 0,
∂εi

lecture5 41/43
SVM with soft constraints: dual

2. The dual function is θ(α, r) = min L(β, β0 , ε, α, r) =

β,β0 ,ε

n n n n
1 X X X X
min kβk2 − αi yi xTi β − αi yi β0 + (C − αi − ri )εi + αi
β,β0 ,ε 2 i=1 i=1 i=1 i=1

n n
∂ X ∂ X
Setting L=β− αi yi xi = 0, L=− αi yi = 0,
∂β i=1
∂β0 i=1
∂
L = C − αi − ri = 0,
∂εi
we obtain that θ(α, r) =
 n n n
X 1 X X
αi − αi αj yi yj xTi xj if αi yi = 0 and αi + ri = C


i=1
2 i,j=1 i=1

−∞ otherwise


lecture5 41/43
SVM with soft constraints: dual

3. The dual problem max {θ(α, r) | α ∈ Rn+ , r ∈ Rn+ }

α,r

n n
X 1 X
⇐⇒ max αi − αi αj yi yj hxi , xj i
α,r
i=1
2 i,j=1
n
X
s.t. αi yi = 0
i=1

αi + ri = C, i ∈ [n]
α ∈ Rn+ , r ∈ Rn+
n n
X 1 X
⇐⇒ max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
Xn
s.t. αi yi = 0
i=1

0 ≤ αi ≤ C, i ∈ [n]
lecture5 42/43
In practice

You are encouraged to learn two popular open source machine learning
libraries:
LIBLINEAR https://www.csie.ntu.edu.tw/~cjlin/liblinear/
LIBSVM https://www.csie.ntu.edu.tw/~cjlin/libsvm/

lecture5 43/43

Direct Square Variation
81% (16)
Direct Square Variation
2 pages
BRITTANY - Design Assignment 3 Data
0% (4)
BRITTANY - Design Assignment 3 Data
12 pages
(Chapter 3) Quadratic Function
100% (3)
(Chapter 3) Quadratic Function
17 pages
HJB Equations
100% (1)
HJB Equations
38 pages
Lecture 01 - Intro
No ratings yet
Lecture 01 - Intro
20 pages
SVM-CDing2024 11 15
No ratings yet
SVM-CDing2024 11 15
54 pages
Support Vector Machines: Logisic Regression
No ratings yet
Support Vector Machines: Logisic Regression
10 pages
Support Vector Machines (SVM) : Y.H. Hu
No ratings yet
Support Vector Machines (SVM) : Y.H. Hu
25 pages
Homework 1
No ratings yet
Homework 1
3 pages
斯坦福大学机器学习数学基础 57-64
No ratings yet
斯坦福大学机器学习数学基础 57-64
8 pages
10 SVM
No ratings yet
10 SVM
23 pages
Lect3 2
No ratings yet
Lect3 2
43 pages
Kernel SVM For Image Classification
No ratings yet
Kernel SVM For Image Classification
20 pages
Lec 16
No ratings yet
Lec 16
15 pages
Convexity II: Optimization Basics: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Convexity II: Optimization Basics: Ryan Tibshirani Convex Optimization 10-725
28 pages
Lect 3
No ratings yet
Lect 3
14 pages
Lec06 SVM
No ratings yet
Lec06 SVM
25 pages
Convex Optimization and Lagrange Duality
No ratings yet
Convex Optimization and Lagrange Duality
24 pages
SVM
No ratings yet
SVM
44 pages
Theory Questions
No ratings yet
Theory Questions
4 pages
Modeling, Simulation and Optimisation For Chemical Engineering
No ratings yet
Modeling, Simulation and Optimisation For Chemical Engineering
30 pages
SQQM 3
No ratings yet
SQQM 3
21 pages
convex-fns-scribed
No ratings yet
convex-fns-scribed
6 pages
LMI Methods in Optimal and Robust Control
No ratings yet
LMI Methods in Optimal and Robust Control
31 pages
Convex Optimization - Assignment 3 (Salis)
No ratings yet
Convex Optimization - Assignment 3 (Salis)
9 pages
CVX
No ratings yet
CVX
9 pages
S2_6_NN
No ratings yet
S2_6_NN
55 pages
C&D Ws
No ratings yet
C&D Ws
2 pages
Lecture 3
No ratings yet
Lecture 3
29 pages
Lecture Notes SVM
No ratings yet
Lecture Notes SVM
4 pages
Lecture Notes SVM
No ratings yet
Lecture Notes SVM
4 pages
MODSMO October 2023 Solutions
No ratings yet
MODSMO October 2023 Solutions
8 pages
2023_hw3sol
No ratings yet
2023_hw3sol
14 pages
Cone Programming
No ratings yet
Cone Programming
27 pages
Sets
No ratings yet
Sets
25 pages
Convex Optimization 1 - Charalampos Salis
No ratings yet
Convex Optimization 1 - Charalampos Salis
12 pages
LA-ML
No ratings yet
LA-ML
3 pages
Lecture 4
No ratings yet
Lecture 4
27 pages
Non Linear Optmisation - Notes
No ratings yet
Non Linear Optmisation - Notes
24 pages
1 - 22 - MA1200 Lec 4 Quadratic Lecture
No ratings yet
1 - 22 - MA1200 Lec 4 Quadratic Lecture
18 pages
Lecture 3: Composite Problem Via Duality: 3.1.1 Motivations
No ratings yet
Lecture 3: Composite Problem Via Duality: 3.1.1 Motivations
5 pages
Principles of Semiconductor Devices-L5
No ratings yet
Principles of Semiconductor Devices-L5
21 pages
LinearProgramming I
No ratings yet
LinearProgramming I
49 pages
chapter-3
No ratings yet
chapter-3
2 pages
Op Tim Ization Notes
No ratings yet
Op Tim Ization Notes
55 pages
12 Barrier
No ratings yet
12 Barrier
44 pages
Memory Aid 2
No ratings yet
Memory Aid 2
2 pages
L5-Support Vector Machine
No ratings yet
L5-Support Vector Machine
61 pages
QM-MID
No ratings yet
QM-MID
11 pages
Act-Sat Formulas Notes
No ratings yet
Act-Sat Formulas Notes
3 pages
Amath
No ratings yet
Amath
9 pages
Day 1
No ratings yet
Day 1
41 pages
Cs 229, Public Course Problem Set #2 Solutions: Kernels, SVMS, and Theory
No ratings yet
Cs 229, Public Course Problem Set #2 Solutions: Kernels, SVMS, and Theory
8 pages
Tutorial: Gaussian Process Models For Machine Learning
No ratings yet
Tutorial: Gaussian Process Models For Machine Learning
35 pages
Numerical_Methods_Formula_Sheet
No ratings yet
Numerical_Methods_Formula_Sheet
1 page
Fuzzy Sets and Rule Bases (2)
No ratings yet
Fuzzy Sets and Rule Bases (2)
9 pages
10 Unconstrained
No ratings yet
10 Unconstrained
41 pages
Tute1 Questions
No ratings yet
Tute1 Questions
4 pages
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Useful Formulae: Mathematical & Physical
From Everand
Useful Formulae: Mathematical & Physical
Matthew Watkins
No ratings yet
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
How Does A Vernier Scale Work?
No ratings yet
How Does A Vernier Scale Work?
9 pages
Alg2 Wtrig
No ratings yet
Alg2 Wtrig
16 pages
Java String Methods Cheat Sheet
No ratings yet
Java String Methods Cheat Sheet
18 pages
Single-Source Shortest Paths
No ratings yet
Single-Source Shortest Paths
43 pages
Chapter 3 Study Guide
No ratings yet
Chapter 3 Study Guide
4 pages
Chapter 20 Recursion
No ratings yet
Chapter 20 Recursion
49 pages
CH 2 MCQ
No ratings yet
CH 2 MCQ
6 pages
Introduction To Abstract Algebra: Samir Siksek
No ratings yet
Introduction To Abstract Algebra: Samir Siksek
135 pages
11.cluster Validation PDF
No ratings yet
11.cluster Validation PDF
37 pages
Svit - Module 4
No ratings yet
Svit - Module 4
30 pages
Margham Publications: Price List Cum Order Form For The Year 2010-2011
No ratings yet
Margham Publications: Price List Cum Order Form For The Year 2010-2011
6 pages
Long Quiz On Rational Algebraic Expressions
0% (1)
Long Quiz On Rational Algebraic Expressions
2 pages
Worksheet 4
No ratings yet
Worksheet 4
22 pages
Quadratic Equations: Objective Problems
No ratings yet
Quadratic Equations: Objective Problems
29 pages
XII - Physics - Frequently Asked Question Bank
No ratings yet
XII - Physics - Frequently Asked Question Bank
200 pages
EEET423L Exp3 Script Functions and FlowCOntrol
No ratings yet
EEET423L Exp3 Script Functions and FlowCOntrol
2 pages
MAT9004 Lecture Outline
No ratings yet
MAT9004 Lecture Outline
4 pages
Simulation and Modelling: Chapter Five
No ratings yet
Simulation and Modelling: Chapter Five
12 pages
"Octal To Binary Encoder": A Project Report On
No ratings yet
"Octal To Binary Encoder": A Project Report On
6 pages
EMTH202-TEST 2-2 JUNE-2021-Marking Key
No ratings yet
EMTH202-TEST 2-2 JUNE-2021-Marking Key
3 pages
Conditional Statement Handout
No ratings yet
Conditional Statement Handout
3 pages
Exercises in Biostatistics: Drawing Inferences (One Sample)
No ratings yet
Exercises in Biostatistics: Drawing Inferences (One Sample)
2 pages
Power System Control and Operation
No ratings yet
Power System Control and Operation
3 pages
The Bounded Convergence Theorem - Brian Thomson
No ratings yet
The Bounded Convergence Theorem - Brian Thomson
22 pages
A
No ratings yet
A
20 pages
Understanding The Problem First. You Have To Understand The Problem
No ratings yet
Understanding The Problem First. You Have To Understand The Problem
3 pages
Insta DSA
No ratings yet
Insta DSA
5 pages
Chapter 3 QSE - Google Docs FINALE
No ratings yet
Chapter 3 QSE - Google Docs FINALE
25 pages