Support Vecto Machine (3)
Support Vecto Machine (3)
Section 1: Introduction
Section 1: Introduction
Section 1: Introduction
Section 1: Introduction
Section 1: Introduction
(α) : w1 x1 + w2 x2 + · · · + wn xn + b = 0 (1)
Where:
x = [x1 , x2 , . . . , xn ]T represents the coordinates of a point on the
hyperplane.
w = [w1 , w2 , . . . , wn ]T is a normal vector of (α).
b is a constant.
where: q √
∥w∥2 = w12 + w22 + · · · + wn2 = wT w
is the ℓ2 -norm of w.
Where:
Where:
Definition:
The Lagrange dual function of (P1) is derived from its Lagrangian
function. For any pair of input(λ, ν), it represents the infimum of the
Lagrangian function over all x in the domain D.
m
X p
X
g (λ, ν) = inf L(x, λ, ν) = inf f0 (x) + λi fi (x) + νj hj (x)
x∈D x∈D
i=1 j=1
Key Properties:
The dual function g (λ, ν) is always concave, even if f0 (x) is not
convex.
The dual function g (λ, ν) provides a lower bound for the optimal
value f0 (x∗ ).
Since:
it follows that:
Key Concept:
Each pair (λ, ν) provides a lower bound g (λ, ν) for the optimal value
f0 (x∗ ).
The pair (λ∗ , ν ∗ ) that gives the highest lower bound, g (λ∗ , ν ∗ ), is
called the optimal Lagrange multipliers.
Dual Problem:
Additional Notes:
(P2) is always a convex optimization problem, regardless of the
convexity of (P1),
The difference f0 (x∗ ) − g (λ∗ , ν ∗ ) is called the optimal duality gap.
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 12/57
Strong Duality and Optimal Duality Gap
Strong Duality:
If the optimal duality gap is zero, we say that strong duality occurs.
Significance:
Solving the dual problem (P2) allows us to find the exact optimal
value of the primal problem (P1).
Constraint Qualifications:
For a convex optimization problem (P1), certain conditions called
constraint qualifications ensure strong duality.
A fundamental example of such a qualification is Slater’s condition.
Strictly Feasible Point (Definition):
A point x is strictly feasible if it satisfies:
This means the inequality constraints are strictly satisfied, and the
equality constraints hold.
Slater’s Theorem:
If a strictly feasible point exists and (P1) is convex, then strong
duality holds.
fi (x∗ ) ≤ 0, ∀i = 1, . . . , m, hj (x∗ ) = 0, ∀j = 1, . . . , p.
2 Dual feasibility:
λ∗i ≥ 0, ∀i = 1, . . . , m.
3 Complementary slackness:
λ∗i fi (x∗ ) = 0, ∀i = 1, . . . , m.
4 Stationarity:
m
X p
X
∗
∇x f0 (x ) + λ∗i ∇x fi (x∗ ) + νj∗ ∇x hj (x∗ ) = 0.
i=1 j=1
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 15/57
KKT Conditions for Convex Problems
Given:
A dataset D = {(x(i) , y (i) )}m
i=1 , where:
x(i) ∈ Rn (feature vectors),
y (i) ∈ {−1, 1} (class labels), for i = 1, . . . , m.
Assume the two classes of data points (y (i) = 1 and y (i) = −1) are
linearly separable.
Question:
How can we find the best hyperplane to separate these two classes?
Definition:
Support Vector Machine (SVM) is a supervised learning algorithm
designed for classification problems.
It identifies the optimal hyperplane that separates two classes of
data points.
The hyperplane is selected to maximize the margin, which is the
distance to the nearest data points (support vectors) from both
classes.
Definition:
In the n-dimensional space, the separating hyperplane (α) has the
form:
w1 x1 + w2 x2 + · · · + wn xn + b = w⊤ x + b = 0,
where:
x = [x1 , x2 , . . . , xn ]⊤ : coordinates of a point on the hyperplane.
w = [w1 , w2 , . . . , wn ]⊤ : normal vector to the hyperplane α.
b: a constant (bias term).
Conditions (C1):
For all pairs (x(i) , y (i) ) ∈ D, the following must hold:
w⊤ x(i) + b ≥ 0, if y (i) = 1,
Definition:
The distance d from each data point (x(i) , y (i) ) to the hyperplane α
is given by:
(i) (i) (i)
w1 x1 + w2 x2 + · · · + wn xn + b w⊤ x(i) + b
d= q = .
w12 + w22 + · · · + wn2 ∥w∥2
Scaling Observation:
The coefficients (w, b) are not unique. Scaling them by any positive
constant k ∈ R+ still satisfies (C1).
The distance d from any point in D to the hyperplane remains
unchanged under this scaling.
Simplified Assumption:
To remove this redundancy, we can assume:
Implication of Equation 1:
Since (Eq.1) holds, we can conclude that for every i = 1, . . . , m, the
following holds:
y (i) (w⊤ x(i) + b) ≥ 1.
Margin Size:
The margin size is calculated as:
Objective:
The goal of SVM is to maximize the margin size, which is
equivalent to solving for the pair of optimal values (w∗ , b ∗ ) of the
following optimization problem:
1
(w∗ , b ∗ ) = arg max ,
w,b ∥w∥2
subject to:
y (i) (w⊤ x(i) + b) ≥ 1, ∀i = 1, . . . , m.
Reformulated Problem:
The above problem is equivalent to minimizing the squared norm of
w:
1
(w∗ , b ∗ ) = arg min ∥w∥22 ,
w,b 2
subject to: (P3)
1 − y (i) (w⊤ x(i) + b) ≤ 0, ∀i = 1, . . . , m.
Lagrangian:
The Lagrangian for the optimization problem (P3) is defined as:
m
1 X
2
L(w, b, λ) = ∥w∥2 + λi 1 − y (i) (w⊤ x(i) + b) ,
2
i=1
where:
λ = [λ1 , λ2 , . . . , λm ]⊤ are the Lagrange multipliers.
λi ≥ 0, ∀i = 1, . . . , m.
Key Insights:
It can be proven that (P3) is a convex optimization problem.
Slater’s condition is satisfied for (P3), ensuring strong duality holds.
Conclusion:
The optimal solutions w∗ , b ∗ , λ∗ for the dual problem can be obtained
by solving the Karush-Kuhn-Tucker (KKT) conditions of (P3).
KKT Conditions:
Primal feasibility:
1 − y (i) (w∗ )⊤ x(i) + b ∗ ≤ 0, ∀i = 1, . . . , m. (C2.1)
Dual feasibility:
λ∗i ≥ 0, ∀i = 1, . . . , m. (C2.2)
Complementary slackness:
λ∗i 1 − y (i) (w∗ )⊤ x(i) + b ∗ = 0, ∀i = 1, . . . , m. (C2.3)
Stationarity with respect to w∗ :
m
∂L ∗
X
= w − λ∗i y (i) x(i) = 0. (C2.4)
∂w∗
i=1
Stationarity with respect to b ∗ :
m
∂L X
= λ∗i y (i) = 0. (C2.5)
∂b ∗
i=1
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 29/57
Solving (P3) Using the Dual Problem
Motivation:
Directly solving for w∗ , b ∗ , λ∗ using the KKT conditions can be
computationally intensive.
Instead, solving for λ in the Lagrange dual problem of (P3) is more
efficient and commonly done.
Lagrange Dual Function:
The dual function g (λ) is defined as:
Key Insight:
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 30/57
Finding inf w,b L
Key Steps:
To find inf w,b L, set the partial derivatives of L with respect to w and
b to zero.
Partial Derivatives:
With respect to w:
m m
∂L X X
=w− λi y (i) x(i) = 0 ⇒ w = λi y (i) x(i) . (Eq.2)
∂w
i=1 i=1
With respect to b:
m
∂L X
= λi y (i) = 0. (Eq.3)
∂b
i=1
Substituting (Eq.2) and (Eq.3) into g (λ):
m m m
X 1 XX
g (λ) = λi − λi λj y (i) y (j) x(i)⊤ x(j) . (Eq.4)
2
i=1 i=1 j=1
Observation:
From (C2.3), we conclude that λ∗i could be greater that0 only if:
y (i) (w∗ )⊤ x(i) + b ∗ = 1,
S = {i | λ∗i ̸= 0}.
Step 2: Calculate w∗ :
Using (C2.4): X
w∗ = λ∗i y (i) x(i) .
i∈S
Step 3: Calculate b ∗ :
Since x(i) is a support vector for every i ∈ S, we have:
y (i) (w∗ )⊤ x(i) + b ∗ = 1.
Separating Hyperplane:
The separating hyperplane α is defined as:
∗ ⊤ ∗
X 1 X (i)
α : (w ) x + b = λ∗i y (i) x(i) + ∗ ⊤ (i)
y − (w ) x = 0.
|S|
i∈S i∈S
Role of ξ (epsilon):
ξi measures the degree of
misclassification for each data point.
ξi = 0: The point is correctly
classified and outside the margin.
ξi > 0: The point either lies within
the margin or is misclassified.
Constraints:
⟨w · xi ⟩ + b ≥ 1 − ξi , if yi = 1
⟨w · xi ⟩ + b ≤ −1 + ξi , if yi = −1
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 39/57
Section 3: Types of SVM - Soft Margin
Optimization Objective:
N
!
1 X
arg min ||w ||2 + C ξi (Eq.1)
w ,b,ξ 2
i=1
where:
C > 0 is the penalty constant.
The greater C , the heavier the penalty on misclassifications.
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 40/57
Section 3: Types of SVM – Linear SVM – Soft Margin
∇ ξ n L = 0 ⇒ λ n = C − µn (Eq.5)
Interpretation of Conditions:
This relationship shows that λi is constrained by C and µi , with µi ≥ 0.
If µi = 0, then λi = C , indicating a boundary point.
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 41/57
Section 3: Types of SVM - Linear SVM – Soft Margin
Notes:
This dual form depends only on λi and the inner products ⟨xi , xj ⟩,
making it computationally efficient.
The goal is to maximize L(λ) with respect to λ, which controls the
influence of each support vector.
Subject to the following constraints:
0P≤ λi ≤ C , ∀i = 1 . . . N
N
i=1 λi yi = 0
ξi ≥ 0 (Eq.7)
λi ≥ 0 (Eq.8)
µi ≥ 0 (Eq.9)
µi ξi = 0 (Eq.10)
yi ((w · xi ) + b) − 1 + ξi ≥ 0, ∀i = 1 . . . N (Eq.11)
λi (yi ((w · xi ) + b) − 1 + ξi ) = 0 (Eq.12)
yn (wT xn + b) = 1 − ξn
yn (wT xn + b) = 1
yn (wT xn + b) = 1 − ξn ≤ 1
C = 0.1
C=1
C = 10000
ϕ:
x→
ϕ(x)
subject to:
N
X
λn yn = 0, 0 ≤ λn ≤ C , ∀n
n=1
subject to:
PN
n=1 λn yn = 0, 0 ≤ λn ≤ C , ∀n
Here:
S is the support set with λm > 0.
M is the set of support vectors on the margin where 0 < λm < C .
k(xn , xm ) = Φ(xn )T Φ(xm ) is the kernel function.
For two points x and z in the original space, the dot product in the
feature space is:
√ √ √ √ √ √
Φ(x)T Φ(z) = [1, 2x1 , 2x2 , x12 , 2x1 x2 , x22 ]T [1, 2z1 , 2z2 , z12 , 2z1 z2 , z22 ]
= 1 + 2x1 z1 + 2x2 z2 + x12 z12 + 2x1 z1 x2 z2 + x22 z22
(1 + xT z)2 = k(x, z)
Polynomial:
Sigmoid:
K (x, z) = tanh(γ(x · z) + r ), γ, r ∈ R
Pros
Work well with a clear margin of separation between classes
Productive in high-dimensional spaces
Effective when dimensions outnumber specimens
Memory-efficient
Pros
Work well with a clear margin of separation between classes
Productive in high-dimensional spaces
Effective when dimensions outnumber specimens
Memory-efficient
Drawbacks
Not suitable for large datasets
Performs poorly when classes overlap
Underperforms when features outnumber training data specimens
Lack of a probabilistic interpretation for classification
Computationally expensive for large datasets
Sensitive to the choice of kernel and parameters
Memory-intensive due to storing the kernel matrix
Limited to two-class problems
Not suitable for datasets with missing values
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 57/57