0% found this document useful (0 votes)
24 views122 pages

2.pattern Recognition (Pattern Classification) - Support Vector Machin

This document provides an overview of support vector machines (SVMs) for pattern recognition and classification. It discusses SVMs for the separable binary classification case where the hypothesis set contains the target concept. It presents the SVM optimization problem and regularization-based algorithms for finding the optimal hyperplane that maximizes the margin between the two classes while minimizing empirical risk. The document outlines the contents to include binary SVMs for separable and non-separable cases, geometric representations, kernel methods, and multiclass SVMs.

Uploaded by

temp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views122 pages

2.pattern Recognition (Pattern Classification) - Support Vector Machin

This document provides an overview of support vector machines (SVMs) for pattern recognition and classification. It discusses SVMs for the separable binary classification case where the hypothesis set contains the target concept. It presents the SVM optimization problem and regularization-based algorithms for finding the optimal hyperplane that maximizes the margin between the two classes while minimizing empirical risk. The document outlines the contents to include binary SVMs for separable and non-separable cases, geometric representations, kernel methods, and multiclass SVMs.

Uploaded by

temp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 122

Pattern Recognition

(Pattern Classification)
Support Vector Machin (SVM)
Hypothesis set and Algorithm

Second Edition
Recall from Chapter 1
True Error Bound and linear
• Linear in input space:

• Linear in feature space:

Λ ≥‖h‖ℍ= ℛ ( 𝑊 ) =‖𝑊 ‖2=√ 𝑤 +𝑤 +𝑤 +… 2


1
2
2
2
3

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 3


T, Morteza Analoui
Learning algorithm tries to minimizing Upper
Bound of True Error by finding for a given
= upper bound of true risk (loss)

𝑎𝑟𝑔𝑚𝑖𝑛 ( ^
𝑅𝑆 ( h ) + 𝜆 ℛ ( h) ) =𝑎𝑟𝑔𝑚𝑖𝑛 𝐿(𝑊 )
h∈H h∈H

^
𝑅𝑆 ( h ) 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦 ( 𝑅𝑒𝑔𝑢𝑙𝑎𝑧𝑖𝑛𝑔 ) 𝑇𝑒𝑟𝑚
𝑚
1
𝐿 ( 𝑊 )= ∑ [ 𝐿¿ ¿𝑖 (h ( 𝑥 𝑖 ,𝑊 ) , 𝑦 𝑖 )]+ 𝜆 ℛ (𝑊 )¿
𝑚 𝑖=1 2
ℛ ( 𝑊 )=‖𝑊 ‖2=𝑤 21 +𝑤22 +𝑤 23+ …+𝑤 2𝑁

03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 4


omputer Engineering, IUST - Morteza Analoui
Contents
1. Binary support machine
2. Binary SVM: Separable case (Consistent case)
3. A Geometric Representation of SVM
4. Binary SVM: Non-separable case (Inconsistent case)
5. Kernel Methods
6. Multiclass SVM

This chapter is mostly based on:


Foundations of Machine Learning, 2nd Ed., By Mehryar Mohri, , Afshin Rostamizadeh, Ameet Talwalkar,
Publisher: MIT Press, 2018

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 5


T, Morteza Analoui
1- Binary support machine
Binary Support Vector Machines
• SVM is one of most theoretically well motivated and practically most
effective classification algorithms in modern machine learning.
• We first introduce algorithm for consistent case )H contains concept
to learn ), then present its general version designed for inconsistent
case, and finally provide a theoretical foundation for SVMs based on
notion of margin

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 7


T, Morteza Analoui
SVM: a small generalization error learning
machine
• Consider an input space that is a subset of with , and output or target space ,
and let be target function (concept)
• Given a hypothesis set H of functions mapping to , binary classification task is
formulated as follows
• Learner receives a training sample of size drawn i.i.d. from according to some
unknown distribution ,
, with for all

• Problem consists of determining a hypothesis H, a binary classer, with small


generalization error:

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 8


T, Morteza Analoui
H : - margin linear hyperplane set,
• Different hypothesis sets H can be selected for this task. In view of
Occam's razor principle, hypothesis sets with smaller complexity and
smaller VC-dimension provide better learning guarantees, when
everything else is equal
• A natural hypothesis set with relatively small complexity is that of
linear classifiers, or hyper planes, which can be defined as follows:

(5.2)

• Learning problem is then referred to as a linear classification problem

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 9


T, Morteza Analoui
H : - margin linear hyperplane set
• General equation of a hyperplane in is ·, where is a non-zero vector
normal to hyperplane and is a scalar.

• A hypothesis H of form ·thus labels positively all points falling on one


side of hyperplane ·and negatively all others

• Definition of SVM solution is based on notion of margin

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 10


T, Morteza Analoui
Contents
1. Binary support machine
2. Binary SVM: Separable case (Consistent case)
3. A Geometric Representation of SVM
4. Binary SVM: Non-separable case (Inconsistent case)
5. Kernel Methods
6. Multiclass SVM

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 11


T, Morteza Analoui
2- Binary SVM: consistent case
)H contains concept to learn )
(Separable case)
Binary SVM - Consistent Case
• Concept class (positive class) :
• Negative class :

• Margin loss variable:

• Consistent case: can be learned such that for all training examples

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 13


T, Morteza Analoui
Binary SVM Margin Loss functions:
• Score:

Φ 𝜌 ( 𝜌 h (𝑥𝑖 , 𝑦 𝑖 ) )=𝑚𝑎𝑥 (0 ,1 − 𝑦 𝑖 h ( 𝑥𝑖 ) )

1−
3 𝑦
𝑖 h(
𝑥
Hinge loss function: 𝑖 )
2

1 𝛼 ¿
𝑖
𝑦 𝑖 h ( 𝑥𝑖 )=𝑠 𝑖
-3 -2 -1 0 1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 14


T, Morteza Analoui
Regularization-based algorithm1
• Upper bound of true risk
^
𝑅 𝑆 , 𝜌=1 ( h )
2
𝑚
ℛ ( 𝑊 )=‖𝒘‖2=𝑤 21+ 𝑤22 +𝑤23 + …+𝑤 2𝑁
1
𝑅 (h ) ≤ ℒ ( 𝒘 , 𝑏 ) = ∑
𝑚 𝑖 =1
𝑚𝑎𝑥 ¿ ¿ (5.48.1)

Weighted regularizer

is upper bound of true risk regularization parameter,

• The solution and for the optimization problem min ℒ ( 𝒘 ,𝑏 )


𝑤,𝑏
gives

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 15


T, Morteza Analoui
Regularization-based algorithm1.
• Regularization-based algorithms recall from Chapter1
,
h∈H
𝑚
1
𝑎𝑟𝑔𝑚𝑖𝑛 𝐿 ( 𝑊 )= ∑ [ 𝐿¿ ¿𝑖 (h ( 𝑥 𝑖 ,𝑊 ) , 𝑦 𝑖 )]+ 𝜆 ℛ (𝑊 )¿
h∈H 𝑚 𝑖=1
𝑚
1
min ∑
𝑤 , 𝑏 𝑚 𝑖=1
max ¿ ¿ ¿ ¿
hinge loss:

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 16


T, Morteza Analoui
Regularization-based algorithm1..
• Finding that has minimum regularizer, and no empirical loss
1 2
Minimizing regularizer and no empirical loss: 𝑚𝑖𝑛 ‖𝒘 ‖2
𝒘 ,𝑏 2 (5.7)
Keeping scores 1 or more: Subject to:

• (5.7) and (5.48.1) are convex optimization problem and a specific


instance of quadratic programming (QP)

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 17


T, Morteza Analoui
Regularization-based algorithm2
• Regularization-based algorithms
𝐶 =1/ λ ≥ 0
h∈H
𝑚
𝑎𝑟𝑔𝑚𝑖𝑛 𝐿 ( 𝑊 )= ℛ ( 𝑊 ) + ∑ 𝛼𝑖 [ 𝐿 ¿ ¿𝑖 ( h ( 𝑥 𝑖 ,𝑊 ) , 𝑦 𝑖 ) ]¿ 𝛼 𝑖 ≡ 1/ 𝑚 λ ≥ 0
h∈H 𝑖=1

𝑚
min 𝐿 ( 𝑊 )=‖𝒘‖ + ∑ 𝛼 𝑖 ¿ ¿
2
𝜶=[ 𝛼 1 … 𝛼𝑖 … 𝛼𝑚 ]
2
𝑤 , 𝑏,𝜶 𝑖=1
hinge loss: ,

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 18


T, Morteza Analoui
Dual Problem for Algorithm2: Lagrangian
function
• Lagrangian function associated to problem (5.7)
Weighted

𝑚
1
ℒ ( 𝒘 ,𝑏 , 𝜶 )= ‖𝒘‖ + ∑ 𝛼 𝑖 ¿ ¿
2

2 𝑖=1
upper bound of true risk
(Lagrangian function) Lagrange variables

• The solution and for the dual problem min ℒ ( 𝒘 ,𝑏is, 𝜶the
) solution
for the primal (5.7) 𝑤 , 𝑏,𝜶

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 19


T, Morteza Analoui
Algorithm- QP solvers
• A variety of commercial and open-source solvers are available for
solving convex QP problems.
• Specialized algorithms have been developed to more efficiently solve
this particular convex QP problem, see appendix
QP solver:
• Setting gradient of Lagrangian with respect to primal variables and to
zero ,
• Setting weighted :
(H) (complementarity slackness conditions)

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 20


T, Morteza Analoui
Derivatives of Lagrangian function
, h (𝑥¿¿𝑖)=𝒘 ∙ 𝒙 𝑖 +b ¿
𝑚 𝑚
∇ 𝑤 ℒ =𝒘 − ∑ 𝛼 𝑖 𝑦 𝑖 𝑥𝑖 =0 → 𝒘 =∑ 𝛼 𝑖 𝑦 𝑖 𝑥𝑖 (5.9)
𝑖=1 𝑖=1

𝑚 𝑚
∇ 𝑏 ℒ =− ∑ 𝛼𝑖 𝑦 𝑖=0 → ∑ 𝛼𝑖 𝑦 𝑖=0 (5.10)
𝑖=1 𝑖=1

no empirical loss: ∀ 𝑖 , 𝛼𝑖 ¿ (5.11)

is called a support vector (support example) when

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 21


T, Morteza Analoui
is unique
• Solution of SVM problem is unique, but support vectors are not

• In dimension , points are sufficient to define a hyperplane When more


than points lie on a marginal hyperplane, different choices are
possible for support vectors

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 22


T, Morteza Analoui
Dual optimization problem
• Plug (5.9) and (5.10) into Lagrangian function (5.8) yields minimum
loss:

‖∑ ‖
𝑚 2 𝑚 𝑚 𝑚 𝑚
1
ℒ (𝑤 ,𝑏, 𝛼 )= ℒ 𝑑𝑢𝑎𝑙 (𝜶)= 𝛼 𝑖 𝑦 𝑖 𝑥𝑖 − ∑ ∑ 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙𝑖 ∙ 𝒙 𝑗 ) − ∑ 𝛼𝑖 𝑦 𝑖 𝑏+ ∑ 𝛼 𝑖 (5.12)
2 𝑖=1 𝑖=1 𝑗=1 𝑖=1 𝑖=1

𝑚 𝑚
1
− ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙 𝑖 ∙ 𝒙 𝑗 )
0
2 𝑖 =1 𝑗=1

• which simplifies to
𝑚 𝑚 𝑚
1
ℒ 𝑑𝑢𝑎𝑙 (𝜶)=− ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙 𝑖 ∙𝒙 𝑗 ) + ∑ 𝛼 𝑖 (5.13)
2 𝑖=1 𝑗=1 𝑖=1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 23


T, Morteza Analoui
Dual optimization solution
• This leads to following dual optimization problem for SVMs in separable case:
𝑚 𝑚 𝑚
1
max ℒ 𝑑𝑢𝑎𝑙 (𝜶)=− ∑ ∑ 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝑥 𝑖 ∙ 𝑥 𝑗 ) + ∑ 𝛼 𝑖 (5.14)
𝜶 2 𝑖=1 𝑗=1 𝑖=1

subject to:

• Dual objective function is concave and differentiable. Dual optimization problem


is a QP problem, general-purpose and specialized QP solvers can be used
• SMO (Sequential Minimal Optimization) algorithm is used to solve dual form of
SVM problem in more general non-separable setting

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 24


T, Morteza Analoui
Primal and dual problems are equivalent
• Solution of dual problem (5.14) can be used directly to determine
hypothesis returned by SVMs, using equation (5.9):

(∑ )
𝑚
h ( 𝑥 )= 𝑠𝑔𝑛 ( 𝒘 ∙ 𝒙 + 𝑏 )=𝑠𝑔𝑛 𝛼 𝑖 𝑦𝑖 ( 𝒙 𝑖 ∙ 𝒙 )+ 𝑏 (5.15)
𝑖=1
𝑚
𝒘 =∑ 𝛼 𝑖 𝑦 𝑖 𝒙 𝑖(5.9)
𝑖=1
• Since support vectors lie on marginal hyperplanes, for any support
vector , ·, and thus can be obtained via
𝑚
𝑏= 𝑦 𝑗 − ∑ 𝛼 𝑖 𝑦 𝑖 ( 𝒙 𝑖 ∙ 𝒙 𝑗 ) (5.16)
𝑖=1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 25


T, Morteza Analoui
Inner products between vectors
• Dual optimization problem (5.14) and expressions (5.15) and (5.16) reveal an
important property of SVMs:
• hypothesis solution depends only on inner products between vectors and not
directly on vectors themselves
• This observation is key and its importance will become clear when kernel
methods are introduced
• Now we can derive the following expression (see page 85 of the text for details)

𝑚
‖𝒘 ‖ = ∑ 𝛼 𝑖=‖𝜶‖1(5.19)
2
2
𝑖= 1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 26


T, Morteza Analoui
Theorem 5.4
• Let be a linearly separable sample of .
• Let be hypothesis returned by SVMs for a sample , and let be
number of support vectors that define . Then,
average fraction of support vectors

average generalization error 𝑆 𝒟


𝑁 𝑆𝑉 (𝑆)
𝔼 [ 𝑚𝑅 (h 𝑆 ) ] ≤ 𝔼 𝑚+ 1
𝑆 𝒟 𝑚+1
[ ] (5.4)

^ 𝑁 𝑆𝑉 (𝑆)
Leave-One-Out error 𝑅 𝐿𝑂𝑂 (𝑆𝑉𝑀 ) ≤
𝑚+1

• where denotes distribution according to which points are drawn


03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 27
T, Morteza Analoui
Theorem 5.4
Theorem 5.4 gives a sparsity argument in favor of SVMs:
• Average error of algorithm is upper bounded by average fraction of support
vectors
• One may hope that for many distributions seen in practice, a relatively small
number of training points be the support vectors
• Solution will then be sparse in sense that a small fraction of dual variables will be
non-zero
• 5.4 is relatively weak bound since it applies only to average generalization error
of algorithm over all samples of size . It provides no information about variance of
generalization error
• We present stronger high-probability bounds based on notion of margin
(Theorem 5.10).
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 28
T, Morteza Analoui
Contents
1. Binary support machine
2. Binary SVM: Separable case (Consistent case)
3. A Geometric Representation of SVM
4. Binary SVM: Non-separable case (Inconsistent case)
5. Kernel Methods
6. Multiclass SVM

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 29


T, Morteza Analoui
A Geometric Representation of SVM
• There are infinitely many such separating hyperplanes .
• Linear hypothesis H of form ·
• Consistent :
• There infinit number that support

Training set:

·
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 30
T, Morteza Analoui
Which one is the best ?
• Answer: To keep upper bound of true risk as low as possible, for a
given and , we are looking for maximum “-margin in loss function
while empirical error is zero”
0

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 31


T, Morteza Analoui
A Geometric Representation of SVM
• This is equivalent to existence of such that:

h ( 𝒙 )=𝒘 ∙ 𝒙 +𝑏=0
𝑦 =+1

𝑦 =−1 h ( 𝑥 𝑖 ) ≥ 0 , 𝑦 𝑖 h( 𝑥 𝑖) ≥ 0
h ( 𝑥 𝑖 ) ≤ 0 , 𝑦 𝑖 h( 𝑥 𝑖) ≥ 0

Score of 𝑥𝑖 =𝑠𝑖 =𝑦 𝑖 h ( 𝑥𝑖 )

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 32


T, Morteza Analoui
Definition 5.1 – margin of examples
• Geometric margin at a point = distance from to hyperplane =0

h ( 𝒙 )=𝒘 ∙ 𝒙 +𝑏=0 𝑥𝑖 , 𝑦 𝑖 𝑦 𝑖 h( 𝑥 𝑖) ≥ 0

‖𝒘‖2 =√ 𝑤 +𝑤 +…+𝑤
2
1
2
2
2
𝑁
𝑦 𝑖 h( 𝑥 𝑖) ≥ 0

() = Score of 𝑥𝑖 =𝑠𝑖 =𝑦 𝑖 h ( 𝑥𝑖 )

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 33


T, Morteza Analoui
Geometry margin
• Geometric margin of a linear classifier for a sample is minimum geometric
margin over points in SAMPLE, that is distance of hyperplane defining to closest
sample points.

h ( 𝑥 )=0

geometric margin of : =

2 𝜌h

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 34


T, Morteza Analoui
Geometry margin
h ( 𝑥)− 1=0
h ( 𝑥)+1=0 𝑤

h ( 𝑥 𝑖 ) ≥ 0 , 𝑦 𝑖=+ 1
𝑥𝑖 𝑥𝑖 𝑦 𝑖 ( h ( 𝑥𝑖 ) −1) 1
𝜌 𝑖 ,+ 1= =𝜌 𝑖 −
‖𝑤‖2 ‖𝑤‖2

2
2 𝜌h = h ( 𝑥 )=0
‖𝑤‖2
𝑦 𝑖 (h ( 𝑥 𝑖 ) +1) 1
𝜌 𝑖 ,−1= =𝜌 𝑖 +
‖𝑤‖2 ‖𝑤‖2
h ( 𝑥 𝑖 ) ≥ 0 , 𝑦 𝑖=+ 1 2
𝜌 𝑖 ,−1 − 𝜌 𝑖 ,+ 1= =2 𝜌 h
‖𝑤‖2
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 35
T, Morteza Analoui
Margin of based on
𝜌𝑖
() = → 𝜌h
=𝑦 𝑖 h ( 𝑥𝑖 )=𝑠 𝑖

h ( 𝑥 )=0 𝜌𝑖

𝑥𝑖

2
2 𝜌h =
‖𝑤‖2

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 36


T, Morteza Analoui
SVM: maximum -margin & no empirical
error H

• To keep upper bound of true risk as low as possible, we are looking


for maximum “-margin in loss function while empirical error is zero”

• It means: max =

• What is maximum possible?

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 37


T, Morteza Analoui
SVM: maximum -margin & no empirical
error H

𝜌𝑖
zero-one loss
𝑥𝑖 hinge loss: )
𝜌𝑗 𝑥 quadratic hinge )2
𝑗

2
2 𝜌h = h ( 𝑥 )=0
‖𝑤‖2
𝜌𝑖
−2 𝜌 h − 𝜌 h 𝜌 =𝜌 h

𝜌 𝑗 𝜌𝑖
𝜌 𝑗≥ 𝜌 h→ ≥1 𝜌𝑖 ≥ 𝜌h → ≥1
𝜌h 𝜌h
No training margin loss in -1 examples No training margin loss in +1 examples

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 38


T, Morteza Analoui
SVM Margin Loss functions 𝐿𝑖=max ⁡( 0 , 1− 𝑦 𝑖 h ( 𝑥 𝑖 ) )

zero-one loss zero-one loss:


hinge loss: ) hinge loss:
quadratic hinge quadratic hinge:

𝜌𝑖
𝜌h 𝜌𝑖 = 𝑦 h ( 𝑥𝑖 )= 𝑠 𝑖
−2 𝜌 h − 𝜌 h 1 𝜌h 𝑖

𝜌𝑖
= 𝑦 h ( 𝑥𝑖 )= 𝑠 𝑖
𝜌h 𝑖

Figure 5.5 Both hinge loss and quadratic hinge loss provide convex upper bounds on binary zero-one loss.
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 39
T, Morteza Analoui
Dual of Algorithm1: Lagrangian function
• Lagrangian function associated to problem (5.48) is upper bound of
true risk ^
𝑅 (h) 𝑆 2 2 2 2
ℛ ( 𝑊 )=‖𝑤‖2=𝑤1 + 𝑤2 +𝑤3 + …+𝑤 2𝑁
𝑚
1
𝑅 (h) ≤ ℒ ( 𝒘 , 𝑏 ) = ∑
𝑚 𝑖 =1
𝑚𝑎𝑥 ¿ ¿ (5.48.1)
Lagrangian function is
upper bound of true risk regularization parameter, Lagrange variable

• The solution and for the dual problem is the solution for the primal
min ℒ ( 𝒘 ,𝑏 )
𝑤,𝑏
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 41
T, Morteza Analoui
SVM Primal Algorithm2
• Finding that has maximum geometric margin and no empirical loss
1
m 𝑎𝑥 𝜌 h = (5.7.1)
𝒘 ,𝑏 ‖𝒘‖2
Subject to no empirical loss: Note that:
• Or minimalizing (that is a convex optimization problem and a specific
instance of quadratic programming (QP))

1 2
Minimizing regularizer and no empirical loss:𝑚𝑖𝑛 ‖ ‖
𝒘 2 (5.7)
𝒘 ,𝑏 2
Keeping scores 1 or more: Subject to:

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 42


T, Morteza Analoui
SVM Primal Algorithm2.
• Given , find and to maximize geometric margin while there is no
training error

1 2
min ‖𝒘‖2 (5.7)
𝒘 ,𝑏2
Subject to:

• The resulting algorithm precisely coincides with (5.48.1)

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 43


T, Morteza Analoui
Dual of Algorithm2
• Lagrangian function associated to problem (5.7)
𝑚
1
ℒ (𝑤 ,𝑏 , 𝜶 )= ‖𝒘‖ + ∑ 𝛼 𝑖 ¿ ¿
2

2 𝑖=1
Lagrangian function

Lagrange variables

• The solution and for the dual problem is the solution for the primal
min ℒ ( 𝒘 ,𝑏 , 𝜶 )
𝑤 , 𝑏,𝜶
2 1 1 1
𝜌h = = =
• Note that: ‖𝒘 ‖2
2 𝑚

∑ 𝛼𝑖
‖𝜶‖1 (5.19)
𝑖=1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 44


T, Morteza Analoui
VC-dim of -margin hyperplane (linear) set
H
• VC dimension of -margin loss function and linear set H :
• is the dimension of the space, that is
• Let vectors in belong to a sphere of radius 𝑅 X

𝑑 ≤ 𝑚𝑖𝑛
([ ] )
𝑅2
𝜌
2
, 𝑁 +1

• Using large , generalization ability of the constructed hyperplane is


high.
Maximizing minimizes upper bound of

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 45


T, Morteza Analoui
Contents
1. Binary support machine
2. Binary SVM: Separable case (Consistent case)
3. A Geometric Representation of SVM
4. Binary SVM: Non-separable case (Inconsistent case)
5. Kernel Methods
6. Multiclass SVM

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 46


T, Morteza Analoui
3- Binary SVM: Non-separable
case
Inconsistent case (non-separable case), H
• In most practical settings, training data is not linearly separable: for
any hyperplane ·, there exists such that

(5.22)

• Constraints imposed in linearly separable case cannot all hold


simultaneously
1 2
min ‖𝒘‖2 (5.7)
𝒘 ,𝑏2
Subject to:
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 48
T, Morteza Analoui
Inconsistent case (non-separable case)
• We introduce a new variable in consistent SVM algorithm to measure
empirical loss. ,=1
h ( 𝑥 )=0 𝜉 ′
𝑖
𝑥𝑖

𝑥𝑗
𝜉′𝑗
h ( 𝑥 ) −1=0
h ( 𝑥 )+ 1=0
Figure 5.4
A separating hyperplane with point classified incorrectly and point correctly classified, but with margin less than 1.
and are outliers ( > 0 and > 0 ).
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 49
T, Morteza Analoui
Loss of Inconsistent case
• , =1
• represents loss for based on

𝐿( 𝑦 𝑖 h ( 𝑥 𝑖 ) )=1− 𝑦 𝑖 h ( 𝑥 𝑖 ) ¿ 𝜉 𝑖
h ( 𝑥 )=0 𝜉 ′ 3
𝑖
𝑚
𝑥𝑖
2 ∑ 𝜉𝑖
Total empirical loss=
𝑖 =1
𝑥𝑗 1
𝜉′𝑗 𝜉𝑖
h ( 𝑥 ) −1=0 -2 -1 +1 𝑦 𝑖 h ( 𝑥𝑖 )=1 − 𝜉 𝑖
0
h ( 𝑥 )+ 1=0 1−𝜉 𝑗 1-
=0

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 50


T, Morteza Analoui
Error of outliers:
0 <𝜉 <1: on correct side of separating hyperplane

1< 𝜉: on incorrect side of separating hyperplane

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 51


T, Morteza Analoui
Relaxed constrains
• A relaxed version of these constrains can indeed hold, that is, for
each , there exist such that

Subject to: relaxed to Subject to:

• And therefore the loss function becomes:


𝐿 𝑖 ( 𝑦 𝑖 h ( 𝑥𝑖 ) ) =max ¿ ¿

• (slack variable ) measures quantity by which vector violates the


desired inequality,
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 52
T, Morteza Analoui
Soft margin – Hard margin
• For ·, vector with can be viewed as an miss classified example
• with is correctly classified by hyperplane but is considered to be an
outlier, that is > 0

• If we omit miss classified examples and outliers, training data is


correctly separated by with a margin that we refer to as soft margin,
as opposed to hard margin in separable case

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 53


T, Morteza Analoui
Empirical loss, large-margin, loss function
• One idea consists of selecting hyperplane that minimizes empirical
loss (that is ERM)
• But, that solution will not benefit from large-margin guarantees

• Problem of determining a hyperplane with smallest zero-one loss,


that is smallest number of misclassifications, is NP-hard as a function
of dimension of space. Using hinge or quadratic hinge loss functions
are computationally feasible

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 54


T, Morteza Analoui
𝑚
1
Loss functions ℒ (𝑤 ,𝑏 , 𝜶 )= ‖𝒘‖ + ∑ 𝛼 𝑖 ¿ ¿ ¿
2
2

𝑖=1
Slack (error) terms
• There are many possible choices for leading to more or less aggressive
penalizations of slack terms:
• Choices and lead to most straightforward solutions. Loss functions associated
with and are called hinge loss and quadratic hinge loss, respectively.

zero-one loss:

hinge loss:

quadratic hinge:

Figure 5.5
Both hinge loss and quadratic hinge loss provide
convex upper bounds on binary zero-one loss.
𝑦h ( 𝑥)
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 55
T, Morteza Analoui
Two conflicting objectives: loss and margin
• On one hand, we wish to limit the total amount of empirical loss
(slack penalty) due to misclassified examples and outliers, which can
be measured by , or, more generally by for some .

• On other hand, we seek a hyperplane with a large margin, though a


larger margin can lead to more misclassified and outliers and thus
larger amounts of loss

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 56


T, Morteza Analoui
Primal optimization problem
• This leads to following general optimization problem defining SVMs in non-
separable case where the parameter C determines trade-off between margin-
maximization (or minimization of ) and minimization of slack penalty . Small C
means large empirical loss. Regularization using C
𝑚
1
min , ‖𝒘‖2+ 𝐶 ∑ 𝜉 𝑖
2 𝑝
2 𝑖 =1
(5.24)

Subject to:

relaxed score constrains Non-negativity constrain of slack (error) variable.


Not all examples need to satisfy the score constraint.
• (5.24) is a convex optimization problem

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 57


T, Morteza Analoui
A regularization view
• Optimization problem in 5.24 presents a regularization based solution
• Higher C means lower training error and smaller
• Lower C means higher training error and larger
𝑚
1
∑ 𝑝 2
𝑎𝑟𝑔𝑚𝑖𝑛 ( ^ )
𝑅𝑆 ( h ) + 𝜆 ℛ ( h) min 𝜉 𝑖 + 𝜆‖𝒘‖2
h∈H 𝑚, 𝑖=1
𝑚
1
min , ‖𝒘‖2+ 𝐶 ∑ 𝜉 𝐶
2 𝑝
𝑎𝑟𝑔𝑚𝑖𝑛 ( 1/ 𝜌 h +𝐶 ^
𝑅( h) ) 𝑖 1/ 𝜆
h∈H 2 𝑖 =1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 58


T, Morteza Analoui
A regularization view
• Back to results in chapter1 for Regularization-based algorithms
𝑎𝑟𝑔𝑚𝑖𝑛 ( ^ 𝑅 ( h ) + 𝜆 ℛ ( h) )
𝑆
h∈H
𝑚
1
𝑎𝑟𝑔𝑚𝑖𝑛 𝐿 ( 𝑤 )= ∑ [ 𝐿¿ ¿𝑖 ( h ( 𝑥 𝑖 ,𝑊 ) , 𝑦 𝑖 )]+ 𝜆 ℛ (𝑤)¿
h∈H 𝑚 𝑖=1
𝑚
1
min ∑
𝑤 , 𝑏 𝑚 𝑖=1
max ¿ ¿ ¿ ¿
𝐿( 𝑦 𝑖 h ( 𝑥 𝑖 ) ) ¿ 𝜉 𝑖

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 59


T, Morteza Analoui
primal
Lagrangian function 1
min ‖𝒘‖ + 𝐶 ∑ 𝜉
2 𝑝
𝑚

2 𝑖
2 ,
𝑖 =1
Subject to:

• Analysis is presented in case of hinge loss which is most widely used loss function
for SVMs.
• We introduce Lagrange variables , associated to constraints and , associated to
non-negativity constraints of slack variables
• We denote by the vector and by vector
• Lagrangian can then be defined for all and , by
dual
(5.25)

Where

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 60


T, Morteza Analoui
Derivatives of Lagrangian function
(5.25)

• A vector appears in solution iff . Such vectors are called support


vectors
(5.26)

(5,27)

(5.28)

(5.29)

(5.30)

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 61


T, Morteza Analoui
Two types of support vectors
• By the complementarity condition (5.29), if , then

• If = 0, then ·and lies on a marginal hyperplane, as in the separable case and


requires

• Otherwise, 0 and is an outlier. In this case, (5.30) implies = 0 and (5.28)


requires = C

• As in separable case, weight vector solution is unique, support vectors are not.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 62


T, Morteza Analoui
Two types of support vectors
• Support vectors are either outliers, in which case = C, or vectors lying
on marginal hyperplanes, in which

Support vectors:

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 63


T, Morteza Analoui
Dual optimization problem (5.24)
• Plug into Lagrangian definition of in terms of dual variables (5.26) and
apply constraint (5.27). This yields

‖ ‖
𝑚 2 𝑚 𝑚 𝑚 𝑚
1
ℒ (𝑤 ,𝑏, 𝛼 )= ℒ 𝑑𝑢𝑎𝑙 (𝜶)=
2
∑ 𝛼 𝑖 𝑦 𝑖 𝑥𝑖 − ∑ ∑ 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙𝑖 ∙ 𝒙 𝑗 ) − ∑ 𝛼𝑖 𝑦 𝑖 𝑏+∑ 𝛼 𝑖 (5.31)
𝑖=1 𝑖=1 𝑗=1 𝑖=1 𝑖=1

𝑚 𝑚
1
− ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙 𝑖 ∙ 𝒙 𝑗 )
0
2 𝑖 =1 𝑗=1
• Remarkably, we find that objective function is no different than in
separable case:
𝑚 𝑚 𝑚
1
ℒ 𝑑𝑢𝑎𝑙 (𝜶)=− ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙 𝑖 ∙𝒙 𝑗 ) + ∑ 𝛼 𝑖 (5.32)
2 𝑖=1 𝑗=1 𝑖=1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 64


T, Morteza Analoui
Dual optimization problem: non-separable
• Dual problem only differs from that of separable case (5.14) by
constraints
𝑚 𝑚 𝑚
1
max ℒ 𝑑𝑢𝑎𝑙 (𝜶)=−
𝜶
∑ ∑
2 𝑖=1 𝑗=1
𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝑥 𝑖 ∙ 𝑥 𝑗 ) +∑ 𝛼 𝑖 (5.33)
𝑖=1

subject to:

• Objective function is concave and differentiable and (5.33) is


equivalent to a convex QP. The problem is equivalent to primal
problem (5.24).

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 65


T, Morteza Analoui
Hypothesis
• Solution of dual problem (5.33) can be used directly to determine hypothesis
returned by SVMs, using equation (5.26):

(∑ )
𝑚
h ( 𝑥 )=𝑠𝑔𝑛 ( 𝒘 ∙ 𝒙 + 𝑏 )=𝑠𝑔𝑛 𝛼𝑖 𝑦𝑖 ( 𝒙 𝑖 ∙ 𝒙 )+ 𝑏 (5.34)
𝑚 𝑖=1

𝒘 =∑ 𝛼 𝑖 𝑦 𝑖 𝒙 𝑖
𝑖=1
• can be obtained from any support vector lying on a marginal hyperplane,
𝑚
𝑏= 𝑦 𝑗 − ∑ 𝛼 𝑖 𝑦 𝑖 ( 𝒙 𝑖 ∙ 𝒙 𝑗 ) (5.35)
• important property of SVM:𝑖=1
hypothesis solution depends only on inner products
between vectors and not directly on vectors themselves

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 66


T, Morteza Analoui
Generalization bounds using margin
theory
• Generalization bounds provide a strong theoretical justification for
the SVM algorithm.
• Confidence margin: Confidence margin of a real-valued function at a
point labeled with is quantity

• when , classifies correctly

• is confidence of prediction made by

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 67


T, Morteza Analoui
Margin loss function
• For any parameter , we have a -margin loss function that, penalizes with cost of
1 when it misclassifies point (), and penalizes (linearly) when it correctly
classifies with confidence less than or equal to ρ ().

• The parameter ρ can be interpreted


as the confidence margin
demanded from a hypothesis

𝑦 𝑖 h ( 𝑥𝑖 )=𝜌 𝑖 / 𝜌 h

Figure 5.6
ρ margin loss function: illustrated in red

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 68


T, Morteza Analoui
Empirical margin loss
• Definition 5.6 Given a sample and a hypothesis , the empirical margin
loss is defined by

𝑚
^ 1
𝑅 𝑆 , 𝜌 ( h )= ∑ Φ 𝜌 [ 𝑦 𝑖 h(𝑥𝑖 ) ] (5.37)
𝑚 𝑖=1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 69


T, Morteza Analoui
Generalization bound for linear
hypotheses
• Corollary 5.11 Let H and assume that . Fix , then, for any , with probability at least
over the choice of a sample of size , the following hold for any :

^
𝑅 ( h ) ≤ 𝑅 𝑆 , 𝜌 ( h ) +2
𝑟Λ
𝜌 √𝑚
+
𝑙𝑜𝑔1 /𝛿
2𝑚 √
• In the separable case, for a linear with geometric margin and choice of
(5.44)

confidence margin parameter empirical margin loss =0

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 70


T, Morteza Analoui
Generalization bound for linear
hypotheses
• (5.44)  small generalization error can be achieved when:
• is small and
• empirical margin loss is relatively small

• For a given problem larger means smaller generalized error upper


bound
• It is a strong justification for margin-maximization algorithms such as
SVMs

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 71


T, Morteza Analoui
From Bound to Optimization problem
• An algorithm based on this theoretical guarantee consists of
minimizing right hand side of (5.44), that is, minimizing an objective
function with a term corresponding to sum of slack variables , and
another one minimizing
2
‖𝑤 ‖2

∑ 𝜉𝑖
^
𝑅 ( h ) ≤ 𝑅 𝑆 , 𝜌 ( h ) +2
𝜌
𝑟Λ
√ 𝑚
+
𝑙𝑜𝑔1 /𝛿
2𝑚 √ (5.44)

𝑖 =1
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 72
T, Morteza Analoui
Contents
1. Binary support machine
2. Binary SVM: Separable case (Consistent case)
3. Binary SVM: Non-separable case (Inconsistent case)
4. Kernel Methods
5. Multiclass SVM

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 73


T, Morteza Analoui
4- Kernel Methods
Kernel Methods and non-separable case, H
• When is not linearly separable means target function is nonlinear
• Q: How we can use a linear hypotheses set to learn a non-linear ?
• A: Kernel method:
• use a nonlinear mapping from the input space to a higher-dimensional space H (feature
space),
• H is linear in feature space
• H is linear in feature space
• H is nonlinear in input space
• Now we train a nonlinear hypothesis. In some application, it is possible to find for some . So,
problem becomes consistent in feature space, no training error (consistent case)

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 75


T, Morteza Analoui
Example: degree-2 Polynomial Kernel
• Suppose input space is in and
• Nonlinear mapping using
• Then, : nonlinear in space
• And, : linear in space

• is nonlinear in input space and is linear in feature space

• Complexity of linear in feature space is twice of linear in input space

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 76


T, Morteza Analoui
Figure 6.1
^
𝑅> 0 ^
𝑅 =0

linear in input space linear in feature space and


nonlinear in input space
Figure 6.1
Non-linearly separable case. The classification task consists of discriminating between blue and red points.
(a) No hyperplane can separate the two populations. (b) A non-linear mapping can be used instead.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 77


T, Morteza Analoui
Polynomial Kernel and complexity of
linear H
• Complexity of linear H in feature space:

• Foe: Input space in , and degree of


• when and
• VC-dimension is huge and can easily overfit training data ()

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 78


T, Morteza Analoui
SVMs with kernels
• Replacing each instance of an inner product in 5.33

𝑚 𝑚 𝑚
1
max ℒ 𝑑𝑢𝑎𝑙 (𝜶)=−
𝜶
∑ ∑
2 𝑖=1 𝑗=1
𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 𝐾 ( 𝑥𝑖 ∙ 𝑥 𝑗 ) + ∑ 𝛼𝑖 (6.13)
𝑖=1

subject to:

• solution can be written as:


(∑ )
𝑚
h ( 𝑥 )=𝑠𝑔𝑛 ( 𝒘 ∙ 𝒙 + 𝑏 )=𝑠𝑔𝑛 𝛼 𝑖 𝑦 𝑖 𝐾 ( 𝒙 𝑖 ∙ 𝒙 ) +𝑏 (6.14)
𝑖=1

for any with

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 79


T, Morteza Analoui
Definition 6.1 (Kernels)
• A function is called a kernel over
• Idea is to define a kernel such that for any two examples be equal to an inner
product of vectors

∀ 𝑥 , 𝑥 ∈ 𝑋 , 𝐾 ( 𝑥 , 𝑥 ) =⟨ Φ ( 𝑥 ) , Φ ( 𝑥 ′ ) ⟩
′ ′ (6.1)

inner product of vectors and

• Since an inner product is a measure of similarity of two vectors, is often


interpreted as a similarity measure between elements of input space

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 80


T, Morteza Analoui
Polynomial kernels
• For any constant , a polynomial kernel of degree is kernel defined
over by:
𝑑
∀ 𝑥 , 𝑥 ∈ ℝ , 𝐾 ( 𝑥 , 𝑥 )= ( 𝑥 ∙ 𝑥 +𝑐 ) (6.3)
′ 𝑁 ′ ′

• Example: ∀ 𝑥 , 𝑥′ ∈ ℝ 𝑁 , 𝐾 ( 𝑥 , 𝑥 ′ )= ( 𝑥1 𝑥 ′ 1 + 𝑥 2 𝑥 ′ 2 +𝑐 )
2

[ ]
2
𝑥 ′1
2
𝑥 ′2
𝐾 ( 𝑥 , 𝑥 ′ ) =⟨ Φ ( 𝑥 ) , Φ ( 𝑥 ′ ) ⟩ = [ 𝑥21 𝑥22 √ 2 𝑥 1 𝑥 2 √ 2 𝑐 𝑥 1 √ 2 𝑐 𝑥 2 𝑐 ] ∙ √
2 𝑥 ′1 𝑥 ′2 2
= ( 𝑥 1 𝑥 ′ 1+ 𝑥 2 𝑥 ′ 2 + 𝑐 )
√2 𝑐 𝑥 ′1
√2 𝑐 𝑥 ′2
𝑐
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 81
T, Morteza Analoui
Example: XOR and 2nd degree polynomial
h ( 𝑥)
√ 2 𝑥1 𝑥 2
(1,1,+ √ 2,− √ 2,− √ 2,1)(1,1,+ √ 2,+ √ 2,+ √ 2,1)
SVM solution:

h (Φ ( 𝑥 )) √ 2 𝑥1

(1,1,− √ 2,− √2,+ √ 2,1) (1,1,− √ 2,+ √ 2,− √ 2,1)

[ 𝑥 21 𝑥 22 √2 𝑥1 𝑥2 √ 2 𝑥 1 √2 𝑥 2 1 ]
Figure 6.3
Illustration of XOR classification problem and use of polynomial kernels. (a) XOR problem linearly
non-separable in input space. (b) Linearly separable using second-degree polynomial kernel.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 82


T, Morteza Analoui
Gaussian kernels or radial basis function
(RBF)
• Gaussian kernels are among most frequently used kernels in applications

( )
2
− ‖𝑥 −𝑥 ′‖2

, 𝐾 ( 𝑥 , 𝑥 )=𝑒
2
′ 𝑁 ′ 2𝜎
∀ 𝑥, 𝑥 ∈ℝ (6.5)

• For , (maximum similarity)


• For , when (maximum dissimilarity)

• What is non linear ?


• What is complexity of linear H in feature space?
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 83
T, Morteza Analoui
Sigmoid kernels
• For any real constants a, , a sigmoid kernel is defined over by:

∀ 𝑥 , 𝑥′ ∈ ℝ 𝑁 , 𝐾 ( 𝑥 , 𝑥 ′ )=tanh ( 𝑎 ( 𝑥 ∙ 𝑥 ′ ) + 𝑏 ) (6.6)

• Using sigmoid kernels with SVMs leads to an algorithm that is closely


related to learning algorithms based on simple neural networks,
which have sigmoid activation function.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 84


T, Morteza Analoui
Example-1D
• Suppose we have 5 1D data points as the training set:
• (1=1, 1=+1), (2=2, 2=+1), (3=4, 3=1), (4=5, 4=1), (5=6, 5=+1)

Class label +1 +1 -1 -1 +1
data point 1 2 3 4 5

input space 1 2 4 5 6

[ ]
𝑥 2𝑖
2
Φ ( 𝑥𝑖 ) = √ 2 𝑥 𝑖 𝐾 ( 𝑥 , 𝑥 𝑖 ) =( 𝑥𝑥𝑖 +1)
1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 85


T, Morteza Analoui
[ ]
2
𝑥𝑖
Example-1D, Feature Space Φ ( 𝑥𝑖 ) = √ 2 𝑥 𝑖
1

𝑥 2=2 , 𝜑 ( 𝑥2 ) =[4 ,2 √ 2 , 1] 𝑥5 =6 ,𝜑 ( 𝑥 5 )=[36 , 6 √2 , 1]

𝜑 ( 𝑥 5)
sv

𝜑 ( 𝑥4)
sv
sv 𝜑 ( 𝑥3)
𝜑 ( 𝑥 1) 𝜑 ( 𝑥 2)

Find the widest strip just by looking at the data!

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 86


T, Morteza Analoui
Example-1D, Applying SVM algorithm
• Polynomial kernel of degree 2

• is set to 100 (using large C means we like to emphasis on minimizing training


error
• We first find a ) by 2
𝐾 ( 𝑥 , 𝑥 𝑖 ) =( 𝑥 ∙ 𝑥 𝑖 +1)
5 5 5
1
max ℒ 𝑑𝑢𝑎𝑙 (𝜶)=∑ 𝛼𝑖 − ∑ ∑ 𝛼𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝑥 𝑖 ∙ 𝑥 𝑗 +1 ) 2

𝜶 𝑖=1 2 𝑖=1 𝑗=1


subject to:

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 87


T, Morteza Analoui
Example-1D 𝐾 ( 𝑥 , 𝑥 𝑖 ) =( 𝑥 ∙ 𝑥 𝑖 +1)
2

1 2 3 4 5

1 2 4 5 6

= (6x6+1)2

(6x1+1)2

5 5 5
1
ℒ 𝑑𝑢𝑎𝑙 (𝜶)=∑ 𝛼𝑖 − ∑ ∑ 𝛼 𝑖 𝑗 𝑖 𝑗( 𝑖
𝛼 𝑦 𝑦 𝑥 ∙ 𝑥 𝑗 +1 )
2

…………………………………………………………………+
𝑖=1 2 𝑖=1 𝑗=1 ………………………………………………………………...+
…………………………………………

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 88


T, Morteza Analoui
Support vectors: )

Example-1D
• Finding α to maximize

• Solving the following 5 linear equations: • Or using a QP solver

∑ 𝑦𝑖 𝛼𝑖=0
𝜕ℒ
=1 −0 . 5 ( 2 × 4 𝛼 1+ 2× 9 𝛼 2 −2 ×25 𝛼3 −2 ×36 𝛼4 + 2× 49 𝛼 5 )=0
𝜕 𝛼1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 89


T, Morteza Analoui
Support vectors: )

Example-1D
• We get
• a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833, 2.5 + 4.833 - 7.333 = 0
• Note that a< , so there is no training error
• The support vectors are {2=2, 4=5, 5=6}
• using:

a2=2.5, a4=7.333, a5=4.833

1 2 3 4 5

1 2 6
𝑥
4 5

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 90


T, Morteza Analoui
Example-1D,
Support vectors: )

𝑤 =𝛼 2 𝑦 2 𝜑 ( 𝑥 2 ) +𝛼 4 𝑦 4 𝜑 ( 𝑥 4 ) +𝛼 5 𝑦 5 𝜑 ( 𝑥 5 ) =[ +0 .663 ,− 3 .77 , 0] h ( Φ ( 𝑥 ) ) =0 .663 𝑧 1 − 3 .77 𝑧 2 +9


1 1
𝑧1 𝜌h = =0 .261=
‖𝑤‖ √ 𝛼2 +𝛼 4 +𝛼 5
𝜑 ( 𝑥 5)

𝑤
𝜑 ( 𝑥4)

𝜑 ( 𝑥3)
𝜑 ( 𝑥 1) 𝜑 ( 𝑥 2)

𝑧2
03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 91
T, Morteza Analoui
Example-1D 𝐾 ( 𝑥 , 𝑥 𝑖 ) =( 𝑥 ∙ 𝑥 𝑖 +1)
2

Support vectors: )

• Testing for example : h ( 𝑥 )= ∑ 𝛼𝑖 𝑦𝑖 𝐾 ( 𝑥, 𝑥𝑖 )=𝑏 𝐾 (𝑥 , 𝑥5 )


𝑥𝑖 ∈ 𝑆
2 2 2
h ( 𝑥 )=2 . 5 ( 1 )( 2 𝑥+ 1 ) + 7 . 333 ( −1 ) (5 𝑥 +1 ) + 4 . 833 ( 1 ) ( 6 𝑥+1 ) + 𝑏
2
h ( 𝑥 )=0 . 6667 𝑥 − 5 .333 𝑥 +𝑏
• is also can be recovered by solving or by or by , as x2 and x5 lie on and x4 lies on

• All three give

2
h ( 𝑥 )=0 . 6667 𝑥 − 5 .333 𝑥 +9 h ( Φ ( 𝑥 ) ) =0 .663 𝑧 1 − 3 .77 𝑧 2 +9

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 92


T, Morteza Analoui
Example-1D
𝑇 2
h ( 𝑥 )=𝑤 Φ (𝑥 )+𝑏=0 , 0 . 6667 𝑥 − 5 .333 𝑥 +9=0
𝑇 2
h ( 𝑥 )=𝑤 Φ (𝑥 )+𝑏=+1 , 0 . 6667 𝑥 − 5 . 333 𝑥+ 9=+ 1
𝑇 2
h ( 𝑥 )=𝑤 Φ (𝑥 )+𝑏=−1 , 0 . 6667 𝑥 − 5 . 333 𝑥+ 9= − 1

h ( 𝑥 )=−1 h ( 𝑥 )=+1

𝑥
1 2 3 4 5
1 2 3 4 5 6 7

𝐿𝑎𝑏𝑒𝑙 :+1 𝐿𝑎𝑏𝑒𝑙 :+1


𝐿𝑎𝑏𝑒𝑙 :−1
=2.42 =5.58

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 93


T, Morteza Analoui
Example-XOR
Input vector y
+1 -1
x1: [-1,-1] -1
x2: [-1,+1] +1
x3: [+1,-1] +1
-1 +1
x4: [+1,+1] -1

[ ]
9 1 1 1
𝜑 ( 𝑧 ) =¿ 1 9 1 1
K= 1 1 9 1
𝑘 ( 𝑥1 , 𝑥 2 ) =¿ 1 1 1 9

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 94


T, Morteza Analoui
Example

Note that H is a linear hyperplane set.


Non-linearity is due to kernel function.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 95


T, Morteza Analoui
RBF kernel
C=0.01: Lower C
higher training error and larger

C=100: Higher C
lower training error and smaller

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 96


T, Morteza Analoui
RBF kernel

=10≫1: decreasing influence


of support vectors (no amount
of regularization with C will be
able to prevent overfitting)
Too Complex model

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 97


T, Morteza Analoui
RBF-Kernel
• ≫1  radius of area of influence of support vectors only includes
support vector itself and no amount of regularization with will be
able to prevent overfitting.
• 1  model is too simple and cannot capture complexity of target
function c. The region of influence of any selected support vector
would include whole training set.
• =intermediate values  good models can be found on a diagonal of
and .

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 98


T, Morteza Analoui
RBF kernel
1: Too simple model

increasing margin

𝐶
≫1: Too Complex model

1: region of influence of any support vector


would include the whole training set.
Too simple model ≫1: decreasing influence of support vectors

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 99


T, Morteza Analoui
Contents
1. Binary support machine
2. Binary SVM: Separable case (Consistent case)
3. Binary SVM: Non-separable case (Inconsistent case)
4. Kernel Methods
5. Multiclass SVM

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 100


T, Morteza Analoui
5- Multiclass SVM

Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based
vector machines. Journal of Machine Learning Research, 2, 2002.
Multiclass SVM
• Let denote input space and denote output space, and let be an unknown
distribution over according to which input points are drawn. We will distinguish
between two cases:
• mono-label case, where is a finite set of classes that we mark with numbers for convenience,
Learning: Given a dataset
• the multi-label case where
• In mono-label case, each example is labeled with a single class, while in multi-
label case it can be labeled with several. Text documents can be labeled with
several different relevant topics, e.g., sports, business, and society. The positive
components of a vector in indicate classes associated with an example.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 102


T, Morteza Analoui
Multi-class SVM, Mono-label case
• In a risk minimization framework
• Each label has a different weight vector

• Leaning (Training): Maximizing multiclass margin


• Equivalently, Minimize total norm of the weight vectors such that the true
label is scored at least 1 more than the second best one
• Training results in ,
• Testing (Inference): Select the label based on the highest score for all

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 103


T, Morteza Analoui
Multiclass Margin loss
• Suppose a 5-class tasks.
• For pattern the scores are:
• The margin loss is given in 3 different possibilities:
𝑇 𝑇 𝑇
𝑠𝑙 =𝑤 𝑙 𝑥 +𝑏 𝑙 𝑠𝑙 =𝑤 𝑙 𝑥 +𝑏 𝑙 𝑠𝑙 =𝑤 𝑙 𝑥 +𝑏 𝑙

3.1
2.8 2.8 2.8

2.2

Labels: Labels: Labels:


1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Margin loss= = 0 Margin loss= = 0.7 Margin loss= = 1.6

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 104


T, Morteza Analoui
Linear Hard SVM
no empirical error

• Recall hard binary linear SVM


1 2 regularizer
min ‖𝒘‖2 (5.7)
𝒘 ,𝑏2
Subject to:
Score constraint
• Single task hard multiclass linear SVM
min
1
𝑘 regularizer
(𝑤 ¿ ¿1 ; 𝑏1), …,(𝑤 𝑘 ; 𝑏𝑘 ) ∑‖ 𝒘‖2 ¿
2

2 𝑙=1
Score constraint: Score for true
𝑠𝑖 − 𝑠 𝑙 ≥1 ≡(1− ( 𝑠𝑖 − 𝑠 𝑙 ) ) ≤ 0 Subject to: label is higher than score for any
other label by 1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 105


T, Morteza Analoui
Linear Soft SVM
𝑚
• Recall soft binary linear SVM 1
‖𝒘‖ + 𝐶 ∑ 𝜉
2 𝑝
min 2 𝑖
2
,
𝑖 =1
Subject to:

relaxed score constrains Non-negativity constrain of slack variable

• Single task soft multiclass linear SVM


min
𝑘 𝑚
1
(𝑤 ¿ ¿1 ; 𝑏1), …,(𝑤 𝑘 ; 𝑏𝑘 ) ∑‖𝑤 𝑙‖ +𝐶 ∑ 𝜉 𝑖 ¿
2 𝑝
2 𝑙=1 𝑖=1
Subject to:

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 106


T, Morteza Analoui
Lagrangian of the optimization problem
• To solve the optimization problem we use the Karush-Kuhn-Tucker theorem. We
add a dual set of variables, one for each constraint and get the Lagrangian of the
optimization problem:
• Recall single task soft binary linear SVM

(5.25)

• Single task soft multiclass linear SVM


𝑘 𝑚 𝑚 𝑘 𝑚
1
ℒ ( {𝒘 𝑙 , 𝑏𝑙 }𝑙=1 , 𝝃 , 𝜶 , 𝜷 )= ∑‖𝑤𝑙‖ +𝐶 ∑ 𝜉 𝑝𝑖 − ∑ ∑ 𝛼𝑖, 𝑙 [ 𝑠 𝑖 − 𝑠 𝑙 −1+𝜉 𝑖 ] − ∑ 𝛽 𝑖 𝜉 𝑖
𝑘 2

2 𝑙=1 𝑖=1 𝑖=1 𝑙=1 𝑖=1


Subject to :

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 107


T, Morteza Analoui
𝑚 𝑚 𝑚
1
ℒ 𝑑𝑢𝑎𝑙 (𝜶)=∑ 𝛼𝑖 − ∑ ∑ 𝛼 𝑖 𝛼 𝑗 𝑦 𝑖 𝑦 𝑗 ( 𝒙 𝑖 ∙ 𝒙 𝑗 )
Dual Problem 𝑖=1 2 𝑖=1 𝑗=1
(5.32)

subject to:

• We can rewrite the dual program in the following vector form


𝑚 𝑚
𝐶
max ℒ 𝑑𝑢𝑎𝑙 =∑ 𝐴𝑖 ∙1 𝑦 − ∑ ( 𝐴𝑖 ∙ 𝐴 𝑗 ) (𝑥 ¿ ¿𝑖, 𝑥 𝑗 )¿
Α 𝑖=1 2 𝑖, 𝑗=1 𝑖

Subject to: and

• Where and ,
• Let be a vector whose components are all zero except for the component which
is equal to 1,
• Let be the vector whose components are all 1.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 108


T, Morteza Analoui
Dual Problem.

𝑚 𝑚
𝐶
max ℒ 𝑑𝑢𝑎𝑙 =∑ 𝐴𝑖 ∙1 𝑦 − ∑ ( 𝐴𝑖 ∙ 𝐴 𝑗 ) (𝑥 ¿ ¿𝑖, 𝑥 𝑗 )¿
Α 𝑖=1 2 𝑖, 𝑗=1 𝑖

Subject to: and

𝛼 𝑖= { 𝛼𝑖 , 1 𝛼 𝑖 ,2 … 𝛼𝑖 , 𝑘 }
→ 𝐴 =1 −𝛼 𝑖= [ −𝛼 𝑖,1 − 𝛼𝑖 ,2 … 1 −𝛼 𝑖, 𝑦 … −𝛼 𝑖,𝑘 ]
1 𝑦 =[ 0 … 0 1 0 … 0 ] 𝑖 𝑦
𝑖
𝑖 𝑖

𝑚
𝐴 𝑖 ∙1 𝑦 =1 −𝛼𝑖 , 𝑦
𝑖 𝑖
𝐴 𝑖 ∙ 1 =1 − ∑ 𝛼 𝑖= 0
𝑖=1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 109


T, Morteza Analoui
Applying Kernel function ( )
𝑚
h ( 𝑥 )= 𝑠𝑔𝑛 ∑ 𝛼 𝑦 𝐾 ( 𝒙 ∙ 𝒙 ) + 𝑏 𝑖 𝑖 𝑖
𝑖=1

• Replacing the inner-products with a kernel function (·,·) that satisfies


Mercer’s conditions. The general dual program using kernel functions
is therefore,
𝑚 𝑚
𝐶
max ℒ =∑ 𝐴𝑖 ∙1 𝑦 − ∑ ( 𝐴𝑖 ∙ 𝐴 𝑗 ) 𝐾(𝑥 ¿ ¿𝑖∙𝑥 𝑗 )¿
Α 𝑖=1 2 𝑖, 𝑗=1
𝑖

Subject to: and

• Classification function becomes:

{ }
𝑚
h ( 𝑥 )=arg𝑚𝑎𝑥 𝑘𝑙=1 ∑ 𝐴𝑖,𝑙 𝐾 ( 𝑥 , 𝑥 𝑗 ) +𝑏 𝑙
𝑖=1

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 110


T, Morteza Analoui
Support Vectors
• The first sum is over all patterns that belong to the class. Hence, an example
labeled is a support pattern only if
• The second sum is over the rest of the patterns whose labels are different from .
In this case, an example is a support pattern only if

[ ]
𝑚 𝑚
𝑤𝑙 = 𝛽 ∑ (1− 𝛼𝑖, 𝑙 )Φ(𝑥 𝑖 )+ ∑ ( − 𝛼𝑖, 𝑙 ¿ Φ (𝑥 𝑖 ))
𝑖=1 𝑖=1
𝑦 𝑖 =𝑙 𝑦 𝑖 ≠𝑙

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 111


T, Morteza Analoui
Probabilistic interpretation for vector
• For each pattern (example) the vector satisfies the constraints
𝑘
𝛼 𝑖 , 𝑙 ≥ 0 ∧ ∑ 𝛼𝑖 , 𝑙 =1
𝑙=1
• Each set can be viewed as a probability distribution over the labels
• is a support pattern if and only if its corresponding distribution is not
concentrated on the correct label . That is: for and for
• Therefore, the classifier is constructed using patterns whose labels are uncertain;
the rest of the input patterns are ignored.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 112


T, Morteza Analoui
Example
• Suppose , , and then

• example does not support Example scaled by


• example supports Example scaled by
• example supports Example scaled by
• example supports Example scaled by
• example supports Example scaled by

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 113


T, Morteza Analoui
Quadratic Programing
• Both the primal and dual problems are simple QPs generalizing those
of standard SVM algorithm.
• However, size of solution and number of constraints for both
problems is in , which, for a large number of classes , can make it
difficult to solve.
• However, there exist specific optimization solutions designed for this
problem based on a decomposition of the problem into disjoint sets
of constraints.

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 114


T, Morteza Analoui
Concluding Remarks
• Generalizes binary SVM algorithm
• If we have only two classes, this reduces to the binary (up to scale)

• Comes with similar generalization guarantees as the binary SVM

• Can be trained using different optimization methods


• Stochastic sub-gradient descent can be generalized

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 115


T, Morteza Analoui
Generalization bound
• In multi-class classification, a kernel-based hypothesis is based on a matrix of
maintain prototypes Vector is the row of .
• Each weight vector , defines a scoring function
• A family of kernel-based hypotheses we will consider is
H 𝐾 ={ ( 𝑥 , 𝑦 ) ∈ 𝑋 × { 1 , … , 𝑘 } ⟼ 𝑤 𝑦 ∙ Φ ( 𝑥 ) : ,‖𝑊 ‖ ≤ Λ }
2
2

• In which

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 116


T, Morteza Analoui
Generalization bound
• Assume that there exists , such that for all
• For any with probability at list for all


𝑚
1 𝑟 Λ 𝑙𝑜𝑔 1/ 𝛿
𝑅 ( h ) ≤ ∑ 𝜉 𝑖 +4 𝑘 + (9.12)
𝑚 𝑖=1 √𝑚 2𝑚
{ }
𝑘
h ∈ H 𝐾 = ( 𝑥 , 𝑦 ) ⟶ 𝑊 𝑦 ∙ Φ ( 𝑥 ) : ∑ ‖𝑤 𝑙‖ ≤ Λ 2
2
2
𝑙 =1

• Where for all

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 117


T, Morteza Analoui
Appendix
SVM solvers
SVM solvers, Exact SVM solvers
• LIBSVM
• LIBLINEAR
• liquidSVM
• Pegasos
• LASVM
• SVMLight

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 119


T, Morteza Analoui
SVM solvers, Hierarchical solvers
• ThunderSVM
• cuML SVM
• LPSVM

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 120


T, Morteza Analoui
SVM solvers, Approximate SVM solvers
• DC-SVM
• EnsembleSVM
• BudgetedSVM

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 121


T, Morteza Analoui
SVM solvers run on GPU
• GTSVM
• OHD-SVM

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 122


T, Morteza Analoui
SVM solvers, Multiclass
• Crammer-Singer SVM
• MSVMpack
• BSVM
• LaRank
• GaLa

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 123


T, Morteza Analoui

You might also like