0% found this document useful (0 votes)
11 views37 pages

L5 SVMs

The document provides an overview of Support Vector Machines (SVMs), a classification method developed by V. Vapnik in the 1970s, which defines hyperplanes to separate classes of data. It discusses linear and non-linear SVMs, the concept of soft margin SVMs for handling non-separable data, and the use of kernel functions for non-linear classification. Additionally, it covers the optimization problems involved in maximizing the margin and determining the classification using support vectors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views37 pages

L5 SVMs

The document provides an overview of Support Vector Machines (SVMs), a classification method developed by V. Vapnik in the 1970s, which defines hyperplanes to separate classes of data. It discusses linear and non-linear SVMs, the concept of soft margin SVMs for handling non-separable data, and the use of kernel functions for non-linear classification. Additionally, it covers the optimization problems involved in maximizing the margin and determining the classification using support vectors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Support Vector Machines

(SVMs)

Lương Thái Lê
Outline
1. Introduction
2. Separating hyperplane
3. Linear SVM with data can be separated
4. Linear SVM with data can not be separated – Soft margin SVM
5. Non-linear SVM
• Kernel function

6. Conclusion
SVMs – Introduction (1)
• SVMs is proposed by V. Vapnik et al. in 1970s in Russia
• SVMs is a linear classifier for the purpose of defining a hyperplane to
separate 2 classes of data (binary classifier). Then SVMs is improved
for multiple-class classifier
• Kernel functions, also called transformation functions, are used for
nonlinear classification cases
• SVM is a good (suitable) method for classification problems with large
attribute representation spaces
• SVM is known as one of the best classification methods for text
classification problems
SVMs – Introduction (2)
• Represent the set of r training examples:
{(x1,y1),(x2,y2),…,(xr,yr)}
• xi∈ 𝑹𝒏 is the input vector representing the training example
• yi is a output lable, yi ∈ {1,-1} = {positive class, negative class}
• SVMs define a linear decomposition function:
f(x) = <w.x> + b
• w weight vector corresponding to the attributes
• b is a real number (bias number)
• For each xi :
Outline
1. Introduction
2. Separating hyperplane
3. Linear SVM with data can be separated
4. Linear SVM with data can not be separated – Soft margin SVM
5. Non-linear SVM
• Kernel function

6. Conclusion
Separating hyperplane
• The hyperplane separates the positive class training examples and the
negative class training examples:
<w.x> + b = 0
• There are a lot of separating hyperplanes. Which one will be chosen
Hyperplane with maximum margin
• SVM selects the split hyperplane with the largest margin
• Machine learning theory shows that a largest margin separable
hyperplane will minimize the error of classification
Outline
1. Introduction
2. Separating hyperplane
3. Linear SVM with data can be separated
4. Linear SVM with data can not be separated – Soft margin SVM
5. Non-linear SVM
• Kernel function

6. Conclusion
Data can be linearly separable (1)
• Assume that the data set can be linearly separabled
Linear SVM with Data can be separated
• Consider a positive example (x+ ,1) and a negative example (x- ,1) are
nearest hyperlane H0 < 𝒘. 𝒙 > +𝑏 = 0
• Define 2 margin hyperlanes that parallel to each other
• H+ passes x+ , and is parallel to H0
• H- passes x- , and is parallel to H0

H+ ∶ < 𝒘. 𝒙+ > +𝑏 = 1
H- ∶ < 𝒘. 𝒙− > +𝑏 = −1
so that:
< 𝒘. 𝒙𝒊 > +𝑏 ≥ 1, 𝑦𝑖 = 1
< 𝒘. 𝒙𝒊 > +𝑏 ≤ −1, 𝑦𝑖 = −1
Calculate Margin Level (1)
• Margin level is the distance between two margin hyperplanes H+ and
H- .
• d+ is the distance between H+ and H0
• d- is the distance between H- and H0
• (d+ + d-) is margin level
• The distance from point xi to the
|<𝒘.𝒙𝒊 >+𝑏|
plane H0 is:
𝒘

where 𝒘 = 𝑤12 + 𝑤22 + ⋯ + 𝑤𝑛2 (the length of w)


Calculate Margin Level (1)
• Find d+ : distance from x+ to < 𝒘. 𝒙 > +𝑏 = 0
|<𝒘.𝒙+ >+𝑏| 𝟏
𝑑+ = =
𝒘 𝒘
• Find d- : distance from x- to < 𝒘. 𝒙 > +𝑏 = 0
|<𝒘.𝒙− >+𝑏| 𝟏
𝑑− = =
𝒘 𝒘
• Find margin level:
𝟐
• 𝑚𝑎𝑟𝑔𝑖𝑛 = 𝑑+ + 𝑑− =
𝒘
SVM learning – Maximize Margin
Definition: (Linear SVM - The case of separable)
• The set of r linearly separable training examples:
D = { 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 ,… 𝑥𝑟 , 𝑦𝑟 }
• SVM learns a classification to maximize margin
• Equivalent to solve the following optimization problem
𝟐
• Find w and b so that 𝑚𝑎𝑟𝑔𝑖𝑛 = is maximum
𝒘
• with the conditions:
< 𝒘. 𝒙𝒊 > +𝑏 ≥ 1, 𝑖𝑓 𝑦𝑖 = 1

< 𝒘. 𝒙𝒊 > +𝑏 ≤ −1, 𝑖𝑓 𝑦𝑖 = −1
Maximize Margin – Optimization problem
• SVM learning equivalent to :
<𝒘.𝒘>
• Minimize:
2
< 𝒘. 𝒙𝒊 > +𝑏 ≥ 1, 𝑖𝑓 𝑦𝑖 = 1
• Condition: ቊ
< 𝒘. 𝒙𝒊 > +𝑏 ≤ −1, 𝑖𝑓 𝑦𝑖 = −1

• … Equivalent to:
<𝒘.𝒘>
• Minimize:
2

• Condition: 𝑦𝑖 < 𝒘. 𝒙𝒊 > +𝑏 ≥1 , ∀𝑖 = 1, … , 𝑟


Constrained optimization theory (1)
• Minimization problem with equality constraint
Minimize f(x) with condition g(x)=0
• Necessary condition for x0 to be a solution:
𝜕
𝑓 𝑥 + 𝛼𝑔 𝑥 ቚ =0
ቐ𝜕𝑥 𝑥=𝑥0 ; where 𝛼 is a Lagrange coefficient
𝑔 𝑥 =0
• In the case of multi constraints:
𝜕
𝑓 𝑥 + σ𝑟𝑖=1 𝛼𝑖 𝑔𝑖 (𝑥) ቚ
ቐ𝜕𝑥 𝑥=𝑥0 ; where𝛼𝑖 is Lagrange coefficients
𝑔𝑖 𝑥 = 0
Constrained optimization theory
• Minimization problem with inequality constraint
Minimize f(x) with condition 𝒈𝒊 𝒙 ≤0
• Necessary condition for x0 to be a solution:
𝜕
𝑓 𝑥 + σ𝑟𝑖=1 𝛼𝑖 𝑔𝑖 (𝑥) ቚ =0
ቐ𝜕𝑥 𝑥=𝑥0 ; where 𝛼 ≥ 0 is a Lagrange coefficient
𝑔𝑖 𝑥 ≤ 0
• Hàm Lagrange:
𝑳 = 𝒇 𝒙 + σ𝒓𝒊=𝟏 𝜶𝒊 𝒈𝒊 (𝒙)
<𝒘 . 𝒘>
=> 𝑳𝒑 (𝒘, 𝒃, 𝜶) = + σ𝒓𝒊=𝟏 𝜶𝒊 [𝑦𝑖 < 𝒘. 𝒙𝒊 > +𝑏 − 𝟏]
𝟐
where 𝜶𝒊 ≥ 𝟎 are Lagrange coefficients
Solve the constrained minimization problem (1)
• Optimization theory : an optimal solution for above equation must satisfy certain
conditions, called the Karush-Kuhn-Tucker (KKT) conditions – which are necessary
conditions (but not the sufficient condition)
(1)

(2)

(3)
(4)

(5)
• (5) show that only xi belong to margin hyperlanes (H+, H-) have 𝜶𝒊 > 0 (because
𝑦𝑖 < 𝒘. 𝒙𝒊 > +𝑏 =1 in this case) => those xi are called support vectors
• 𝛼𝑖 = 0 with other xi
Solve the constrained minimization problem (2)
• For this minimization problem under consideration with a convex
objective function and linear constraints, the conditions KKT are
necessary and sufficient for an optimal solution.
• It is difficult to deal with inequality constraint, so that we use
Lagrange method to leads to a dual expression of the optimization
problem:

=> The solution is found by minimize LP or maximize LD


The dual optimization problem
Maximize:

Conditions:

• For a convex objective function and linear constraints, the maximum value of LD occurs at
the same values of w, b and αi which helps to obtain minimum value of LP
• Solve the above equation will get αi (out of our lecture), then help to find w*, b*
Find w* and b*

• Let SV be the set of support vectors


• SV is a subset of the set of r initial training examples
-> 𝛼𝑖 > 0 for support vectors xi
-> 𝛼𝑖 = 0 for vectors are not support vectors xi
• Use KKT (1) we have
𝒘∗ = σ𝑟𝑖=1 𝛼𝑖 𝑦𝑖 𝒙𝒊 = σ𝒙𝒊 ∈𝑺𝑽 𝛼𝑖 𝑦𝑖 𝒙𝒊
• Consider KKT (5) and any support vecto xk , we have:
𝑦𝑘 < 𝒘∗ . 𝒙𝒌 > +𝑏 ∗ − 1 = 0
 𝑦𝑘 < 𝒘∗ . 𝒙𝒌 > +𝑏 ∗ = 1
 𝑏 ∗ = 𝑦𝑘 − < 𝒘∗ . 𝒙𝒌 > (𝑏𝑒𝑐𝑎𝑢𝑠𝑒 𝑦𝑘2 = 1)
Determining the classification
• Classification is determined by the hyperplane:
f x =< 𝒘∗ . 𝒙 > +𝑏∗ = σ𝑥𝑖 ∈𝑆𝑉 𝛼𝑖 𝑦𝑖 < 𝒙𝒊 . 𝒙 > +𝑏 ∗ = 0 (I)
• For a new example z, we need to define:
𝑠𝑖𝑔𝑛 < 𝒘∗ . 𝒛 > +𝑏 ∗ = 𝑠𝑖𝑔𝑛(σ𝑥𝑖 ∈𝑆𝑉 𝛼𝑖 𝑦𝑖 < 𝒙𝒊 . 𝒛 > +𝑏 ∗ ) (II)
If (II) return 1 then example z is classified into the positive class; otherwise, it is
classified into the negative class
• Comment:
• This classification only depends on support vectors
• Only need the value of the dot product of 2 vectors (not needing to know the value
of those 2 vectors)
Outline
1. Introduction
2. Separating hyperplane
3. Linear SVM with data can be separated
4. Linear SVM with data can not be separated – Soft margin SVM
5. Non-linear SVM
• Kernel function

6. Conclusion
Linear SVM with Data can not be separated
• The case of linear and separable classifiers is ideal (at leasthappen)
• The data set may contain noise and errors (e.g. some examples are
mislabeled)
• If the data set contains noise, the conditions may not be satisfied
• Can not find w* and b* satisfied 𝑦𝑖 < 𝒘. 𝒙𝒊 > +𝑏 ≥ 1
• To work with noisy data, it is necessary to loosen margin constraints by
using slack variables 𝜖𝑖
The new condition:
𝑦𝑖 < 𝒘. 𝒙𝒊 > +𝑏 ≥ 1 − i ∀𝑖 = 1, … , 𝑟
i ≥ 0
=> We have new optimization problem – soft margin SVM
Soft Margin SVM
<𝒘.𝒘>
Minimize: + 𝐶 σ𝑟𝑖=1 i (C is called error cost)
2

𝑦𝑖 < 𝒘. 𝒙𝒊 > +𝑏 ≥ 1 − i ∀𝑖 = 1, … , 𝑟
Condition: ቊ
𝜖𝑖 ≥ 0
• Lagrange optimal expression:
𝑟 𝒓 𝑟
<𝒘. 𝒘>
𝑳𝒑 = +𝐶෍ i − ෍ 𝛼𝑖 [𝑦𝑖 < 𝒘. 𝒙𝒊 > +𝑏 − 1+i ] − ෍ 𝜇𝑖 i
𝟐
𝑖=1 𝒊=𝟏 𝑖=1

where 𝛼𝑖 ≥ 0 𝑎𝑛𝑑 𝜇𝑖 ≥ 0 are multi coefficcient Lagrange


The dual optimization problem
Maximize:

Conditions:

=> Objective function is the same with separable linear classification


=> The only difference is: 𝛼𝑖 ≤ 𝐶
Important characteristics from sprasity
• The solution is determined based on very few (sparse) 𝛼𝑖 values
• A lot of the learning examples are outside the margin area and they have
𝛼𝑖 = 0
• The examples locate on the margin 𝑦𝑖 < 𝒘. 𝒙𝒊 > +𝑏 = 1 (are the support
vectors), then have 𝛼𝑖 ≠ 0 (0 < 𝛼𝑖 < 𝐶)
• Examples that fall within the margin 𝑦𝑖 < 𝒘. 𝒙𝒊 > +𝑏 < 1 (are noisy/error
examples), have 𝛼𝑖 ≠ 0 (𝛼𝑖 = 𝐶 )
• Without this sparsity feature, the SVM method cannot be effective for
large data sets.
The Classifier hyperplane
• Classification is determined by the hyperplane:
f x =< 𝒘∗ . 𝒙 > +𝑏 ∗ = σ𝑥𝑖 ∈𝑆𝑉 𝛼𝑖 𝑦𝑖 < 𝒙𝒊 . 𝒙 > +𝑏 ∗ = 0
A lot of xi have 𝛼𝑖 = 0 (that is sparsity characteristic of SVMs)
• We determine the class for a new example z by:
sign(<w*.z>+b*)
To find b* need to identify C (in objective function)
• Often use validation set
Linear SVM - Conclusions
• Classification is based on the separating hyperplane
• The split hyperplane is defined based on the set of support vectors
• Only for support vectors, their Lagrange multiplier is ≠ 0
• For other training examples (which are not support vectors) their Lagrange
multiplier = 0
• Determining the support vectors (among the training examples)
requires solving a quadratic optimization problem
• In the dual expression (LD) and in the separating hyperplane
expression, the training examples appear only inside the inner/dot-
products of the vectors.
Outline
1. Introduction
2. Separating hyperplane
3. Linear SVM with data can be separated
4. Linear SVM with data can not be separated – Soft margin SVM
5. Non-linear SVM
• Kernel function

6. Conclusion
Non-Linear SVM
• The formulas in the SVM method require the data set to be linearly
classifiable (with/without noise)
• In many practical problems, data sets can not be linearly separable
• Non-linear SVM method:
• Phase 1: Convert the original input representation space to another space
(usually with much higher dimensions)
-> The data is represented in the new space can be linearly separable
• Phas 2: Re-apply the formulas and steps as in the linear SVM classification
method
Transforming performance space
• The idea is to map (transform) the data representation from the original space X
to another space F by applying a nonlinear mapping function ∅
• In the transformed space, the set of original learning examples
𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑟 , 𝑦𝑟 will be performed:
∅(𝑥1 ), 𝑦1 , ∅(𝑥2 ), 𝑦2 , … , ∅(𝑥𝑟 ), 𝑦𝑟
• Example: (𝑥1 , 𝑥2 ) → (𝑥12 , 𝑥22 , 𝑥1 𝑥2 )
x = 2,3 , 𝑦 = −1 ⇒ (x = 4,9,6 , 𝑦 = −1)
Problem: dimension is too big
Use Kernel function
Kernel functions
• Consider the classification hyperplane:
f z =< 𝒘∗ . ∅(𝒛) > +𝑏 ∗ = σ𝑟𝑖=1 𝛼𝑖 𝑦𝑖 < ∅(𝒙𝒊 ). ∅(𝒛) > +𝑏 ∗
Not need to find exactly ∅(𝒙𝒊 ) and ∅(𝒛)
Only need to find inner product < ∅(𝒙𝒊 ). ∅(𝒛) >
Use Kernel functions K:
K 𝐱, 𝐳 =< ∅(𝒙). ∅(𝒛) >
=> We do not need to know what ∅ is
Kernel functions - Example
• Polynomial Kernel function:
𝐾 𝒙, 𝒛 = < 𝒙. 𝒛 >𝑑
• Consider d=2, x = (x1,x2) and z = (z1,z2):

• The above example shows that the kernel function < 𝒙. 𝒛 >2 is a dot product of two
vectors φ(x) and φ(z) in the space after transformation
Commonly used kernel functions
• Polynomial:
K(x,z)=(<x,z> + 𝜃)d , 𝜃 ∈ 𝑅, 𝑑 ∈ 𝑁
• Gaussian radial basis function (RBF)
𝑥−𝑧 2
𝐾 𝒙, 𝒛 = 𝑒 − 2𝜎 , 𝜎>0

• Sigmoidal
1
𝐾 𝒙, 𝒛 = tanh(𝛽 < 𝒙. 𝒛 > −𝛾 = ; 𝛽, 𝛾 ∈ 𝑅
1+𝑒 − 𝛽<𝒙.𝒛>−𝛾
SVMs - Conclusions
• SVM only works with an input space of real numbers
• For nominal attributes, it is necessary to convert the nominal values into
numeric values
• SVM only works with 2-class classification
• For multi-class classification problems, it is necessary to convert to a set of 2-
class classification problems, and then solve each of these 2-class problems
separately.
• Eg: One vs All, One vs One
• Separated Hyperplane is difficult to understand especially Kernel
function is used:
• SVM is often used in application problems where explaining the behavior
(decisions) of the system to the user is not an important requirement.
Outline
1. Introduction
2. Separating hyperplane
3. Linear SVM with data can be separated
4. Linear SVM with data can not be separated – Soft margin SVM
5. Non-linear SVM
• Kernel function

6. Conclusion
Q&A

You might also like