L5 SVMs
L5 SVMs
(SVMs)
Lương Thái Lê
Outline
1. Introduction
2. Separating hyperplane
3. Linear SVM with data can be separated
4. Linear SVM with data can not be separated – Soft margin SVM
5. Non-linear SVM
• Kernel function
6. Conclusion
SVMs – Introduction (1)
• SVMs is proposed by V. Vapnik et al. in 1970s in Russia
• SVMs is a linear classifier for the purpose of defining a hyperplane to
separate 2 classes of data (binary classifier). Then SVMs is improved
for multiple-class classifier
• Kernel functions, also called transformation functions, are used for
nonlinear classification cases
• SVM is a good (suitable) method for classification problems with large
attribute representation spaces
• SVM is known as one of the best classification methods for text
classification problems
SVMs – Introduction (2)
• Represent the set of r training examples:
{(x1,y1),(x2,y2),…,(xr,yr)}
• xi∈ 𝑹𝒏 is the input vector representing the training example
• yi is a output lable, yi ∈ {1,-1} = {positive class, negative class}
• SVMs define a linear decomposition function:
f(x) = <w.x> + b
• w weight vector corresponding to the attributes
• b is a real number (bias number)
• For each xi :
Outline
1. Introduction
2. Separating hyperplane
3. Linear SVM with data can be separated
4. Linear SVM with data can not be separated – Soft margin SVM
5. Non-linear SVM
• Kernel function
6. Conclusion
Separating hyperplane
• The hyperplane separates the positive class training examples and the
negative class training examples:
<w.x> + b = 0
• There are a lot of separating hyperplanes. Which one will be chosen
Hyperplane with maximum margin
• SVM selects the split hyperplane with the largest margin
• Machine learning theory shows that a largest margin separable
hyperplane will minimize the error of classification
Outline
1. Introduction
2. Separating hyperplane
3. Linear SVM with data can be separated
4. Linear SVM with data can not be separated – Soft margin SVM
5. Non-linear SVM
• Kernel function
6. Conclusion
Data can be linearly separable (1)
• Assume that the data set can be linearly separabled
Linear SVM with Data can be separated
• Consider a positive example (x+ ,1) and a negative example (x- ,1) are
nearest hyperlane H0 < 𝒘. 𝒙 > +𝑏 = 0
• Define 2 margin hyperlanes that parallel to each other
• H+ passes x+ , and is parallel to H0
• H- passes x- , and is parallel to H0
H+ ∶ < 𝒘. 𝒙+ > +𝑏 = 1
H- ∶ < 𝒘. 𝒙− > +𝑏 = −1
so that:
< 𝒘. 𝒙𝒊 > +𝑏 ≥ 1, 𝑦𝑖 = 1
< 𝒘. 𝒙𝒊 > +𝑏 ≤ −1, 𝑦𝑖 = −1
Calculate Margin Level (1)
• Margin level is the distance between two margin hyperplanes H+ and
H- .
• d+ is the distance between H+ and H0
• d- is the distance between H- and H0
• (d+ + d-) is margin level
• The distance from point xi to the
|<𝒘.𝒙𝒊 >+𝑏|
plane H0 is:
𝒘
• … Equivalent to:
<𝒘.𝒘>
• Minimize:
2
(2)
(3)
(4)
(5)
• (5) show that only xi belong to margin hyperlanes (H+, H-) have 𝜶𝒊 > 0 (because
𝑦𝑖 < 𝒘. 𝒙𝒊 > +𝑏 =1 in this case) => those xi are called support vectors
• 𝛼𝑖 = 0 with other xi
Solve the constrained minimization problem (2)
• For this minimization problem under consideration with a convex
objective function and linear constraints, the conditions KKT are
necessary and sufficient for an optimal solution.
• It is difficult to deal with inequality constraint, so that we use
Lagrange method to leads to a dual expression of the optimization
problem:
Conditions:
• For a convex objective function and linear constraints, the maximum value of LD occurs at
the same values of w, b and αi which helps to obtain minimum value of LP
• Solve the above equation will get αi (out of our lecture), then help to find w*, b*
Find w* and b*
6. Conclusion
Linear SVM with Data can not be separated
• The case of linear and separable classifiers is ideal (at leasthappen)
• The data set may contain noise and errors (e.g. some examples are
mislabeled)
• If the data set contains noise, the conditions may not be satisfied
• Can not find w* and b* satisfied 𝑦𝑖 < 𝒘. 𝒙𝒊 > +𝑏 ≥ 1
• To work with noisy data, it is necessary to loosen margin constraints by
using slack variables 𝜖𝑖
The new condition:
𝑦𝑖 < 𝒘. 𝒙𝒊 > +𝑏 ≥ 1 − i ∀𝑖 = 1, … , 𝑟
i ≥ 0
=> We have new optimization problem – soft margin SVM
Soft Margin SVM
<𝒘.𝒘>
Minimize: + 𝐶 σ𝑟𝑖=1 i (C is called error cost)
2
𝑦𝑖 < 𝒘. 𝒙𝒊 > +𝑏 ≥ 1 − i ∀𝑖 = 1, … , 𝑟
Condition: ቊ
𝜖𝑖 ≥ 0
• Lagrange optimal expression:
𝑟 𝒓 𝑟
<𝒘. 𝒘>
𝑳𝒑 = +𝐶 i − 𝛼𝑖 [𝑦𝑖 < 𝒘. 𝒙𝒊 > +𝑏 − 1+i ] − 𝜇𝑖 i
𝟐
𝑖=1 𝒊=𝟏 𝑖=1
Conditions:
6. Conclusion
Non-Linear SVM
• The formulas in the SVM method require the data set to be linearly
classifiable (with/without noise)
• In many practical problems, data sets can not be linearly separable
• Non-linear SVM method:
• Phase 1: Convert the original input representation space to another space
(usually with much higher dimensions)
-> The data is represented in the new space can be linearly separable
• Phas 2: Re-apply the formulas and steps as in the linear SVM classification
method
Transforming performance space
• The idea is to map (transform) the data representation from the original space X
to another space F by applying a nonlinear mapping function ∅
• In the transformed space, the set of original learning examples
𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑟 , 𝑦𝑟 will be performed:
∅(𝑥1 ), 𝑦1 , ∅(𝑥2 ), 𝑦2 , … , ∅(𝑥𝑟 ), 𝑦𝑟
• Example: (𝑥1 , 𝑥2 ) → (𝑥12 , 𝑥22 , 𝑥1 𝑥2 )
x = 2,3 , 𝑦 = −1 ⇒ (x = 4,9,6 , 𝑦 = −1)
Problem: dimension is too big
Use Kernel function
Kernel functions
• Consider the classification hyperplane:
f z =< 𝒘∗ . ∅(𝒛) > +𝑏 ∗ = σ𝑟𝑖=1 𝛼𝑖 𝑦𝑖 < ∅(𝒙𝒊 ). ∅(𝒛) > +𝑏 ∗
Not need to find exactly ∅(𝒙𝒊 ) and ∅(𝒛)
Only need to find inner product < ∅(𝒙𝒊 ). ∅(𝒛) >
Use Kernel functions K:
K 𝐱, 𝐳 =< ∅(𝒙). ∅(𝒛) >
=> We do not need to know what ∅ is
Kernel functions - Example
• Polynomial Kernel function:
𝐾 𝒙, 𝒛 = < 𝒙. 𝒛 >𝑑
• Consider d=2, x = (x1,x2) and z = (z1,z2):
• The above example shows that the kernel function < 𝒙. 𝒛 >2 is a dot product of two
vectors φ(x) and φ(z) in the space after transformation
Commonly used kernel functions
• Polynomial:
K(x,z)=(<x,z> + 𝜃)d , 𝜃 ∈ 𝑅, 𝑑 ∈ 𝑁
• Gaussian radial basis function (RBF)
𝑥−𝑧 2
𝐾 𝒙, 𝒛 = 𝑒 − 2𝜎 , 𝜎>0
• Sigmoidal
1
𝐾 𝒙, 𝒛 = tanh(𝛽 < 𝒙. 𝒛 > −𝛾 = ; 𝛽, 𝛾 ∈ 𝑅
1+𝑒 − 𝛽<𝒙.𝒛>−𝛾
SVMs - Conclusions
• SVM only works with an input space of real numbers
• For nominal attributes, it is necessary to convert the nominal values into
numeric values
• SVM only works with 2-class classification
• For multi-class classification problems, it is necessary to convert to a set of 2-
class classification problems, and then solve each of these 2-class problems
separately.
• Eg: One vs All, One vs One
• Separated Hyperplane is difficult to understand especially Kernel
function is used:
• SVM is often used in application problems where explaining the behavior
(decisions) of the system to the user is not an important requirement.
Outline
1. Introduction
2. Separating hyperplane
3. Linear SVM with data can be separated
4. Linear SVM with data can not be separated – Soft margin SVM
5. Non-linear SVM
• Kernel function
6. Conclusion
Q&A