0% found this document useful (0 votes)
28 views54 pages

SVM-CDing2024 11 15

Uploaded by

hunterqjm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views54 pages

SVM-CDing2024 11 15

Uploaded by

hunterqjm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Tutorial on SVM: motivations, formulations and extensions

Support Vector Machines

Chris Ding

Many Slides adopted from Andrew Moore and UT Austin

Read: A tutorial on Support Vector Machines, by Chris Burges


Key points:

-- max-margin
-- use f(x) = +1, -1 to set the scaling constant
-- optimization: dual opt function
-- KKT condition, complementarity slackness condition
-- separable(hard) SVM vs non-separable(soft) SVM
-- Kernel trick
-- XOR problem

Three homeworks
Perceptron

• Binary classification can be viewed as the task


of separating classes in feature space:

wTx + b = 0
wTx + b > 0
wTx + b < 0

f(x) = sign(wTx + b)
Do better than perceptron.

• Which of the lines is better? optimal?

This is a decision boundary problem.


A discriminant function decides decision boundary
Classification Margin (1992)

wT xi + b
r=
• Distance from example xi to the separator is w
• Examples closest to the hyperplane are support vectors.
• Margin ρ of the separator is the distance between the two
f(x) = 1, f(x)= -1 lines. ρ

r
Maximum Margin Classification
• Maximizing the margin is good according to intuition and
PAC theory.
• Implies that only support vectors matter; other training
examples are ignorable. +1
-1
Linear SVM Mathematically
• Let training set {(xi, yi)}i=1..n, xi∈Rd, yi ∈ {-1, 1} be separated by
a hyperplane with margin ρ. Then for each training example
(xi, yi):
wTxi + b ≤ - ρ/2 if yi = -1 yi(wTxi + b) ≥ ρ/2
w xi + b ≥ ρ/2 if yi = 1
T ⇔
y s ( w T x s + b) 1
r= =
w w
• For every support vector xs the above inequality is an equality.
After rescaling w and b by ρ/2 in the equality, we obtain that
distance between each xs and the hyperplane is

• The margin can be expressed through (rescaled) w and b as:


2
ρ = 2r =
w
Set derivative w.r.t. w equal to 0
Set derivative w.r.t. b equal to 0
KKT Complementarity slackness condition
Linear SVMs Mathematically (cont.)

• The quadratic optimization problem:


Find w and b such that
2
ρ=
w is maximized
and for all (xi, yi), i=1..n : yi(wTxi + b) ≥ 1

Which can be reformulated as:


Find w and b such that
Φ(w) = ||w||2=wTw is minimized
and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1
Prime Opt Problem transfermed into Dual Opt Problem
KKT complementarity slackness condition
KKT complementarity slackness condition
Solving the Optimization Problem

Find w and b such that


Φ(w) =wTw is minimized
and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1

• A quadratic function with linear constraints. This original optimization


problem is called primal optimization problem
• Lagrange multiplier αi is associated with every inequality constraint. Final
optimization is to solve a dual problem

Find α1…αn such that


Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
The Optimization Problem Solution

• Given a solution α1…αn to the dual problem, solution to


the primal is:

w =Σαiyixi b = yk - Σαiyixi Txk for any αk > 0

• Each non-zero αi indicates that corresponding xi is a


support vector. f(x) = ΣαiyixiTx + b
• Then the classifying function is (note that we don’t need
w explicitly):
Soft Margin Classification
(Data has noise, not completely separable)

• What if the training set is not linearly separable?


• Slack variables ξi can be added to allow misclassification
of noisy examples.

Proposed in 1970s
ξi =0
ξi

ξi
Soft Margin Classification Mathematically

• The old formulation:


Find w and b such that
Φ(w) =wTw is minimized
and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1

• Modified formulation incorporates slack variables:

Find w and b such that


Φ(w) =wTw + CΣξi is minimized
and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0

• Parameter C can be viewed as a way to control overfitting: it “trades off” the


relative importance of maximizing the margin and fitting the training data.
Soft Margin Classification – Solution
Invented 1995
• Dual problem is almost identical to separable case (they would not be
identical if the 2-norm penalty for slack variables CΣξi2 was used in primal
objective, we would need additional Lagrange multipliers for slack variables):
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
• Again, xi with non-zero αi will be support vectors.
• Solution to the dual problem is:
Again, we don’t need to
compute w explicitly for
w =Σαiyixi classification:
b= yk(1- ξk) - ΣαiyixiTxk for any k s.t. αk>0
f(x) = ΣαiyixiTx + b
Theoretical Justification for Maximum
Margins
• Vapnik has proved the following:
The class of optimal linear separators has VC dimension h bounded from
above as
 D 
2

h ≤ min  2 , m0  + 1
 ρ  

where ρ is the margin, D is the diameter of the smallest sphere that


enclose all of the training examples, and m0 is the dimensionality.

• Intuitively, this implies that regardless of dimensionality m0 we can


minimize the VC dimension by maximizing the margin ρ.

• Complexity of the classifier is kept small regardless of data dimensionality.


Non-linear SVMs: Feature spaces
• General idea: the original feature space can always be mapped
to some higher-dimensional feature space where the training set is
separable

Φ: x → φ(x)
Non-linear SVMs
• Datasets that are linearly separable

0 x

• Dataset is not linearly separable

0 x

• Mapping data to a higher-dimensional space:


x2
z=(x, x2)

0 x
Kernal Trick

• Both in the dual formulation of the problem and in the solution


training points appear only inside inner products:

Find α1…αN such that


Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and f(x) = ΣαiyixiTx + b
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi

• Thus the explicit coordinates (the map function) is not required.


Kernel trick
• Example: polynomial kernel

Transform 2D input space to 3D feature space


x (=x1 , x2 ) φ ( x) ( x1 , x2 , 2 x1 x2 )
2 2

z (=z1 , z2 ) φ ( z ) ( z12 , z22 , 2 z1 z2 )

φ=
( x) ⋅ φ ( z ) ( x12 , x22 , 2 x1 x2 ) ⋅ ( z12 , z22 , 2 z1 z2 )
= x z + x z + 2 x1 z1 x2 z2 = ( x1 z1 + x2 z2 )
2 2
1 1
2 2
2 2
2

=( x ⋅ z ) =K ( x, z )
2
Kernel trick + QP
• Max margin classifier can be found by solving
1
arg max(∑ α j − ∑ α jα k y j yk (φ (x j ) ⋅ φ (x k )))
α j 2 j ,k
1
arg max(∑ α j − ∑ α jα k y j yk ( K (x j , x k ))
α j 2 j ,k

• the weight matrix (no need to compute and store)


w = ∑ α j y jφ ( x j )
j
• the decision function is

h(x) sign(∑ α j y j (φ (x=


) ⋅ φ (x j )) + b) sign(∑ α j y j K (x, x j ) + b)
j j

Copyright © 2001, 2003, Andrew W. Moore


SVM Kernel Functions
• Use kernel functions which compute

) (φ ( xi ) ⋅ φ ( x j ))= K ( xi , x j )
( zi ⋅ z j =
• The inner-product kernel K(a, b)= a ⋅ b is the (simplest) linear kernal

• Polynomial Kernel K(a, b)=(a ⋅ b +1)d

• Beyond polynomials, other very high dimensional basis functions


form practical useful Kernels:
• Radial-Basis-style Kernel Function:
 (a − b) 2  σ, κ and δ are
K (a, b) = exp − 

 2σ 
2
model parameters
chosen by CV
• Neural network style Kernel Function:
K (a, b) = tanh(κ a.b − δ )
Copyright © 2001, 2003, Andrew W. Moore
What Functions are Kernels?
• For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) , i.e.,
find φ(xi) can be difficult.
• Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
• Semi-positive definite symmetric functions correspond to a semi-
positive definite symmetric Gram matrix:

K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xn)


K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xn)

K=
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)
Colour shades is f(x) value
Checkerboard data: Standard Linear Classifier cannot separate the two classes
Solve
SolveXOR
XORproblem
problemusing
usingkernel
kernelSVM
SVM

Note: 0 ↔ (-1)
Solve XOR problem using kernel SVM
Solve XOR problem using kernel SVM

Final output of SVM:

Note: 0 ↔ (-1)
Input are {0,1}*2
Examples of Kernel Functions
• Linear: K(xi,xj)= xiTxj
– Mapping Φ: x → φ(x), where φ(x) is x itself

• Polynomial of power p: K(xi,xj)= (1+ xiTxj)p


d + p
– Mapping Φ: x → φ(x), where φ(x) has   dimensions
 p 

2
xi − x j

2σ 2
• Gaussian (radial-basis function): K(xi,xj) = e
– Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional: every point is
mapped to a function (a Gaussian); combination of functions for support
vectors is the separator.

• Higher-dimensional space still has intrinsic dimensionality d


(the mapping is not onto). Linear separators in it correspond to
non-linear separators in original space.
Non-linear SVMs Mathematically

• Dual problem formulation:


Find α1…αn such that
Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi

• The solution is: f(x) = ΣαiyiK(xi, xj)+ b

• Optimization techniques for finding αi’s remain the same!


D
a
t
a
M
i
n

SVM and other Linear


i
n

Classification can only separate


g
,
C
h
r
space into 2 classes
i
s How to do multi-class classification
D
i (k > 2)?
n
g

47
Sec.14.5

Use 2-class classifier to do k-class


classification

• One vs others
– Build a classifier for each class against all other class
combined together
– Need to train K such classifiers
– Use the largest score to determine final class
• One vs one
– Train K(K-1)/2 classifers, each classifer one class vs
another class.
– Use majority voting to obtain final class

48
Sec.14.5

Multi-label Classification
• Classes are mutually exclusive
– Each handwritten letter belongs to exactly one class
– A student is either 1st year, 2nd year, 3rd year, 4th year
student, can not be bother or more
– The common case: multi-class exclusive classification
• Classes are mutually non-exclusive
– An article on drug design could also discuss the drug
company’s (and market) economics.
– An image has sky, building, road etc.
– Multi-class inclusive classification (multi-label classification)

49
Sec.14.5

One vs Others: more details

• Build a classifier between each class and its


complementary set (docs from all other classes).
• Given test object, evaluate it for membership in each
class.
• Assign document to class with:
– maximum score
– maximum confidence
– maximum probability
?

?
?

50
SVM applications
• SVMs were originally proposed by Boser, Guyon and Vapnik in 1992
and gained increasing popularity in late 1990s.
• SVMs are currently among the best performers for a number of
classification tasks ranging from text to genomic data.
• SVMs can be applied to complex data types beyond feature vectors (e.g.
graphs, sequences, relational data) by designing kernel functions for
such data.
• SVM techniques have been extended to a number of tasks such as
regression [Vapnik et al. ’97], principal component analysis [Schölkopf et
al. ’99], etc.
• Most popular optimization algorithms for SVMs use decomposition to
hill-climb over a subset of αi’s at a time, e.g. SMO [Platt ’99] and
[Joachims ’99]
• Tuning SVMs remains a black art: selecting a specific kernel and
parameters is usually done in a try-and-see manner
• Most popular SVM software is LIBSVM from C.J.Lin
Homework SVM1: SVM1a, SVM1b

Solve the SVM problem for following: Find W and b.

X1: (1, 1), y = +1


X2: (-1,1), y = +1
X3: (0,-1), y = -1

X1: (1, 1, +1)


X2: (-1,1, +1)
X3: (0,-1, -1)
X4: (0,-2, -1)
Homework SVM2 (Optional. TA will provide some code example):

Generate 200 data points in 2 dimension, each class has 100.


Make the 2 classes close enough so that they are non-separable.
Run SVM solver on the data. Set C=0.01 or 0.1
Find out the data points where alpha_i = 0 or alpha_i = C.
Find data point with ksi_i > 0.

Plot the lines f(x)=1, 0, -1. adding red circles to the data where
alpha_i = 0. Adding squares to data point where alpha_i=C.

Explain what ksi_i>0 data points are ?

See example in SVM slides.


Homework SVM3 (optional) :
Derive the dual problem from the prime problem, using
quadratic penalty on ksi’s.

You might also like