SVM-CDing2024 11 15
SVM-CDing2024 11 15
Chris Ding
-- max-margin
-- use f(x) = +1, -1 to set the scaling constant
-- optimization: dual opt function
-- KKT condition, complementarity slackness condition
-- separable(hard) SVM vs non-separable(soft) SVM
-- Kernel trick
-- XOR problem
Three homeworks
Perceptron
wTx + b = 0
wTx + b > 0
wTx + b < 0
f(x) = sign(wTx + b)
Do better than perceptron.
wT xi + b
r=
• Distance from example xi to the separator is w
• Examples closest to the hyperplane are support vectors.
• Margin ρ of the separator is the distance between the two
f(x) = 1, f(x)= -1 lines. ρ
r
Maximum Margin Classification
• Maximizing the margin is good according to intuition and
PAC theory.
• Implies that only support vectors matter; other training
examples are ignorable. +1
-1
Linear SVM Mathematically
• Let training set {(xi, yi)}i=1..n, xi∈Rd, yi ∈ {-1, 1} be separated by
a hyperplane with margin ρ. Then for each training example
(xi, yi):
wTxi + b ≤ - ρ/2 if yi = -1 yi(wTxi + b) ≥ ρ/2
w xi + b ≥ ρ/2 if yi = 1
T ⇔
y s ( w T x s + b) 1
r= =
w w
• For every support vector xs the above inequality is an equality.
After rescaling w and b by ρ/2 in the equality, we obtain that
distance between each xs and the hyperplane is
Proposed in 1970s
ξi =0
ξi
ξi
Soft Margin Classification Mathematically
Φ: x → φ(x)
Non-linear SVMs
• Datasets that are linearly separable
0 x
0 x
0 x
Kernal Trick
φ=
( x) ⋅ φ ( z ) ( x12 , x22 , 2 x1 x2 ) ⋅ ( z12 , z22 , 2 z1 z2 )
= x z + x z + 2 x1 z1 x2 z2 = ( x1 z1 + x2 z2 )
2 2
1 1
2 2
2 2
2
=( x ⋅ z ) =K ( x, z )
2
Kernel trick + QP
• Max margin classifier can be found by solving
1
arg max(∑ α j − ∑ α jα k y j yk (φ (x j ) ⋅ φ (x k )))
α j 2 j ,k
1
arg max(∑ α j − ∑ α jα k y j yk ( K (x j , x k ))
α j 2 j ,k
) (φ ( xi ) ⋅ φ ( x j ))= K ( xi , x j )
( zi ⋅ z j =
• The inner-product kernel K(a, b)= a ⋅ b is the (simplest) linear kernal
K=
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)
Colour shades is f(x) value
Checkerboard data: Standard Linear Classifier cannot separate the two classes
Solve
SolveXOR
XORproblem
problemusing
usingkernel
kernelSVM
SVM
Note: 0 ↔ (-1)
Solve XOR problem using kernel SVM
Solve XOR problem using kernel SVM
Note: 0 ↔ (-1)
Input are {0,1}*2
Examples of Kernel Functions
• Linear: K(xi,xj)= xiTxj
– Mapping Φ: x → φ(x), where φ(x) is x itself
2
xi − x j
−
2σ 2
• Gaussian (radial-basis function): K(xi,xj) = e
– Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional: every point is
mapped to a function (a Gaussian); combination of functions for support
vectors is the separator.
47
Sec.14.5
• One vs others
– Build a classifier for each class against all other class
combined together
– Need to train K such classifiers
– Use the largest score to determine final class
• One vs one
– Train K(K-1)/2 classifers, each classifer one class vs
another class.
– Use majority voting to obtain final class
48
Sec.14.5
Multi-label Classification
• Classes are mutually exclusive
– Each handwritten letter belongs to exactly one class
– A student is either 1st year, 2nd year, 3rd year, 4th year
student, can not be bother or more
– The common case: multi-class exclusive classification
• Classes are mutually non-exclusive
– An article on drug design could also discuss the drug
company’s (and market) economics.
– An image has sky, building, road etc.
– Multi-class inclusive classification (multi-label classification)
49
Sec.14.5
?
?
50
SVM applications
• SVMs were originally proposed by Boser, Guyon and Vapnik in 1992
and gained increasing popularity in late 1990s.
• SVMs are currently among the best performers for a number of
classification tasks ranging from text to genomic data.
• SVMs can be applied to complex data types beyond feature vectors (e.g.
graphs, sequences, relational data) by designing kernel functions for
such data.
• SVM techniques have been extended to a number of tasks such as
regression [Vapnik et al. ’97], principal component analysis [Schölkopf et
al. ’99], etc.
• Most popular optimization algorithms for SVMs use decomposition to
hill-climb over a subset of αi’s at a time, e.g. SMO [Platt ’99] and
[Joachims ’99]
• Tuning SVMs remains a black art: selecting a specific kernel and
parameters is usually done in a try-and-see manner
• Most popular SVM software is LIBSVM from C.J.Lin
Homework SVM1: SVM1a, SVM1b
Plot the lines f(x)=1, 0, -1. adding red circles to the data where
alpha_i = 0. Adding squares to data point where alpha_i=C.