4 - SVM
4 - SVM
Linear Separators
• Binary classification can be viewed as the task of
separating classes in feature space:
wTx + b = 0
wTx + b > 0
wTx + b < 0
f(x) = sign(wTx + b)
Linear Separators
• Which of the linear separators is optimal?
What is a good Decision Boundary?
• Many decision
boundaries! Class 2
– The Perceptron algorithm
can be used to find such a
boundary
• Are all decision
boundaries equally Class 1
good?
4
Examples of Bad Decision Boundaries
Class 2 Class 2
Class 1 Class 1
5
Finding the Decision Boundary
• Let {x1, ..., xn} be our data set and let yi {1,-1} be the class
label of xi
For yi=1 w xi + b 1
T
For yi=-1 wT xi + b −1
y=1
y=1
So:
y=-1
y=-1
y=1
y=1
y=1 ( T
)
yi w xi + b 1, (xi , yi )
Class 2
y=-1
y=-1
y=-1
Class 1 y=-1 m
6
Large-margin Decision Boundary
• The decision boundary should be as far away
from the data of both classes as possible
– We should maximize the margin, m
Class 2
Class 1
m
7
Finding the Decision Boundary
• The decision boundary should classify all points correctly
8
Finding the Decision Boundary
• The Lagrangian is
– ai≥0
– Note that ||w||2 = wTw
9
Gradient with respect to w and b
• Setting the gradient of w.r.t. w and b to
zero, we have
( ( ))
n
1 T
L = w w + a i 1 − yi wT xi + b =
2 i =1
1 m k k n m k k
= w w + a i 1 − yi w xi + b
2 k =1 i =1 k =1
n: no of examples, m: dimension of the space
L
wk = 0, k
L
b = 0
10
The Dual Problem
• If we substitute to , we have
Since
11
The Dual Problem
• The new objective function is in terms of ai only
• It is known as the dual problem: if we know w, we know all ai; if we know
all ai, we know w
• The original problem is known as the primal problem
• The objective function of the dual problem needs to be maximized (comes
out from the KKT theory)
• The dual problem is therefore:
• w can be recovered by
13
Characteristics of the Solution
• Many of the ai are zero
– w is a linear combination of a small number of data
points
– This “sparse” representation can be viewed as data
compression as in the construction of knn classifier
a8=0.6 a10=0
a7=0
a2=0
a5=0
a1=0.8
a4=0
a6=1.4
a9=0
a3=0
Class 1
15
Characteristics of the Solution
• For testing with a new data z
– Compute
and classify z as class 1 if the sum is positive, and
class 2 otherwise
17
Non-linearly Separable Problems
• We allow “error” xi in classification; it is based on the output
of the discriminant function wTx+b
• xi approximates the number of misclassified samples
Class 2
Class 1
18
Soft Margin Hyperplane
• The new conditions become
19
The Optimization Problem
( ( ))
n n n
1 T
L = w w + C x i + a i 1 − x i − yi wT xi + b − ix i
2 i =1 i =1 i =1
L
= C −a j − j = 0
x j
L n
= yia i = 0
b i =1
20
The Dual Problem
1 n n T n
L = a ia j yi y j xi x j + C x i +
2 i =1 j =1 i =1
n n n
+ a i 1 − x i − yi a j y j x j xi + b − ix i
T
i =1
i =1 j =1
n
With
ya =0 C =aj + j
i i
i =1
1 T
n n n
L = − a ia j yi y j xi x j + a i
2 i =1 j =1 i =1
The Optimization Problem
• The dual of this new constrained optimization problem is
22
n
1 2
w + C xi
2 i =1
11/30/2024 23
Soft margin is more robust
24
Extension to Non-linear Decision
Boundary
• So far, we have only considered large-margin classifier with
a linear decision boundary
• How to generalize it to become nonlinear?
• Key idea: transform xi to a higher dimensional space to
“make life easier”
– Input space: the space the point xi are located
– Feature space: the space of f(xi) after transformation
• Why transform?
– Linear operation in the feature space is equivalent to non-linear
operation in input space
– Classification can become easier with a proper transformation.
In the XOR problem, for example, adding a new feature of x1x2
make the problem linearly separable
25
XOR Is not linearly separable
X Y
0 0 0
0 1 1
1 0 1
1 1 0
Is linearly separable
X Y XY
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0
26
Find a feature space
27
Find a feature Space
28
Transforming the Data
f( )
f( ) f( )
f( ) f( ) f( )
f(.) f( )
f( ) f( )
f( ) f( )
f( ) f( )
f( ) f( ) f( )
f( )
f( )
30
An Example for f(.) and K(.,.)
• Suppose f(.) is given as follows
31
Kernels
• Given a mapping: x → φ(x)
a kernel is represented as the inner product
K (x, y ) → φ (x)φ (y)
i
i i
32
Modification Due to Kernel Function
• Change all inner products to kernel functions
• For training,
Original
With kernel
function
33
Modification Due to Kernel Function
• For testing, the new data z is classified as class
1 if f 0, and as class 2 if f <0
Original
With kernel
function
34
More on Kernel Functions
• Since the training of SVM only requires the value of
K(xi, xj), there is no restriction of the form of xi and xj
– xi can be a sequence or a tree, instead of a feature vector
35
Example
• Suppose we have 5 1D data points
– x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1
and 4, 5 as class 2 y1=1, y2=1, y3=-1, y4=-1, y5=1
36
Example
1 2 4 5 6
37
Example
• We use the polynomial kernel of degree 2
– K(x,y) = (xy+1)2
– C is set to 100
38
Example
• By using a QP solver, we get
– a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833
– Note that the constraints are indeed satisfied
– The support vectors are {x2=2, x4=5, x5=6}
• The discriminant function is
39
Example
Value of discriminant function
1 2 4 5 6
40
Kernel Functions
• In practical use of SVM, the user specifies the kernel
function; the transformation f(.) is not explicitly stated
• Given a kernel function K(xi, xj), the transformation f(.)
is given by its eigenfunctions (a concept in functional
analysis)
– Eigenfunctions can be difficult to construct explicitly
– This is why people only specify the kernel function without
worrying about the exact transformation
• Another view: kernel function, being an inner product,
is really a similarity measure between the objects
41
A kernel is associated to a
transformation
– Given a kernel, in principle it should be recovered the
transformation in the feature space that originates it.
x2
It corresponds the transformation x → 2x
1
11/30/2024 42
Examples of Kernel Functions
• Polynomial kernel up to degree d
43
Example
44
Building new kernels
• If k1(x,y) and k2(x,y) are two valid kernels then the
following kernels are valid
– Linear Combination
k ( x, y ) = c1k1 ( x, y ) + c2 k 2 ( x, y )
– Exponential
k ( x, y ) = expk1 ( x, y )
– Product
k ( x, y ) = k1 ( x, y ) k 2 ( x, y )
– Polymomial tranfsormation (Q: polymonial with non
negative coeffients)
k ( x, y ) = Qk1 ( x, y )
– Function product (f: any function)
k ( x, y ) = f ( x)k1 ( x, y ) f ( y )
45
Ploynomial kernel
50
Active Support Vector Learning
53
Summary: Steps for Classification
• Prepare the pattern matrix
• Select the kernel function to use
• Select the parameter of the kernel function and
the value of C
– You can use the values suggested by the SVM
software, or you can set apart a validation set to
determine the values of the parameter
• Execute the training algorithm and obtain the ai
• Unseen data can be classified using the ai and the
support vectors
54
Strengths and Weaknesses of SVM
• Strengths
– Training is relatively easy
• No local optimal, unlike in neural networks
– It scales relatively well to high dimensional data
– Tradeoff between classifier complexity and error can
be controlled explicitly
– Non-traditional data like strings and trees can be used
as input to SVM, instead of feature vectors
• Weaknesses
– Need to choose a “good” kernel function.
55
Conclusion
• SVM is a useful alternative to neural networks
• Two key concepts of SVM: maximize the
margin and the kernel trick
• Many SVM implementations are available on
the web for you to try on your data set!
56
Resources
• http://www.kernel-machines.org/
• http://www.support-vector.net/
• http://www.support-vector.net/icml-
tutorial.pdf
• http://www.kernel-
machines.org/papers/tutorial-nips.ps.gz
• http://www.clopinet.com/isabelle/Projects/SV
M/applist.html
57
• 63,64,65,71,72,80,82,83,86,89,95,
• 4,5,6,8-10,14,17-20,23,25,26,27,29
• 3,6,7,8,10,13,15-17,22,23,24,