Lecture 8
Lecture 8
overview
• SVM for linearly separable binary set
• Main Goal to design a hyper plane that classify all training vectors into two classes
• The best model that leaves the maximum margin from both classes
• the two classes labels +1 (positive examples and -1 (negative examples)
X2
X1
Intuition behind SVM
SVM more formally
Margin in terms of W
SVM as a minimization problem
1
|| w ||2
Quadratic
min
w ,b 2
problem
Linear
s .t . yn (w T xn − b ) − 1 0 n constrain
9
We wish to find the w and b which minimizes, and the α which maximizes
LP(whilst keeping αi ≥ 0 ∀i). We can do this by differentiating LP with
respect to w and b and setting the derivatives to zero:
Characteristics of the Solution
▪ Many of the ai are zero (see next page for example)
▪ w is a linear combination of a small number of data points
▪ This “sparse” representation can be viewed as data compression as in the
construction of knn classifier
▪ xi with non-zero ai are called support vectors (SV)
▪ The decision boundary is determined only by the SV
▪ Let tj (j=1, ..., s) be the indices of the s support vectors. We can write
Class 2
a10=0
a8=0.6
a7=0
a2=0
a5=0
a1=0.8
a4=0
a6=1.4
a9=0
a3=0
Class 1
Example
Example
Example
Kernel trick
Non-linear SVMs: Feature spaces
▪ General idea: the original feature space can always be
mapped to some higher-dimensional feature space where the
training set is separable:
Φ: x → φ(x)
26
SVM for nonlinear reparability
Kernels
▪ Why use kernels?
▪ Make non-separable problem separable.
▪ Map data into better representational space
▪ Common kernels
▪ Linear
▪ Polynomial K(x,z) = (1+xTz)d
▪ Gives feature conjunctions
▪ Radial basis function (infinite dimensional space)