Final - Support Vector Machine - Class - Modifie
Final - Support Vector Machine - Class - Modifie
■ SVM works well with higher dimensional data and thus avoids
dimensionality problem.
We discuss
A classification technique when training data are linearly
separable known as a Linear SVM
• A classification technique when training data are linearly
non-separable known as a Non-linear SVM.
Maximum margin of hyperplane is also key point
Support Vector Machines - Linear classifier
linearly
separabl
e
not
linearly
separabl
e
Input Space to Feature Space
f(x,w,b) = sign(w x + b)
denotes +1 wx+b>0
denotes -1
wx+b<0
Finding a hyperplane
W.X + b = 0 (4)
where W =[w1,w2.......wm] and X = [x1,x2.......xm] and b is a real
constant.
f ( xi ) = w > xi + b
separates the categories for i = 1, .., N
• how can we find this separating hyperplane ?
X2 X2
w
w
X1 X1
xi
8
6
Perceptron
example 4
-2
-4
-6
-8
-10
-15 -10 -5 0 5 10
■ w . xi + b ≥ 1 when yi = +1
■ w . xi + b ≤ 1 when yi = -1
yi (w . xi + b) ≥ 1 ∀i
• To obtain the geometric distance from the hyperplane to a data
point, we normalize by the magnitude of w.
• We want the hyperplane that maximizes the geometric distance
to the closest data points. d( (w, b) , x ) = [y (w. x + b)] / ||w|| ≥ 1 / ||w||
i i i
Linear SVM
Linear SVM for Linearly Not Separable Data
.
Linear SVM Mathematically
x+ M=Margin Width
X-
Two hyperplanes are parallel (they
have the same normal) and that no
training points fall between them.
What we know:
■ w . x+ + b = +1
■ w . x- + b = -1
■ w . (x+-x-) = 2
Linear SVM Mathematically
■ Goal: 1) Correctly classify all training data
if yi = +1
if yi = -1
for all i
2) Maximize the Margin or same as minimize
■ Minimize
subject to
Searching for MMH
Original Problem:
Find w and b such that
Φ(w) =½ wTw is minimized;
and for all i {(xi ,yi)}: yi (wTxi + b) ≥ 1
Construct the Lagrangian Function for
optimization
S. T. αi ≥ 0; ∀
i
Substituting we get:
maxα :
Subject to
•
Illustration : Linear SVM
Consider the case of a binary classification starting with a
training data of 8 tuples as shown in Table 1.
Using quadratic programming, we can solve the KKT constraints
to obtain the Lagrange multipliers λ i for each training tuple,
which is shown in Table 1.
Note that only the first two tuples are support vectors in this case.
Let W = (w1, w2) and b denote the parameter to be
determined now. We can solve for w1 and w2 as follows:
x1 x2 y λ
0.38 0.47 + 65.52
0.49 0.61 - 65.52
0.92 0.41 - 0
0.74 0.89 - 0
0.18 0.58 + 0
0.41 0.35 + 0
0.93 0.81 - 0
0.21 0.10 + 0
Thus, the MMH is −6.64x1 −9.32x2 + 7.93 = 0 (also see Fig. 6).
Suppose, test data is X = (0.5, 0.5). Therefore,
δ(X ) = W .X + b
= −6.64 × 0.5 −9.32 × 0.5 + 7.93
= −0.05
= −ve
This implies that the test data falls on or below the MMH and
SVM classifies that X belongs to class label -.
Dataset with noise
OVERFITTING!
Soft Margin Classification
Slack variables ξi can be added to allow
misclassification of difficult or noisy examples.
ε7
Hard Margin v.s. Soft Margin
■ The old formulation:
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1
f(x) = Σαiyixi Tx + b
Learned model
f ( x) = w > x + b
This implies that the test data falls on or below the MMH and
SVM classifies that X belongs to class label -.
Non-linear SVMs
■ Datasets that are linearly separable with some noise work out
great:
0 x
x2
0 x
Non-Linear SVM
For understanding this, .
Note that a linear hyperplane is expressed as a linear equation
in terms of n-dimensional component, whereas a non-linear
hypersurface is a non-linear expression.
Φ: x → φ(x)
Concept of Non-Linear Mapping
Example: Non-linear mapping to linear SVM
The below figure shows an example of 2-D data set consisting
of class label +1 (as +) and class label -1 (as -).
if q
(x1 −0.5) 2 + (x2 −0.5) 2 > 2
R 2 ⇒ X (x1 , x2 )
R 3 ⇒ Z (z1 , z2 , z3 )
φ(X ) ⇒ Z
w1 z1 + w2 z2 + w3 z2 = 0
This is clearly a linear form in 3-D space. In other words,
W .x + b = 0 in R 2 has a mapped equivalent W .z + bJ = 0 in R 3
This means that data which are not linearly separable in 2-D are
separable in 3-D, that is, non linear data can be classified by a
linear SVM classifier.
Classifier:
n
δ(x) = Σ λ i yi yi .x + b
i=1
n
δ(z) = Σ λ i yi φ(xi ).φ(x) + b
i=1
Learning:
n 1
Maximize Σ λ i − 2 i,jΣ i j i j i j
λ λ .y .y .x .x
i=1
1
n
Maximize Σ λ i − 2 Σ λ i λ j .yi .yj φ(xi ).φ(xj )
i=1 i,j
Σ
Subject to: λ i ≥ 0, i λ i .yi = 0
Similarly, φ(Xi ) = [x i1
2 , 2.x .x , x 2 ] and
√ i1 i2 i2
φ(Xj ) = [xj1, √2.xj1.xj2, xj2
2 ] are two transformed version of X and
i
2
Xj but in 3
R .
x j1 2
= { [xi1 , xi2] xj2}
= (Xi .Xj ) 2
Computational efficiency:
Another important significance is easy and efficient computability.
On other hand, using kernel trick, we can do it once and with fewer
dot products.
■ Mahalanobis:
K (X, Y ) = e−(X −y )T A(x −y )
Followed when statistical test data is known
■ Feature Selection
Weakness of SVM
■ It is sensitive to noise
-A relatively small number of mislabeled examples can
dramatically decrease the performance