SVM Overview
SVM Overview
However, there are many problems, such as XOR, which are not linearly separable.
2. SVM Approach
SVM uses linear models to implement nonlinear class boundaries. It transforms the input space
using a nonlinear mapping into a new space (F feature space). Then a linear model constructed in
the new space can represent a nonlinear decision boundary in the original space:
The instances that are closest to the maximum hyperplane are called support vectors. There is at
least one support vector for each class, and often there are more.
A set of support vectors can uniquely defines the maximum margin hyperplane for the learning
problem. All other training instances are irrelevant; they can be removed without changing the
position and orientation of the hyperplane.
b. The equation of the maximum margin hyperplane
In general, a hyperplane separating the two classes in the two-attribute case, can be written as
F ( x) = w0 + w1 a1 + w2 a 2 , where a1 and a2 are attribute values, and w0, w1, w2 are weights.
The equation for the maximum margin hyperplane can also be written in terms of support
vectors.
l
f ( x) = b + i y i < xi x >
i =1
where
4. Kernel Functions
According to the equation of the maximum margin hyperplane above, every time an instance is
classified, its dot product with all support vectors must be calculated. In a high-dimensional
space resulting after a non-linear transformation is applied to the instances, the number of
attributes in the new space could become huge and the computation of dot product becomes very
expensive. The same problem occurs during training as well. Working in such a highdimensional space becomes intractable very quickly.
l
f ( x ) = b + i y i < ( xi ) ( x ) >
(1)
i =1
f ( x) = b + i y i K ( xi , x)
i =1
(2)
Polynomial Kernel
The function < x z > represents a polynomial function of degree n applied to the do product
of two vector x and z. In the case when n = 2 and the number of attributes is 2, the function
becomes:
n
< x z >2
2
= ( xi z i ) 2
i =1
= ( x1 z1 + x 2 z 2 ) 2
= ( x12 z12 + 2 x1 x 2 z1 z 2 + x 22 z 22 )
=< ( x12 , 2 x1 x 2 , x 22 ) ( z12 , 2 z1 z 2 , z 22 ) >
=< ( x ) ( z ) >
Therefore, the kernel K becomes
K(x, z) =< x z > 2 = < ( x ) ( z ) > , where ( x) = ( x12 , 2 x1 x 2 , x 22 ) .
Other Kernels
In addition to the polynomial kernel, RBF (radial basis function) and sigmoid kernels are often
used. Using which kernel depends on the specific data and applications.
Advantages:
1. Produce very accurate classifiers.
2. Less overfitting, robust to noise.
Disadvantages:
1. SVM is a binary classifier. To do a multi-class classification, pair-wise classifications
can be used (one class against all others, for all classes).
2. Computationally expensive, thus runs slow.