SVM-Worked Out Example
SVM-Worked Out Example
Dan Ventura March 12, 2009 In what follows we will use vectors augmented with a 1 as a bias input, and for clarity we will differentiate these with an
over-tilde. So, if s1 = (10), then s˜1 = (101). Figure 3 shows the SVM architecture, and our task is to ?ind values for the αi such that
Abstract
α1Φ(s1) · Φ(s1) + α2Φ(s2) · Φ(s1) + α3Φ(s3) · Φ(s1) = −1
We try to give a helpful simple example that demonstrates a linear SVM and then extend the example to a simple non-linear
case to illustrate the use of mapping functions and kernels. α1Φ(s1) · Φ(s2) + α2Φ(s2) · Φ(s2) + α3Φ(s3) · Φ(s2) = +1
α1Φ(s1) · Φ(s3) + α2Φ(s2) · Φ(s3) + α3Φ(s3) · Φ(s3) = +1
1 Introduction
Since for now we have let Φ() = I, this reduces to
Many learning models make use of the idea that any learning problem can be made easy with the right set of features. The trick,
of course, is discovering that “right set of features”, which in general is a very dif?icult thing to do. SVMs are another attempt at a
model that does this. The idea behind SVMs is to make use of a (nonlinear) mapping function Φ that transforms data in input α1s˜1 · s˜1 + α2s˜2 · s˜1 + α3s˜3 · s˜1 = −1 α1s˜1 ·
space to data in feature space in such a way as to render a problem linearly separable. The SVM then automatically discovers the s˜2 + α2s˜2 · s˜2 + α3s˜3 · s˜2 = +1 α1s˜1 · s˜3 +
optimal separating hyperplane (which, when mapped back into input space via Φ−1, can be a complex decision surface). SVMs α2s˜2 · s˜3 + α3s˜3 · s˜3 = +1
are rather interesting in that they enjoy both a sound theoretical basis as well as state-of-the-art success in real-world Now, computing the dot products results in
applications.
To illustrate the basic ideas, we will begin with a linear SVM (that is, a model that assumes the data is linearly separable). We
will then expand the example to the nonlinear case to demonstrate the role of the mapping function Φ, and ?inally we will explain
the idea of a kernel and how it allows SVMs to make use of high-dimensional feature spaces while remaining tractable.
and the following negatively labeled data points in <2 (see Figure 1):
Figure 1: Sample data points in <2. Blue diamonds are positive examples and red squares are negative examples.
We would like to discover a simple SVM that accurately discriminates the two classes. Since the data is linearly separable, we
can use a linear SVM (that is, one whose mapping function Φ() is the identity function). By inspection, it should be obvious that
there are three support vectors (see Figure 2):
1 2
Figure 3: The SVM architecture.
2α1 + 4α2 + 4α3 = −1
4α1 + 11α2 + 9α3 = +1
4α1 + 9α2 + 11α3 = +1
A little algebra reveals that the solution to this system of equations is α1 = −3.5,α2 = 0.75 and α3 = 0.75.
Now, we can look at how these α values relate to the discriminating hyperplane; or, in other words, now that we have the αi,
how do we ?ind the hyperplane that discriminates the positive from the negative examples? It turns out that
w˜ = X
αis˜i
i
=
Figure 5: Nonlinearly separable sample data points in <2. Blue diamonds are positive examples and red squares are negative
examples.
=
Finally, remembering that our vectors are augmented with a bias, we can equate the last entry in ˜w as the hyperplane offset
b and write the separating
and the following negatively labeled data points in <2 (see Figure 5): Our goal, again, is to discover a separating hyperplane that accurately discriminates the two classes. Of course, it is obvious
that no such hyperplane exists in the input space (that is, in the space in which the original input data live). Therefore, we must
use a nonlinear SVM (that is, one whose mapping function Φ is a nonlinear mapping from input space into some feature space).
De?ine
if
(1)
otherwise
Referring back to Figure 3, we can see how Φ transforms our data before the dot products are performed. Therefore, we can
rewrite the data in feature space as
Figure 4: The discriminating hyperplane corresponding to the values α1 = −3.5,α2 = 0.75 and α3 = 0.75. for the negative examples (see Figure 6). Now we can once again easily identify the support vectors (see Figure 7):
3 4
We again use vectors augmented with a 1 as a bias input and will differentiate them as before. Now given the [augmented]
support vectors, we must again ?ind values for the αi. This time our constraints are
Figure 7: The two support vectors (in feature space) are marked as yellow circles.
=
3α1 + 5α2 = −1 where σ(z) returns the sign of z. For example, if we wanted to classify the point
5α1 + 9α2 = +1 x = (4,5) using the mapping function of Eq. 1,
w˜ = Xαis˜i
i
5 6
for the positive examples and
for the negative examples. With a little thought, we realize that in this case, all 8 of the examples will be support vectors with
for the positive support vectors and for the negative ones. Note that a consequence of this mapping is that we
do not need to use augmented vectors (though it wouldn’t hurt to do so) because the hyperplane in feature space goes through
Figure 9: The decision surface in input space corresponding to Φ1. Note the singularity. the origin, y = wx+b, where and b = 0. Therefore, the discriminating feature,
is x3, and Eq. 2 reduces to f(x) = σ(x3).
and thus we would classify x = (4,5) as negative. Looking again at the input space, we might be tempted to think this is not a Figure 10 shows the decision surface induced in the input space for this new mapping function. Kernel trick.
reasonable classi?ication; however, it is what our model says, and our model is consistent with all the training data. As always,
there are no guarantees on generalization accuracy, and if we are not happy about our generalization, the likely culprit is our
choice of Φ. Indeed, if we map our discriminating hyperplane (which lives in feature space) back into input space, we can see the 5 Conclusion
effective decision surface of our model (see Figure 9). Of course, we may or may not be able to improve generalization accuracy What kernel to use? Slack variables. Theory. Generalization. Dual problem. QP.
by choosing a different Φ; however, there is another reason to revisit our choice of mapping function.
(3)
which transforms our data from 2-dimensional input space to 3-dimensional feature space. Using this alternative mapping, the
data in the new feature space looks like
7 8