Machine Learning Notes SVM
Machine Learning Notes SVM
A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning
model, capable of performing linear or nonlinear classification, regression, and even
outlier detection. It is one of the most popular models in Machine Learning, and any‐
one interested in Machine Learning should have it in their toolbox. SVMs are partic‐
ularly well suited for classification of complex but small- or medium-sized datasets.
This chapter will explain the core concepts of SVMs, how to use them, and how they
work.
145
Figure 5-1. Large margin classification
Notice that adding more training instances “off the street” will not affect the decision
boundary at all: it is fully determined (or “supported”) by the instances located on the
edge of the street. These instances are called the support vectors (they are circled in
Figure 5-1).
To avoid these issues it is preferable to use a more flexible model. The objective is to
find a good balance between keeping the street as large as possible and limiting the
margin violations (i.e., instances that end up in the middle of the street or even on the
wrong side). This is called soft margin classification.
In Scikit-Learn’s SVM classes, you can control this balance using the C hyperparame‐
ter: a smaller C value leads to a wider street but more margin violations. Figure 5-4
shows the decision boundaries and margins of two soft margin SVM classifiers on a
nonlinearly separable dataset. On the left, using a high C value the classifier makes
fewer margin violations but ends up with a smaller margin. On the right, using a low
C value the margin is much larger, but many instances end up on the street. However,
it seems likely that the second classifier will generalize better: in fact even on this
training set it makes fewer prediction errors, since most of the margin violations are
actually on the correct side of the decision boundary.
The following Scikit-Learn code loads the iris dataset, scales the features, and then
trains a linear SVM model (using the LinearSVC class with C = 0.1 and the hinge loss
iris = datasets.load_iris()
X = iris["data"][:, (2, 3)] # petal length, petal width
y = (iris["target"] == 2).astype(np.float64) # Iris-Virginica
svm_clf = Pipeline((
("scaler", StandardScaler()),
("linear_svc", LinearSVC(C=1, loss="hinge")),
))
svm_clf.fit(X_scaled, y)
Then, as usual, you can use the model to make predictions:
>>> svm_clf.predict([[5.5, 1.7]])
array([ 1.])
Alternatively, you could use the SVC class, using SVC(kernel="linear", C=1), but it
is much slower, especially with large training sets, so it is not recommended. Another
option is to use the SGDClassifier class, with SGDClassifier(loss="hinge",
alpha=1/(m*C)). This applies regular Stochastic Gradient Descent (see Chapter 4) to
train a linear SVM classifier. It does not converge as fast as the LinearSVC class, but it
can be useful to handle huge datasets that do not fit in memory (out-of-core train‐
ing), or to handle online classification tasks.
The LinearSVC class regularizes the bias term, so you should center
the training set first by subtracting its mean. This is automatic if
you scale the data using the StandardScaler. Moreover, make sure
you set the loss hyperparameter to "hinge", as it is not the default
value. Finally, for better performance you should set the dual
hyperparameter to False, unless there are more features than
training instances (we will discuss duality later in the chapter).
To implement this idea using Scikit-Learn, you can create a Pipeline containing a
PolynomialFeatures transformer (discussed in “Polynomial Regression” on page
121), followed by a StandardScaler and a LinearSVC. Let’s test this on the moons
dataset (see Figure 5-6):
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
polynomial_svm_clf = Pipeline((
("poly_features", PolynomialFeatures(degree=3)),
("scaler", StandardScaler()),
("svm_clf", LinearSVC(C=10, loss="hinge"))
))
polynomial_svm_clf.fit(X, y)
Polynomial Kernel
Adding polynomial features is simple to implement and can work great with all sorts
of Machine Learning algorithms (not just SVMs), but at a low polynomial degree it
cannot deal with very complex datasets, and with a high polynomial degree it creates
a huge number of features, making the model too slow.
Fortunately, when using SVMs you can apply an almost miraculous mathematical
technique called the kernel trick (it is explained in a moment). It makes it possible to
get the same result as if you added many polynomial features, even with very high-
degree polynomials, without actually having to add them. So there is no combinato‐
rial explosion of the number of features since you don’t actually add any features. This
trick is implemented by the SVC class. Let’s test it on the moons dataset:
from sklearn.svm import SVC
poly_kernel_svm_clf = Pipeline((
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
))
poly_kernel_svm_clf.fit(X, y)
This code trains an SVM classifier using a 3rd-degree polynomial kernel. It is repre‐
sented on the left of Figure 5-7. On the right is another SVM classifier using a 10th-
degree polynomial kernel. Obviously, if your model is overfitting, you might want to
It is a bell-shaped function varying from 0 (very far away from the landmark) to 1 (at
the landmark). Now we are ready to compute the new features. For example, let’s look
at the instance x1 = –1: it is located at a distance of 1 from the first landmark, and 2
from the second landmark. Therefore its new features are x2 = exp (–0.3 × 12) ≈ 0.74
and x3 = exp (–0.3 × 22) ≈ 0.30. The plot on the right of Figure 5-8 shows the trans‐
formed dataset (dropping the original features). As you can see, it is now linearly
separable.
You may wonder how to select the landmarks. The simplest approach is to create a
landmark at the location of each and every instance in the dataset. This creates many
dimensions and thus increases the chances that the transformed training set will be
linearly separable. The downside is that a training set with m instances and n features
gets transformed into a training set with m instances and m features (assuming you
drop the original features). If your training set is very large, you end up with an
equally large number of features.
Other kernels exist but are used much more rarely. For example, some kernels are
specialized for specific data structures. String kernels are sometimes used when classi‐
fying text documents or DNA sequences (e.g., using the string subsequence kernel or
kernels based on the Levenshtein distance).
With so many kernels to choose from, how can you decide which
one to use? As a rule of thumb, you should always try the linear
kernel first (remember that LinearSVC is much faster than SVC(ker
nel="linear")), especially if the training set is very large or if it
has plenty of features. If the training set is not too large, you should
try the Gaussian RBF kernel as well; it works well in most cases.
Then if you have spare time and computing power, you can also
experiment with a few other kernels using cross-validation and grid
search, especially if there are kernels specialized for your training
set’s data structure.
Computational Complexity
The LinearSVC class is based on the liblinear library, which implements an optimized
algorithm for linear SVMs.1 It does not support the kernel trick, but it scales almost
1 “A Dual Coordinate Descent Method for Large-scale Linear SVM,” Lin et al. (2008).
SVM Regression
As we mentioned earlier, the SVM algorithm is quite versatile: not only does it sup‐
port linear and nonlinear classification, but it also supports linear and nonlinear
regression. The trick is to reverse the objective: instead of trying to fit the largest pos‐
sible street between two classes while limiting margin violations, SVM Regression
tries to fit as many instances as possible on the street while limiting margin violations
(i.e., instances off the street). The width of the street is controlled by a hyperparame‐
ter ϵ. Figure 5-10 shows two linear SVM Regression models trained on some random
linear data, one with a large margin (ϵ = 1.5) and the other with a small margin (ϵ =
0.5).
Adding more training instances within the margin does not affect the model’s predic‐
tions; thus, the model is said to be ϵ-insensitive.
You can use Scikit-Learn’s LinearSVR class to perform linear SVM Regression. The
following code produces the model represented on the left of Figure 5-10 (the train‐
ing data should be scaled and centered first):
from sklearn.svm import LinearSVR
svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)
To tackle nonlinear regression tasks, you can use a kernelized SVM model. For exam‐
ple, Figure 5-11 shows SVM Regression on a random quadratic training set, using a
2nd-degree polynomial kernel. There is little regularization on the left plot (i.e., a large
C value), and much more regularization on the right plot (i.e., a small C value).
SVMs can also be used for outlier detection; see Scikit-Learn’s doc‐
umentation for more details.
The dashed lines represent the points where the decision function is equal to 1 or –1:
they are parallel and at equal distance to the decision boundary, forming a margin
around it. Training a linear SVM classifier means finding the value of w and b that
make this margin as wide as possible while avoiding margin violations (hard margin)
or limiting them (soft margin).
Training Objective
Consider the slope of the decision function: it is equal to the norm of the weight vec‐
tor, ∥ w ∥. If we divide this slope by 2, the points where the decision function is equal
to ±1 are going to be twice as far away from the decision boundary. In other words,
dividing the slope by 2 will multiply the margin by 2. Perhaps this is easier to visual‐
ize in 2D in Figure 5-13. The smaller the weight vector w, the larger the margin.
3 More generally, when there are n features, the decision function is an n-dimensional hyperplane, and the deci‐
sion boundary is an (n – 1)-dimensional hyperplane.
1 1
We are minimizing 2 wT · w, which is equal to 2 ∥ w ∥2, rather than
minimizing ∥ w ∥. This is because it will give the same result (since
the values of w and b that minimize a value also minimize half of
1
its square), but 2 ∥ w ∥2 has a nice and simple derivative (it is just
w) while ∥ w ∥ is not differentiable at w = 0. Optimization algo‐
rithms work much better on differentiable functions.
To get the soft margin objective, we need to introduce a slack variable ζ(i) ≥ 0 for each
instance:4 ζ(i) measures how much the ith instance is allowed to violate the margin. We
now have two conflicting objectives: making the slack variables as small as possible to
1
reduce the margin violations, and making 2 wT · w as small as possible to increase the
margin. This is where the C hyperparameter comes in: it allows us to define the trade‐
Quadratic Programming
The hard margin and soft margin problems are both convex quadratic optimization
problems with linear constraints. Such problems are known as Quadratic Program‐
ming (QP) problems. Many off-the-shelf solvers are available to solve QP problems
using a variety of techniques that are outside the scope of this book.5 The general
problem formulation is given by Equation 5-5.
Note that the expression A · p ≤ b actually defines nc constraints: pT · a(i) ≤ b(i) for i =
1, 2, ⋯, nc, where a(i) is the vector containing the elements of the ith row of A and b(i) is
the ith element of b.
You can easily verify that if you set the QP parameters in the following way, you get
the hard margin linear SVM classifier objective:
5 To learn more about Quadratic Programming, you can start by reading Stephen Boyd and Lieven Vanden‐
berghe, Convex Optimization (Cambridge, UK: Cambridge University Press, 2004) or watch Richard Brown’s
series of video lectures.
So one way to train a hard margin linear SVM classifier is just to use an off-the-shelf
QP solver by passing it the preceding parameters. The resulting vector p will contain
the bias term b = p0 and the feature weights wi = pi for i = 1, 2, ⋯, m. Similarly, you
can use a QP solver to solve the soft margin problem (see the exercises at the end of
the chapter).
However, to use the kernel trick we are going to look at a different constrained opti‐
mization problem.
1 m m i j i j iT m
2i∑ ∑ α α t t � ·� j
minimize
α =1j=1
− ∑ αi
i=1
subject to αi ≥0 for i = 1, 2, ⋯, m
6 The objective function is convex, and the inequality constraints are continuously differentiable and convex
functions.
α i >0
The dual problem is faster to solve than the primal when the number of training
instances is smaller than the number of features. More importantly, it makes the ker‐
nel trick possible, while the primal does not. So what is this kernel trick anyway?
Kernelized SVM
Suppose you want to apply a 2nd-degree polynomial transformation to a two-
dimensional training set (such as the moons training set), then train a linear SVM
classifier on the transformed training set. Equation 5-8 shows the 2nd-degree polyno‐
mial mapping function ϕ that you want to apply.
How about that? The dot product of the transformed vectors is equal to the square of
the dot product of the original vectors: ϕ(a)T · ϕ(b) = (aT · b)2.
Now here is the key insight: if you apply the transformation ϕ to all training instan‐
ces, then the dual problem (see Equation 5-6) will contain the dot product ϕ(x(i))T ·
ϕ(x(j)). But if ϕ is the 2nd-degree polynomial transformation defined in Equation 5-8,
T 2
then you can replace this dot product of transformed vectors simply by � i · � j .
So you don’t actually need to transform the training instances at all: just replace the
dot product by its square in Equation 5-6. The result will be strictly the same as if you
went through the trouble of actually transforming the training set then fitting a linear
SVM algorithm, but this trick makes the whole process much more computationally
efficient. This is the essence of the kernel trick.
The function K(a, b) = (aT · b)2 is called a 2nd-degree polynomial kernel. In Machine
Learning, a kernel is a function capable of computing the dot product ϕ(a)T · ϕ(b)
based only on the original vectors a and b, without having to compute (or even to
know about) the transformation ϕ. Equation 5-10 lists some of the most commonly
used kernels.
There is still one loose end we must tie. Equation 5-7 shows how to go from the dual
solution to the primal solution in the case of a linear SVM classifier, but if you apply
the kernel trick you end up with equations that include ϕ(x(i)). In fact, � must have
the same number of dimensions as ϕ(x(i)), which may be huge or even infinite, so you
can’t compute it. But how can you make predictions without knowing �? Well, the
good news is that you can plug in the formula for � from Equation 5-7 into the deci‐
sion function for a new instance x(n), and you get an equation with only dot products
between input vectors. This makes it possible to use the kernel trick, once again
(Equation 5-11).
m
i T
= ∑ αiti
i=1
ϕ� ·ϕ �
n
+b
m
= ∑
i=1
i
α i t i K � ,�
n
+b
α i >0
Note that since α(i) ≠ 0 only for support vectors, making predictions involves comput‐
ing the dot product of the new input vector x(n) with only the support vectors, not all
the training instances. Of course, you also need to compute the bias term b , using the
same trick (Equation 5-12).
α i >0 α i >0
m m
1
=
ns ∑
i=1
1−t i
∑
j=1
α j t j K � ,�
i j
α i >0 α j >0
If you are starting to get a headache, it’s perfectly normal: it’s an unfortunate side
effects of the kernel trick.
Online SVMs
Before concluding this chapter, let’s take a quick look at online SVM classifiers (recall
that online learning means learning incrementally, typically as new instances arrive).
For linear SVM classifiers, one method is to use Gradient Descent (e.g., using
SGDClassifier) to minimize the cost function in Equation 5-13, which is derived
from the primal problem. Unfortunately it converges much more slowly than the
methods based on QP.
The first sum in the cost function will push the model to have a small weight vector
w, leading to a larger margin. The second sum computes the total of all margin viola‐
tions. An instance’s margin violation is equal to 0 if it is located off the street and on
the correct side, or else it is proportional to the distance to the correct side of the
street. Minimizing this term ensures that the model makes the margin violations as
small and as few as possible
Hinge Loss
The function max(0, 1 – t) is called the hinge loss function (represented below). It is
equal to 0 when t ≥ 1. Its derivative (slope) is equal to –1 if t < 1 and 0 if t > 1. It is not
differentiable at t = 1, but just like for Lasso Regression (see “Lasso Regression” on
page 130) you can still use Gradient Descent using any subderivative at t = 0 (i.e., any
value between –1 and 0).
Exercises
1. What is the fundamental idea behind Support Vector Machines?
2. What is a support vector?
3. Why is it important to scale the inputs when using SVMs?
4. Can an SVM classifier output a confidence score when it classifies an instance?
What about a probability?
5. Should you use the primal or the dual form of the SVM problem to train a model
on a training set with millions of instances and hundreds of features?
6. Say you trained an SVM classifier with an RBF kernel. It seems to underfit the
training set: should you increase or decrease γ (gamma)? What about C?
7. How should you set the QP parameters (H, f, A, and b) to solve the soft margin
linear SVM classifier problem using an off-the-shelf QP solver?
8. Train a LinearSVC on a linearly separable dataset. Then train an SVC and a
SGDClassifier on the same dataset. See if you can get them to produce roughly
the same model.
9. Train an SVM classifier on the MNIST dataset. Since SVM classifiers are binary
classifiers, you will need to use one-versus-all to classify all 10 digits. You may
7 “Incremental and Decremental Support Vector Machine Learning,” G. Cauwenberghs, T. Poggio (2001).
8 “Fast Kernel Classifiers with Online and Active Learning,“ A. Bordes, S. Ertekin, J. Weston, L. Bottou (2005).
Exercises | 165
want to tune the hyperparameters using small validation sets to speed up the pro‐
cess. What accuracy can you reach?
10. Train an SVM regressor on the California housing dataset.