0% found this document useful (0 votes)
8 views49 pages

ML 41

Uploaded by

Srisurya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views49 pages

ML 41

Uploaded by

Srisurya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Chapter

LEARNING WITH SUPPORT


4 VECTOR MACHINES (SVM)

4.1 INTRODUCTION

The classical regression and Bayesian classification statistical techniques (overview given in Section
3.5) are based on the very strict assumption that probability distribution models or probability
density functions are known. Unfortunately, in real-world practical situations, there is not enough
information about the underlaying distributions, and distribution-free regression or classification is
needed that does not require knowledge of probability distributions. The only available information
is the training dataset.
Under the assumption that the data follows a normal distribution, statistical techniques result in
linear regression functions. For the classification problems with normally distributed classes and
equal covariance matrices for corresponding classes, we get linear discriminant functions using
statistical techniques.
The linear functions are extremely powerful for the regression/classification problems whenever
the stated assumptions are true. Unfortunately, the classical statistical paradigm turns out to be
inappropriate for many real-life problems because the underlying real-life data generation laws
may typically be very far from normal distribution.
Till the 1980s, most of the data analysis and learning methods were confined to linear statistical
techniques. Most of the optimal algorithms and theoretical results were available for inference
of linear dependencies from data; for nonlinear ones, only greedy algorithms, local minima, and
heuristic search were known.
A paradigm shift occurred in the 1980s when researchers, armed with powerful computers of
the day, boldly embraced nonlinear methods of learning. The simultaneous introduction of decision
trees (Chapter 8) and neural network algorithms (Chapter 5) revolutionalized the practice of pattern
recognition and numeric prediction. These methods opened the possibility of efficient learning of
nonlinear dependencies.
A second paradigm shift occurred in the 1990s with the introduction of the ‘kernel methods’.
The differences with the previous approaches are worth mentioning. Most of the earlier learning
Learning with Support Vector Machines (SVM) 131

algorithms had, to a large extent, been based on heuristics or on loose analogies with natural
learning systems, e.g., model of nervous systems (neural networks). They were mostly the result of
creativity and extensive tuning by the designer, and the underlying reasons for their performance
were not fully understood. A large part of the work was devoted to designing heuristics to avoid
local minima in hypothesis search process.
With the emergence of computational learning theory (Section 2.3), new efficient representations
of nonlinear functions have been discovered and used for the design of learning algorithms. This
has led to the creation of powerful algorithms, whose training often amounts to optimization. In
other words, they are free from local minima. The use of optimization theory marks a radical
departure from the previous greedy search algorithms. In a way, researchers now have the power of
nonlinear function learning together with the conceptual and computational convenience, that was,
hitherto, a characteristic of linear systems. Support Vector Machine (SVM), probably, represents
the best known example of this class of algorithms.
In the present state-of-the-art, no machine learning method is inherently superior to any other;
it is the type of problem, prior distribution, and other information that determine which method
should provide the best performance. If one algorithm seems to outperform another in a particular
situation, it is a consequence of its fit to the particular problem, not the general superiority of
the algorithm. Machine learning involves searching through a space of possible hypotheses to
determine one that fits the observed data and any prior knowledge held by the learner. We have
covered in this book, hypotheses space which is being exploited by the practitioners in data mining
problems.
In the current chapter, we present the basic concepts of SVM in an easily digestible way. Applied
learning algorithms for support vector classification and regression have been thoroughly explained.
The support vector machine is currently considered to be the best off-the-shelf learning algorithm
and has been applied successfully in various domains.
Support vector machines were originally designed for binary classification. Initial research
attempts were diverted towards making several two-class SVMs to do multiclass classification.
Recently, several single-shot multiclass classification algorithms appeared in literature. Our focus
in this chapter will be on binary classification; multiclass problems will be reduced to binary
classification problems.
Support vector regression is the natural extension of methods used for classification. We will
see in this chapter that SVM regression analysis retains all the properties of SVM classifiers. It has
already become a powerful technique for predictive data analysis with many applications in varied
areas of study.
In Section 3.5, we observed that the practical limitation of statistical approach is the assumed
initial knowledge available on the process under investigation. For the Bayes procedure, one should
know the underlying probability distributions. If only the forms of underlying distributions were
known, we can use the training samples to estimate the values of their parameters.
In this chapter, we shall instead assume that we know the proper forms for the discriminant
functions, and use the samples to estimate the values of parameters of the classifier. We shall examine
various procedures for determining discriminant functions; none of them requiring knowledge of
the forms of underlying distributions.
We shall be concerned with linear discriminant functions which have a variety of pleasant
analytical properties. As we observed in Section 3.5, they can be optimal if the underlying
132 Applied Machine Learning

distributions are cooperative, such as Gaussians having equal covariance. Even when they are
not optimal, we might be willing to sacrifice some performance in order to gain the advantage of
their simplicity. Linear discriminant functions are relatively easy to compute and in the absence
of information suggesting otherwise, linear classifiers are attractive candidates for initial, trial
classifiers.
The problem of finding a linear discriminant function will be formulated as a problem of
minimizing a criterion function. The obvious criterion function for classification purposes is
the misclassificastion error (refer to Section 2.8). Instead of deriving discriminants based on
misclassification error, we investigate related criterion function that is analytically more tractable.

4.2 LINEAR DISCRIMINANT FUNCTIONS FOR BINARY CLASSIFICATION

Let us assume two classes of patterns described by two-dimensional feature vectors (coordinates
x1 and x2) as shown in Fig. 4.1. Each pattern is represented by vector x = [x1 x2]T Œ ¬2. In Fig. 4.1,
we have used circle to denote Class 1 patterns and square to denote Class 2 patterns. In general,
patterns of each class will be characterized by random distributions of the corresponding feature
vectors.
Figure 4.1 also shows a straight line separating the two classes. We can easily write the equation
of the straight line in terms of the coordinates (features) x1 and x2 using coefficients or weights w1
and w2 and a bias (offset) or threshold term w0, as given in Eqn (4.1). The weights determine the
slope of the straight line, and the bias determines the deviation from the origin of the straight line
intersections with the axes:
g(x) = w1 x1 + w2 x2 + w0 = 0 (4.1)
We say that g(x) is a linear discriminant function that divides (categorizes) ¬2 into two decision
regions.
x2

w0
-
w2

Class1, g (x) > 0 ( yˆ = +1)

g(x) = 0

Class 2, g (x) < 0 ( yˆ = -1)


w0 x1
-
w1
Figure 4.1 Linear discriminant function in two-dimensional space
Learning with Support Vector Machines (SVM) 133

The generalization of the linear discriminant function for an n-dimensional feature space in ¬n
is straight forward:
g(x) = wTx + w0 = 0 (4.2)
x = [x1 x2 … xn]T is the feature vector
w = [w1 w2 … wn]T is a weight vector
w0 = bias parameter
The discriminant function is now a linear n-dimensional surface, called a hyperplane; symbolized
as H in the discussion that follows.
For the discriminant function of the form of Eqn (4.2), a two-category classifier implements the
following decision rule:
Decide Class 1 if g(x) > 0 and Class 2 if g(x) < 0. (4.3)
Thus, x is assigned to Class 1 if the inner product wTx exceeds the threshold (bias) –w0, and to
Class 2 otherwise. If g(x) = 0, x can ordinarily be assigned to any class, but in this chapter, we shall
leave the assignment undefined.
Figure 4.2 shows the architecture of a typical implementation of the linear classifier. It consists
of two computational units: an aggregation unit and an output unit. The aggregation unit collects
the n weighted input signals w1 x1, w2 x2, …, wn xn and sums them together. Note that the summation
also has a bias term—a constant input x0 = 1 with a weight of w0. This sum is then passed on to the
output unit, which is a step filter that returns –1 if its input is negative and +1 if its input is positive.
In other words, the step filter implements the sign function:
Ê n ˆ
ŷ = sgn Á Â wj xj + w0 ˜ (4.4a)
Ë j =1 ¯
= sgn (wTx + w0) = sgn(g(x)) (4.4b)
The sgn function extracts the appropriate pattern label from the decision surface g(x). This means
that linear classifier implemented in Fig. 4.2 represents a decision function of the form,
ŷ = sgn (g(x)) = sgn (wTx + w0) (4.5)
The values ±1 in output unit are not unique; only change of sign is important. ±1 is taken for
computational convenience.
x1 Aggregation unit Output unit
w1
x2 ŷ
w2 1 ŷ (output)
� wn –1 S –1
xn–1
wn w0 n
xn g ( x) = Â wj x j + w0
j =1
x0 = 1
Figure 4.2 A simple linear classifier
134 Applied Machine Learning

If x(1) and x(2) are two points on the decision hyperplane, then the following is valid:

wTx(1) + w0 = wTx(2) + w0 = 0
This implies that
wT(x(1) - x(2)) = 0
The difference (x(1) - x(2)) obviously lies on the decision hyperplane for any x(1) and x(2).
The scalar product is equal to zero, meaning that the weights vector w is normal (perpendicular)
to the decision hyperplane. Without changing the normal vector w, varying w0 moves the
hyperplane parallel to itself. Note also that wTx + w0 = 0 has an inherent degree of freedom. We can
rescale the hyperplane to KwTx + Kw0 = 0 for K Œ ¬+ (positive real numbers) without changing
the hyperplane. Geometry for n = 2 with w1 > 0, w2 > 0 and w0 < 0 is shown in Fig. 4.3.
x2

x
w r
- 0
w2
xP w
H +(g(x) > 0)

xd
d
H (g(x) = 0)
H –(g(x) < 0)
w0 x1
-
w1
Figure 4.3 Linear decision boundary between two classes

The location of any point x may be considered relative to the hyperplane H. Defining xP as the
normal projection of x onto H (shown in Fig. 4.3), we may decompose x as,
w
x = xP + r (4.6)
|| w ||

where ||w|| is the Euclidean norm of w, and w/||w|| is a unit vector (unit length with direction that
of w). Since by definition

g(xP) = wTxP + w0 = 0
it follows that

Ê w ˆ
g(x) = wTx + w0 = wT Á x P + r + w0
Ë || w || ˜¯
Learning with Support Vector Machines (SVM) 135

T wT r w
= w x P + w0 +
|| w ||

wT w || w ||2
=r =r = r || w ||
|| w || || w ||

g(x)
or r= (4.7)
|| w ||

In other words, |g(x)| is a measure of the Euclidean distance of the point x from the decision
hyperplane H. If g(x) > 0, we say that the point x is on the positive side of the hyperplane, and if
g(x) < 0, we say that point x is on the negative side of the hyperplane. When g(x) = 0, the point x
is on the hyperplane H.
In general, the hyperplane H divides the feature space into two half-spaces: decision region H +
(positive side of hyperplane H ) for Class 1 (g(x) > 0) and region H – (negative side of hyperplane
H ) for Class 2 (g(x) < 0). The assignment of vector x to H + or H – can be implemented as,

Ï> 0 if x Œ H +
Ô
wT x + w0 Ì= 0 if x Œ H (4.8)
Ô -
Ó< 0 if x Œ H
The perpendicular distance d from the coordinates origin to the hyperplane H is given by w0/||w||,
as is seen below.
w
g (x d ) = wT x d + w0 = 0; x d = d
|| w ||
Therefore,
wT w || w ||2 - w0
d + w0 = 0; d = - w0 ; d = (4.9)
|| w || || w || || w ||

The origin is on the negative side of H if w0 < 0, and if w0 > 0, the origin is on the positive side
of H. If w0 = 0, the hyperplane passes through the origin.
Geometry for n = 3 is shown in Fig. 4.4.
136 Applied Machine Learning

x3
x
r

xP
H+

H–

= 0)
w
g (x) x2
d H(

x1
Figure 4.4 Hyperplane H separates the feature space into two
half-space H +(g(x) > 0) and H –(g(x) < 0)

4.3 PERCEPTRON ALGORITHM


Let us assume that we have a set of N samples x(1), x(2), …, x(N); some labeled Class 1 and some
labeled Class 2:
D : {(x(1), y(1)), …, (x(N), y(N))} (4.10)
(i) n (i)
with x Œ ¬ , and y Œ {+1, –1}.
We want to use these samples to determine the weights w and w0 in a linear discriminant function
g(x) = wTx + w0 (4.11)
Let us say there is a reason behind the belief that there is a solution for which the likelihood of
error is quite low. This leads to the desire to seek a weight vector that correctly classifies all the
samples. If there does exist such a weight vector, the samples are said to be linearly separable
(Fig. 4.5a).
x2 x2

x1 x1
(a) Linearly seperable (b) Linearly inseperable
Figure 4.5
Learning with Support Vector Machines (SVM) 137

Mostly, the classes are overlapped and the genuine separability is given by nonlinear decision
boundaries (Fig. 4.5b). The samples in these cases are said to linearly inseparable.
Consider now the problem of constructing a criterion function for solving the linear classification
problem to determine the weights w and w0. The criterion function, which is the most obvious for
the purpose of classification, is the number of samples misclassified by the weight vector. But
since this function is stepwise constant, it is naturally a weak candidate for gradient search. The
Perceptron Algorithm seems to be a better alternative for this criterion function.
Rosenblatt (1950s) proposed the machine—the perceptron—whose architecture encodes the
structure of a linear discriminant function (Fig. 4.3). Although it seemed initially promising, it was
quickly proved that perceptrons could not be trained to recognize many classes of problems. Inspite
of limitations, the perceptron is an interesting machine because, even in such a simple system, we
can find most of the central concepts that we will need for the theory of Neural Networks (discussed
in the next chapter) and Support Vector Machines (discussed in the present chapter).
In the following, we describe perceptron training algorithm and its limitations for classification
problems.
A perceptron takes a vector of real-valued inputs xj ; j = 1, …, n, calculates a linear combination
Ê n ˆ
of these inputs Á Â wj x j = wT x˜ , then outputs a +1 if the result is greater than the threshold (–w0)
Ë j =1 ¯
and –1 if the result is less than the threshold (Fig. 4.3). The perceptron algorithm tests the decision
function g(x) on each element in the training set, and if the test fails, it adjusts the free parameters w
and w0 incrementally. This process continues until all the elements of the training set are perfectly
classified.
At the heart of the algorithm are two update rules (Here y(i) is the target output for the current
training sample x(i)(i = 1, …, N), and yˆ (i ) is the output generated by the perceptron):

w ¨ w + Dw = w + h y(i) x(i) (4.12a)

w0 ¨ w0 + Dw0 = w0 + h y(i) R2 (4.12b)

The quantity R is called the radius of the training data and can be considered the radius of
the hypersphere centered at the origin of our coordinate system that encloses all the points
of the dataset. In our case where the data universe is ¬n, this is simply the position vector length of
the training set point located farthest from the origin [53]:

R ¨ max || x(i ) || (4.13)


1£ i £ N

where || ◊ || stands for the Euiclidean norm.


The quantity h is a positive scale factor (0 < h £ 1) that sets the step size. Called the learning rate,
it controls the convergence speed of the search heuristic. Note that if h is too small, convergence
is needlessly slow, whereas if h is too large, the correction process will overshoot, and can even
diverge. So choice of h is crucial.
138 Applied Machine Learning

The intuition behind the update rules is that if yˆ (i ) = y(i ) (the point is correctly classified), there is
no weights update (Dw = 0, Dw0 = 0). In case of a misclassified point ( yˆ (i ) π y(i ) ), the rules attempt
to correct the position of the decision surface in such a way that the point is no longer misclassified.
One may come across different expressions of the weight changes Dw, Dw0 in the literature
compared to the ones given in (4.12). However, the conceptual framework of the weight updates
remains the same. For example, the following update rules are probably most commonly employed:
(i )
w ¨ w + Dw = w + h ( y(i ) - y� ) x (i )
(4.14)
w0 ¨ w0 + Dw0 = w0 + h ( y(i ) – yˆ (i ))

Here, we describe the conceptual framework with respect to the rules given in (4.12).
Consider a training dataset point (x(i), y(i)) with y(i) = +1. If this point is correctly classified
by the decision surface ( yˆ (i ) = +1), Dw and Dw0 are zero and no weights are updated. Suppose
the perceptron outputs a –1, when the target output is +1. The update rule (4.12a) attempts to
correct this misclassification by rotating the decision surface in the direction of x(i). The rotation
is accompalished by adding a scaled version of x(i) to the normal vector w (refer to Fig. 4.6). An
analogous computation can be performed for a misclassified point with a target value of –1. In this
case, the adjustment term will be subtracted from the normal vector, causing the rotation in the
opposite direction.
x2

h x(i)
w + h x(i)
w wTx + w0 = 0

x(i)

(w + h x(i))T x(i) + w0 = 0
x1

Figure 4.6 The point x(i) is no more misclassified after rotation of hyperplane

The second update rule (4.12b) attempts to correct a misclassification by translating the decision
surface. We may rewrite the rule as,
–w0 ¨ –w0 – h y(i) R2

or b ¨ b – h y(i) R2 (4.15)

For a misclassified point with y(i) = +1, b is reduced and we need to translate the decision surface
in the direction opposite to the normal vector (refer to Fig. 4.7).
Learning with Support Vector Machines (SVM) 139

x2
wTx = – w0 = b

x(i) wTx = b – hR2

x1

Figure 4.7 Translation of hyperplane

The overall effect of the two update rules is demonstrated in Fig. 4.8. A square point is
misclassified at time step t. Decision surface is rotated and translated at time step t + 1. The square
point is now correctly classified but overcompensation results in misclassification of a circle point.
This misclassification in turn forces the perceptron algorithm to apply the update rules in the
opposite direction during the next iteration, leading to a decision surface at time step t + 2. This
process continues. If one episode is completed (all data used), another episode starts from the first
sample. The algorithm terminates when all the points are classified correctly.
x2
t+2

t+1

x1

Figure 4.8 Rotation and translation of decision surface

The solution is nonunique because there are more than one hyperplanes separating two linearly
separable classes (refer to Fig. 4.9). The decision surface search stops as soon as some surface is
found that separates the training set. This can lead to decision surfaces that are positioned close to
training set points. Considering that the training dataset is only an approximate representation of
the rest of the data universe, such solutions can lead to misclassifications of unseen points.
Attractiveness of the perceptron algorithm lies in its simplicity. There is, however, a major
problem associated with this algorithm for real-world solutions: datasets are almost certainly not
linearly separable, while the algorithm finds a separating hyperplane only for linearly separable
data. When the dataset is linearly inseparable, the test of the decision surface will always fail for
some subset of training points, regardless of the adjustments we make to the free parameters, and the
algorithm will loop forever. There may, however, be situations when linear separating hyperplane
140 Applied Machine Learning

can be good solution even when data are overlapped; an upper bound needs to be imposed on the
number of iterations when this method is applied in practice.

x2

x1

Figure 4.9 Nonunique solution

Summary of the Perceptron Learning


Given the set of N data points that are used for training: x(i), y(i); i = 1, …, N.
Perform the following steps for i = 1, …, N.
Step 1: Choose the learning rate h > 0 (h = 0.1 may be a good initial choice) and initial weights w,
w0 (initial weights can be random or zero).
Step 2: Apply the next (the first one for i = 1) training sample (x(i), y(i)) to the perceptron and using
Eqn (4.5), find the perceptron output yˆ (i ) for the data pair applied and the given weights w and w0.
Step 3: Find the errors and adapt the weights using the update rules (4.12a)/(4.12b).
Step 4: Stop the adaptation of the weights if ( y(i) – yˆ (i ) ) = 0 for all data pairs. Otherwise, go back
to Step 2.

How do we cope with problems which are not linearly separable? The perceptron gives
us a simple method when the samples are linearly separable. The minimization of classification
error is the perceptron criterion function. It is an error-correcting process, as it requires that the
weights be modified, only when an error crops up. Since no weight vector can accurately classify
each sample in a nonseparable group, it is quite clear that the error-correcting process in perceptron
algorithm can never stop. Support Vector Machines (SVM), as we shall see in this chapter, seek a
weight vector that maximizes the margin (the minimum distance from the samples to the separating
hyperplane), and employ an optimization procedure that works well for both the linearly separable
and inseparable samples. The SVM criterion function of largest margin provides a unique solution
and promises a good classification with previously unseen data.
Learning with Support Vector Machines (SVM) 141

Another strong alternative is available in Neural Networks (NN). History has proved that limitations
of Rosenblatt’s perceptron can be overcome by Neural Networks (discussed in the next chapter).
The perceptron criterion function considers misclassified samples, and the gradient procedures
for minimization are not applicable. The neural networks primarily solve the regression problems
considering all the samples and minimum squared-error criterion, and employ gradient procedures
for minimization. The algorithms for separable—as well as inseparable—data classification are
first developed in the context of regression problems and then adapted for classification problems.
In the present chapter, our interest is in SVM-based solutions to real-life (nonlinear) classification
and regression problems. To explain how a support vector machine works for these problems, it is
perhaps easiest to start with the case of linearly separable patterns in the context of binary pattern
classification. In this context, the main idea of a support vector machine is to construct a hyperplane
as the decision surface in such a way that the margin of separation between Class 1 and Class
2 examples is maximized. We will then take up the more difficult case of linearly nonseparable
patterns. With the material on how to find the optimal hypersurface for linearly nonseparable
patterns at hand, we will formally describe the construction of a support vector machine for real-life
(nonlinear) pattern recognition task. As we shall see shortly, basically the idea of a support vector
machine hinges on the following two mathematical operations:
(i) Nonlinear mapping of input patterns into a high-dimensional feature space.
(ii) Construction of optimal hyperplane for linearly separating the feature vectors discovered in
Step (i).
The final stage of our presentation will be to extend these results for application to multiclass
classification problems, and nonlinear regression problems.

4.4 LINEAR MAXIMAL MARGIN CLASSIFIER FOR LINEARLY SEPARABLE DATA

Let the set of training (data) examples D be


D = {x(1), y(1)), (x(2), y(2)), …, (x(N), y(N))} (4.16)
where x = [x1 x2 … xn]T is an n-dimensional input vector (pattern with n-features) for the ith
example in a real-valued space X Õ ¬n; y is its class label (output value), and y Œ{+1, –1}. +1
denotes Class 1 and –1 denotes Class 2.
To build a classifier, SVM finds a linear function of the form
g(x) = wTx + w0 (4.17)
so that the input vector x(i) is assigned to Class 1 if g(x(i)) > 0, and to Class 2 if g(x(i)) < 0, i.e.,

(i )
ÏÔ+1 if wT x(i ) + w0 > 0
y =Ì (4.18)
ÔÓ-1 if w x + w0 < 0
T (i )

Hence, g(x) is a real-valued function; g: X Õ ¬n Æ ¬.


w = [w1 w2 … wn]T Œ ¬n is called the weight vector and w0 Œ ¬ is called the bias.
142 Applied Machine Learning

In essence, SVM finds a hyperplane


w T x + w0 = 0 (4.19)
that separates Class 1 and Class 2 training examples. This hyperplane is called the decision
boundary or decision surface. Geometrically, the hyperplane (4.19) divides the input space into
two half spaces: one half for Class 1 examples and the other half for Class 2 examples. Note that
hyperplane (4.19) is a line in a two-dimensional space and a plane in a three-dimensional space.
For linearly separable data, there are many hyperplanes (lines in two-dimensional feature
space; Fig. 4.9) that can perform separation. How can one find the best one? The SVM framework
provides good answer to this question. Among all the hyperplanes that minimize the training error,
find the one with the largest margin—the gap between the data points of the two classes. This is an
intuitively acceptable approach: select the decision boundary that is far away from both the classes
(Fig. 4.10). Large-margin separation is expected to yield good classification on previously unseen
data, i.e., good generalization.
From Section 4.2, we know that in wTx + w0 = 0, w defines a direction perpendicular to the
hyperplane. w is called the normal vector (or simply normal) of the hyperplane. Without changing
the normal vector w, varying w0 moves the hyperplane parallel to itself. Note also that wTx + w0 =
0 has an inherent degree of freedom. We can rescale the hyperplane to KwTx + Kw0 = 0 for K Œ ¬+
(positive real numbers), without changing the hyperplane.
x2 Separating line x2
(decision boundary)

Class 1 (y = +1) Class 1 (y = +1)

Separating
line

Class 2 (y = –1) Class 2 (y = –1)


x1 x1
(a) Large margin separation (b) Small margin separation
Figure 4.10

Since SVM maximizes the margin between Class 1 and Class 2 data points, let us find the
margin. The linear function g(x) = wTx + w0 gives an algebraic measure of the distance r from x
to the hyperplane wTx + w0 = 0. We have seen earlier in Section 4.2 that this distance is given by
(Eqn (4.7))
g ( x)
r= (4.20)
|| w ||
Now consider a Class 1 data point (x(i), +1) that is closest to the hyperplane wTx + w0 = 0 (Fig.
4.11).
Learning with Support Vector Machines (SVM) 143

The distance d1 of this data point from the hyperplane is

g (x(i ) ) wT x(i ) + w0
d1 = = (4.21a)
|| w || || w ||

Similarly,
g (x( k ) ) wT x( k ) + w0
d2 = = (4.21b)
|| w || || w ||

where (x(k), –1) is a Class 2 data point closest to the hyperplane wTx + w0 = 0.
x2
M

x
r Class 1
xP
x(i)
d1
d2
w H1 : wTx + w0 = +1(g(x) > 0)
x(k)
H : wTx + w0 = 0 (g(x) = 0)
Class 2 H2 : wTx + w0 = –1 (g(x) < 0)
x1
Figure 4.11 Geometric interpretation of algebraic distances of points to a
hyperplane for two-dimenstional case

We define two parallel hyperplanes H1 and H2 that pass through x(i) and x(k), respectively. H1
and H2 are also parallel to the hyperplane wTx + w0 = 0. We can rescale w and w0 to obtain (this
rescaling, as we shall see later, simplifies the quest for significant patterns, called support vectors)
H1 : wTx + w0 = +1
(4.22)
H2 : wTx + w0 = –1
such that
wTx(i) + w0 ≥ 1 if y(i) = +1
(4.23a)
wTx(i) + w0 £ –1 if y(i) = –1
or equivalently
y(i) (wTx(i) + w0) ≥ 1 (4.23b)
144 Applied Machine Learning

which indicates that no training data fall between hyperplanes H1 and H2. The distance between the
two hyperplanes is the margin M. In the light of rescaling given by (4.22),
1 -1
d1 = ; d2 = (4.24)
|| w || || w ||
where the ‘-’ sign indicates that x(k) lies on the side of the hyperplane wTx + w0 = 0 opposite to that
where x(i) lies. From Fig. 4.11, it follows that
2
M= (4.25)
|| w ||

Equation (4.25) states that maximizing the margin of separation between classes is equivalent to
minimizing the Euclidean norm of the weight vector w.
Since SVM looks for the separating hyperplane that minimizes the Euclidean norm of the weight
vector, this gives us an optimization problem. A full description of the solution method requires a
significant amount of optimization theory, which is beyond the scope of this book. We will only
use relevant results from optimization theory, without giving formal definitions, theorems or proofs
(refer to [54, 55] for details).
Our interest here is in the following nonlinear optimization problem with inequality constraints:
minimize f (x)
(4.26)
subject to gi (x) ≥ 0; i = 1, …, m

where x = [x1 x2 … xn]T, and the functions f and gi are continuously differentiable.
The optimality conditions are expressed in terms of the Lagrangian function
m
L ( x , l ) = f ( x ) - Â li g i ( x ) (4.27)
i =1

where k = [l1 … lm]T is a vector of Lagrange multipliers.


An optimal solution to the problem (4.26) must satisfy the following necessary conditions, called
Karush-Kuhn-Tucker (KKT) conditions:
∂ L ( x, l )
(i) = 0; j = 1, …, n
∂ xj
(ii) gi (x) ≥ 0; i = 1, …, m (4.28)
(iii) li ≥ 0; i = 1, …, m
(iv) li gi (x) = 0; i = 1, …, m
In view of condition (iii), the vector of Lagrange multipliers belongs to the set {k Œ ¬m, k ≥ 0}.
Also note that condition (ii) is the original set of constraints.
Our interest, as we will see shortly, is in convex functions f and linear functions gi . For this class
of optimization problems, when there exist vectors x0 and k0 such that the point (x0, k0 ) satisfies the
Learning with Support Vector Machines (SVM) 145

KKT conditions (4.28), then x0 gives the global minimum of the function f (x), with the constraint
given in (4.26).
Let
L* (x) = max L (x, l ), and L*( l ) = min L (x, l )
l Œ¬ m x Œ¬ n

It is clear from these equations that for any x Œ¬n and k Œ¬m,
L*(k) £ L(x, k) £ L*(x)
and thus, in particular
L*(k) £ L*(x)
This holds for any x Œ¬n and k Œ¬m; so it holds for the k that maximizes the left-hand side, and
the x that minimizes the right-hand side. Thus,
max min L (x, l ) £ min max L (x, l )
l Œ¬ m x Œ¬ n x Œ¬ n l Œ¬ m

The two problems, min-max and max-min, are said to be dual to each other. We refer to the
min-max problem as the primal problem. The objective to be minimized, L*(x), is referred to as
the primal function. The max-min problem is referred to as the dual problem, and L*(k) as the
dual function. The optimal primal and dual function values are equal when f is a convex function
and gi are linear functions. The concept of duality is widely used in the optimization literature.
The aim is to provide an alternative formulation of the problem which is more convenient to solve
computationally and/or has some theoretical significance. In the context of SVM, the dual problem
is not only easy to solve computationally, but also crucial for using kernel functions to deal with
nonlinear decision boundaries. This will be clear later.
The nonlinear optimization problem defined in (4.26) can be represented as min-max problem,
as follows:
For the Lagrangian (4.27), we have
È m ˘
L (x) = max Í f (x) - Â li gi (x) ˙
*
l Œ¬ m Í ˙˚
Î i =1

Since gi (x) ≥ 0 for all i, li = 0 (i = 1, …, m) would maximize the Lagrangian; thus,


L*(x) = f (x)
Therefore, our original constrained problem (4.26) becomes the min-max primal problem:
minimize L* (x)
x Œ¬ n

subject to gi (x) ≥ 0; i = 1, …, m
The concept of duality gives the following formulation for max-min dual problem:
maximize L*( l )
l Œ¬ m , l ≥ 0
146 Applied Machine Learning

More explicitly, this nonlinear optimization problem with dual variables k, can be written in the
form: È m ˘
maximize min Í f (x) - Â li gi (x) ˙ (4.29)
l≥0 x Œ¬ n Í
Î i = 1 ˙
˚

Let us now state the learning problem in SVM.


Given a set of linearly separable training examples,
D = {(x(1), y(1)), (x(2), y(2)), …, (x(N), y(N))},
the learning problem is to solve the following constrained minimization problem:

minimize f (w) = 1
2
wT w
(4.30)
(i) T (i)
subject to y (w x + w0) ≥ 1; i = 1, …, N
This formulation is called the primal formulation of hard-margin SVM. Solving this problem
will produce the solutions for w and w0 which in turn, give us the maximal margin hyperplane wTx
+ w0 = 0 with the margin 2/||w||.
The objective function is quadratic and convex in parameters w, and the constraints are linear
in parameters w and w0. The dual formulation of this constrained optimization problem is obtained
as follows.
First we construct the Lagrangian:
N
L(w, w0, k) = 1
2
wT w - Â li [ y (i ) (wT x(i ) + w0 ) - 1] (4.31)
i =1
The KKT conditions are as follows:
N
∂L (i ) (i )
(i) = 0; which gives w = Â li y x
∂w i =1
N
∂L (i )
= 0; which gives  li y = 0 (4.32)
∂ w0 i =1

(ii) y(i) (wTx(i) + w0) – 1 ≥ 0; i = 1, …, N


(iii) li ≥ 0; i = 1, …, N
(iv) li [y(i)(wTx(i) + w0) – 1] = 0; i = 1, …, N
From condition (i) of KKT conditions (4.32), we observe that the solution vector has an
expansion in terms of training examples. Note that although the solution w is unique (due to the
strict convexity of the function f (w)), the dual variables li need not be. There is a dual variable li
for each training data point. Condition (iv) of KKT conditions (4.32) shows that for data points not
on the margin hyperplanes (i.e., H1 and H2), li = 0:

y(i)(wTx(i) + w0) – 1 > 0 fi li = 0


Learning with Support Vector Machines (SVM) 147

For data points on the margin hyperplanes, li ≥ 0:


y(i)(wTx(i) + w0) – 1 = 0 fi li ≥ 0
However, the data points on the margin hyperplanes with li = 0 do not contribute to the
solution w, as is seen from condition (i) of KKT conditions (4.32). The data points on the margin
hyperplanes with associated dual variables li > 0 are called support vectors, which give the name
to the algorithm, support vector machines.
To postulate the dual problem, we first expand Eqn (4.31), term by term, as follows:
N N N
L(w , w0 , l ) = 12 wT w - Â li y (i ) wT x(i ) - w0 Â li y ( i ) + Â li (4.33)
i =1 i =1 i =1

Transformation from the primal to the corresponding dual is carried out by setting the partial
derivatives of the Lagrangian (4.33) with respect to the primal variables (i.e., w and w0) to zero, and
substituting the resulting relations back into the Lagrangian. The objective is to merely substitute
condition (i) of KKT conditions (4.32) into the Lagrangian (4.33) to remove the primal variables;
which gives us the dual objective function.
The third term on the right-hand side of Eqn (4.33) is zero by virtue of condition (i) of KKT
conditions (4.32). Furthermore, from this condition we have,
N N N
T (i ) T (i )
w w = Â li y w x = Â Â l i l k y ( i ) y ( k ) x ( i )T x ( k )
i =1 i =1 k =1

Accordingly, minimization of function L in Eqn (4.33) with respect to primal variables w and w0,
gives us the following dual objective function:
N N N
L* ( l ) = Â li - 1
2 Â Â li lk y (i ) y (k ) x(i )T x( k ) (4.34)
i =1 i =1 k =1

We may now state the dual optimization problem.


N
Given a set of linearly separable training examples {(x(i ), y (i ) )}iN=1 , find the dual variables {li }i =1 ,
that maximize the objective function (4.34) subject to the constraints
N
• Â li y ( i ) = 0 (4.35)
i =1

• li ≥ 0; i = 1, …, N
This formulation is dual formulation of the hard-margin SVM.
Having solved the dual problem numerically (using MATLAB’s quadprog function, for
example), the resulting optimum li values are then used to compute w and w0. w is computed using
condition (i) of KKT conditions (4.32):
N
w = Â li y ( i ) x ( i ) (4.36)
i =1
148 Applied Machine Learning

and w0 is computed using condition (iv) of KKT conditions (4.32):

li [y(i)(wTx(i) + w0) – 1] = 0; i = 1, …, N (4.37)

Note that though there are N values of li in Eqn (4.36), most vanish with li = 0 and only a small
percentage have li > 0. The set of x(i) whose li > 0 are the support vectors, and as we see in Eqn
(4.36), w is the weighted sum of these training instances that are selected as the support vectors:

w= Â li y ( i ) x ( i ) (4.38)
i Œ svindex

where svindex denotes the set of indices of support vectors.


From Eqn (4.38), we see that the support vectors x(i); i Œ svindex, satisfy

y(i)(wTx(i) + w0) = 1
and lie on the margin. We can use this fact to calculate w0 from any support vector as,
1
w0 =
(i )
- wT x(i )
y
For y(i) Œ [+1, –1], we can equivalently express this equation as,
w0 = y(i) – wTx(i ) (4.39)
Instead of depending on one support vector to compute w0, in practice, all support vectors are
used to compute w0, and then their average is taken for the final value of w0. This is because the
values of li are computed numerically and can have numerical errors.
1
w0 = Â ( y (i ) - wT x (i ) ) (4.40)
| svindex | i Œ svindex

where |svindex| corresponds to total number of indices in the set svindex, i.e., total number of
support vectors.
The majority of li are 0, for which y(i)(wTx(i) + w0) > 1. These are the x(i) points that exist more
than adequately away from the discriminant, and have zero effect on the hyperplane. The instances
that are not support vectors have no information; the same solution will be obtained on removing
any subset from them. From this viewpoint, the SVM algorithm can be said to be similar to the
k-NN algorithm (Section 3.4) which stores only the instances neighboring the class discriminant.
During testing, we do not enforce a margin. We calculate
g(x) = wTx + w0 (4.41a)
and choose the class according to the sign of g(x): sgn (g(x)) which we call the indicator function iF,

iF = ŷ = sgn (wTx + w0) (4.41b)

Choose Class 1 ( ŷ = +1) if wTx + w0 > 0, and Class 2 ( ŷ = –1) otherwise.


Learning with Support Vector Machines (SVM) 149

Example 4.1
In this example, we visualize SVM (hard-margin) formulation in two variables. Consider the toy
dataset given in Table 4.1.
SVM finds a hyperplane
H : w1 x1 + w2 x2 + w0 = 0
and two bounding planes
H1: w1 x1 + w2 x2 + w0 = +1
H2: w1 x1 + w2 x2 + w0 = –1
such that
w1 x1 + w2 x2 + w0 ≥ +1 if y(i) = +1
w1 x1 + w2 x2 + w0 £ –1 if y(i) = –1
or equivalently
y(i)(w1 x1 + w2 x2 + w0) ≥ 1
We write these constraints explicitly as (refer to Table 4.1),
(–1) [w1 + w2 + w0] ≥ 1
(–1) [2w1 + w2 + w0] ≥ 1
(–1) [w1 + 2w2 + w0] ≥ 1
(–1) [2w1 + 2w2 + w0] ≥ 1
(+1) [4w1 + 4w2 + w0] ≥ 1
(+1) [4w1 + 5w2 + w0] ≥ 1
(+1) [5w1 + 4w2 + w0] ≥ 1
(+1) [5w1 + 5w2 + w0] ≥ 1

Table 4.1 Data for classification


Sample i x1(i) x2(i) y(i)
1 1 1 –1
2 2 1 –1
3 1 2 –1
4 2 2 –1
5 4 4 +1
6 4 5 +1
7 5 4 +1
8 5 5 +1
150 Applied Machine Learning

The constraint equations in matrix form:

È - 1 0 0 0 0 0 0 0 0 0˘ È È 1 1˘ È1˘ ˘ È1˘
Í 0 - 1 0 0 0 0 0 0 0 0˙ Í Í 2 1˙˙ Í1˙ ˙ Í1˙
Í ˙ÍÍ Í ˙˙ Í ˙
Í 0 0 - 1 0 0 0 0 0 0 0˙ Í Í 1 2˙ Í1˙ ˙ Í1˙
Í ˙ÍÍ ˙ Í ˙˙ Í ˙
Í 0 0 0 - 1 0 0 0 0 0 0˙ Í Í 2 2˙ Í1˙ ˙ Í1˙
Í 0 0 0 0 -1 0 0 0 0 0˙ Í Í1.5 1.5˙ È w1 ˘ Í1˙ ˙ Í1˙
Í ˙ÍÍ ˙ + w0 Í ˙˙≥Í ˙
Í 0 0 0 0 0 + 1 0 0 0 0˙ Í Í 4 4˙ ÍÎ w2 ˙˚ Í1˙ ˙ Í1˙
Í 0 0 0 0 0 0 + 1 0 0 0˙ Í Í 4 5˙ Í1˙ ˙ Í1˙
Í ˙ÍÍ ˙ Í ˙˙ Í ˙
Í 0 0 0 0 0 0 0 + 1 0 0˙ Í Í 5 4˙ Í1˙ ˙ Í1˙
Í 0 0 0 0 0 0 0 0 + 1 0˙ Í Í 5 5˙˙ Í1˙ ˙ Í ˙
Í ˙ÍÍ Í ˙ ˙ Í1˙
ÍÎ 0 0 0 0 0 0 0 0 0 + 1˙˚ ÍÎ ÍÎ4.5 4.5˙˚ ÎÍ1˙˚ ˙˚ ÍÎ1˙˚
Y X w e e
or Y(Xw + w0e) ≥ e (4.42a)
For a dataset with N samples, and n features in vector x,

È y (1) 0 º 0 ˘
Í ˙ È x(1) T ˘ È w1 ˘ È1˘
Í 0 y ( 2) º 0 ˙ Í ( 2) T ˙ Íw ˙ Í1˙
Íx ˙
Y =Í 0 ˙ Í 2˙
0 º 0 ˙; X = Í ˙ ; w = ; e =Í ˙ (4.42b)
(N ¥ N ) Í ( N ¥ n) ¥ Í � ˙ ( N ¥ 1) Í� ˙
Í � ˙
n 1
Í � � � ˙ Í ˙ Í˙
Í ˙ Íx( N )T ˙ Î wn ˚ Î1˚
(N ) Î ˚
ÍÎ 0 0 º y ˙˚

Y = diag(y); y = [y(1) y(2) … y(N)]T (4.42c)


Our aim is to find the weight matrix w and the bias term w0 that maximize the margin of separation
between the hyperplanes H1 and H2, and at the same time satisfy the constraint equations (4.42). It
gives us an optimization problem
maximize
w , w0
( 1
2
wT w )
(4.43)
subject to Y(Xw + w0e) ≥ e
Once we obtain w and w0, we have our decision boundary:
wTx + w0 = 0
and for a new unseen data point x, we assign sgn (wTx + w0) as the class value.
For solving the above problem in primal, we need to rewrite the problem in the standard QP
(Quadratic Programming) format.
Learning with Support Vector Machines (SVM) 151

Another way to solve the above primal problem is the Lagrangian dual. The problem is more
easily solved in terms of its Lagrangian dual variables:
maximize È min L (w , w0 , l ) ˘
l≥0 Î w , w0 ˚
where (Eqn (4.31)) (4.44)

1
L(w, w0, k) = 2 wTw – kT [Y(Xw + w0e) – e]

l = [l1 l2 … lN]T is a vector of Lagrange multipliers.


( N ¥ 1)

It leads to the dual optimization problem (Eqns (4.34–4.35))

maximize ÈÎ l T e - 12 l T YX XT Y l ˘˚
l
(4.45)
T
subject to e Yk = 0; k ≥ 0
Some standard quadratic optimization programs typically minimize the given objective function:

minimize ÈÎ 12 l T Q l - eT l ˘˚
l

subject to yTk = 0; k ≥ 0 (4.46)


where Q = Y X XT Y
The input required for the above program is only X and y. It returns the Lagrange multiplier
vector k.
Having solved the dual problem numerically (using a standard optimization program), optimum
li values are then used to compute w and w0 (Eqns (4.38, 4.40)):

w= Â li y ( i ) x ( i ) (4.47a)
i Œ svindex

1
w0 = È Â [ y (i ) - wT x(i ) ]˘ (4.47b)
| svindex | ÍÎi Œ svindex ˙˚

Using the MATLAB quadprog routine for the dataset of Table 4.1, we obtain [56]

kT = [0 0 0 0.25 0 0.25 0 0 0 0]

This means that data points with index 4 and 6 are support vectors; i.e., support vectors are:

È 2˘ È 4˘
Í 2˙ ; Í 4˙
Î ˚ Î ˚
152 Applied Machine Learning

and svindex is {4, 6}; |svindex| = 2.


È 2˘ È4˘ È0.5˘
w= Â li y (i ) x(i ) = 0.25 [-1] Í ˙ + 0.25 [1] Í ˙ = Í ˙
i Œ svindex Î 2˚ Î4˚ Î0..5˚

1
w0 = 2 [y(4) – wTx(4) + y(6) – wTx(6)]

È È 2˘ È 4˘ ˘
Í- 1 - [ w1 w2 ] Í ˙ + 1 - [ w1 w2 ] Í ˙ ˙
1
= 2
Î Î 2˚ Î 4˚ ˚
= –3
Therefore, the decision hyperplane g(x) is
g(x) = 0.5x1 + 0.5x2 – 3
and the indicator function
iF = ŷ = sgn (g(x)) = sgn (0.5x1 + 0.5x2 – 3)

4.5 LINEAR SOFT MARGIN CLASSIFIER FOR OVERLAPPING CLASSES

The linear hard-margin classifier gives a simple SVM when the samples are linearly separable.
In practice, however, the training data is almost always linearly nonseparable because of random
errors owing to different reasons. For instance, certain instances may be wrongly labeled. The
labels could be different even for two input vectors that are identical.
If SVM has to be of some use, it should permit noise in the training data. But, with noisy data,
the linear SVM algorithm described in earlier section, will not obtain a solution as the constraints
cannot be satisfied. For instance, in Fig. 4.12, there exists a Class 2 point (square) in the Class 1
region, and a Class 1 point (circle) in the Class 2 area. However, in spite of the couple of mistakes,
the decision boundary seems to be good. But the hard-margin classifier presented previously cannot
be used, because all the constraints
y(i)(wTx(i) + w0) ≥ 1; i = 1, …, N
cannot be satisfied.
So the constraints have to be modified to permit mistakes. To allow errors in data, we can relax
the margin constraints by introducing slack variables, zi (≥ 0), as follows:
wT x(i) + w0 ≥ 1 – zi for y(i) = + 1
wT x(i) + w0 £ –1 + zi for y(i) = – 1
Thus, we have the new constraints
y(i)(wTx(i) + w0) ≥ 1 – zi; i = 1, …, N
zi ≥ 0 (4.48)
Learning with Support Vector Machines (SVM) 153

The geometric interpretation is shown in Fig. 4.12.


x2 w T x + w0 = 0 x2 wTx + w0 = 0

Class 1 Class 1

w
z k || g(x(i)) 3 1
w|| || w || 2
1 ||
|| w x(i)
0

||
w

(k)
4
|| w
x
z i ||
||w

Class 2 x1 Class 2 x1
(a) (b)

Figure 4.12 Soft decision boundary

In classifying an instance, there are four possible cases (see Fig. 4.12(b)). Instance 1 is on the
correct side and far away from the origin; y(i)g(x(i)) > 1, zi = 0. Instance 2 is on the correct side and
on the margin; zi = 0. For instance 3, zi = 1 – g(x(i)), 0 < zi < 1, the instance is on the correct side
but in the margin and not sufficiently away. For instance 4, zi = 1 + g(x(i)) > 1, the instance is on the
wrong side—this is a misclassification.
We also need to penalize the errors in the objective function. A natural way is to assign an extra
cost for errors to change the objective function to
Ê N ˆ
1
2
wT w + C Á Â z i ˜ ; C ≥ 0
Ë i =1 ¯
where C is a user specified penalty parameter. This parameter is a trade-off parameter between
margin and mistakes.
The parameter C trades off complexity, as measured by norm of weight vector, and data
misfit, as measured by the number of nonseperable points. Note that we are penalizing not only
the misclassified points but also the ones in the margin for better generalization. Increasing C
corresponds to assigning a high penalty to errors, simultaneously resulting in larger weights. The
width of the soft-margin can be controlled by penalty parameter C.
The new optimization problem becomes
N
T
minimize 1
2
w w + C Â zi
i =1 (4.48a)
(i) T (i)
subject to y (w x + w0) ≥ 1 – zi; i = 1, …, N
zi ≥ 0; i = 1, …, N
154 Applied Machine Learning

This formulation is called the soft-margin SVM.


Proceeding in the manner similar to that described earlier for separable case, we may formulate
the dual problem for nonseparable patterns as follows.
The Lagrangian
N N N
T (i ) T (i )
L(w , w0 , z, l , m ) = w w + C Â z i - Â li [ y (w x + w0 ) - 1 + z i ] - Â miz i
1
2
(4.49)
i =1 i =1 i =1
where li, mi ≥ 0 are the dual variables.
The KKT conditions for optimality are as follows:
N
∂L
(i) = w - Â li y ( i ) x ( i ) = 0
∂w i =1
N
∂L
= - Â li y ( i ) = 0
∂w0 i =1

∂L
= C - li - mi = 0; i = 1, º, N
∂z i
(i ) T (i )
(ii) y (w x + w0 ) - 1 + z i ≥ 0; i = 1, º, N (4.50)
zi ≥ 0; i = 1, …, N
(iii) li ≥ 0; i = 1, …, N
mi ≥ 0; i = 1, …, N
(iv) li (y(i)(wTx(i) + w0) –1 + zi) = 0; i = 1, …, N
mi zi = 0; i = 1, …, N
We substitute the relations in condition (i) of KKT conditions (4.50) into the Lagrangian (4.49)
to obtain dual objective function. From the relation C – li – mi = 0, we can deduce that li £ C
because mi ≥ 0. Thus, the dual formulation of the soft-margin SVM is
N N N
maximize L*(k) = Â li - 1
2 Â Â l i l k y ( i ) y ( k ) x ( i )T x ( k )
i =1 i =1 k =1
N
subject to  li y ( i ) = 0 (4.51)
i =1

0 £ li £ C; i = 1, …, N
Interestingly, zi and mi are not in the dual objective function; the objective function is identical
to that for the separable case. The only difference is the constraint li £ C (inferred from C – li – mi
= 0 and mi ≥ 0). The dual problem (4.51) can also be solved numerically, and the resulting li values
are then used to compute w and w0. The weight vector w is computed using Eqn (4.36).
Learning with Support Vector Machines (SVM) 155

The bias parameter w0 is computed using condition (iv) of KKT conditions (4.50):
li(y(i)(wTx(i) + w0) – 1 + zi) = 0 (4.52a)
mizi = 0 (4.52b)
Since we do not have values for zi, we have to get around it. li can have values in the interval 0
£ li £ C. We will separate it into the following three cases:
Case 1: li = 0
We know that C – li – mi = 0. With li = 0, we get mi = C. Since mi zi = 0 (Eqn (4.52b)), this implies
that zi = 0; which means that the corresponding ith pattern is correctly classified without any error
(as it would have been with hard-margin SVM). Such patterns may lie on margin hyperplanes or
outside the margin. However, they do not contribute to the optimum value of w, as is seen from
Eqn (4.36).
Case 2: 0 < li < C
We know that C – li – mi = 0. Therefore, mi = C – li, which means mi > 0. Since mi zi = 0 (Eqn
(4.52b), this implies that zi = 0. Again the corresponding ith pattern is correctly classified. Also,
from Eqn (4.52a), we see that for zi = 0 and 0 < li < C, y(i)(wT x(i) + w0) = 1; so the corresponding
patterns are on the margin.
Case 3: li = C
With li = C, y(i)(wT x(i) + w0) + zi = 1, and zi > 0. But zi ≥ 0 is a constraint of the problem. So zi
> 0; which means that the corresponding pattern is mis-classified or lies inside the margin.
Note that support vectors have their li > 0, and they define w as given by Eqn (4.36). We can
compute w from the following equation (refer Eqn (4.38)):

w= Â li y ( i ) x ( i ) (4.53a)
i Œ svindex

where svindex denotes the set of indices of support vectors (patterns with li > 0).
Of all the support vectors, those whose li < C (Case 2), are the ones that are on the margin, and
we can use them to calculate w0; they satisfy

y(i)(wTx(i) + w0) = 1

We can compute w0 from the following equation (refer Eqn (4.40)):


1
w0 = Â ( y (i ) - wT x (i ) ) (4.53b)
| svmindex | i Œ svmindex
where svmindex are the set of support vectors that fall on the margin.
Finally, expressions for both the decision function g(x) and an indicator function iF = sgn (g(x))
for a soft-margin classifier are the same as for linearly separable classes (refer Eqns (4.41)):
g(x) = wTx + w0 (4.54a)
156 Applied Machine Learning

iF = ŷ = sgn (g(x)) = sgn (wTx + w0) (4.54b)


The following points need attention of the reader:
• A significant property of SVM is that the solution is sparse in li. Majority of the training data
points exist outside the margin area and their li’s in the solution are 0. The data points on the
margin with li = 0, do not contribute to the solution either. Only those data points that are on
the margin hyperplanes with 0 < li < C, and those mis-classified or inside the margin (li = C)
make contribution to the solution. In the absence of this sparsity property, SVM would prove
to be impractical for huge datasets.
• Parameter C in the optimization formulation (4.51) is the regularization parameter, fine-tuned
with the help of cross-validation. It defines the trade-off between margin maximization and
error minimization. In case it is too big, there is high penalty for nonseparable points, and
we may store several support vectors and overfit. In case it is too small, we may come across
very simple solutions that underfit.
The tuning process can be rather time-consuming for huge datasets. Many heuristic rules for
selection of C have been recommended in the literature. Refer to [57] for a heuristic formula
for the selection for the parameter, which has proven to be close to optimal in many practical
situations. More about the difficulty associated with choice of C will appear in a later section.
• The final decision boundary is
w T x + w0 = 0
Substituting for w and w0 from Eqns (4.53), we obtain

T
Ê Â li y ( i ) x ( i ) ˆ x +
1 È È y(k ) - Ê (i ) (i ) ˆ T ( k ) ˘ ˘
ÁË i Œ svindex ˜¯ Â
| svmindex | ÍÍ k Œ svmindex Í
Â
ÁË i Œ svindex
l i y x
˜¯
x ˙˙ = 0
Î Î ˚ ˙˚

T
or  ( li y ( i ) x ( i ) x )
i Œ svindex

1 È È y(k ) - Ê ( i ) ( i )T ( k ) ˆ ˘ ˘
+ Â
| svmindex | ÍÍ k Œ svmindex Í
Â
ÁË i Œ svindex
l i y x x
˜¯ ˙ ˙
=0 (4.55)
Î Î ˚ ˙˚

We notice that w and w0 do not need to be explicitly computed. As we will see in a later section,
this is crucial for using kernel functions to handle nonlinear decision boundaries.

Example 4.2
In Example 4.1, SVM (hard margin) formulation was developed in matrix form. In the following,
we give matrix form of SVM (soft margin) formulation [56].
If all the data points are not linearly separable, we allow training error or in other words, allow
points to lie between bounding hyperplanes and beyond. When a point, say x(i) = [x1(i) x2(i) … xn(i)]T
Learning with Support Vector Machines (SVM) 157

with y(i) = +1 lies either between the bounding hyperplanes or beyond (into the region where wTx(i)
+ w0 £ –1), we add a positive quantity zi to the left of inequality (refer to (4.23a)) to satisfy the
constraint wTx(i) + w0 + zi ≥ +1. Similarly, when a point with y(i) = –1 lies either between the
bounding hyperplanes or beyond (into the region where wTx(i) + w0 ≥ +1), we subtract a positive
quantity zi from the left of the inequality (refer to (4.23a)) to satisfy the constraint wTx(i) + w0 – zi £
–1. For all other points, let us assume that we are adding zi terms with zero values. Thus, we have
the constraints (Eqn (4.47))

y(i)(wTx(i) + w0) ≥ 1 – zi ; zi ≥ 0
The constraint equation in matrix form can be written as,

Y(Xw + w0e) + y ≥ e (4.56)

È z1 ˘ È0˘
Íz ˙ Í ˙
0
z =Í ˙ ≥ Í ˙
2
where
( N ¥ 1) Í � ˙ Í�˙
Í ˙ Í ˙
Îz N ˚ Î0˚

and matrices/vectors Y, X, w and e have already been defined in Eqns (4.42).


The optimization problem becomes (Eqn (4.48a))

Minimize ÈÎ 12 wT w + C eT z ˘˚
w, z
(4.57)
subject to Y(Xw + w0e) + y ≥ e, and y ≥ 0

Here, C is a scalar value (≥ 0) that controls the trade-off between margin and errors. This C is to
be supplied at the time of training. Proper choice of C is crucial for good generalization performance
of the classifier. Usually, the value of C is obtained by trial-and-error with cross-validation.
Minimization of the quantity 12 wT w + C eT z with respect to w and y causes maximum separation
between the bounding planes with minimum number of points crossing their respective bounding
planes.
Proceeding in the manner similar to that described in Example 4.1 for hard-margin SVM, we
may formulate the dual problem for soft-margin SVM as follows (Eqn (4.51)):

{
Minimize - eT l + 12 lT Q l
l
}
(4.58)
T
subject to y k = 0, 0 £ k £ Ce

where k = [l1 l2 … lN]T are dual variables, and all other matrices/vectors have been defined earlier
in Example 4.1.
158 Applied Machine Learning

The main difference between soft-margin and hard-margin SVM classifiers is that in case of soft
margin, the Lagrange multipliers li are bounded; 0 £ li £ C. We will separate the bounds on li into
three cases:
1. li = 0. This leads to zi = 0, implying that the corresponding ith pattern is correctly classified.
2. 0 < li < C. This also leads to zi = 0, implying that corresponding ith pattern is correctly
classified. The points with zi = 0 and 0 < li < C, fall on the bounding planes of the class to
which the points belong. These are the support vectors that lie on the margin.
3. li = C. In this case, zi > 0; this implies that the corresponding ith pattern is mis-classified or
lies inside the margin. Since li π 0, these are also support vectors
Once we obtain Lagrange multipliers, k, using quadratic programming, we can compute w and
w0 using Eqns (4.53).
As we will see in Section 4.7, SVM formulation for nonlinear classifiers is similar to the one
given in this example. There, we will consider a toy dataset to illustrate numerical solution of SVM
(soft-margin) problem.

4.6 KERNEL-INDUCED FEATURE SPACES

So far we have considered parametric models for classification (and regression) in which the form
of mapping from input vector x to class label y (or continuous real-valued output y) is linear. One
common strategy in machine learning is to learn nonlinear functions with a linear machine. For this,
we need to change the representation of the data:
x = {x1, …, xn} fi e(x) = {f1(x), …, fm(x)} (4.59)
where e(.) is a nonlinear map from input feature space to some other feature space. The selection
of e is constrained to yield a new feature space in which the linear machine can be used. Hence, the
set of hypotheses we consider will be functions of the type
m
g ( f(x), w ) = Â wl fl (x) + w0 = wT f + w0 (4.60)
l =1
where w is now the m-dimensional weight parameter.
This means that we will build nonlinear machines in two steps:
(i) first a fixed nonlinear mapping transforms the data into a new feature space, and
(ii) then a linear machine is used to classify the data in the new feature space.
By selecting m functions fl (x) judiciously, one can approximate any nonlinear discriminant
function in x by such a linear expansion. The resulting discriminant function is not linear in x, but
it is linear in e(x). The m functions merely map points in the n-dimensional x-space to points in
the m-dimensional e-space. The homogeneous discriminant function wTe + w0 separates points
in the transformed space by a hyperplane. Thus, the mapping from x to e reduces the problem of
finding nonlinear discriminant function in x to one of finding linear discriminant function in e(x).
With a clever choice of nonlinear e-functions, we can obtain arbitrary nonlinear decision regions in
x-space, in particular those leading to minimum errors.
Learning with Support Vector Machines (SVM) 159

The central difficulty is naturally choosing the appropriate m-dimensional mappings e(x) of
the original input feature vectors x. This approach of the design of nonlinear machines is linked
to our expectation that the patterns which are not linearly separable in x-space become linearly
separable in e-space. This expectation is based on the observations that by selecting fl (x); l = 1,
…, m, functions judiciously and letting m sufficiently large, one can approximate any nonlinear
discriminant function in x by linear expansion (4.60).
In the sequel, we will first try to justify our expectations that by going to a higher dimensional
space, the classification task may be transformed into a linear one, and then study popular alternatives
for the choice of functions fl (◊).
Let us first attempt to intuitively understand why going to a higher dimensional space increases
the chances of a linear separation. For linear function expansions such as,
n
g(x, w) = Â wj x j + w0
j =1

or m
g(e(x), w) = Â wj fl (x) + w0
l =1

the VC dimension increases as the number of weight parameters increases. Using a transformation
e(x) to a higher dimensional e-space (m > n), typically amounts to increasing the capacity (refer
to Section 2.3) of the learning machine, and rendering problems separable that are not linearly
separable to start with.
Cover’s theorem [58] formalizes the intuition that the number of linear separations increases
with the dimensionality. The number of possible linear separations of well distributed N points in
an n-dimensional space (N > n + 1), as per this theorem, equals

n
Ê N - 1ˆ n
( N - 1)!
2  ÁË j ˜¯ = 2  ( N - 1 - j )! j !
j=0 j=0

The more we increase n, the more terms there are in the sum, and thus the larger is the resulting
number.
The linear separability advantage of high-dimensional feature spaces comes with a cost: the
computational complexity. The approach of selecting fl (x); l = 1, …, m, with m Æ • may not work;
such a classifier would have too many free parameters to be determined from a limited number of
training data. Also, if the number of parameters is too large relative to the number of training
examples, the resulting model will overfit the data, affecting the generalization performance.
So there are problems. First how do we choose the nonlinear mapping to a higher dimensional
space? Second, the computations involved will be costly. It so happens that in solving the quadratic
optimization problem of the linear SVM (i.e., when searching for a linear SVM in the new higher
dimensional space), the training tuples appear only in the form of dot products (refer to Eqn (4.51)):
·e(x(i)), e(x(k))Ò = [e(x(i))]T [e(x(k))]
160 Applied Machine Learning

Note that the dot product requires one multiplication and one addition for each of the
m-dimensions. Also, we have to compute the dot product with every one of the support vectors.
In training, we have to compute a similar dot product several times. The dot product computation
required is very heavy and costly. We need a trick to avoid the dot product computations.
Luckily, we can use a math trick. Instead of computing the dot product on the transformed data
tuples, it turns out that it is mathematically equivalent to instead apply a kernel function, K(x(i), xk),
to the original input data. That is,
K(x(i), x(k)) = ·e(x(i)), e(x(k))Ò
In other words, everywhere that ·e(x(i)), e(x(k))Ò appears in the training algorithm, we can replace
it with K(x(i), x(k)). In this way, all calculations are made in the original input space, which is of
potentially much lower dimensionality.
Another feature of the kernel trick addresses the problem of choosing nonlinear mapping e(x).
It turns out that we don’t even have to know what the mapping is. That is, admissible kernel
substitutions K(x(i), x(k)) can be determined without the need of first selecting a mapping function
e(x).
Let us study the kernel trick in a little more detail for appreciating this important property of
SVMs to solve nonlinear classification problems.

Example 4.3
Given a database with feature vectors x = [x1 x2]T. Consider the nonlinear mapping
f
(x1, x2) ææ Æ (f1(x), f2(x), f3(x), f4(x), f5(x), f6(x))

e(x) = [1 2 x1 2 x2 2 x1 x2 x12 x22 ]T

For this mapping, the dot product is


È 1 ˘
Í ˙
Í 2 x1( k ) ˙
Í ˙
(i ) 2 Í
2 x2( k ) ˙
[e(x(i))] T e(x(k)) = [1 2 x1(i ) 2 x2(i ) 2 x1(i ) x2(i ) (i ) 2
( x1 ) ( x2 ) ] Í
2 x1( k ) x2( k ) ˙
Í ˙
Í ( x1( k ) ) 2 ˙
Í ˙
ÍÎ ( x2( k ) ) 2 ˙˚

= 1 + 2x1(i) x1(k) + 2x2(i) x2(k) + 2x1(i) x2(i) x1(k) x2(k)

+ (x1(i))2 (x1(k))2 + (x2(i))2 (x2(k))2

= [(x(i))T x(k) + 1]2


Learning with Support Vector Machines (SVM) 161

The inner product of vectors in new (higher dimensional) space has been expressed as a function
of the inner product of the corresponding vectors in the original (lower dimensional) space. The
kernel function is
K(x(i), x(k)) = [x(i)T x(k) + 1]2

Constructing Kernels
In order to exploit kernel trick, we need to be able to construct valid kernel functions. A
straightforward way of computing kernel K(◊) from a map e(◊) is to choose a feature mapping e(x)
and then use it to find the corresponding kernel:
K(x(i), x(k)) = [f(x(i ) )]T f(x( k ) ) (4.61)
The kernel trick makes it possible to map the data implicitly into a higher dimensional feature
space, and to train a linear machine in such a space, potentially side-stepping the computational
problems inherent in this high-dimensional space. One of the curious facts about using a kernel
is that we do not need to know the underlying feature map e(◊) in order to be able to learn in the
new feature space. Kernel trick shows a way of computing dot products in these high-dimensional
spaces without explicitly mapping into the spaces.
The straightforward way of computing kernel K(◊) from the map e(◊) given in Eqn (4.61), can be
inverted, i.e., we can choose a kernel rather than the mapping and apply it to the learning algorithm
directly.
We can, of course, first propose a kernel function and then expand it to identify e(x). Identification
of e is not needed if we can show whether the proposed function is a kernel or not without the need
of corresponding mapping function.

Mercer’s Theorem: In the following, we introduce Mercer’s theorem, which provides a test
whether a function K(x(i),x(k)) constitutes a valid kernel without having to construct the function
e(x) explicitly.
Let K(x(i), x(k)) be a symmetric function on the finite input space. Then K(x(i), x(k)) is a kernel
function if and only if the matrix
È K (x(1) , x(1) ) K (x(1) , x( 2) ) … K (x(1) , x( N ) ) ˘
Í ˙
K= Í � � K (x(i ) , x( k ) ) � ˙ (4.62)
Í (N ) (1) (N ) ( 2) ˙
Î K (x , x ) K (x , x ) � K (x( N ) , x( N ) )˚
is positive semidefinite.
For any function K(x(i), x(k)) satisfying Mercer’s theorem, there exists a space in which K(x(i),
x(k)) defines an inner product. What, however, Mercer’s theorem does not disclose to us is how to
find this space. That is, we do not have a general tool to construct the mapping function e(◊) once
we know the kernel function (in simple cases, we can expand K(x(i), x(k)) and rearrange it to give
[f(x(i ) )]T f(x( k ) )). Furthermore, we lack the means to know the dimensionality of the space, which
can even be infinite. For further details, see [53].
162 Applied Machine Learning

What are some of the kernel functions that could be used? Properties of the kinds of kernel
functions that could be used to replace the dot product scenario just described, have been studied.
The most popular general-purpose kernel functions are:
Polynomial kernel of degree d

K(x(i), x(k)) = (x(i)T x(k) + c)d; c > 0, d ≥ 2 (4.63)


Gaussian radial basis function kernel

Ê || x(i ) - x( k ) ||2 ˆ
K(x , x ) = exp Á -
(i) (k)
˜;s>0 (4.64)
Ë 2s2 ¯

The feature vector that corresponds to the Gaussian kernel has infinite dimensionality.
Sigmoidal kernel

K(x(i), x(k)) = tanh (b x(i)T x(k) + g ) (4.65)


for appropriate values of b and g so that Mercer’s conditions are satisfied. One possibility is b = 2,
g = 1.
Each of these results in a different nonlinear classifier in (the original) input space.
There are no golden rules for determining which admissible kernel will result in the most
accurate SVM. In practice, the kernel chosen does not generally make a large difference in resulting
accuracy. SVM training always finds a global solution, unlike neural networks (discussed in the
next chapter) where many local minima usually exist.

4.7 NONLINEAR CLASSIFIER

The SVM formulations debated till now, need Class 1 and Class 2 examples to be capable of linear
representation, that is, with the decision boundary being a hyperplane. But, for several real-life
datasets, the decision boundaries are nonlinear. To deal with nonlinear case, the formulation and
solution methods employed for the linear case are still applicable. Only input data is transformed
from its original space into another space (generally, a much higher dimensional space) so that a
linear decision boundary can separate Class 1 examples from Class 2 in the transformed space,
called the feature space. The original data space is known as the input space.
Let the set of training (data) examples be

D = {x(1), y(1)), (x(2), y(2)), …, (x(N), y(N))} (4.66)

where x = [x1 x2 … xn]T.


Figure 4.13 illustrates the process. In the input space, the training examples cannot be linearly
separated; in the feature space, they can be separated linearly.
Learning with Support Vector Machines (SVM) 163

Input space Feature space

x Æ f (x)

Figure 4.13 Transformation from the input space to feature space

With the transformation, the optimization problem in (4.48a) becomes


N
1 T
minimize 2
w w + C Âz i
i =1 (4.67)

subject to y(i)(wTe(x(i)) + w0) ≥ 1 – zi ; i = 1, …, N

zi ≥ 0; i = 1, …, N

The corresponding dual is (refer to (4.51))


N N N
minimize L*(k) = Â li - Â Â li lk y (i ) y ( k ) [f(x(i ) )]T f(x(k ))
1
2
i =1 i =1 k =1

N
subject to  li y ( i ) = 0 (4.68)
i =1

0 £ li £ C; i = 1, …, N
The potential issue with this strategy is that there are chances of it suffering from the curse of
dimensionality. The number of dimensions in the feature space may be very large with certain
useful transformations, even with a reasonable number of attributes in the input space. Luckily,
explicit transformations are not required as we see that for the dual problem (4.68), the building
of the decision boundary only requires the assessment of [e(x(i))]Te(x) in the feature space. With
reference to (4.55), we have the following decision boundary in the feature space:
N
 li y (i ) [f(x(i ) )]T f(x) + w0 = 0 (4.69)
i =1
164 Applied Machine Learning

Thus, if we have to compute [e(x(i))]Te(x) in the feature space using the input vectors x(i) and x
directly, then we would not need to know the feature vector e(x) or even the mapping e itself. In
SVM, this is done through the use of kernel functions, denoted by K (details given in the previous
section):
K(x(i), x) = [e(x(i))]Te(x) (4.70)
We replace [e(x(i))]Te(x) in (4.69) with kernel (4.70). We would never need to explicitly know
what e is.

Example 4.4
The basic idea in designing nonlinear SVMs is to map input vectors x Œ¬n into higher dimensional
feature-space vectors z Œ¬m; m > n. z = e(x) where e represents a mapping ¬n Æ ¬m. Note that
input space is spanned by components xj; j = 1, …, n, of an input vector x, and feature space is
spanned by components zl; l = 1, …, m, of vector z. By performing such a mapping, we expect
that in feature space, the learning algorithm will be able to linearly separate the mapped data by
applying the linear SVM formulation. This approach leads to a solution of a quadratic optimization
problem with inequality constraints in z-space. The solution for an indicator function, sgn (wTz +
w0), which is a linear classifier in feature space, creates a nonlinear separating hypersurface in the
original input space.
There are two basic problems in taking this approach when mapping an input x-space into
higher-order z-space:
1. Choice of e(x), which should result in a rich class of decision hypersurfaces.
2. Calculation of the scalar product zTz, which can be computationally very discouraging if the
feature-space dimension m is very large.
The explosion in dimensionality from n to m can be avoided in calculations by noticing that in
the quadratic optimization problem (Eqn (4.51)), training data only appear in the form of scalar
products xTx. These products are replaced by scalar products zTz in feature space, and the latter
are expressed by using a symmetric kernel function K(x(i), x(k)) that results in positive semidefinite
kernel matrix (Eqn (4.62))

È K (x(1) , x(1) ) K (x(1) , x( 2) ) … K (x(1) , x( N ) ) ˘


Í ˙
K(x(i), x(k)) = Í � � K (x(i ) , x( k ) ) � ˙ (4.71)
Í (N ) (1) (N ) ( 2) ˙
Î K (x , x ) K (x , x ) � K (x( N ) , x( N ) )˚

The required scalar products zTz in feature space are calculated directly by computing kernels
K(x(i),x(k)) for given training data vectors in an input space. In this way, we bypass the computational
complexity of an extremely high dimensionality of feature space. By applying kernels, we do not
even have to know what the actual mapping e(x) is. Thus, using the chosen kernel K(x(i),x(k)), an
SVM can be constructed that operates in an infinite dimensional space.
Learning with Support Vector Machines (SVM) 165

Another problem with kernel-induced nonlinear classification approach is regarding the choice
of a particular type of kernel function. There is no clear-cut answer. No theoretical proofs yet exist
supporting applications for any particular type of kernel function. For the time being, one can only
suggest that various models be tried on a given dataset and that the one with the best generalization
capacity be chosen.
The learning algorithm for nonlinear soft-margin SVM classifier has already been given. Let
us give the matrix formulation here, which is helpful in using standard quadratic optimization
software.
In the matrix form, nonlinear SVM (soft margin) formulation is (Eqn (4.58))

{
Minimize - eT l + 12 l T Q l
l
}
(4.72)
subject to yT k = 0, 0 £ k £ Ce

Here, Q = YKY, and K is kernel matrix (Eqn (4.71). This formulation follows from Eqn (4.68).
As an illustration, with consider toy dataset with n = 1.

È1 ˘ È-1˘
Í 2˙ Í-1˙
X= Í ˙ ;y =Í ˙
Í5 ˙ Í 1˙
Í ˙ Í ˙
Î6 ˚ Î-1˚

For the choice of the kernel function K(x(i), x(k)) = (x(i)Tx(k) + 1)2, the matrix Q = YKY is given
by:
È 4 9 - 36 49˘
Í 9 25 -121 169˙˙
Q= Í
Í- 36 -121 676 - 961˙
Í ˙
Î 49 169 - 961 1369˚

The SVM formulation with C = 50, yields [56]

k = [0 2.5 7.333 4.833]T = [l1 l2 l3 l4]T.


From Eqn (4.55), we have,

w= Â li y (i ) z (i ) ; svindex = {2, 3, 4}
i Œsvindex

This gives

w Tz = Â l i y ( i ) z ( i )T z
i Œsvindex
166 Applied Machine Learning

= Â li y ( i ) K ( x ( i ) , x ) (4.73)
i Œsvindex

= Â li y (i ) (x(i )T x + 1) 2
i Œsvindex
For the given data,

wTz = (2.5) (– 1) (2x + 1)2 + (7.333) (1) (5x + 1)2 + (4.833) (– 1) (6x + 1)2

= – 0.667x2 + 5.333x
The bias w0 is determined from the requirement that at the support-vector points x = 2, 5 and 6,
the outputs must be –1, +1 and –1, respectively. From Eqn (4.55), we have

1 È
w0 = Â È y ( k ) - ÊÁ  li y (i ) z (i )T z ( k ) ˆ˜ ˘˙˘˙
| svmindex | Í k Œsvmindex Í
Î Î Ë i Œsvindex ¯ ˚˚

1 È È y ( k ) - Ê Â li y (i ) K (x(i ) , x( k ) )ˆ ˘ ˘
=
Í Â
| svmindex | k Œsvmindex Í ÁË i Œsvindex ˜¯ ˙ ˙
(4.74)
Î Î ˚˚

1 È
= Â È y ( k ) - ÊÁ  li y (i ) (x(i )T x( k ) + 1)2 ˆ˜ ˘˙˘˙
| svmindex | Í k Œsvmindex Í
Î Î Ë i Œsvindex ¯ ˚˚

2 2
= 13 [-1 - (- 0.667 (2) + 5.333 (2)) + 1 - (- 0.667 (5)

+ 5.333 (5)) - 1 - (- 0.667 (6) 2 + 5.333 (6))]


=–9
Therefore, the nonlinear decision function in the input space:

g(x) = – 0.667x2 + 5.333x – 9


and indicator function
iF = ŷ = sgn (g(x))

= sgn (– 0.667x2 + 5.333x – 9)

The nonlinear decision function and the indicator function for one-dimensional data under
consideration, are shown in Fig. 4.14.
Learning with Support Vector Machines (SVM) 167


Decision function
Indicator function
1
x
1 2 3 4 5 6
–1

Figure 4.14 Nonlinear SV classification for Example 4.4

4.8 REGRESSION BY SUPPORT VECTOR MACHINES

Initially developed for solving classification problems, SV techniques can also be successfully
applied in regression (numeric prediction) problems. Unlike classification (pattern recognition)
problems where the desired outputs are discrete values: y Œ [+1, –1], here the system responses
y Œ¬ are continuous values. The general regression learning problem is set as follows:
The learning machine is given N training data
D : {(x(1), y(1)), …, (x(N), y(N))}; x Œ¬n, y Œ¬ (4.75)
where inputs x are n-dimensional vectors and scalar output y has continuous values. The objective
is to learn the input-output relationship ŷ = f (x): a nonlinear regression model.
In regression, typically some measure for error of approximation is used instead of margin
between an optimal separating hyperplane and support vectors, which was used in the design of SV
classifiers. In our regression formulations described in earlier chapters, we have used sum-of-error-
squares criterion (Section 2.7). Here, in SV regression, our goal is to find a function ŷ = f (x) that
has at most e deviation (when e is a prescribed parameter) from the actually obtained targets y for
all the training data. In other words, we do not care about errors as long as they are less than e, but
any deviation larger than this will be treated as regression error.
To account for the regression error in our SV formulation, we use e-insensitive loss function:
Ï 0 if | y - f (x)| £ e
|y – f (x)|e D Ì (4.76)
Ó| y - f (x)| - e otherwise
This loss (error) function defines an e-insensitivily zone (e-tube). We tolerate errors up to e (data
point (x(i), y(i)) within the e-insensitivity zone or e-tube), and the errors beyond (above/below the
e-tube) have a linear effect (unlike sum-of-error-squares criterion). The error function is, therefore,
more tolerant to noise and is thus more robust. There is a region of no error, which results in
sparseness.
The parameter e defines the requirements on the accuracy of approximation. An increase in e
means a reduction in accuracy requirements; it results in smoothing effects on modeling highly
noisy polluted data. On the other hand, a decrease in e may result in complex model that overfits
the data (refer to Fig. 4.15).
168 Applied Machine Learning

2
e-tube

f (x)
1

–1 data points

–4 –2 0 2 4
x
(a) e = 0.1

f (x)
1
e-tube

–1 data points

–4 –2 0 2 4
x
(b) e = 0.5
Figure 4.15 One-dimensional SV regression
Learning with Support Vector Machines (SVM) 169

In formulating an SV algorithm for regression, the objective is to minimize the error (loss) and
||w||2 simultaneously. The role of ||w||2 in the objective function is to reduce model complexity;
thereby preventing problem of overfitting, thus improving generalization.

4.8.1 Linear Regression


For pedagogical reasons, we begin by describing SV formulation for linear regression; functions
f (◊) taking the form
f (x) = wTx + w0; w Œ ¬n, w0 Œ ¬ (4.77)
Analogous to the ‘soft margin’ classifier described earlier, we introduce (non-negative) slack
variables zi, zi*; i = 1, …, N; to measure the deviation of training examples outside the e-insensitivity
zone. Figure 4.16 shows how the e-insensitivity zone looks like when the regression is linear.

y y (i ) - yˆ (i )
e
e-tube
+e
zi
0
–e
z *i Predicted zi
f (x) –e +e
y (i ) - yˆ (i )
Training data points e + zi
(x(i), y(i))
x
(a) One-dimensional support vector linear regression (b) e-insensitive loss function
Figure 4.16

If a point (x(i), y(i)) falls within the e-tube, the associated zi, zi* is zero. If it is above the tube, zi
> 0, zi* = 0 and zi = 0, zi* > 0 if the point is below it.

|y(i) – f (x(i))| – e = z i for data ‘above’ the e-insensitivity zone (4.78a)

|y(i) – f (x(i))| – e = zi* for data ‘below’ the e-insensitivity zone (4.78b)

The loss (error) is equal to zero for training data points inside the tube (|y(i) – ŷ (i)| £ e); the loss
is zi for data ‘above’ the tube (yi – ŷ (i) – e = zi), and zi* for data ‘below’ the tube ( ŷ (i) – y(i) – e =
z i* ). Only the data points outside the tube contribute to the loss (error) with deviations penalized
in a linear fashion.
Minimizing ||w||2 simultaneously with minimizing the loss, results in small values for w and
thereby a flat function f (x) given by (4.77).
Analogous to the SV formulation for soft margin linear classifiers, we arrive at the following
formulation of linear regression:
170 Applied Machine Learning

N
minimize 1 T
2
w w + C Â (z i + z i*)
i =1
(4.79)
(i) T (i)
subject to y – w x – w0 £ e + zi ; i = 1, …, N

wTx(i) + w0 – y(i) £ e + zi*; i = 1, …, N

zi, zi* ≥ 0; i = 1, …, N
Note that the constant C > 0, which influences a trade-off between an approximation error and
the weights vector norm ||w||, is a design parameter chosen by the user. An increase in C penalizes
larger errors (large zi and zi* ) and in this way leads to a decrease in approximation error. However,
this can be achieved only by increasing the weights vector norm ||w||, which does not guarantee
good generalization performance. Another design parameter chosen by the user is the required
precision embodied in an e value that defines the size of an e-tube.
As with procedures applied to SV classifiers, the constrained optimization problem (4.79) is
solved by forming the Lagrangian:
L(w, w0, y, y*, k, k*, l, l*)
N N
= w w+CÂ
1
2
T
(z i + z i*) - Â li (e + z i - y (i ) + wT x(i ) + w0 )
i =1 i =1
N N
-Â li* (e + z i* +y (i )
- w x - w0 ) - Â ( mi z i + mi* z i*)
T (i )
(4.80)
i =1 i =1

where w, w0, zi and zi* are the primal variables, and li, l *i , mi, mi* ≥ 0 are the dual variables.
The KKT conditions are as follows:
N
∂L
(i) = w - Â (li - li*) x(i ) = 0
∂w i =1
N
∂L
= Â (li* - li ) = 0
∂w0 i = 1
∂L
= C – li – mi = 0; i = 1, …, N
∂z i
∂L
= C – l i* – m *i = 0; i = 1, …, N
∂ z i*
(ii) e + zi – y(i) + wTx(i) + w0 ≥ 0; i = 1, …, N
e + zi* + y(i) – wTx(i) – w0 ≥ 0; i = 1, …, N (4.81)
zi, zi* ≥ 0; i = 1, …, N
(iii) li, li*, mi, m*i ≥ 0; i = 1, …, N
Learning with Support Vector Machines (SVM) 171

(iv) li(e + zi – y(i) + wTx(i) + w0) = 0; i = 1, …, N

l *i (e + z i* + y(i) – wTx(i) – w0) = 0; i = 1, …, N


mizi = 0; i = 1, …, N
mi*zi* = 0; i = 1, …, N

Substituting the relations in condition (i) of KKT conditions (4.81) into Lagrangian (4.80),
yields the dual objective function. The procedure is parallel to what has been followed earlier. The
resulting dual optimization problem is
maximize L* (k, k*) =
N N N N
- e  (li + li*) +  (li - li*) y (i ) - 1
2 Â Â (li - li*) (lk - lk*) x(i )T x( k )
i =1 i =1 i =1 k =1

(4.82)
N
subject to  (li - li*) = 0; li, l*i Œ [0, C]
i =1

From condition (i) of KKT conditions (4.81), we have,


N
w = Â (li - li*) x(i ) (4.83)
i =1

Thus, the weight vector w is completely described as a linear combination of the training patterns
x . One of the most important properties of SVM is that the solution is sparse in li, li*. For | yˆ (i ) –
(i)

y(i)| < e, the second factor in the following KKT conditions (conditions (iv) in (4.81)):
li(e + zi – y(i) + wTx(i) + w0) = li(e + zi – y(i) + ŷ (i)) = 0
(4.84)
l*i (e + zi* (i) T (i)
+ y – w x – w0) = l*i (e + zi* (i) (i)
+ y – ŷ ) = 0
are nonzero; hence li, l*i have to be zero. This equivalently means that all the data points inside the
e-insensitive tube (a large number of training examples belong to this category) have corresponding
li, l*i equal to zero. Further, from Eqn (4.84), it follows that only for | yˆ (i ) – y(i)| ≥ e, the dual
variables li, l i* may be nonzero. Since there can never be a set of dual variables li, l*i which are
both simultaneously nonzero, as this would require slacks in both directions (‘above’ the tube and
‘below’ the tube), we have li ¥ l i* = 0.
From conditions (i) and (iv) of KKT conditions (4.81), it follows that
(C – li)zi = 0
(4.85)
(C – l*i )zi* =0
Thus, the only samples (x(i), y(i)) with corresponding l i, l i* = C lie outside the e-insensitive tube
around f. For li, l i* Œ (0, C), we have zi, zi* = 0 and moreover the second factor in Eqn (4.84) has
to vanish. Hence, w0 can be computed as follows:
172 Applied Machine Learning

w0 = y(i) – wTx(i) – e for li Œ (0, C)


(4.86)
(i) T (i)
w0 = y – w x + e for l*i Œ (0, C)
All the data points with li, li* Œ (0, C) may be used to compute w0, and then their average taken
as the final value for w0.
Once we solve the quadratic optimization problem for k and k*, we see that all instances that
fall in the e-tube have li = l*i = 0; these are the instances that are fitted with enough precision. The
support vectors satisfy either li > 0 or l*i > 0, and are of two types. They may be instances that
are on the boundary of the tube (either li or l*i is between 0 and C), and we use these to calculate
w0. Instances that fall outside the e-tube (li = C) are of second type of support vectors. For these
instances, we do not have a good fit.
Using condition (i) of KKT conditions (4.81), we can write the fitted line as a weighted sum of
the support vectors;
f (x) = wTx + w0 = Â (li - li*) x(i )T x + w0 (4.87a)
i Œsvindex

where svindex denotes the set of indices of support vectors. Note that for each i Œ svindex, one
element of the pair (l i , l*i ) is zero.
The parameter w0 may be obtained from either of the equations in (4.86). If the former one is
used for which i Œ set of instances that correspond to support vectors on the upper boundary (li
Œ (0, C)) of the e-tube (let us denote these points as belonging to the set svm1index of indices of
support vectors that fall on the upper boundary), then we have (refer to Eqn (4.55)),

1 È È y ( k ) - e - Ê Â (li - li*) x(i )T x( k ) ˆ ˘ ˘


w0 = Â
| svm1index | Í k Œsvm1index Í ÁË i Œsvindex ˜¯ ˙ ˙
(4.87b)
Î Î ˚˚

4.8.2 Nonlinear Regression


For nonlinear regression, the quadratic optimization problem follows from Eqn (4.82):
N N
* * (i )
maximize L*(k, k*) = - e  (li + li ) +  (li - li ) y
i =1 i =1

N N
- 12 Â Â (li - li*) (lk - lk*) z (i )T z ( k )
i =1 k =1
N
subject to  (li - li*) = 0; li li* Œ[0, C ] (4.88)
i =1

where z = e(x)
The dot product z(i)Tz(k) = [e(x(i))]Te(x(k)) in Eqn (4.88) can be replaced with a kernel K(x(i), x(k)).
Learning with Support Vector Machines (SVM) 173

The nonlinear regression function (refer to Eqn (4.87a))


f (x) = wT e(x) + w0 = Â (li - li*) [f (x(i ) )]T f (x) + w0
i Œ svindex

= Â (li - li*) K (x(i ) , x) + w0 (4.89)


i Œ svindex

The parameter w0 for this nonlinear regression solution may be obtained from either of the
equations in (4.86). If the former one is used for which i Œ set of instances that correspond to
support vectors on the upper boundary (li Œ (0, C)) of the e-tube (let us denote these points as
belonging to the set svm1index), then we have (refer to Eqn (4.87b))

1 È È (k ) Ê (k ) ˆ
˘˘
w0 = Í Â Í y - e - Á Â ( l i - l i ) K ( x , x )˜ ˙ ˙
| svm 1 index | Í k Œ svm1index Í
* (i )
(4.90)
Î Î Ë i Œ svindex ¯ ˙˚ ˙˚

Regression problems when solved using SVM algorithm presented in this section, are basically
quadratic optimization problems. Attempting standard quadratic programming routines (e.g.,
MATLAB) will be a rich learning experience for the readers. To help the readers use the standard
routine, we give matrix formulation of SVM regression in the example that follows.

Example 4.5
In this example, we illustrate the SVM nonlinear regressor formulation; the formulation will be
described in matrix form so as to help use of a standard quadratic optimization software.
As with nonlinear classification, input vector x Œ ¬n are mapped into vectors z of a higher-
dimensional feature space: z = e(x), where e represents a mapping ¬n Æ ¬m. The linear regression
problem is then solved in this feature space. The solution for a regression hypersurface, which is
linear in a feature space, will create a nonlinear regressing hypersurface in the original input space.
It can easily be shown that ŷ = wTz + w0 is a regression expression, and with the e-insensitive
loss function, the formulation leads to the solution equations of the form (refer to (4.82))

Minimize 1
2
lT Q l + gT l
(4.91)
T T
subject to [e – e ] l = 0, and 0 £ li, l i* £ C; i = 1, …, N
where
È K - K˘ Èl˘ Èe - y ˘
Q= Í ˙ ; l = Í * ˙ , and g = Í ˙
Î- K K˚ Îl ˚ Îe + y ˚

K as given in Eqn (4.71),

k = [l1 l2 … lN]T
e – y = [e – y(1) e – y(2) … e – y(N)]T
174 Applied Machine Learning

After computing Lagrange multipliers li and l i* using a quadratic optimization routine, we find
optimal desired nonlinear regression function as (Eqns (4.89–4.90))

f (x) = Â (li – li*) z(i)Tz + w0


i Œ svindex

= Â (li – l*i ) [e(x(i))]T e(x) + w0


i Œ svindex

= Â (li – l *i ) K(x(i), x)
i Œ svindex

1 È È y ( k ) - e - Ê Â (li - li*) K (x(i ) , x( k ) )ˆ ˘ ˘


+ Â
| svm 1 index | Í k Œ svm1index Í ÁË i Œ svindex ˜¯ ˙ ˙
(4.92)
Î Î ˚˚

There are a number of learning parameters that can be utilized for constructing SV machines for
regression. The two most relevant are insensitivity parameter e and penalty parameter C. Both the
parameters are chosen by the user. An increase in e has smoothing effects on modeling highly noisy
polluted data. An increase in e means a reduction in requirements for the accuracy of approximation.
We have already commented on the selection of parameter C. More on it will appear later.
The SV training works almost perfectly for not too large datasets. However, when the number
of data points is large, the quadratic programming problem becomes extremely difficult to solve
with standard methods. Some approaches to resolve the quadratic programming problem for large
datasets have been developed. We will talk about this aspect of SV training in a later section.

4.9 DECOMPOSING MULTICLASS CLASSIFICATION PROBLEM INTO BINARY


CLASSIFICATION TASKS

Support vector machines were originally designed for binary classification. Initial research attempts
were directed towards making several two-class SVMs to do multiclass classification. Recently,
several single-shot multiclass classification algorithms appeared in the literature.
At present, there are two types of approaches for multiclass classification. In the first approach
called ‘indirect methods’, we construct several binary SVMs and combine the outputs for predicting
the class. In the second approach called ‘direct methods’, we consider all in a single optimization
problem. Because of computational complexity of training in the direct methods, indirect methods
so far are most widely used as they do not pose any numerical difficulties while training. We limit
our discussion to indirect methods.
There are two popular methods in the category of indirect methods: One-Against-All (OAA),
and One-Against-One (OAO). In the general case, both OAA and OAO are special cases of error-
correcting-output codes that decompose a multiclass problem to a set of two-class problems [59].
Learning with Support Vector Machines (SVM) 177

This results in a vector


c = [c1, …, cM]
of frequencies of ‘wins’ of each class. The final decision is made by the most frequent class:
Class q = arg max cq (4.96)
q = 1, º M

Frequencies of ‘wins’ = number of votes.


There is a likelihood of a tie in the voting scheme of OAO classification. We can breakdown
the tie through the interpretation of the actual values returned by decision surfaces as confidence
values. On adding up absolute values of the confidence values assigned to each of the tied labels,
we take the winner to be the tied label possessing the maximum sum of confidence values.
It seems that OAO classification solves our problem of unbalanced datasets. However, it solves
the problem at the expense of introducing a new complication: the fact that for M classes, we have
to construct M(M –1)/2 decision surfaces. For small M, the difference between the number of
decision surfaces we have to build for the OAA and OAO techniques, is not that drastic. (For M =
4, OAA requires 4 binary classifiers and OAO requires 6). However, for large M, the difference can
be quite drastic (For M = 10, OAA requires 10 binary classifiers and OAO requires 45).
The individual classifiers in OAO technique, however, are usually smaller in size (they have
fewer support vectors) then they would be in the OAA approach. This is for two reasons: first, the
training sets are smaller, and second, the problems to be learned are usually easier, since the classes
have less overlap. Since the size of QP in each classifier is smaller, it is possible to train fast.
Nevertheless, if M is large, then the resulting OAO system may be slower than the corresponding
OAA. Platt et al. [61] improved the OAO approach and proposed a method called Directed Acyclic
Graph SVM (DAGSVM) that forms a tree-like structure to facilitate the testing phase.

4.10 VARIANTS OF BASIC SVM TECHNIQUES

Basically, support vector machines form a learning algorithm family rather than a single algorithm.
The basic concept of the SVM-based learning algorithm is pretty simple: find a good learning
boundary while maximizing the margin (i.e., the distance between the closest learning samples that
correspond to different classes). Each algorithm that optimizes an objective function in which the
maximal margin heuristic is encoded, can be considered a variant of basic SVM.
In the variants, improvements are proposed by researchers to gain speed, accuracy, low computer-
memory requirement and ability to handle multiple classes. Every variant holds good in a particular
field under particular circumstances.
Since the introduction of SVM, numerous variants have been developed. In this section, we
highlight some of these which have earned popularity in their usage or are being actively researched.

Changing the Metric of Margin from L2-norm to L1-norm


The standard L2-norm SVM is a widely used tool in machine learning. The L1-norm SVM is a
variant of the standard L2-norm SVM.
178 Applied Machine Learning

The L1-norm formulation is given by (Eqns (4.48a)):


N
minimize 1
2
wT w + C Â z i
i =1
(i) T (i)
subject to y (w x + w0) ≥ 1 – zi ; i = 1, …, N
zi ≥ 0; i = 1, …, N
In L2-norm SVM, the sum of squares of error (slack) variables are minimized along with the
reciprocal of the square of the margin between the boundary planes. The formulation of the problem
is given by:
N
C
minimize 1
2
T
w w+
2
 (z i )2
i =1
(4.97)
subject to y(i)(wTx(i) + w0) ≥ 1 – zi ; i = 1, …, N
zi ≥ 0; i = 1, …, N
It has been argued that the L1-norm penalty has advantages over the L2-norm penalty under
certain scenarios, especially when there are redundant noise variables. L1-norm SVM is able to
delete many noise features by estimating their coefficients by zero, while L2-norm SVM will use all
the features. When there are many noise variables, the L2-norm SVM suffers severe damage caused
by the noise features. Thus, L1-norm SVM has inherent variable selection property, while this is not
the case for L2-norm SVM. In this book, our focus has been on L1-norm SVM formulations. Refer
to [56] for L2-norm SVM formulations.
Replacing Control Parameter C in Basic SVM (C-SVM); C ≥ 0, by Parameter n (n-SVM); n
Œ [0, 1]
As we have seen in the formulations of basic SVM (C-SVM) presented in earlier sections, C is a
user-specified penalty parameter. It is a trade-off between margin and mistakes. Proper choice of
C is crucial for good generalization power of the classifier. Usually, the parameter is selected by
trial-and-error with cross-validation.
Tuning of parameter C can be quite time-consuming for large datasets. In the scheme proposed
by Schölkopf et al. [62], the parameter C ≥ 0 in the basic SVM is replaced by a parameter n Œ [0, 1].
n has been shown to be a lower bound on the fraction of support vectors and an upper bound on the
fraction of instances having margin errors (instances that lie on the wrong side of the hyperplane).
By playing with n, we can control the fraction of support vectors, and this is advocated to be
more intuitive than playing with C. However, as compared to C-SVM, its formulations are more
complicated.
A formulation of n-SVM is given by:
N
T 1
minimize 1
2
w w -nr +
N
 zi
i =1
Learning with Support Vector Machines (SVM) 179

subject to y(i)(wTx(i) + w0) ≥ r – zi ; i = 1, …, N (4.98)


zi ≥ 0; i = 1, …, N; r ≥ 0
Note that parameter C does not appear in this formulation; instead there is parameter n. An
additional parameter r also appears that is a variable of the optimization problem and scales the
margin: the margin is now 2r/||w|| (refer to Eqn (4.25)).
The formulation given by (4.98) represents modification of the basic SVM classification
(C-SVM) given by (4.48a) to obtain n-SVM classification. With analogous modifications of the
basic SVM regression (e-SVM regression), we obtain n-SVM regression.

Sequential Minimization Algorithms


Support vector machines have attracted many researchers in the last two decades due to many
interesting properties they enjoy: immunity to overfitting by means of regularization, guarantees on
the generalization error, robust training algorithms that are based on well established mathematical
programming techniques, and above all, their success in many real-world classification problems.
Despite the many advantages, basic SVM suffers from a serious drawback; it requires Quadratic
Programming (QP) solver to solve the problem. The amount of computer memory needed for a
standard QP solver increases exponentially with the size of the data. Therefore, the issue is whether
it is possible for us to scale up the SVM algorithm for huge datasets comprising thousands and
millions of instances. Many decomposition techniques have been developed to scale up the SVM
algorithm.
Techniques based on decomposition, break down a large optimization problem into smaller
problems, with each one involving merely some cautiously selected variables so that efficient
optimization is possible. Platt’s SMO algorithm (Sequential Minimal Optimization) [63] is an
extreme case of the decomposition techniques developed, which works on a set of two data points
at a time. Owing to the fact that the solution for a working set of two data points can be arrived
at analytically, the SMO algorithm does not invoke standard QP solvers. Due to its analytical
foundation, the SMO and its improved versions [64, 65, 66] are particularly simple and at the
moment in the widest use. Many free software packages are available, and the ones that are most
popular are SVM light [67] and LIBSVM [68].

Variants based on Trade-off between Complexity and Accuracy


Decomposition techniques handle memory issue alone by dividing a problem into a series
of smaller ones. However, these smaller problems are rather time consuming for big datasets.
Number of techniques for reduction in the training time have been suggested at the cost of accuracy.
Many variants have been reported in the literature. Some of the popular ones are on follows.
LS-SVM (Least Squares Support Vector Machine): It is a Least Squares version of the classical
SVM. LS-SVM classification formulation implicitly corresponds to a regression interpretation with
binary targets y(i) = ±1. Proposed by Suykens and Vandewalla [69], its formulation has equality
constraints; a set of linear equations has to be solved instead of a quadratic programming problem
for classical SVMs.
180 Applied Machine Learning

PSVM (Proximal Support Vector Machine): Developed by Fung and Mangasarian [70],
Proximal SVM leads to an extremely fast and simple algorithm for generating a classifier that is
obtained by solving a set of linear equations. Proximal SVM is comparable with standard SVM in
performance.
The key idea of PSVM is that it classifies points by assigning them to the closer of the two
parallel planes that are pushed apart as far as possible. These planes are pushed apart by introducing
the term wTw + w02 in the objective function of classical L2-norm optimization problem.
LSVM (Lagrangian Support Vector Machine): A fast and simple algorithm, based on an
implicit Lagrangian formulation of the dual of a simple reformulation of the standard quadratic
program of a support vector machine, was proposed by Mangasarian [71]. This algorithm minimizes
unconstrained differentiable convex function for classifying N points in a given n-dimensional input
space. An iterative Lagrangian Support Vector Machine (LSVM) algorithm is given for solving the
modified SVM. This algorithm can resolve problems accurately with millions of points, at a pace
greater than SMO (if n is less than 100) without any optimization tools, like linear or quadratic
programming solvers.

Multiclass based SVM Algorithms


Originally, the SVM was developed for binary classification; the basic idea to apply SVM technique
to multiclass problems is to decompose the multiclass problem into several two-class problems that
can be addressed directly using several SVMs (refer to Section 4.9). This decomposition approach
gives ‘indirect methods’ for multiclass classification problems.
Instead of creating several binary classifiers, a more natural way is to distinguish all classes in one
single optimization processing. This approach gives ‘direct methods’ for multiclass classification
problems [72].
Weston and Watkins’ Multiclass SVM: In the method (the idea is similar to OAA approach)
proposed by Weston and Watkins [73], for an M-class problem, a single objective function is
designed for training all M-binary classifiers simultaneously and maximizing the margin from each
class to the remaining classes. The main disadvantage of this approach is that the computational
time may be very high due to the enormous size of the resulting QP. The OAA approach is generally
preferred over this method.
Crammer and Singer’s Multiclass SVM: Crammer and Singer [74] presented another
‘all-together’ approach. This approach gives a compact set of constraints; however, the number of
variables in its dual problem are high. This value may explode even for small datasets.
Simplified Multiclass SVM (SimMSVM): A simplified method, named SimMSVM [75], relaxes
the constraints of Crammer and Singer’s approach.
The support vector machine is currently considered to be the best off-the-shelf learning algorithm
and has been applied successfully in various domains. Scholkopf and Smola [53] is a classic book
on the subject.

You might also like