CH 12
CH 12
Large-Scale Machine
Learning
Many algorithms are today classied as machine learning. These algorithms
share the goal of extracting information from data with the other algorithms
studied in this book. All algorithms for analysis of data are designed to produce
a useful summary of the data, from which decisions are made. Among many
examples, the frequent-itemset analysis that we did in Chapter 6 produces
information like association rules, which can then be used for planning a sales
strategy or for many other purposes.
However, algorithms classied as machine learning have a particular pur-
pose; they are designed to classify data that will be seen in the future. The
data to which the algorithm is applied is called the training set, and the result
of the machine-learning algorithm is called a classier. For instance, the clus-
tering algorithms discussed in Chapter 7 produce clusters that not only tell us
something about the data being analyzed (the training set), but they allow us
to classify future data into one of the clusters that result from the clustering
algorithm. Thus, machine-learning enthusiasts often speak of clustering with
the neologism unsupervised learning; the term unsupervised refers to the fact
that the input data, or training set, does not tell the clustering algorithm what
the clusters should be. In supervised machine learning, the subject of this chap-
ter, the training set includes information about the correct way to classify a
subset of the data, called the training set.
In this chapter, we do not attempt to cover all the dierent approaches to
machine learning. We concentrate on methods that are suitable for very large
data and that have the potential for parallel implementation. We consider the
classical perceptron approach to learning a data classier, where a hyperplane
that separates two classes is sought. Then, we look at more modern techniques
involving support-vector machines. Similar to perceptrons, these methods look
for hyperplanes that best divide the classes, so few, if any, members of the
training set lie close to the hyperplane. We end with a discussion of nearest-
403
404 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
neighbor techniques, where data is classied according to the class(es) of their
nearest neighbors in some space.
12.1 The Machine-Learning Model
In this brief section we introduce the framework for machine-learning algorithms
and give the basic denitions.
12.1.1 Training Sets
The data to which a machine-learning (often abbreviated ML) algorithm is
applied is called a training set. A training set consists of a set of pairs (x, y),
called training examples, where
x is a vector of values, often called a feature vector.
y is the label, the classication value for x.
The objective of the ML process is to discover a function y = f(x) that best
predicts the value of y associated with unseen values of x. The type of y is in
principle arbitrary, but there are several common and important cases.
1. y is a real number. In this case, the ML problem is called regression.
2. y is a boolean value true-or-false, more commonly written as +1 and 1,
respectively. In this class the problem is binary classication.
3. y is a member of some nite set. The members of this set can be thought
of as classes, and each member represents one class. The problem is
multiclass classication.
4. y is a member of some potentially innite set, for example, a parse tree
for x, which is interpreted as a sentence.
12.1.2 Some Illustrative Examples
Example 12.1: Recall Fig. 7.1, repeated as Fig. 12.1, where we plotted the
height and weight of dogs in three classes: Beagles, Chihuahuas, and Dachs-
hunds. We can think of this data as a training set, provided the data includes
the variety of the dog along with each height-weight pair. Each pair (x, y) in
the training set consists of a feature vector x of the form [height, weight]. The
associated label y is the variety of the dog. An example of a training-set pair
would be ([5 inches, 2 pounds], Chihuahua).
An appropriate way to implement the decision function f would be to imag-
ine two lines, shown dashed in Fig. 12.1. The horizontal line represents a height
of 7 inches and separates Beagles from Chihuahuas and Dachshunds. The verti-
cal line represents a weight of 3 pounds and separates Chihuahuas from Beagles
and Dachshunds. The algorithm that implements f is:
12.1. THE MACHINE-LEARNING MODEL 405
Beagles
Weight
Height
Chihuahuas
Dachshunds
height = 7
weight = 3
Figure 12.1: Repeat of Fig. 7.1, indicating the heights and weights of certain
dogs
if (height > 7) print Beagle
else if (weight < 3) print Chihuahua
else print Dachshund;
Recall that the original intent of Fig. 7.1 was to cluster points without
knowing which variety of dog they represented. That is, the label associated
with a given height-weight vector was not available. Thus, clustering does not
really t the machine-learning model, which is why it was given a variant name
unsupervised learning. 2
(2,1)
(3,4)
(4,3)
(1,2)
Figure 12.2: Repeat of Fig. 11.1, to be used as a training set
Example 12.2: As an example of supervised learning, the four points (0, 2),
(2, 1), (3, 4), and (4, 3) from Fig.11.1 (repeated here as Fig. 12.2), can be thought
406 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
of as a training set, where the vectors are one-dimensional. That is, the point
(1, 2) can be thought of as a pair ([1], 2), where [1] is the one-dimensional feature
vector x, and 2 is the associated label y; the other points can be interpreted
similarly.
Suppose we want to learn the linear function f(x) = ax + b that best
represents the points of the training set. A natural interpretation of best
is that the RMSE of the value of f(x) compared with the given value of y is
minimized. That is, we want to minimize
4
x=1
(ax + b y
x
)
2
where y
x
is the y-value associated with x. This sum is
(a + b 2)
2
+ (2a + b 1)
2
+ (3a + b 4)
2
+ (4a + b 3)
2
Simplifying, the sum is 30a
2
+4b
2
+20ab 56a 20b +30. If we then take the
derivatives with respect to a and b and set them to 0, we get
60a + 20b 56 = 0
20a + 8b 20 = 0
The solution to these equations is a = 3/5 and b = 1. For these values the
RMSE is 3.2.
Note that the learned straight line is not the principal axis that was dis-
covered for these points in Section 11.2.1. That axis was the line with slope
1, going through the origin, i.e., the line y = x. For this line, the RMSE is
4. The dierence is that PCA discussed in Section 11.2.1 minimizes the sum
of the squares of the lengths of the projections onto the chosen axis, which is
constrained to go through the origin. Here, we are minimizing the sum of the
squares of the vertical distances between the points and the line. In fact, even
had we tried to learn the line through the origin with the least RMSE, we would
not choose y = x. You can check that y =
14
15
x has a lower RMSE than 4. 2
Example 12.3: A common application of machine learning involves a training
set where the feature vectors x are boolean-valued and of very high dimension.
Each component represents a word in some large dictionary. The training set
consists of pairs, where the vector x represents a document of some sort. For
instance, the label y could be +1 or 1, with +1 representing that the document
(an email, e.g.) is spam. Our goal would be to train a classier to examine
future emails and decide whether or not they are spam. We shall illustrate this
use of machine learning in Example 12.4.
Alternatively, y could be chosen from some nite set of topics, e.g., sports,
politics, and so on. Again, x could represent a document, perhaps a Web
page. The goal would be to create a classier for Web pages that assigned a
topic to each. 2
12.1. THE MACHINE-LEARNING MODEL 407
12.1.3 Approaches to Machine Learning
There are many forms of ML algorithms, and we shall not cover them all here.
Here are the major classes of such algorithms, each of which is distinguished by
the form by which the function f is represented.
1. Decision trees were discussed briey in Section 9.2.7. The form of f is
a tree, and each node of the tree has a function of x that determines
to which child or children the search must proceed. While we saw only
binary trees in Section 9.2.7, in general a decision tree can have any
number of children for each node. Decision trees are suitable for binary
and multiclass classication, especially when the dimension of the feature
vector is not too large (large numbers of features can lead to overtting).
2. Perceptrons are threshold functions applied to the components of the vec-
tor x = [x
1
, x
2
, . . . , x
n
]. A weight w
i
is associated with the ith component,
for each i = 1, 2, . . . , n, and there is a threshold . The output is +1 if
n
i=1
w
i
x
i
and the output is 1 otherwise. A perceptron is suitable for binary classi-
cation, even when the number of features is very large, e.g., the presence
or absence of words in a document. Perceptrons are the topic of Sec-
tion 12.2.
3. Neural nets are acyclic networks of perceptrons, with the outputs of some
perceptrons used as inputs to others. These are suitable for binary or
multiclass classication, since there can be several perceptrons used as
output, with one or more indicating each class.
4. Instance-based learning uses the entire training set to represent the func-
tion f. The calculation of the label y associated with a new feature vector
x can involve examination of the entire training set, although usually some
preprocessing of the training set enables the computation of f(x) to pro-
ceed eciently. We shall consider an important kind of instance-based
learning, k-nearest-neighbor, in Section 12.4. For example, 1-nearest-
neighbor classies data by giving it the same class as that of its nearest
training example. There are k-nearest-neighbor algorithms that are ap-
propriate for any kind of classication, although we shall concentrate on
the case where y and the components of x are real numbers.
5. Support-vector machines are an advance over the algorithms traditionally
used to select the weights and threshold. The result is a classier that
tends to be more accurate on unseen data. We discuss support-vector
machines in Section 12.3.
408 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
12.1.4 Machine-Learning Architecture
Machine-learning algorithms can be classied not only by their general algorith-
mic approach as we discussed in Section 12.1.3. but also by their underlying
architecture the way data is handled and the way it is used to build the model.
One general issue regarding the handling of data is that there is a good reason to
withhold some of the available data from the training set. The remaining data
is called the test set. The problem addressed is that many machine-learning
algorithms tend to overt the data; they pick up on artifacts that occur in the
training set but that are atypical of the larger population of possible data. By
using the test data, and seeing how well the classier works on that, we can tell
if the classier is overtting the data. If so, we can restrict the machine-learning
algorithm in some way. For instance, if we are constructing a decision tree, we
can limit the number of levels of the tree.
Often, as in Examples 12.1 and 12.2 we use a batch learning architecture.
That is, the entire training set is available at the beginning of the process, and
it is all used in whatever way the algorithm requires to produce a model once
and for all. The alternative is on-line learning, where the training set arrives
in a stream and, like any stream, cannot be revisited after it is processed. In
on-line learning, we maintain a model at all times. As new training examples
arrive, we may choose to modify the model to account for the new examples.
On-line learning has the advantages that it can
1. Deal with very large training sets, because it does not access more than
one training example at a time.
2. Adapt to changes in the population of training examples as time goes on.
For instance, Google trains its spam-email classier this way, adapting
the classier for spam as new kinds of spam email are sent by spammers
and indicated to be spam by the recipients.
An enhancement of on-line learning, suitable in some cases, is active learn-
ing. Here, the classier may receive some training examples, but it primarily
receives unclassied data, which it must classify. If the classier is unsure of
the classication (e.g., the newly arrived example is very close to the bound-
ary), then the classier can ask for ground truth at some signicant cost. For
instance, it could send the example to Mechanical Turk and gather opinions of
real people. In this way, examples near the boundary become training examples
and can be used to modify the classier.
12.1.5 Exercises for Section 12.1
Exercise 12.1.1: Redo Example 12.2 for the following dierent forms of f(x).
(a) Require f(x) = ax; i.e., a straight line through the origin. Is the line
y =
14
15
x that we discussed in the example optimal?
(b) Require f(x) to be a quadratic, i.e., f(x) = ax
2
+ bx + c.
12.2. PERCEPTRONS 409
12.2 Perceptrons
A perceptron is a linear binary classier. Its input is a vector x = [x
1
, x
2
, . . . , x
d
]
with real-valued components. Associated with the perceptron is a vector of
weights w = [w
1
, w
2
, . . . , w
d
], also with real-valued components. Each percep-
tron has a threshold . The output of the perceptron is +1 if w.x > , and
the output is 1 if w.x < . The special case where w.x = will always be
regarded as wrong, in the sense that we shall describe in detail when we get
to Section 12.2.1.
The weight vector w denes a hyperplane of dimension d1 the set of all
points x such that w.x = , as suggested in Fig. 12.3. Points on the positive
side of the hyperplane are classied +1 and those on the negative side are
classied 1.
w
w.x =
Figure 12.3: A perceptron divides a space by a hyperplane into two half-spaces
12.2.1 Training a Perceptron with Zero Theshold
To train a perceptron, we examine the training set and try to nd a weight
vector w and threshold such that all the feature vectors with y = +1 (the
positive examples) are on the positive side of the hyperplane and all those with
y = 1 (the negative examples) are on the negative side. It may or may not be
possible to do so, since there is no guarantee that any hyperplane separates all
the positive and negative examples in the training set.
We begin by assuming the threshold is 0; the simple augmentation needed
to handle an unknown threshold is discussed in Section 12.2.3. The follow-
ing method will converge to some hyperplane that separates the positive and
negative examples, provided one exists.
1. Initialize the weight vector w to all 0s.
410 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
2. Pick a learning-rate parameter c, which is a small, positive real number.
3. Consider each training example t = (x, y) in turn.
(a) Let y
= w.x.
(b) If y
= 0, replace w by
w + cyx. That is, adjust w slightly in the direction of x.
x
1
x
1
w.x = 0
w
c
w
w.x = 0
Figure 12.4: A misclassied point x
1
moves the vector w
The two-dimensional case of this transformation on w is suggested in Fig.
12.4. Notice how moving w in the direction of x moves the hyperplane that is
perpendicular to w in such a direction that it makes it more likely that x will
be on the correct side of the hyperplane, although it does not guarantee that
to be the case.
Example 12.4: Let us consider training a perceptron to recognize spam email.
The training set consists of pairs (x, y) where x is a vector of 0s and 1s, with
each component x
i
corresponding to the presence (x
i
= 1) or absence (x
i
= 0)
of a particular word in the email. The value of y is +1 if the email is known
to be spam and 1 if it is known not to be spam. While the number of words
found in the training set of emails is very large, we shall use a simplied example
where there are only ve words: and, viagra, the, of, and nigeria.
Figure 12.5 gives the training set of six vectors and their corresponding classes.
In this example, we shall use learning rate c = 1/2, and we shall visit
each training example once, in the order shown in Fig. 12.5. We begin with
w = [0, 0, 0, 0, 0] and compute w.a = 0. Since 0 is not positive, we move w in
12.2. PERCEPTRONS 411
and viagra the of nigeria y
a 1 1 0 1 1 +1
b 0 0 1 1 0 1
c 0 1 1 0 0 +1
d 1 0 0 1 0 1
e 1 0 1 0 1 +1
f 1 0 1 1 0 1
Figure 12.5: Training data for spam emails
Pragmatics of Training on Emails
When we represent emails or other large documents as training examples,
we would not really want to construct the vector of 0s and 1s with a
component for every word that appears even once in the collection of
emails. Doing so would typically give us sparse vectors with millions of
components. Rather, create a table in which all the words appearing in the
emails are assigned integers 1, 2, . . ., indicating their component. When we
process an email in the training set, make a list of the components in which
the vector has 1; i.e., use the standard sparse representation for the vector.
Only the vector w needs to have all its components listed, since it will not
be sparse after a small number of training examples have been processed.
the direction of a by performing w := w + (1/2)(+1)a. The new value of w
is thus
w = [0, 0, 0, 0, 0] + [
1
2
,
1
2
, 0,
1
2
,
1
2
] = [
1
2
,
1
2
, 0,
1
2
,
1
2
]
Next, consider b. w.b = [
1
2
,
1
2
, 0,
1
2
,
1
2
].[0, 0, 1, 1, 0] =
1
2
. Since the associated
y for b is 1, b is misclassied. We thus assign
w := w + (1/2)(1)b = [
1
2
,
1
2
, 0,
1
2
,
1
2
] [0, 0,
1
2
,
1
2
, 0] = [
1
2
,
1
2
,
1
2
, 0,
1
2
]
Training example c is next. We compute
w.c = [
1
2
,
1
2
,
1
2
, 0,
1
2
].[0, 1, 1, 0, 0] = 0
Since the associated y for c is +1, c is also misclassied. We thus assign
w := w + (1/2)(+1)c = [
1
2
,
1
2
,
1
2
, 0,
1
2
] + [0,
1
2
,
1
2
, 0, 0] = [
1
2
, 1, 0, 0,
1
2
]
Training example d is next to be considered:
w.d = [
1
2
, 1, 0, 0,
1
2
].[1, 0, 0, 1, 0] = 1
412 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
Since the associated y for d is 1, d is misclassied as well. We thus assign
w := w + (1/2)(1)d = [
1
2
, 1, 0, 0,
1
2
] [
1
2
, 0, 0,
1
2
, 0] = [0, 1, 0,
1
2
,
1
2
]
For training example e we compute w.e = [0, 1, 0,
1
2
,
1
2
].[1, 0, 1, 0, 1] =
1
2
.
Since the associated y for e is +1, e is classied correctly, and no change to w
is made. Similarly, for f we compute
w.f = [0, 1, 0,
1
2
,
1
2
].[1, 0, 1, 1, 0] =
1
2
so f is correctly classied. If we check a through d, we nd that this w cor-
rectly classies them as well. Thus, we have converged to a perceptron that
classies all the training set examples correctly. It also makes a certain amount
of sense: it says that viagra and nigeria are indicative of spam, while of
is indicative of nonspam. It considers and and the neutral, although we
would probably prefer to give and, or, and the the same weight. 2
12.2.2 The Winnow Algorithm
There are many other rules one could use to adjust weights for a perceptron. Not
all possible algorithms are guaranteed to converge, even if there is a hyperplane
separating positive and negative examples. One that does converge is called
Winnow, and that rule will be described here. Winnow assumes that the feature
vectors consist of 0s and 1s, and the labels are +1 or 1. Unlike the basic
perceptron algorithm, which can produce positive or negative components in
the weight vector w, Winnow produces only positive weights.
Start the Winnow Algorithm with a weight vector w = [w
1
, w
2
, . . . , w
d
]
all of whose components are 1, and let the threshold equal d, the number
of dimensions of the vectors in the training examples. Let (x, y) be the next
training example to be considered, where x = [x
1
, x
2
, . . . , x
d
].
1. If w.x > and y = +1, or w.x < and y = 1, then the example is
correctly classied, so no change to w is made.
2. If w.x , but y = +1, then the weights for the components where x has
1 are too low as a group. Double each of the corresponding components
of w. That is, if x
i
= 1 then set w
i
:= 2w
i
.
3. If w.x , but y = 1, then the weights for the components where x has
1 are too high as a group. Halve each of the corresponding components
of w. That is, if x
i
= 1 then set w
i
:= w
i
/2.
Example 12.5: Let us reconsider the training data from Fig. 12.5. Initialize
w = [1, 1, 1, 1, 1] and let = 5. First, consider feature vector a = [1, 1, 0, 1, 1].
w.a = 4, which is less than . Since the associated label for a is +1, this
example is misclassied. When a +1-labeled example is misclassied, we must
12.2. PERCEPTRONS 413
double all the components where the example has 1; in this case, all but the
third component of a is 1. Thus, the new value of w is [2, 2, 1, 2, 2].
Next, we consider example b = [0, 0, 1, 1, 0]. w.b = 3, which is less than .
However, the associated label for b is 1, so no change to w is needed.
For c = [0, 1, 1, 0, 0] we nd w.c = 3 < , while the associated label is +1.
Thus, we double the components of w where the corresponding components of
c are 1. These components are the second and third, so the new value of w is
[2, 4, 2, 2, 2].
The next two examples, d and e require no change, since they are correctly
classied. However, there is a problem with f = [1, 0, 1, 1, 0], since w.f = 6 > ,
while the associated label for f is 1. Thus, we must divide the rst, third,
and fourth components of w by 2, since these are the components where f has
1. The new value of w is [1, 4, 1, 1, 2].
x y w.x OK? and viagra the of nigeria
1 1 1 1 1
a +1 4 no 2 2 1 2 2
b 1 3 yes
c +1 3 no 2 4 2 2 2
d 1 4 yes
e +1 6 yes
f 1 6 no 1 4 1 1 2
a +1 8 yes
b 1 2 yes
c +1 5 no 1 8 2 1 2
d 1 2 yes
e +1 5 no 2 8 4 1 4
f 1 7 no 1 8 2
1
2
4
Figure 12.6: Sequence of updates to w performed by the Winnow Algorithm
on the training set of Fig. 12.5
We still have not converged. It turns out we must consider each of the
training examples a through f again. At the end of this process, the algorithm
has converged to a weight vector w = [1, 8, 2,
1
2
, 4], which with threshold = 5
correctly classies all of the training examples in Fig. 12.5. The details of
the twelve steps to convergence are shown in Fig. 12.6. This gure gives the
associated label y and the computed dot product of w and the given feature
vector. The last ve columns are the ve components of w after processing
each training example. 2
12.2.3 Allowing the Threshold to Vary
Suppose now that the choice of threshold 0, as in Section 12.2.1, or threshold
d, as in Section 12.2.2 is not desirable, or that we dont know what threshold
414 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
is best to use. At the cost of adding another dimension to the feature vectors,
we can treat as one of the components of the weight vector w. That is:
1. Replace the vector of weights w = [w
1
, w
2
, . . . , w
d
] by
w
= [w
1
, w
2
, . . . , w
d
, ]
2. Replace every feature vector x = [x
1
, x
2
, . . . , x
d
] by
x
= [x
1
, x
2
, . . . , x
d
, 1]
Then, for the new training set and weight vector, we can treat the threshold
as 0 and use the algorithm of Section 12.2.1. The justication is that w
.x
0
is equivalent to
d
i=1
w
i
x
i
+1 = w.x 0, which in turn is equivalent to
w.x . The latter is the condition for a positive response from a perceptron
with theshold .
We can also apply the Winnow Algorithm to the modied data. Winnow
requires all feature vectors to have 0s and 1s, as components. However, we can
allow a 1 in the feature vector component for if we treat it in the manner
opposite to the way we treat components that are 1. That is, if the training
example is positive, and we need to increase the other weights, we instead divide
the component for the threshold by 2. And if the training example is negative,
and we need to decrease the other weights we multiply the threshold component
by 2.
Example 12.6: Let us modify the training set of Fig. 12.5 to incorporate a
sixth word that represents the negative of the threshold. The new data
is shown in Fig. 12.7.
and viagra the of nigeria y
a 1 1 0 1 1 1 +1
b 0 0 1 1 0 1 1
c 0 1 1 0 0 1 +1
d 1 0 0 1 0 1 1
e 1 0 1 0 1 1 +1
f 1 0 1 1 0 1 1
Figure 12.7: Training data for spam emails, with a sixth component representing
the negative of the threshold
We begin with a weight vector w with six 1s, as shown in the rst line
of Fig. 12.8. When we compute w.a = 3, using the rst feature vector a, we
are happy because the training example is positive, and so is the dot product.
However, for the second training example, we compute w.b = 1. Since the
example is negative and the dot product is positive, we must adjust the weights.
12.2. PERCEPTRONS 415
Since b has 1s in the third and fourth components, the 1s in the corresponding
components of w are replaced by 1/2. The last component, corresponding to ,
must be doubled. These adjustments give the new weight vector [1, 1,
1
2
,
1
2
, 1, 2]
shown in the third line of Fig. 12.8.
x y w.x OK? and viagra the of nigeria
1 1 1 1 1 1
a +1 3 yes
b 1 1 no 1 1
1
2
1
2
1 2
c +1
1
2
no 1 2 1
1
2
1 1
d 1
1
2
no
1
2
2 1
1
4
1 2
Figure 12.8: Sequence of updates to w performed by the Winnow Algorithm
on the training set of Fig. 12.7
The feature vector c is a positive example, but w.c =
1
2
. Thus, we must
double the second and third components of w, because c has 1 in the cor-
responding components, and we must halve the last component of w, which
corresponds to . The resulting w = [1, 2, 1,
1
2
, 1, 1] is shown in the fourth line
of Fig. 12.8. Negative example d is next. Since w.d =
1
2
, we must again adjust
weights. We halve the weights in the rst and fourth components and double
the last component, yielding w = [
1
2
, 2, 1,
1
4
, 1, 2]. Now, all positive examples
have a positive dot product with the weight vector, and all negative examples
have a negative dot product, so there are no further changes to the weights.
The designed perceptron has a theshold of 2. It has weights 2 and 1 for
viagra and nigeria and smaller weights for and and of. It also has
weight 1 for the, which suggests that the is as indicative of spam as nige-
ria, something we doubt is true. nevertheless, this perceptron does classify all
examples correctly. 2
12.2.4 Multiclass Perceptrons
There are several ways in which the basic idea of the perceptron can be ex-
tended. We shall discuss transformations that enable hyperplanes to serve for
more complex boundaries in the next section. Here, we look at how perceptrons
can be used to classify data into many classes.
Suppose we are given a training set with labels in k dierent classes. Start
by training a perceptron for each class; these perceptrons should each have the
same threshold . That is, for class i treat a training example (x, i) as a positive
example, and all examples (x, j), where j = i, as a negative example. Suppose
that the weight vector of the perceptron for class i is determined to be w
i
after
training.
Given a new vector x to classify, we compute w
i
.x for all i = 1, 2, . . . , k. We
take the class of x to be the value of i for which w
i
.x is the maximum, provided
416 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
that value is at least . Otherwise, x is assumed not to belong to any of the k
classes.
For example, suppose we want to classify Web pages into a number of topics,
such as sports, politics, medicine, and so on. We can represent Web pages by a
vector with 1 for each word present in the page and 0 for words not present (of
course we would only visualize the pages that way; we wouldnt construct the
vectors in reality). Each topic has certain words that tend to indicate that topic.
For instance, sports pages would be full of words like win, goal, played,
and so on. The weight vector for that topic would give higher weights to the
words that characterize that topic.
A new page could be classied as belonging to the topic that gives the
highest score when the dot product of the pages vector and the weight vectors
for the topics are computed. An alternative interpretation of the situation is
to classify a page as belonging to all those topics for which the dot product
is above some threshold (presumably a threshold higher than the used for
training).
12.2.5 Transforming the Training Set
While a perceptron must use a linear function to separate two classes, it is
always possible to transform the vectors of a training set before applying a
perceptron-based algorithm to separate the classes. An example should give
the basic idea.
Example 12.7: In Fig. 12.9 we see a plot of places to visit from my home. The
horizontal and vertical coordinates represent latitude and longitude of places.
Some example places have been classied into day trips places close enough
to visit in one day and excursions, which require more than a day to visit.
These are the circles and squares, respectively. Evidently, there is no straight
line that separates day trips from excursions. However, if we replace the Carte-
sian coordinates by polar coordinates, then in the transformed space of polar
coordinates, the dashed circle shown in Fig. 12.9 becomes a hyperplane. For-
mally, we transform the vector x = [x
1
, x
2
] into [
_
x
2
1
+ x
2
2
, arctan(x
2
/x
1
)].
In fact, we can also do dimensionality reduction of the data. The angle of
the point is irrelevant, and only the radius
_
x
2
1
+ x
2
2
matters. Thus, we can
turn the point vectors into one-component vectors giving the distance of the
point from the origin. Associated with the small distances will be the class
label day trip, while the larger distances will all be associated with the label
excursion. Training the perceptron is extremely easy. 2
12.2.6 Problems With Perceptrons
Despite the extensions discussed above, there are some limitations to the ability
of perceptrons to classify some data. The biggest problem is that sometimes
the data is inherently not separable by a hyperplane. An example is shown in
12.2. PERCEPTRONS 417
Figure 12.9: Transforming from rectangular to polar coordinates turns this
training set into one with a separating hyperplane
Fig. 12.10. In this example, points of the two classes mix near the boundary so
that any line through the points will have points of both classes on at least one
of the sides.
One might argue that, based on the observations of Section 12.2.5 it should
be possible to nd some function on the points that would transform them to
another space where they were linearly separable. That might be the case,
but if so, it would probably be an example of overtting, the situation where
the classier works very well on the training set, because it has been carefully
designed to handle each training example correctly. However, because the clas-
sier is exploiting details of the training set that do not apply to other examples
that must be classied in the future, the classier will not perform well on new
data.
Another problem is illustrated in Fig. 12.11. Usually, if classes can be sep-
arated by one hyperplane, then there are many dierent hyperplanes that will
separate the points. However, not all hyperplanes are equally good. For in-
stance, if we choose the hyperplane that is furthest clockwise, then the point
indicated by ? will be classied as a circle, even though we intuitively see it as
closer to the squares. When we meet support-vector machines in Section 12.3,
we shall see that there is a way to insist that the hyperplane chosen be the one
that in a sense divides the space most fairly.
Yet another problem is illustrated by Fig. 12.12. Most rules for training
418 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
Figure 12.10: A training set may not allow the existence of any separating
hyperplane
a perceptron stop as soon as there are no misclassied points. As a result,
the chosen hyperplane will be one that just manages to classify some of the
points correctly. For instance, the upper line in Fig. 12.12 has just managed
to accommodate two of the squares, and the lower line has just managed to
accommodate one of the circles. If either of these lines represent the nal
weight vector, then the weights are biased toward one of the classes. That
is, they correctly classify the points in the training set, but the upper line
would classify new squares that are just below it as circles, while the lower line
would classify circles just above it as squares. Again, a more equitable choice
of separating hyperplane will be shown in Section 12.3.
12.2.7 Parallel Implementation of Perceptrons
The training of a perceptron is an inherently sequential process. If the num-
ber of dimensions of the vectors involved is huge, then we might obtain some
parallelism by computing dot products in parallel. However, as we discussed in
connection with Example 12.4, high-dimensional vectors are likely to be sparse
and can be represented more succinctly than would be expected from their
length.
In order to get signicant parallelism, we have to modify the perceptron
algorithm slightly, so that many training examples are used with the same esti-
mated weight vector w. As an example, let us formulate the parallel algorithm
as a map-reduce job.
The Map Function: Each Map task is given a chunk of training examples,
and each Map task knows the current weight vector w. The Map task computes
w.x for each feature vector x = [x
1
, x
2
, . . . , x
k
] in its chunk and compares that
12.2. PERCEPTRONS 419
?
Figure 12.11: Generally, more that one hyperplane can separate the classes if
they can be separated at all
dot product with the label y, which is +1 or 1, associated with x. If the signs
agree, no key-value pairs are produced for this training example. However,
if the signs disagree, then for each nonzero component x
i
of x the key-value
pair (i, cyx
i
) is produced; here, c is the learning-rate constant used to train
this perceptron. Notice that cyx
i
is the increment we would like to add to the
current ith component of w, and if x
i
= 0, then there is no need to produce a
key-value pair. However, in the interests of parallelism, we defer that change
until we can accumulate many changes in the Reduce phase.
The Reduce Function: For each key i, the Reduce task that handles key i
adds all the associated increments and then adds that sum to the ith component
of w.
Probably, these changes will not be enough to train the perceptron. If any
changes to w occur, then we need to start a new map-reduce job that does the
same thing, perhaps with dierent chunks from the training set. However, even
if the entire training set was used on the rst round, it can be used again, since
its eect on w will be dierent if w is dierent.
12.2.8 Exercises for Section 12.2
Exercise 12.2.1: Modify the training set of Fig. 12.5 so that example b also
includes the word nigeria (yet remains a negative example perhaps someone
telling about their trip to Nigeria). Find a weight vector that separates the
positive and negative examples, using:
(a) The basic training method of Section 12.2.1.
(b) The Winnow method of Section 12.2.2.
420 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
Figure 12.12: Perceptrons converge as soon as the separating hyperplane reaches
the region between classes
(c) The basic method with a variable theshold, as suggested in Section 12.2.3.
(d) The Winnow method with a variable theshold, as suggested in Section
12.2.3.
! Exercise 12.2.2: For the following training set:
([1, 2], +1) ([2, 1], +1)
([2, 3], 1) ([3, 2], 1)
describe all the vectors w and thresholds such that the hyperplane (really a
line) dened by w.x = 0 separates the points correctly.
! Exercise 12.2.3: Suppose the following four examples constitute a training
set:
([1, 2], 1) ([2, 3], +1)
([2, 1], +1) ([3, 2], 1)
(a) What happens when you attempt to train a perceptron to classify these
points using 0 as the threshold?
!! (b) Is it possible to change the threshold and obtain a perceptron that cor-
rectly classies these points?
(c) Suggest a transformation using quadratic polynomials that will transform
these points so they become linearly separable.
12.3. SUPPORT-VECTOR MACHINES 421
12.3 Support-Vector Machines
We can view a support-vector machine, or SVM, as an improvement on the
perceptron that is designed to address the problems mentioned in Section 12.2.6.
An SVM selects one particular hyperplane that not only separates the points in
the two classes, but does so in a way that maximizes the margin the distance
between the hyperplane and the closest points of the training set.
12.3.1 The Mechanics of an SVM
The goal of an SVM is to select a hyperplane w.x + b = 0
1
that maximizes
the distance between the hyperplane and any point of the training set. The
idea is suggested by Fig. 12.13. There, we see the points of two classes and a
hyperplane dividing them.
Support
vectors
w.x + b = 0
Figure 12.13: An SVM selects the hyperplane with the greatest possible margin
between the hyperplane and the training points
There are also two parallel hyperplanes at distance from the central hy-
perplane w.x +b = 0, and these each touch one or more of the support vectors.
The latter are the points that actually constrain the dividing hyperplane, in
the sense that they are all at distance from the hyperplane. In most cases, a
d-dimensional set of points has d+1 support vectors, as is the case in Fig. 12.13.
However, there can be more support vectors if too many points happen to lie
on the parallel hyperplanes. We shall see an example based on the points of
Fig. 11.1, where it turns out that all four points are support vectors, even
though two-dimensional data normally has three.
1
Constant b in this formulation of a hyperplane is the same as the negative of the threshold
in our treatment of perceptrons in Section 12.2.
422 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
A tentative statement of our goal is, given a training set
(x
1
, y
1
), (x
2
, y
2
), . . . , (x
n
, y
n
)
Maximize (by varying w and b) subject to the constraint that for all
i = 1, 2, . . . , n, y
i
(w.x
i
+ b) .
Notice that y
i
, which must be +1 or 1, determines which side of the hyperplane
the point x
i
must be on, so the relationship to is always correct. However,
it may be easier to express this condition as two cases: if y = +1, then w.x ,
and if y = 1, then w.x .
Unfortunately, this formulation doesnt really work properly. The problem
is that by increasing w and b, we can always allow a larger value of . For
example, suppose that w and b satisfy the constraint above. If we replace w
by 2w and b by 2b, we observe that for all i, y
i
_
(2w).x
i
+ 2b
_
2. Thus, 2w
and 2b is always a better choice that w and b, so there is no best choice and no
maximum .
12.3.2 Normalizing the Hyperplane
The solution to the problem that we described intuitively above is to normalize
the weight vector w. That is, the unit of measure perpendicular to the sepa-
rating hyperplane is the unit vector w/w. Recall that w is the Frobenius
norm, or the square root of the sum of the squares of the components of w. We
shall require that w be such that the parallel hyperplanes that just touch the
support vectors are described by the equations w.x+b = +1 and w.x+b = 1,
as suggested by Fig. 12.14.
w.x + b = +1
w.x + b = 0
w.x
+ b = 0
+ b = 1
w
|| / || w
1
2
x
x
Figure 12.14: Normalizing the weight vector for an SVM
Our goal becomes to maximize , which is now the multiple of the unit
vector w/w between the separating hyperplane and the parallel hyperplanes
12.3. SUPPORT-VECTOR MACHINES 423
through the support vectors. Consider one of the support vectors, say x
2
shown
in Fig. 12.14. Let x
1
be the projection of x
2
onto the far hyperplane, also as
suggested by Fig. 12.14. Note that x
1
need not be a support vector or even a
point of the training set. The distance from x
2
to x
1
in units of w/w is 2.
That is,
x
1
= x
2
+ 2
w
w
(12.1)
Since x
1
is on the hyperplane dened by w.x + b = +1, we know that
w.x
1
+ b = 1. If we substitute for x
1
using Equation 12.1, we get
w.
_
x
2
+ 2
w
w
_
+ b = 1
Regrouping terms, we see
w.x
2
+ b + 2
w.w
w
= 1 (12.2)
But the rst two terms of Equation 12.2, w.x
2
+ b, sum to 1, since we know
that x
2
is on the hyperplane w.x + b = 1. If we move this 1 from left to
right in Equation 12.2 and then divide through by 2, we conclude that
w.w
w
= 1 (12.3)
Notice also that w.w is the sum of the squares of the components of w.
That is, w.w = w
2
. We conclude from Equation 12.3 that = 1/w.
This equivalence gives us a way to reformulate the optimization problem
originally stated in Section 12.3.1. Instead of maximizing , we want to mini-
mize w, which is the inverse of if we insist on normalizing the scale of w.
That is, given a training set (x
1
, y
1
), (x
2
, y
2
), . . . , (x
n
, y
n
):
Minimize w (by varying w and b) subject to the constraint that for all
i = 1, 2, . . . , n, y
i
(w.x
i
+ b) 1.
Example 12.8: Let us consider the four points of Fig. 11.1, supposing that
they alternate as positive and negative examples. That is, the training set
consists of
([1, 2], +1) ([2, 1], 1)
([3, 4], +1) ([4, 3], 1)
Let w = [u, v]. Our goal is to minimize
u
2
+ v
2
subject to the constraints
we derive from the four training examples. For the rst, where x
1
= [1, 2] and
y
1
= +1, the constraint is (+1)(u + 2v + b) = u + 2v + b 1. For the second,
where x
2
= [2, 1] and y
2
= 1, the constraint is (1)(2u + v + b) 1, or
2u + v + b 1. The last two points are analogously handled, and the four
constraints we derive are:
424 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
u + 2v + b 1 2u + v + b 1
3u + 4v + b 1 4u + 3v + b 1
We shall cover in detail the subject of how one optimizes with constraints;
the subject is broad and many packages are available for you to use. Sec-
tion 12.3.4 discusses one method gradient descent in connection with a
more general application of SVM, where there is no separating hyperplane. An
illustration of how this method works will appear in Example 12.9.
In this simple example, the solution is easy to see: b = 0 and w = [u, v] =
[1, +1]. It happens that all four constraints are satised exactly; i.e., each
of the four points is a support vector. That case is unusual, since when the
data is two-dimensional, we expect only three support vectors. However, the
fact that the positive and negative examples lie on parallel lines allows all four
constraints to be satised exactly. 2
12.3.3 Finding Optimal Approximate Separators
We shall now consider nding an optimal hyperplane in the more general case,
where no matter which hyperplane we chose, there will be some points on the
wrong side, and perhaps some points that are on the correct side, but too close
to the separating hyperplane itself, so the margin requirement is not met. A
typical situation is shown in Fig. 12.15. We see two points that are misclassied;
they are on the wrong side of the separating hyperplane w.x + b = 0. We also
see two points that, while they are classied correctly, are too close to the
separating hyperplane. We shall call all these points bad points.
Each bad point incurs a penalty when we evaluate a possible hyperplane.
The amount of the penalty, in units to be determined as part of the optimization
process, is shown by the arrow leading to the bad point from the hyperplane
on the wrong side of which the bad point lies. That is, the arrows measure the
distance from the hyperplane w.x + b = 1 or w.x + b = 1. The former is
the baseline for training examples that are supposed to be above the separating
hyperplane (because the label y is +1), and the latter is the baseline for points
that are supposed to be below (because y = 1).
We have many options regarding the exact formula that we wish to mini-
mize. Intuitively, we want w to be as small as possible, as we discussed in
Section 12.3.2. But we also want the penalties associated with the bad points
to be as small as possible. The most common form of a tradeo is expressed
by a formula that involves the term w
2
/2 and another term that involves a
constant times the sum of the penalties.
To see why minimizing the term w
2
/2 makes sense, note that minimizing
w is the same as minimizing any monotone function of w, so it is at least an
option to choose a formula in which we try to minimize w
2
/2. It turns out to
be desirable because its derivative with respect to any component of w is that
component. That is, if w = [w
1
, w
2
, . . . , w
d
], then w
2
/2 is
1
2
n
i=1
w
2
i
, so its
partial derivative /w
i
is w
i
. This situation makes sense because, as we shall
see, the derivative of the penalty term with respect to w
i
is a constant times each
12.3. SUPPORT-VECTOR MACHINES 425
w.x + b = 1
w.x + b = +1
w.x + b = 0
Misclassified
Too close
to boundary
Figure 12.15: Points that are misclassied or are too close to the separating
hyperplane incur a penalty; the amount of the penalty is proportional to the
length of the arrow leading to that point
x
i
, the corresponding component of each feature vector whose training example
incurs a penalty. That in turn means that the vector w and the vectors of the
training set are commensurate in the units of their components.
Thus, we shall consider how to minimize the particular function
f(w, b) =
1
2
d
j=1
w
2
i
+ C
n
i=1
max
_
0, 1 y
i
_
d
j=1
w
j
x
ij
+ b
_
_
(12.4)
The rst term encourages small w, while the second term, involving the con-
stant C that must be chosen properly, represents the penalty for bad points
in a manner to be explained below. We assume there are n training exam-
ples (x
i
, y
i
) for i = 1, 2, . . . , n, and x
i
= [x
i1
, x
i2
, . . . , x
id
]. Also, as before,
w = [w
1
, w
2
, . . . , w
d
]. Note that the two summations
d
j=1
express the dot
product of vectors.
The constant C, called the regularization parameter, reects how important
misclassication is. Pick a large C if you really do not want to misclassify
points, but you would accept a narrow margin. Pick a small C if you are OK
with some misclassied points, but want most of the points to be far away from
the boundary (i.e., the margin is large).
We must explain the penalty function (second term) in Equation 12.4. The
summation over i has one term
L(x
i
, y
i
) = max
_
0, 1 y
i
_
d
j=1
w
j
x
ij
+ b
_
_
426 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
for each training example x
i
. L is a hinge function, suggested in Fig. 12.16,
and we call its value the hinge loss. Let z
i
= y
i
(
d
j=1
w
j
x
ij
+b). When z
i
is 1
or more, the value of L is 0. But for smaller values of z
i
, L rises linearly as z
i
decreases.
2 1 0 1 2 3
0
1
2
max{0, 1 z }
z = y
i
w .
i
x ( + b )
Figure 12.16: The hinge function decreases linearly for z 1 and then remains
0
Since we shall have need to take the derivative with respect to each w
j
of
L(x
i
, y
i
), note that the derivative of the hinge function is discontinuous. It is
y
i
x
ij
for z
i
< 1 and 0 for z
i
> 1. That is, if y
i
= +1 (i.e., the ith training
example is positive), then
L
w
j
= if
d
j=1
w
j
x
ij
+ b 1 then 0 else x
ij
Moreover, if y
i
= 1 (i.e., the ith training example is negative), then
L
w
j
= if
d
j=1
w
j
x
ij
+ b 1 then 0 else x
ij
The two cases can be summarized as one, if we include the value of y
i
, as:
L
w
j
= if y
i
(
d
j=1
w
j
x
ij
+ b) 1 then 0 else y
i
x
ij
(12.5)
12.3.4 SVM Solutions by Gradient Descent
A common approach to solving Equation 12.4 is gradient descent. We compute
the derivative of the equation with respect to b and each component w
j
of the
vector w. Since we want to minimize f(w, b), we move b and the components
w
j
in the direction opposite to the direction of the gradient. The amount we
12.3. SUPPORT-VECTOR MACHINES 427
move each component is proportional to the derivative with respect to that
component.
Our rst step is to use the trick of Section 12.2.3 to make b part of the
weight vector w. Notice that b is really the negative of a threshold on the dot
product w.x, so we can append a (d + 1)st component b to w and append an
extra component with value +1 to every feature vector in the training set (not
1 as we did in Section 12.2.3).
We must choose a constant to be the fraction of the gradient that we move
w in each round. That is, we assign
w
j
:= w
j
f
w
j
for all j = 1, 2, . . . , d + 1.
The derivative
f
wj
of the rst term in Equation 12.4,
1
2
d
j=1
w
2
i
, is easy;
it is w
j
.
2
However, the second term involves the hinge function, so it is harder
to express. We shall use an if-then expression to describe these derivatives, as
in Equation 12.5. That is:
f
w
j
= w
j
+ C
n
i=1
_
if y
i
(
d
j=1
w
j
x
ij
+ b) 1 then 0 else y
i
x
ij
_
(12.6)
Note that this formula applies to w
d+1
, which is b, as well as to the weights
w
1
, w
2
, . . . , w
d
. We continue to use b instead of the equivalent w
d+1
in the
if-then condition to remind us of the form in which the desired hyperplane is
described.
To execute the gradient-descent algorithm on a training set, we pick:
1. Values for the parameters C and .
2. Initial values for w, including the (d + 1)st component b.
Then, we repeatedly:
(a) Compute the partial derivatives of f(w, b) with respect to the w
j
s.
(b) Adjust the values of w by subtracting
f
wj
from each w
j
.
Example 12.9: Figure 12.17 shows six points, three positive and three nega-
tive. We expect that the best separating line will be horizontal, and the only
question is whether or not the separating hyperplane and the scale of w allows
the point (2, 2) to be misclassied or to lie too close to the boundary. Initially,
we shall choose w = [0, 1], a vertical vector with a scale of 1, and we shall
choose b = 2. As a result, we see in Fig. 12.17 that the point (2, 2) lies on the
428 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
w
hyperplane
Margin
Margin
Initial
(1,4)
(2,2)
(1,1) (2,1) (3,1)
(3,4)
Figure 12.17: Six points for a gradient-descent example
initial hyperplane and the three negative points are right at the margin. The
parameter values we shall choose for gradient descent are C = 0.1, and = 0.2.
We begin by incorporating b as the third component of w, and for notational
convenience, we shall use u and v as the rst two components, rather than the
customary w
1
and w
2
. That is, we take w = [u, v, b]. We also expand the two-
dimensional points of the training set with a third component that is always 1.
That is, the training set becomes
([1, 4, 1], +1) ([2, 2, 1], +1) ([3, 4, 1], +1)
([1, 1, 1], 1) ([2, 1, 1], 1) ([3, 1, 1], 1)
In Fig. 12.18 we tabulate the if-then conditions and the resulting contri-
butions to the summations over i in Equation 12.6. The summation must be
multiplied by C and added to u, v, or b, as appropriate, to implement Equa-
tion 12.6.
The truth or falsehood of each of the six conditions in Fig. 12.18 determines
the contribution of the terms in the summations over i in Equation 12.6. We
shall represent the status of each condition by a sequence of xs and os, with x
representing a condition that does not hold and o representing one that does.
The rst few iterations of gradient descent are shown in Fig. 12.19.
Consider line (1). It shows the initial value of w = [0, 1]. Recall that we use
u and v for the components of w, so u = 0 and v = 1. We also see the initial
value of b = 2. We must use these values of u and v to evaluate the conditions
in Fig. 12.18. The rst of the conditions in Fig. 12.18 is u + 4v + b +1. The
left side is 0 + 4 + (2) = 2, so the condition is satised. However, the second
condition, 2u + 2v + b +1 fails. The left side is 0 + 2 + (2) = 0. The fact
2
Note, however, that d there has become d + 1 here, since we include b as one of the
components of w when taking the derivative.
12.3. SUPPORT-VECTOR MACHINES 429
for u for v for b
if u + 4v + b +1 then 0 else 1 4 1
if 2u + 2v + b +1 then 0 else 2 2 1
if 3u + 4v + b +1 then 0 else 3 4 1
if u + v + b 1 then 0 else +1 +1 +1
if 2u + v + b 1 then 0 else +2 +1 +1
if 3u + v + b 1 then 0 else +3 +1 +1
Figure 12.18: Sum each of these terms and multiply by C to get the contribution
of bad points to the derivatives of f with respect to u, v, and b
w = [u, v] b Bad /u /v /b
1. [0.000, 1.000] 2.000 oxoooo 0.200 0.800 2.100
2. [0.040, 0.840] 1.580 oxoxxx 0.440 0.940 1.380
3. [0.048, 0.652] 1.304 oxoxxx 0.352 0.752 1.104
4. [0.118, 0.502] 1.083 xxxxxx 0.118 0.198 1.083
5. [0.094, 0.542] 0.866 oxoxxx 0.306 0.642 0.666
6. [0.155, 0.414] 0.733 xxxxxx
Figure 12.19: Beginning of the process of gradient descent
that the sum is 0 means the second point (2, 2) is exactly on the separating
hyperplane, and not outside the margin. The third condition is satised, since
0 + 4 +(2) = 2 +1. The last three conditions are also satised, and in fact
are satised exactly. For instance, the fourth condition is u + v + b 1. The
left side is 0+1+(2) = 1. Thus, the pattern oxoooo represents the outcome
of these six conditions, as we see in the rst line of Fig. 12.19.
We use these conditions to compute the partial derivatives. For f/u, we
use u in place of w
j
in Equation 12.6. This expression thus becomes
u + C
_
0 + (2) + 0 + 0 + 0 + 0
_
= 0 +
1
10
(2) = 0.2
The sum multiplying C can be explained this way. For each of the six conditions
of Fig. 12.18, take 0 if the condition is satised, and take the value in the
column labeled for u if it is not satised. Similarly, for v in place of w
j
we
get f/v = 1 +
1
10
_
0 + (2) + 0 + 0 + 0 + 0
_
= 0.8. Finally, for b we get
f/b = 2 +
1
10
_
0 + (1) + 0 + 0 + 0 + 0
_
= 2.1.
We can now compute the new w and b that appear on line (2) of Fig. 12.19.
Since we chose = 1/5, the new value of u is 0
1
5
(0.2) = 0.04, the new
value of v is 1
1
5
(0.8) = 0.84, and the new value of b is 2
1
5
(2.1) = 1.58.
To compute the derivatives shown in line (2) of Fig. 12.19 we must rst check
the conditions of Fig. 12.18. While the outcomes of the rst three conditions
have not changed, the last three are no longer satised. For example, the
430 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
fourth condition is u + v + b 1, but 0.04 + 0.84 + (1.58) = 0.7, which is
not less than 1. Thus, the pattern of bad points becomes oxoxxx. We now
have more nonzero terms in the expressions for the derivatives. For example
f/u = 0.04 +
1
10
_
0 + (2) + 0 + 1 + 2 + 3
_
= 0.44.
The values of w and b in line (3) are computed from the derivatives of
line (2) in the same way as they were computed in line (2). The new values
do not change the pattern of bad points; it is still oxoxxx. However, when we
repeat the process for line (4), we nd that all six conditions are unsatised.
For instance, the rst condition, u + 4v + b +1 is not satised, because
(0.118 + 4 0.502 + (1.083) = 0.807, which is less than 1. In eect, the
rst point has become too close to the separating hyperplane, even though it is
properly classied.
We can see that in line (5) of Fig. 12.19, the problems with the rst and
third points are corrected, and we go back to pattern oxoxxx of bad points.
However, at line (6), the points have again become too close to the separating
hyperplane, so we revert to the xxxxxx pattern of bad points. You are invited
to continue the sequence of updates to w and b for several more iterations.
One might wonder why the gradient-descent process seems to be converging
on a solution where at least some of the points are inside the margin, when
there is an obvious hyperplane (horizontal, at height 1.5) with a margin of 1/2,
that separates the positive and negative points. The reason is that when we
picked C = 0.1 we were saying that we really dont care too much whether
there are points inside the margins, or even if points are misclassied. We were
saying also that what was important was a large margin (which corresponds to
a small w), even if some points violated that same margin. 2
12.3.5 Parallel Implementation of SVM
One approach to parallelism for SVM is analogous to what we suggested for
perceptrons in Section 12.2.7. You can start with the current w and b, and
in parallel do several iterations based on each training example. Then average
the changes for each of the examples to create a new w and b. If we distribute
w and b to each mapper, then the Map tasks can do as many iterations as we
wish to do in one round, and we need use the Reduce tasks only to average the
results. One iteration of map-reduce is needed for each round.
A second approach is to follow the prescription given here, but implement
the computation of the second term in Equation 12.4 in parallel. The contribu-
tion from each training example can then be summed. This approach requires
one round of map-reduce for each iteration of gradient descent.
12.3.6 Exercises for Section 12.3
Exercise 12.3.1: Continue the iterations of Fig. 12.19 for three more itera-
tions.
12.4. LEARNING FROM NEAREST NEIGHBORS 431
Exercise 12.3.2: The following training set obeys the rule that the positive
examples all have vectors whose components sum to 10 or more, while the sum
is less than 10 for the negative examples.
([3, 4, 5], +1) ([2, 7, 2], +1) ([5, 5, 5], +1)
([1, 2, 3], 1) ([3, 3, 2], 1) ([2, 4, 1], 1)
(a) Which of these six vectors are the support vectors?
! (b) Suggest a vector w and constant b such that the hyperplane dened by
w.x + b = 0 is a good separator for the positive and negative examples.
Make sure that the scale of w is such that all points are outside the margin;
that is, for each training example (x, y), you have y(w.x + b) +1.
! (c) Starting with your answer to part (b), use gradient descent to nd the
optimum w and b. Note that if you start with a separating hyperplane,
and you scale w properly, then the second term of Equation 12.4 will
always be 0, which simplies your work considerably.
! Exercise 12.3.3: The following training set obeys the rule that the positive
examples all have vectors whose components have an odd sum, while the sum
is even for the negative examples.
([1, 2], +1) ([3, 4], +1) ([5, 2], +1)
([2, 4], 1) ([3, 1], 1) ([7, 3], 1)
(a) Suggest a starting vector w and constant b that classies at least three of
the points correctly.
!! (b) Starting with your answer to (a), use gradient descent to nd the optimum
w and b.
12.4 Learning from Nearest Neighbors
In this section we consider several examples of learning, where the entire
training set is stored, perhaps, preprocessed in some useful way, and then used
to classify future examples or to compute the value of the label that is most
likely associated with the example. The feature vector of each training example
is treated as a data point in some space. When a new point arrives and must
be classied, we nd the training example or examples that are closest to the
new point, according to the distance measure for that space. The estimated
label is then computed by combining the closest examples in some way.
432 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
12.4.1 The Framework for Nearest-Neighbor Calculations
The training set is rst preprocessed and stored. The decisions take place when
a new example, called the query example arrives and must be classied.
There are several decisions we must make in order to design a nearest-
neighbor-based algorithm that will classify query examples. We enumerate
them here.
1. What distance measure do we use?
2. How many of the nearest neighbors do we look at?
3. How do we weight the nearest neighbors? Normally, we provide a function
(the kernel function) of the distance between the query example and its
nearest neighbors in the training set, and use this function to weight the
neighbors.
4. How do we dene the label to associate with the query? This label is some
function of the labels of the nearest neighbors, perhaps weighted by the
kernel function, or perhaps not. if there is no weighting, then the kernel
function need not be specied.
12.4.2 Learning with One Nearest Neighbor
The simplest cases of nearest-neighbor learning are when we choose only the
one neighbor that is nearest the query example. In that case, there is no use
for weighting the neighbors, so the kernel function is omitted. There is also
typically only one possible choice for the labeling function: take the label of the
query to be the same as the label of the nearest neighbor.
Example 12.10: Figure 12.20 shows some of the examples of dogs that last
appeared in Fig. 12.1. We have dropped most of the examples for simplicity,
leaving only three Chihuahuas, two Dachshunds, and two Beagles. Since the
height-weight vectors describing the dogs are two-dimensional, there is a simple
and ecient way to construct a Voronoi diagram for the points, in which the
perpendicular bisectors of the lines between each pair of points is constructed.
Each point gets a region around it, containing all the points to which it is
the nearest. These regions are always convex, although they may be open to
innity in one direction.
3
It is also a surprising fact that, even though there are
O(n
2
) perpendicular bisectors for n points, the Voronoi diagram can be found
in O(nlog n) time (see the References for this chapter).
In Fig. 12.20 we see the Voronoi diagram for the seven points. The bound-
aries that separate dogs of dierent breeds are shown solid, while the boundaries
3
While the region belonging to any one point is convex, the union of the regions for two or
more points might not be convex. Thus, in Fig. 12.20 we see that the region for all Dachshunds
and the region for all Beagles are not convex. That is, there are points p
1
and p
2
that are
both classied Dachshunds, but the midpoint of the line between p
1
and p
2
is classied as a
Beagle, and vice-versa.
12.4. LEARNING FROM NEAREST NEIGHBORS 433
Beagles
Dachshunds
Chihuahuas
Figure 12.20: Voronoi diagram for the three breeds of dogs
between dogs of the same breed are shown dashed. Suppose a query example
q is provided. Note that q is a point in the space of Fig. 12.20. We nd the
region into which q falls, and give q the label of the training example to which
that region belongs. Note that it is not too hard to nd the region of q. We
have to determine to which side of certain lines q falls. This process is the same
as we used in Sections 12.2 and 12.3 to compare a vector x with a hyperplane
perpendicular to a vector w. In fact, if the lines that actually form parts of the
Voronoi diagram are preprocessed properly, we can make the determination in
O(log n) comparisons; it is not necessary to compare q with all of the O(nlog n)
lines that form part of the diagram. 2
12.4.3 Learning One-Dimensional Functions
Another simple and useful case of nearest-neighbor learning has one-dimensional
data. In this situation, the training examples are of the form ([x], y), and we
shall write them as (x, y), identifying a one-dimensional vector with its lone
component. In eect, the training set is a collection of samples of the value
of a function y = f(x) for certain values of x, and we must interpolate the
function f at all points. There are many rules that could be used, and we shall
only outline some of the popular approaches. As discussed in Section 12.4.1,
the approaches vary in the number of neighbors they use, whether or not the
434 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
neighbors are weighted, and if so, how the weight varies with distance.
Suppose we use a method with k nearest neighbors, and x is the query point.
Let x
1
, x
2
, . . . , x
k
be the k nearest neighbors of x, and let the weight associated
with training point (x
i
, y
i
) be w
i
. Then the estimate of the label y for x is
k
i=1
w
i
y
i
/
k
i=1
w
i
. Note that this expression gives the weighted average of
the labels of the k nearest neighbors.
Example 12.11: We shall illustrate four simple rules, using the training set
(1, 1), (2, 2), 3, 4), (4, 8), (5, 4), (6, 2), and (7, 1). These points represent a
function that has a peak at x = 4 and decays exponentially on both sides.
Note that this training set has values of x that are evenly spaced. There is no
requirement that the points have any regular pattern. Some possible ways to
interpolate values are:
1 2 3 4 5 6 7 1 2 3 4 5 6 7
(a) One nearest neighbor (b) Average of two nearest neighbors
Figure 12.21: Results of applying the rst two rules in Example 12.11
1. Nearest Neighbor. Use only the one nearest neighbor. There is no need
for a weighting. Just take the value of any f(x) to be the label y associ-
ated with the training-set point nearest to query point x. The result of
using this rule on the example training set described above is shown in
Fig. 12.21(a).
2. Average of the Two Nearest Neighbors. Choose 2 as the number of nearest
neighbors to use. The weights of these two are each 1/2, regardless of how
far they are from the query point x. The result of this rule on the example
training set is in Fig. 12.21(b).
3. Weighted Average of the Two Nearest Neighbors. We again choose two
nearest neighbors, but we weight them in inverse proportion to their dis-
tance from the query point. Suppose the two neighbors nearest to query
12.4. LEARNING FROM NEAREST NEIGHBORS 435
point x are x
1
and x
2
. Suppose rst that x
1
< x < x
2
. Then the weight
of x
1
, the inverse of its distance from x, is 1/(x x
1
), and the weight of
x
2
is 1/(x
2
x). The weighted average of the labels is
_
y
1
x x
1
+
y
2
x
2
x
_
/
_
1
x x
1
+
1
x
2
x
_
which, when we multiply numerator and denominator by (xx
1
)(x
2
x),
simplies to
y
1
(x
2
x) + y
2
(x x
1
)
x
2
x
1
This expression is the linear interpolation of the two nearest neighbors, as
shown in Fig. 12.22(a). When both nearest neighbors are on the same side
of the query x, the same weights make sense, and the resulting estimate is
an extrapolation. We see extrapolation in Fig. 12.22(a) in the range x = 0
to x = 1. In general, when points are unevenly spaced, we can nd query
points in the interior where both neighbors are on one side.
4. Average of Three Nearest Neighbors. We can average any number of the
nearest neighbors to estimate the label of a query point. Figure 12.22(b)
shows what happens on our example training set when the three nearest
neighbors are used.
2
1 2 3 4 5 6 7 1 2 3 4 5 6 7
(a) Weighted average of two neighbors (b) Average of three neighbors
Figure 12.22: Results of applying the last two rules in Example 12.11
436 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
12.4.4 Kernel Regression
A way to construct a continuous function that represents the data of a training
set well is to consider all points in the training set, but weight the points using a
kernel function that decays with distance. A popular choice is to use a normal
distribution (or bell curve), so the weight of a training point x when the
query is q is e
(xq)
2
/
2
. Here is the standard deviation of the distribution
and the query q is the mean. Roughly, points within distance of q are heavily
weighted, and those further away have little weight. The advantage of using
a kernel function that is itself continuous and that is dened for all points in
the training set is to be sure that the resulting function learned from the data
is itself continuous (see Exercise 12.4.6 for a discussion of the problem when a
simpler weighting is used).
Example 12.12: Let us use the seven training examples of Example 12.11.
To make calculation simpler, we shall not use the normal distribution as the
kernel function, but rather another continuous function of distance, namely
w = 1/(x q)
2
. That is, weights decay as the square of the distance. Suppose
the query q is 3.5. The weights w
1
, w
2
, . . . w
7
of the seven training examples
(x
i
, y
i
) = (i, 8/2
|i4|
) for i = 1, 2, . . . , 7 are shown in Fig. 12.23.
1. x
i
1 2 3 4 5 6 7
2. y
i
1 2 4 8 4 2 1
3. w
i
4/25 4/9 4 4 4/9 4/25 4/49
4. w
i
y
i
4/25 8/9 16 32 16/9 8/25 4/49
Figure 12.23: Weights of points when the query is q = 3.5
Lines (1) and (2) of Fig. 12.23 give the seven training points. The weight
of each when the query is q = 3.5 is given in line (3). For instance, for x
1
= 1,
the weight w
1
= 1/(1 3.5)
2
= 1/(2.5)
2
= 4/25. Then, line (4) shows each y
i
weighted by the weight from line (3). For instance, the column for x
2
has value
8/9 because w
2
y
2
= 2 (4/9).
To compute the label for the query q = 3.5 we sum the weighted values
of the labels in the training set, as given by line (4) of Fig. 12.23; this sum is
51.23. We then divide by the sum of the weights in line (3). This sum is 9.29,
so the ratio is 51.23/9.29 = 5.51. That estimate of the value of the label for
q = 3.5 seems intuitively reasonable, since q lies midway between two points
with labels 4 and 8. 2
12.4.5 Dealing with High-Dimensional Euclidean Data
We saw in Section 12.4.2 that the two-dimensional case of Euclidean data is
fairly easy. There are several large-scale data structures that have been de-
veloped for nding near neighbors when the number of dimensions grows, and
12.4. LEARNING FROM NEAREST NEIGHBORS 437
Problems in the Limit for Example 12.12
Suppose q is exactly equal to one of the training examples x. If we use the
normal distribution as the kernel function, there is no problem with the
weight of x; the weight is 1. However, with the kernel function discussed in
Example 12.12, the weight of x is 1/(xq)
2
= . Fortunately, this weight
appears in both the numerator and denominator of the expression that
estimates the label of q. It can be shown that in the limit as q approaches
x, the label of x dominates all the other terms in both numerator and
denominator, so the estimated label of q is the same as the label of x.
That makes excellent sense, since q = x in the limit.
the training set is large. We shall not cover these structures here, because the
subject could ll a book by itself, and there are many places available to learn
about these techniques, collectively called multidimensional index structures.
The bibliography for this section mentions some of these sources for informa-
tion about such structures as kd-Trees, R-Trees, and Quad Trees.
Unfortunately, for high-dimension data, there is little that can be done to
avoid searching a large portion of the data. This fact is another manifestation
of the curse of dimensionality from Section 7.1.3. Two ways to deal with the
curse are the following:
1. VA Files. Since we must look at a large fraction of the data anyway in
order to nd the nearest neighbors of a query point, we could avoid a
complex data structure altogether. Accept that we must scan the entire
le, but do so in a two-stage manner. First, a summary of the le is
created, using only a small number of bits that approximate the values of
each component of each training vector. For example, if we use only the
high-order (1/4)th of the bits in numerical components, then we can create
a le that is (1/4)th the size of the full dataset. However, by scanning
this le we can construct a list of candidates that might be among the k
nearest neighbors of the query q, and this list may be a small fraction of
the entire dataset. We then look up only these candidates in the complete
le, in order to determine which k are nearest to q.
2. Dimensionality Reduction. We may treat the vectors of the training set
as a matrix, where the rows are the vectors of the training example, and
the columns correspond to the components of these vectors. Apply one of
the dimensionality-reduction techniques of Chapter 11, to compress the
vectors to a small number of dimensions, small enough that the techniques
for multidimensional indexing can be used. Of course, when processing
a query vector q, the same transformation must be applied to q before
searching for qs nearest neighbors.
438 CHAPTER 12. LARGE-SCALE MACHINE LEARNING
12.4.6 Dealing with Non-Euclidean Distances
To this point, we have assumed that the distance measure is Euclidean. How-
ever, most of the techniques can be adapted naturally to an arbitrary distance
function d. For instance, in Section 12.4.4 we talked about using a normal dis-
tribution as a kernel function. Since we were thinking about a one-dimensional
training set in a Euclidean space, we wrote the exponent as (xq)
2
. However,
for any distance function d, we can use as the weight of a point x at distance
d(x, q) from the query point q the value of
e
_
d(xq)
_
2
/
2
Note that this expression makes sense if the data is in some high-dimensional
Euclidean space and d is the usual Euclidean distance or Manhattan distance or
any other distance discussed in Section 3.5.2. It also makes sense if d is Jaccard
distance or any other distance measure.
However, for Jaccard distance and the other distance measures we consid-
ered in Section 3.5 we also have the option to use locality-sensitive hashing,
the subject of Chapter 3. Recall these methods are only approximate, and they
could yield false negatives training examples that were near neighbors to a
query but that do not show up in a search.
If we are willing to accept such errors occasionally, we can build the buckets
for the training set and keep them as the representation of the training set.
These buckets are designed so we can retrieve all (or almost all, since there can
be false negatives) training-set points that are have a minimum similarity to a
given query q. Equivalently, one of the buckets to which the query hashes will
contain all those points within some maximum distance of q. We hope that as
many nearest neighbors of q as our method requires will be found among those
buckets.
Yet if dierent queries have radically dierent distances to their nearest
neighbors, all is not lost. We can pick several distances d
1
< d
2
< d
3
< .
Build the buckets for LSH using each of these distances. For a query q, start
with the buckets for distance d
1
. If we nd enough near neighbors, we are done.
Otherwise, repeat the search using the buckets for d
2
, and so on, until enough
nearest neighbors are found.
12.4.7 Exercises for Section 12.4
Exercise 12.4.1: Suppose we modied Example 12.10 to look at the two
nearest neighbors of a query point q. Classify q with the common label if those
two neighbors have the same label, and leave q unclassied if the labels of the
neighbors are dierent.
(a) Sketch the boundaries of the regions for the three dog breeds on Fig. 12.20.
! (b) Would the boundaries always consist of straight line segments for any
training data?
12.4. LEARNING FROM NEAREST NEIGHBORS 439
Exercise 12.4.2: Suppose we have the following training set
([1, 2], +1) ([2, 1], 1)
([3, 4], 1) ([4, 3], +1)
which is the training set used in Example 12.9. If we use nearest-neighbor
learning with the single nearest neighbor as the estimate of the label of a query
point, which query points are labeled +1?
Exercise 12.4.3: Consider the one-dimensional training set
(1, 1), (2, 2), (4, 3), (8, 4), (16, 5), (32, 6)
Describe the function f(q), the label that is returned in response to the query
q, when the interpolation used is:
(a) The label of the nearest neighbor.
(b) The average of the labels of the two nearest neighbors.
! (c) The average, weighted by distance, of the two nearest neighbors.
(d) The (unweighted) average of the three nearest neighbors.
! Exercise 12.4.4: Apply the kernel function of Example 12.12 to the data of
Exercise 12.4.3. For queries q in the range 2 < q < 4, what is the label of q?
Exercise 12.4.5: What is the function that estimates the label of query points
using the data of Example 12.11 and the average of the four nearest neighbors?
!! Exercise 12.4.6: Simple weighting functions such as those in Example 12.11
need not dene a continuous function. We can see that the constructed functions
in Fig. 12.21 and Fig. 12.22(b) are not continuous, but Fig. 12.22(a) is. Does the
weighted average of two nearest neighbors always give a continuous function?