Linear Models: The Least-Squares Method, The Perceptron: A Heuristic Learning Algorithm For
Linear Models: The Least-Squares Method, The Perceptron: A Heuristic Learning Algorithm For
Linear Models: The Least-Squares Method, The Perceptron: A Heuristic Learning Algorithm For
Linear models: The least-squares method, The perceptron: a heuristic learning algorithm for
linear classifiers, Support vector machines, obtaining probabilities from linear classifiers,
Going beyond linearity with kernel methods. Distance Based Models: Introduction,
Neighbours and exemplars, Nearest Neighbours classification, Distance Based Clustering,
Hierarchical Clustering.
----------------------------------------------------------------------------------------------------------------
Linear Models:
The models that can be understood in terms of lines and planes, commonly called linear
models. In machine learning, linear models are of particular interest because of their
simplicity. The reasons for simplicity are.
Linear models are parametric, meaning that they have a fixed form with a small number of
numeric parameters that need to be learned from data. This is different from tree or rule
models, where the structure of the model is not fixed in advance.
Linear models are stable, which is to say that small variations in the training data have
only limited impact on the learned model.
Linear models are less likely to overfit the training data than some other models, largely
because they have relatively few parameters.
The last two points can be summarized by saying that linear models have low variance but
high bias. Such models are often preferable when you have limited data and want to avoid
over fitting. High variance low bias models such as decision trees are preferable if data is
abundant but under fitting is a concern.
This can be understood by noting that the covariance is measured in units of x times units of
y (e.g., metres times kilograms in Example 7.1) and the variance in units of x squared (e.g.,
metres squared), so their quotient is measured in units of y per unit of x (e.g., kilograms per
metre).
In other words, univariate linear regression can be understood as consisting of two steps:
1. normalisation of the feature by dividing its values by the feature’s variance;
2. Calculating the covariance of the target variable and the normalised feature.
The sum of the residuals of the least-squares solution is zero:
The result follows because ˆ a = y − ˆb x, as derived in Example 7.1. While this property is
intuitively appealing, it is worth keeping in mind that it also makes linear regression
susceptible to outliers: points that are far removed from the regression line, often because of
measurement errors.
Example 7.2 (The effect of outliers). Suppose that, as the result of a transcription error, one
of the weight values in Figure 7.1 is increased by 10 kg. Figure 7.2 shows that this has a
considerable effect on the least-squares regression line.
Figure 7.2. The effect of an outlier in univariate regression. One of the blue points got moved
up10 units to the green point, changing the red regression line to the green line.
Suppose that, as the result of a transcription error, one of the weight values in Figure 7.1 is
increased by 10 kg. Figure 7.2 shows that this has a considerable effect on the least-squares
regression line.
In the second form of this equation, y, a, X and are n-vectors, and b is a scalar. In case of
d features, all that changes is that X becomes an n-by-d matrix, and b becomes a d-vector of
regression coefficients. We can apply the by now familiar trick of using homogeneous
coordinates to simplify these equations as follows:
with X0 an n-by-(d +1) matrix whose first column is all 1’s and the remaining columns are the
columns of X, and w has the intercept as its first entry and the regression coefficients as the
remaining d entries. For convenience we will often blur the distinction between these two
formulations and state the regression equation as y = Xw+ with X having d columns and w
having d rows, from the context it will be clear whether we are representing the intercept by
means of homogeneous coordinates, or have rather zero-centered the target and features to
achieve a zero intercept. In the univariate case we were able to obtain a closed-form solution
for w: which we can we do in the multivariate case. First, we are likely to need the co-
variances between every feature and the target variable.
Consider the expression XTy, which is an n-vector, the j-th entry of which is the product of
the j-th row of XT – i.e., the j-th column of X, which is (x1j , . . . ,xn j) – with (y1, . . . , yn):
Assuming for the moment that every feature is zero-centred, we have μ j = 0 and thus XTy is
an n-vector holding all the required covariances (times n).
In the multivariate case, to normalize the features to have unit variance, we can achieve this
by means of a d-by-d scaling matrix: a diagonal matrix with diagonal entries 1/nσjj. If S is a
diagonal matrix with diagonal entries nσjj , we can get the required scaling matrix by simply
inverting S. So our first stab at a solution for the multivariate regression problem is
As it turns out, the general case requires a more elaborate matrix instead of S:
Notice that if we do assume σ12 = 0 then the components of ˆw reduce to σ jy /σjj ,which brings
us back to Equation 7.2. Assuming uncorrelated features effectively decomposes a
multivariate regression problem into d univariate problems.
THE PERCEPTRON
A linear classifier that will achieve perfect separation on linearly separable data is the
perceptron. The perceptron iterates over the training set, updating the weight vector every
time it encounters an incorrectly classified example.
For example, let xi be a misclassified positive example, then we have yi =+1and w·xi < t. We
therefore want to find w` such that w`·xi >w·xi , which moves the decision boundary towards
and hopefully past xi . This can be achieved by calculating the new weight vector as w` =
w+ηxi , where 0 < η ≤ 1 is the learning rate.
We then have w`·xi =w·xi +ηxi·xi >w·xi as required. Similarly, if xj is a misclassified negative
example, then we have yj =−1 and w·xj > t. In this case we calculate the new weight vector as
w`=w−ηxj, and thus w`·xj = w·xj −ηxj ·xj < w·xj . The two cases can be combined in a single
update rule:
w` =w+ηyi xi (7.8)
The perceptron training algorithm is given in Algorithm 7.1. It iterates through the training
examples until all examples are correctly classified. The algorithm can easily be turned into
an online algorithm that processes a stream of examples, updating the weight vector only if
the last received example is misclassified. The perceptron is guaranteed to converge to a
solution if the training data is linearly separable, but it won’t converge otherwise.
Figure 7.5. (left) A perceptron trained with a small learning rate (η = 0.2). The circled
examples are the ones that trigger the weight update. (middle) Increasing the learning rate to
η = 0.5 leads in this case to a rapid convergence. (right) Increasing the learning rate further
to η = 1 may lead to too aggressive weight updating, which harms convergence. The starting
point in all three cases was the basic linear classifier.
Figure 7.5 gives a graphical illustration of the perceptron training algorithm. In this particular
example the weight vector is initialized to the basic linear classifier, which means the
learning rate does have an effect on how quickly we move away from the initial decision
boundary.
However, if the weight vector is initialised to the zero vector, it is easy to see that the learning
rate is just a constant factor that does not affect convergence. We will set it to 1 in the next.
The key point of the perceptron algorithm is that, every time an example xi is misclassified,
we add yixi to the weight vector. After training has completed, each example has been
misclassified zero or more times denote this number αi for example xi.
Using this notation the weight vector can be expressed as
In other words, the weight vector is a linear combination of the training instances. The
perceptron shares this property with, e.g., the basic linear classifier:
where c(x) is the true class of example x (i.e., +1 or −1), α ⊕ = 1/Pos and α- = 1/Neg. In the
dual, instance-based view of linear classification we are learning instance weights α i rather
than feature weights wj. In this dual perspective, an instance x is classified as
.
This means that, during training, the only information needed about the training data is all
pair wise dot products: the n-by-n matrix G =XX T containing these dot products is called the
Gram matrix. Algorithm 7.2 gives the dual form of the perceptron training algorithm.
SUPPORT VECTOR MACHINES
For a given training set and decision boundary, let m+ be the smallest margin of any positive,
and m- the smallest margin of any negative, then we want the sum of these to be as large as
possible.
This sum is independent of the decision threshold t, as long as we keep the nearest positives
and negatives at the right sides of the decision boundary, and so we re-adjust t such that m+
and m- become equal. Figure 7.7 depicts this graphically in a two dimensional instance space.
The training examples nearest to the decision boundary are called support vectors: as we shall
see, the decision boundary of a support vector machine (SVM) is defined as a linear
combination of the support vectors.
Figure 7.7. The geometry of a support vector classifier. The circled data points are the
support vectors, which are the training examples nearest to the decision boundary. The
support vector machine finds the decision boundary that maximises the margin m/||w||.
The margin is thus defined as m/||w||, where m is the distance between the decision boundary
and the nearest training instances (at least one of each class) as measured along w. Since we
are free to rescale t , ||w|| and m, it is customary to choose m = 1. Maximising the margin
then corresponds to minimising ||w|| or, more conveniently,1/2||w||2, provided of course that
none of the training points fall inside the margin. This leads to a quadratic, constrained
optimisation problem:
We will approach this using the method of Lagrange multipliers.Adding the constraints with
multipliers αi for each training example gives the Lagrange Function
While this looks like a formidable formula, some further analysis will allow us to derive
the simpler dual form of the Lagrange function.
By taking the partial derivative of the Lagrange function with respect to t and setting it to 0
The dual problem is to maximize this function under positivity constraints and one equality
constraint:
The dual form of the optimization problem for support vector machines illustrates two
important points.
First, it shows that searching for the maximum-margin decision boundary is equivalent to
searching for the support vectors: they are the training examples with non-zero Lagrange
multipliers, and through , they completely determine the decision
boundary.
Secondly, it shows that the optimization problem is entirely defined by pairwise dot products
between training instances: the entries of the Gram matrix.
Figure 7.8. (left) A maximum-margin classifier built from three examples, with w = (0,−1/2)
and margin 2. The circled examples are the support vectors: they receive non-zero Lagrange
multipliers and define the decision boundary. (right) By adding a second positive the
decision
boundary is rotated to w= (3/5,−4/5) and the margin decreases to 1.
with w = w/||w|| rescaled to unit length and t = t/||w|| the corresponding rescaled intercept.
The sign of this quantity tells us which side of the decision boundary we are on: positive
distances for points on the ‘positive’ side of the decision boundary (the direction in which w
points) and negative distances on the other side.
This geometric interpretation of the scores produced by linear classifiers offers an interesting
possibility for turning them into probabilities, a process that was called calibration.
Let denote the mean distance of the positive examples to the decision boundary: i.e.,
=w·μ −t , where μ⊕ is the mean of the positive examples and w is unit length (although the
⊕
latter assumption is not strictly necessary, as it will turn out that the weight vector will be
rescaled). It would not be unreasonable to expect that the distance of positive examples to the
decision boundary is normally distributed around this mean that is, when plotting a histogram
of these distances, we would expect the familiar bell curve to appear.
Under this assumption, the probability density function of d is
Similarly, the distances of negative examples to the decision boundary can be expected to be
normally distributed around .
We will assume that both normal distributions have the same variance σ2.
Suppose we now observe a point x with distance d(x). We classify this point as positive if
d(x) > 0 and as negative if d(x) < 0, but we want to attach a probability pˆ(x) = P( ⊕|d(x)) to
these predictions. Using Bayes’ rule we obtain
where LR is the likelihood ratio obtained from the normal score distributions, and clr is the
class ratio. We will assume for simplicity that clr = 1 in the derivation below. Furthermore,
assume for now that σ2 = 1 and = = 1/2 . We then have
Figure 7.11. The logistic function, a useful function for mapping distances from a linear
decision boundary into an estimate of the positive posterior probability. The fat red line
indicates the standard logistic function pˆ(d) = 1/1+exp(−d) ; this function can be used to
obtain probability estimates if the two classes are equally prevalent and the class means are
equidistant from the decision boundary and one unit of variance apart. The steeper and
flatter red lines show how the function changes if the class means are 2 and 1/2 units of
variance apart, respectively. The three blue lines show how these curves change if d0 = 1,
which means that the positives are on average further away from the decision boundary.
Take the perceptron algorithm in dual form.The algorithm is a simple counting algorithm –
the only operation that is somewhat involved is testing whether example xi is correctly
classified by evaluating The key component of this calculation is the dot
product xi ·xj .Assume bivariate examples xi=(xi , yi ) and xj =(xj , y j) for notational
simplicity, the dot product can be written as xi·xj = xi xj+yi y j . The corresponding instances
in the quadratic feature space are
and their dot product is
this is almost equal to
but not quite because
of the third term of cross-products. We can capture this term by extending the feature vector
with a third feature. This gives the following feature space:
Distance-based models
What’s the relevance when trying to understand distance-based machine learning. Well, the
rank (row) and file (column) on a chessboard is not unlike a discrete or categorical feature in
machine learning.
We can switch to real-valued features by imagining a ‘continuous’ chessboard with infinitely
many, infinitesimally narrow ranks and files. Squares now become points, and distances are
not expressed as the number of squares travelled, but simply as a real number on some scale.
If we now look at the shapes obtained by connecting equidistant points, we see that many of
these carry over from the discrete to the continuous case.
For a King, for example, all points a given fixed distance away still form a square around the
current position; and for a KRook they still form a square rotated 45 degrees.
As it happens, these are special cases of the following generic concept.
the distance experienced by the King on a chessboard, who can move diagonally as well as
horizontally and vertically but only one step at a time; it is also called Chebyshev distance.
Dis∞∞(x,y) = maxj |xj − y j |.
This is not strictly a Minkowski distance; however, we can define it as under the
understanding that x0 = 0 for x = 0 and 1 otherwise. This is actually the distance experienced
by a Rook on the chessboard: if both rank and file are different the square is two moves
away, if only one of them is different the square is one move away.
If x and y are binary strings, this is also called the Hamming distance.
Figure 8.4 investigates this for Minkowski distances of various orders. The triangle inequality
dictates that the distance from the origin to C is no more than the sum of the distances from
the origin to A (Dis(O,A)) and from A to C (Dis(A, C)). B is at the same distance from A as
C, regardless of the distance measure used; so Dis(O,A)+Dis(A,C) is equal to the distance
from the origin to B. So, if we draw a circle around the origin through B, the triangle
inequality dictates that C not be outside that circle. As we see in the left figure for Euclidean
distance, B is the only point where the circles around the origin and around A intersect, so
everywhere else the triangle inequality is a strict inequality.
The middle figure shows the same situation for Manhattan distance (p = 1). Now, B and C
are in fact equidistant from the origin, and so travelling via A to C is no longer a detour, but
just one of the many shortest routes. However, if we now decrease p further, we see that C
ends up outside the red shape, and is thus further away than B when seen from the origin,
whereas of course the sum of the distances from the origin to A and from A to C is still equal
to the distance from the origin to B. At this point, our intuition breaks down: Minkowski
distances with p < 1 are simply not very useful as distances since they all violate the triangle
inequality.
Nearest-neighbour classification
In the previous section we saw how to generalise the basic linear classifier to more than two
classes, by learning an exemplar for each class and using the nearest-exemplar decision rule
to classify new data. In fact, the most commonly used distance-based classifier is even more
straightforward than that: it simply uses each training instance as an exemplar. Consequently,
‘training’ this classifier requires nothing more than memorising the training data. This
extremely simple classifier is known as the nearestneighbour classifier. Its decision regions
are made up of the cells of a Voronoi tesselation, with piecewise linear decision boundaries
selected from the Voronoi boundaries (since adjacent cells may be labelled with the same
class).
What are the properties of the nearest-neighbour classifier? First, notice that, unless the
training set contains identical instances from different classes, we will be able to separate the
classes perfectly on the training set – not really a surprise, as we memorized all training
examples! Furthermore, by choosing the right exemplars we can more or less represent any
decision boundary, or at least an arbitrarily close piecewise linear approximation. It follows
that the nearest-neighbour classifier has low bias, but also high variance: move any of the
exemplars spanning part of the decision boundary, and you will also change the boundary.
This suggests a risk of overfitting if the training data is limited, noisy or unrepresentative.
From an algorithmic point of view, training the nearest-neighbour classifier is very fast,
taking only O(n) time for storing n exemplars. The downside is that classifying a single
instance also takes O(n) time, as the instance will need to be compared with every exemplar
to determine which one is the nearest. It is possible to reduce classification time at the
expense of increased training time by storing the exemplars in a more elaborate data
structure, but this tends not to scale well to large numbers of features.
In fact, high-dimensional instance spaces can be problematic for another reason: the infamous
curse of dimensionality. High-dimensional spaces tend to be extremely sparse, which means
that every point is far away from virtually every other point, and hence pairwise distances
tend to be uninformative. However, whether or not you are hit by the curse of dimensionality
is not simply a matter of counting the number of features, as there are several reasons why the
effective dimensionality of the instance space may be much smaller than the number of
features. For example, some of the features may be irrelevant and drown out the relevant
features’ signal in the distance calculations. In such a case it would be a good idea, before
building a distance-based model, to reduce dimensionality by performing_feature selection,
as will be discussed in Chapter 10. Alternatively, the data may live on a manifold of lower
dimension than the instance space (e.g., the surface of a sphere is a two-dimensional manifold
wrapped around a three-dimensional object), which allows other dimensionality-reduction
techniques such as _principal component analysis, which will be explained in the same
chapter. In any case, before applying nearest-neighbour classification it is a good idea to plot
a histogram of pairwise distances of a sample to see if they are sufficiently varied.
Notice that the nearest-neighbour method can easily be applied to regression problems with a
real-valued target variable. In fact, the method is completely oblivious to the type of target
variable and can be used to output text documents, images and videos. It is also possible to
output the exemplar itself instead of a separate target, in which case we usually speak of
nearest-neighbour retrieval. Of course we can only output targets (or exemplars) stored in the
exemplar database, but if we have a way of aggregating these we can go beyond this
restriction by applying the k-nearest neighbor method. In its simplest form, the k-nearest
neighbour classifier takes a vote between the k ≥ 1 nearest exemplars of the instance to be
classified, and predicts the majority class. We can easily turn this into a probability estimator
by returning the normalized class counts as a probability distribution over classes.
Figure 8.9. (left) Decision regions of a 3-nearest neighbour classifier; the shading
represents the predicted probability distribution over the five classes. (middle) 5-nearest
neighbour. (right)7-nearest neighbour.
Figure 8.9 illustrates this on a small data set of 20 exemplars from five different classes, for k
= 3, 5, 7. The class distribution is visualised by assigning each test point the class of a
uniformly sampled neighbour: so, in a region where two of k = 3 neighbours are red and one
is orange, the shading is a mix of two-thirds red and one-third orange. While for k = 3 the
decision regions are still mostly discernible, this is much less so for k = 5 and k = 7. This
may seem at odds with our earlier demonstration of the increase in the number of decision
regions with increasing k in Example 8.2. However, this increase is countered by the fact that
the probability vectors become more similar to each other. To take an extreme example: if k
is equal to the number of exemplars n, every test instance will have the same number of
neighbours and will receive the same probability vector which is equal to the prior
distribution over the exemplars. If k = n −1 we can reduce one of the class counts by 1,
which can be done in c ways: the same number of possibilities as with k = 1!
We conclude that the refinement of k-nearest neighbour – the number of different predictions
it can make – initially increases with increasing k, then decreases again. Furthermore, we can
say that the bias increases and the variance decreases with increasing k. There is no easy
recipe to decide what value of k is appropriate for a given data set. However, it is possible to
sidestep this question to some extent by applying distance weighting to the votes: that is, the
closer an exemplar is to the instance to be classified, the more its vote counts.
Figure 8.10 demonstrates this, using the reciprocal of the distance to an exemplar as the
weight of its vote. This blurs the decision boundaries, as the model now applies a
combination of grouping by means of theVoronoi boundaries, and grading by means of
distance weighting. Furthermore, since the weights decrease quickly for larger distances, the
effect of increasing k is much smaller than with unweighted voting. In fact, with distance
weighting we can simply put k = n and still obtain a model that makes different predictions in
different parts of the instance space. One could say that distance weighting makes k-nearest
neighbor more of a global model, while without it (and for small k) it is more like an
aggregation of local models.
If k-nearest neighbour is used for regression problems, the obvious way to aggregate the
predictions from the k neighbours is by taking the mean value, which can again be distance-
weighted. This would lend the model additional predictive power by predicting values that
aren’t observed among the stored exemplars. More generally, we can apply k-means to any
learning problem where we have an appropriate ‘aggregator’ for multiple target values.
Figure 8.10. (left) 3-nearest neighbour with distance weighting on the data from Figure 8.9.
(middle) 5-nearest neighbour. (right) 7-nearest neighbour.
Hierarchical clustering
In this we take a look at methods that represent clusters using trees. Here we consider trees
called dendrograms, which are purely defined in terms of a distance measure. Because
dendrograms use features only indirectly, as the basis on which the distance measure is
calculated, they partition the given data rather than the entire instance space, and hence
represent a descriptive clustering rather than a predictive one.
A precise definition of a dendrogram is as follows.
Definition 8.4 (Dendrogram). Given a data set D, a dendrogram is a binary tree with the
elements of D at its leaves. An internal node of the tree represents the subset of elements in
the leaves of the subtree rooted at that node. The level of a node is the distance between the
two clusters represented by the children of the node. Leaves have level 0.
For this definition, we need a way to measure how close two clusters are. You might think
that this is straightforward: just calculate the distance between the two cluster means.
Furthermore, taking cluster means as exemplars assumes Euclidean distance, and we may
want to use one of the other distance metrics. This has led to the introduction of the so-called
linkage function, which is a general way to turn pairwise point distances into pairwise cluster
distances.
Single and complete linkage both define the distance between clusters in terms of a particular
pair of points. Consequently, they cannot take the shape of the cluster into account, which is
why average and centroid linkage can offer an advantage. However, centroid linkage can lead
to non-intuitive dendrograms, as illustrated in Figure 8.17. The issue here is that we have
L({1}, {2}) < L({1}, {3}) and L({1}, {2}) < L({2}, {3}) but L({1}, {2}) > L({1,2}, {3}).
The first two inequalities mean that 1 and 2 are the first to be merged into a cluster; but the
second inequality means that the level of cluster {1,2,3} in the dendrogram drops below the
level of {1,2}. Centroid linkage violates the requirement of monotonicity, which stipulates
that L(A,B) < L(A,C) and L(A,B) < L(B,C) implies L(A,B) < L(A∪B,C) for any clusters A, B
and C. The other three linkage functions are monotonic (the example also serves as an
illustration why average linkage and centroid linkage are not the same).
Hierarchical clustering methods have the distinct advantage that the number of clusters does
not need to be fixed in advance. However, this advantage comes at considerable
computational cost. Furthermore, we now need to choose not just the distance measure used,
but also the linkage function.