Linear Models: The Least-Squares Method, The Perceptron: A Heuristic Learning Algorithm For

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

UNIT IV:

Linear models: The least-squares method, The perceptron: a heuristic learning algorithm for
linear classifiers, Support vector machines, obtaining probabilities from linear classifiers,
Going beyond linearity with kernel methods. Distance Based Models: Introduction,
Neighbours and exemplars, Nearest Neighbours classification, Distance Based Clustering,
Hierarchical Clustering.
----------------------------------------------------------------------------------------------------------------
Linear Models:
The models that can be understood in terms of lines and planes, commonly called linear
models. In machine learning, linear models are of particular interest because of their
simplicity. The reasons for simplicity are.
 Linear models are parametric, meaning that they have a fixed form with a small number of
numeric parameters that need to be learned from data. This is different from tree or rule
models, where the structure of the model is not fixed in advance.
 Linear models are stable, which is to say that small variations in the training data have
only limited impact on the learned model.
 Linear models are less likely to overfit the training data than some other models, largely
because they have relatively few parameters.
The last two points can be summarized by saying that linear models have low variance but
high bias. Such models are often preferable when you have limited data and want to avoid
over fitting. High variance low bias models such as decision trees are preferable if data is
abundant but under fitting is a concern.

4.1 THE LEAST-SQUARES METHOD:


The differences between the actual and estimated function values on the training examples
are called residuals To learn linear models for classification and
regression, the least-squares method, introduced by Carl Friedrich Gauss in the late
eighteenth century, consists in finding such that is minimised. The following
example illustrates the method in the simple case of a single feature, which is called
univariate regression.
Figure 7.1. The red solid line indicates the result of applying linear regression to 10
measurements of body weight (on the y-axis, in kilograms) against body height (on the x-axis,
in centimetres). The orange dotted lines indicate the average height h = 181 and the average
weight w = 74.5; the regression coefficient ˆb = 0.78. The measurements were simulated by
adding normally distributed noise with mean 0 and variance 5 to the true model indicated by
the blue dashed line (b = 0.83).
It is worthwhile to note that the expression for the regression coefficient or slope ˆb derived
in this example has n times the covariance between h and w in the enumerator and n times the
variance of h in the denominator. This is true in general: for a feature x and a target variable
y, the regression coefficient is

This can be understood by noting that the covariance is measured in units of x times units of
y (e.g., metres times kilograms in Example 7.1) and the variance in units of x squared (e.g.,
metres squared), so their quotient is measured in units of y per unit of x (e.g., kilograms per
metre).
In other words, univariate linear regression can be understood as consisting of two steps:
1. normalisation of the feature by dividing its values by the feature’s variance;
2. Calculating the covariance of the target variable and the normalised feature.
The sum of the residuals of the least-squares solution is zero:

The result follows because ˆ a = y − ˆb x, as derived in Example 7.1. While this property is
intuitively appealing, it is worth keeping in mind that it also makes linear regression
susceptible to outliers: points that are far removed from the regression line, often because of
measurement errors.
Example 7.2 (The effect of outliers). Suppose that, as the result of a transcription error, one
of the weight values in Figure 7.1 is increased by 10 kg. Figure 7.2 shows that this has a
considerable effect on the least-squares regression line.
Figure 7.2. The effect of an outlier in univariate regression. One of the blue points got moved
up10 units to the green point, changing the red regression line to the green line.
Suppose that, as the result of a transcription error, one of the weight values in Figure 7.1 is
increased by 10 kg. Figure 7.2 shows that this has a considerable effect on the least-squares
regression line.

Multivariate linear regression:


We can write univariate linear regression in matrix form as

In the second form of this equation, y, a, X and are n-vectors, and b is a scalar. In case of
d features, all that changes is that X becomes an n-by-d matrix, and b becomes a d-vector of
regression coefficients. We can apply the by now familiar trick of using homogeneous
coordinates to simplify these equations as follows:

with X0 an n-by-(d +1) matrix whose first column is all 1’s and the remaining columns are the
columns of X, and w has the intercept as its first entry and the regression coefficients as the
remaining d entries. For convenience we will often blur the distinction between these two
formulations and state the regression equation as y = Xw+ with X having d columns and w
having d rows, from the context it will be clear whether we are representing the intercept by
means of homogeneous coordinates, or have rather zero-centered the target and features to
achieve a zero intercept. In the univariate case we were able to obtain a closed-form solution
for w: which we can we do in the multivariate case. First, we are likely to need the co-
variances between every feature and the target variable.
Consider the expression XTy, which is an n-vector, the j-th entry of which is the product of
the j-th row of XT – i.e., the j-th column of X, which is (x1j , . . . ,xn j) – with (y1, . . . , yn):

Assuming for the moment that every feature is zero-centred, we have μ j = 0 and thus XTy is
an n-vector holding all the required covariances (times n).
In the multivariate case, to normalize the features to have unit variance, we can achieve this
by means of a d-by-d scaling matrix: a diagonal matrix with diagonal entries 1/nσjj. If S is a
diagonal matrix with diagonal entries nσjj , we can get the required scaling matrix by simply
inverting S. So our first stab at a solution for the multivariate regression problem is

As it turns out, the general case requires a more elaborate matrix instead of S:
Notice that if we do assume σ12 = 0 then the components of ˆw reduce to σ jy /σjj ,which brings
us back to Equation 7.2. Assuming uncorrelated features effectively decomposes a
multivariate regression problem into d univariate problems.

Using least-squares regression for classification


We can also use linear regression to learn a binary classifier by encoding the two classes as
real numbers. For instance, we can label the Pos positive examples with y ⊕ =+1 and the Neg
negative examples with y- =−1. It then follows that XTy =Pos μ⊕−Neg μ-, where μ⊕ and μ- are
d-vectors containing each feature’s mean values for the positive and negative examples,
respectively.
In the general case, the least-squares classifier learns the decision boundary w·x = t with
w= (XTX)−1(Pos μ⊕ −Neg μ-) (7.7) We would hence assign class ˆ y = sign(w· x−t ) to
instance x, where
sign(x) =+1 if x > 0
0 if x = 0
−1 if x < 0
Various simplifying assumptions can be made, including zero-centred features, equal
variance features, uncorrelated features and equal class prevalences. In the simplest case,
when all these assumptions are made, Equation 7.7 reduces to w = c(μ ⊕ −μ-) where c is some
scalar that can be incorporated in the decision threshold t . We recognize this as the basic
linear classifier.

THE PERCEPTRON
A linear classifier that will achieve perfect separation on linearly separable data is the
perceptron. The perceptron iterates over the training set, updating the weight vector every
time it encounters an incorrectly classified example.
For example, let xi be a misclassified positive example, then we have yi =+1and w·xi < t. We
therefore want to find w` such that w`·xi >w·xi , which moves the decision boundary towards
and hopefully past xi . This can be achieved by calculating the new weight vector as w` =
w+ηxi , where 0 < η ≤ 1 is the learning rate.
We then have w`·xi =w·xi +ηxi·xi >w·xi as required. Similarly, if xj is a misclassified negative
example, then we have yj =−1 and w·xj > t. In this case we calculate the new weight vector as
w`=w−ηxj, and thus w`·xj = w·xj −ηxj ·xj < w·xj . The two cases can be combined in a single
update rule:
w` =w+ηyi xi (7.8)
The perceptron training algorithm is given in Algorithm 7.1. It iterates through the training
examples until all examples are correctly classified. The algorithm can easily be turned into
an online algorithm that processes a stream of examples, updating the weight vector only if
the last received example is misclassified. The perceptron is guaranteed to converge to a
solution if the training data is linearly separable, but it won’t converge otherwise.

Figure 7.5. (left) A perceptron trained with a small learning rate (η = 0.2). The circled
examples are the ones that trigger the weight update. (middle) Increasing the learning rate to
η = 0.5 leads in this case to a rapid convergence. (right) Increasing the learning rate further
to η = 1 may lead to too aggressive weight updating, which harms convergence. The starting
point in all three cases was the basic linear classifier.
Figure 7.5 gives a graphical illustration of the perceptron training algorithm. In this particular
example the weight vector is initialized to the basic linear classifier, which means the
learning rate does have an effect on how quickly we move away from the initial decision
boundary.
However, if the weight vector is initialised to the zero vector, it is easy to see that the learning
rate is just a constant factor that does not affect convergence. We will set it to 1 in the next.
The key point of the perceptron algorithm is that, every time an example xi is misclassified,
we add yixi to the weight vector. After training has completed, each example has been
misclassified zero or more times denote this number αi for example xi.
Using this notation the weight vector can be expressed as

In other words, the weight vector is a linear combination of the training instances. The
perceptron shares this property with, e.g., the basic linear classifier:

where c(x) is the true class of example x (i.e., +1 or −1), α ⊕ = 1/Pos and α- = 1/Neg. In the
dual, instance-based view of linear classification we are learning instance weights α i rather
than feature weights wj. In this dual perspective, an instance x is classified as

.
This means that, during training, the only information needed about the training data is all
pair wise dot products: the n-by-n matrix G =XX T containing these dot products is called the
Gram matrix. Algorithm 7.2 gives the dual form of the perceptron training algorithm.
SUPPORT VECTOR MACHINES
For a given training set and decision boundary, let m+ be the smallest margin of any positive,
and m- the smallest margin of any negative, then we want the sum of these to be as large as
possible.
This sum is independent of the decision threshold t, as long as we keep the nearest positives
and negatives at the right sides of the decision boundary, and so we re-adjust t such that m+
and m- become equal. Figure 7.7 depicts this graphically in a two dimensional instance space.
The training examples nearest to the decision boundary are called support vectors: as we shall
see, the decision boundary of a support vector machine (SVM) is defined as a linear
combination of the support vectors.

Figure 7.7. The geometry of a support vector classifier. The circled data points are the
support vectors, which are the training examples nearest to the decision boundary. The
support vector machine finds the decision boundary that maximises the margin m/||w||.
The margin is thus defined as m/||w||, where m is the distance between the decision boundary
and the nearest training instances (at least one of each class) as measured along w. Since we
are free to rescale t , ||w|| and m, it is customary to choose m = 1. Maximising the margin
then corresponds to minimising ||w|| or, more conveniently,1/2||w||2, provided of course that
none of the training points fall inside the margin. This leads to a quadratic, constrained
optimisation problem:

We will approach this using the method of Lagrange multipliers.Adding the constraints with
multipliers αi for each training example gives the Lagrange Function

While this looks like a formidable formula, some further analysis will allow us to derive
the simpler dual form of the Lagrange function.
By taking the partial derivative of the Lagrange function with respect to t and setting it to 0

we find that for the optimal threshold t we have


Similarly, by taking the partial derivative of the Lagrange function with respect to w we see
that the Lagrange multipliers define the weight vector as a linear combination of the training
examples:

Since this partial derivative is 0 for an optimal weight vector we conclude

the same expression as we derived for the perceptron.


For a support vector machine, the αi are non-negative reals. What they have in common is
that, if αi = 0 for a particular example xi, that example could be removed from the training set
without affecting the learned decision boundary. In the case of support vector machines this
means that αi > 0 only for the support vectors: the training examples nearest to the decision
boundary.

Now, by plugging the expressions and back into the


Lagrangian we are able to eliminate w and t , and hence obtain the dual optimization
problem, which is entirely formulated in terms of the Lagrange multipliers:

The dual problem is to maximize this function under positivity constraints and one equality
constraint:
The dual form of the optimization problem for support vector machines illustrates two
important points.
First, it shows that searching for the maximum-margin decision boundary is equivalent to
searching for the support vectors: they are the training examples with non-zero Lagrange
multipliers, and through , they completely determine the decision
boundary.
Secondly, it shows that the optimization problem is entirely defined by pairwise dot products
between training instances: the entries of the Gram matrix.
Figure 7.8. (left) A maximum-margin classifier built from three examples, with w = (0,−1/2)
and margin 2. The circled examples are the support vectors: they receive non-zero Lagrange
multipliers and define the decision boundary. (right) By adding a second positive the
decision
boundary is rotated to w= (3/5,−4/5) and the margin decreases to 1.

OBTAINING PROBABILITIES FROM LINEAR CLASSIFIERS


A linear classifier produces scores ˆs(xi ) =w·xi−t that are thresholded at 0 in order to classify
examples. Due to the geometric nature of linear classifiers, such scores can be used to obtain
the (signed) distance of xi from the decision boundary.
To see this, notice that the length of the projection of xi onto w is ||xi||cosθ, where θ is the
angle between xi and w. Since w· xi = ||w||||xi ||cosθ, we can write this length as (w· xi )/||
w||. This gives the following signed distance:

with w = w/||w|| rescaled to unit length and t = t/||w|| the corresponding rescaled intercept.
The sign of this quantity tells us which side of the decision boundary we are on: positive
distances for points on the ‘positive’ side of the decision boundary (the direction in which w
points) and negative distances on the other side.
This geometric interpretation of the scores produced by linear classifiers offers an interesting
possibility for turning them into probabilities, a process that was called calibration.
Let denote the mean distance of the positive examples to the decision boundary: i.e.,
=w·μ −t , where μ⊕ is the mean of the positive examples and w is unit length (although the

latter assumption is not strictly necessary, as it will turn out that the weight vector will be
rescaled). It would not be unreasonable to expect that the distance of positive examples to the
decision boundary is normally distributed around this mean that is, when plotting a histogram
of these distances, we would expect the familiar bell curve to appear.
Under this assumption, the probability density function of d is
Similarly, the distances of negative examples to the decision boundary can be expected to be
normally distributed around .
We will assume that both normal distributions have the same variance σ2.
Suppose we now observe a point x with distance d(x). We classify this point as positive if
d(x) > 0 and as negative if d(x) < 0, but we want to attach a probability pˆ(x) = P( ⊕|d(x)) to
these predictions. Using Bayes’ rule we obtain

where LR is the likelihood ratio obtained from the normal score distributions, and clr is the
class ratio. We will assume for simplicity that clr = 1 in the derivation below. Furthermore,
assume for now that σ2 = 1 and = = 1/2 . We then have
Figure 7.11. The logistic function, a useful function for mapping distances from a linear
decision boundary into an estimate of the positive posterior probability. The fat red line
indicates the standard logistic function pˆ(d) = 1/1+exp(−d) ; this function can be used to
obtain probability estimates if the two classes are equally prevalent and the class means are
equidistant from the decision boundary and one unit of variance apart. The steeper and
flatter red lines show how the function changes if the class means are 2 and 1/2 units of
variance apart, respectively. The three blue lines show how these curves change if d0 = 1,
which means that the positives are on average further away from the decision boundary.

Going beyond linearity with kernel methods


These are the techniques can be adapted to learn non-linear decision boundaries. The main
idea is simple, to transform the data non-linearly to a feature space in which linear
classification can be applied. It is customary to call the transformed space the feature space
and the original space the input space. The approach thus appears to be to transform the
training data to feature space and learn a model there. In order to classify new data we
transform that to feature space as well and apply the model. However, the remarkable thing is
that in many cases the feature space does not have to be explicitly constructed, as we can
perform all necessary operations in input space.
Example 7.8 (Learning a quadratic decision boundary). The data in Figure7.14 (left) is
not linearly separable, but both classes have a clear circular shape. Figure 7.14 (right) shows
the same data with the feature values squared. In this transformed feature space the data has
become linearly separable, and the perceptron is able to separate the classes. The resulting
decision boundary in the original space is a near-circle. Also shown is the decision boundary
learned by the basic linear classifier in the quadratic feature space, corresponding to an
ellipse in the original space.
In general, mapping points back from the feature space to the instance space is non-trivial.
E.g., in this example each class mean in feature space maps back to four points in the original
space, owing to the quadratic mapping.
Figure 7.14. (left) Decision boundaries learned by the basic linear classifier and the
perceptron using the square of the features. (right) Data and decision boundaries in the
transformed feature space.

Take the perceptron algorithm in dual form.The algorithm is a simple counting algorithm –
the only operation that is somewhat involved is testing whether example xi is correctly
classified by evaluating The key component of this calculation is the dot
product xi ·xj .Assume bivariate examples xi=(xi , yi ) and xj =(xj , y j) for notational
simplicity, the dot product can be written as xi·xj = xi xj+yi y j . The corresponding instances
in the quadratic feature space are
and their dot product is
this is almost equal to
but not quite because
of the third term of cross-products. We can capture this term by extending the feature vector
with a third feature. This gives the following feature space:
Distance-based models
What’s the relevance when trying to understand distance-based machine learning. Well, the
rank (row) and file (column) on a chessboard is not unlike a discrete or categorical feature in
machine learning.
We can switch to real-valued features by imagining a ‘continuous’ chessboard with infinitely
many, infinitesimally narrow ranks and files. Squares now become points, and distances are
not expressed as the number of squares travelled, but simply as a real number on some scale.
If we now look at the shapes obtained by connecting equidistant points, we see that many of
these carry over from the discrete to the continuous case.
For a King, for example, all points a given fixed distance away still form a square around the
current position; and for a KRook they still form a square rotated 45 degrees.
As it happens, these are special cases of the following generic concept.

the distance experienced by the King on a chessboard, who can move diagonally as well as
horizontally and vertically but only one step at a time; it is also called Chebyshev distance.
Dis∞∞(x,y) = maxj |xj − y j |.
This is not strictly a Minkowski distance; however, we can define it as under the
understanding that x0 = 0 for x = 0 and 1 otherwise. This is actually the distance experienced
by a Rook on the chessboard: if both rank and file are different the square is two moves
away, if only one of them is different the square is one move away.
If x and y are binary strings, this is also called the Hamming distance.
Figure 8.4 investigates this for Minkowski distances of various orders. The triangle inequality
dictates that the distance from the origin to C is no more than the sum of the distances from
the origin to A (Dis(O,A)) and from A to C (Dis(A, C)). B is at the same distance from A as
C, regardless of the distance measure used; so Dis(O,A)+Dis(A,C) is equal to the distance
from the origin to B. So, if we draw a circle around the origin through B, the triangle
inequality dictates that C not be outside that circle. As we see in the left figure for Euclidean
distance, B is the only point where the circles around the origin and around A intersect, so
everywhere else the triangle inequality is a strict inequality.
The middle figure shows the same situation for Manhattan distance (p = 1). Now, B and C
are in fact equidistant from the origin, and so travelling via A to C is no longer a detour, but
just one of the many shortest routes. However, if we now decrease p further, we see that C
ends up outside the red shape, and is thus further away than B when seen from the origin,
whereas of course the sum of the distances from the origin to A and from A to C is still equal
to the distance from the origin to B. At this point, our intuition breaks down: Minkowski
distances with p < 1 are simply not very useful as distances since they all violate the triangle
inequality.

Neighbours and exemplars


We understand the basics of measuring distance in instance space, we proceed to consider the
key ideas underlying distance-based models. The two most important of these are:
formulating the model in terms of a number of prototypical instances or exemplars, and
defining the decision rule in terms of the nearest exemplars or neighbours.
Notice that minimising the sum of squared Euclidean distances of a given set of points is the
same as minimising the average squared Euclidean distance. You may wonder what happens
if we drop the square here: wouldn’t it be more natural to take the point that minimises total
Euclidean distance as exemplar? This point is known as the geometric median, as for
univariate data it corresponds to the median or ‘middle value’ of a set of numbers.
In certain situations it makes sense to restrict an exemplar to be one of the given data points.
In that case, we speak of a medoid, to distinguish it from a centroid which is an exemplar that
doesn’t have to occur in the data. Finding a medoid requires us to calculate, for each data
point, the total distance to all other data points, in order to choose the point that minimises it.
Figure 8.5 shows a set of 10 data points where the different ways of determining exemplars
all give different results. In particular, the mean and squared 2-normmedoid can be overly
sensitive to outliers.
Once we have determined the exemplars, the basic linear classifier constructs the decision
boundary as the perpendicular bisector of the line segment connecting the two exemplars. An
alternative, distance-based way to classify instances without direct reference to a decision
boundary is by the following decision rule: if x is nearest to μ⊕ then classify it as positive,
otherwise as negative; or equivalently, classify an instance to the class of the nearest
exemplar. If we use Euclidean distance as our closeness measure, simple geometry tells us we
get exactly the same decision boundary (Figure 8.6 (left)).
So the basic linear classifier can be interpreted from a distance-based perspective as
constructing exemplars that minimise squared Euclidean distance within each class, and then
applying a nearest-exemplar decision rule.
Another useful consequence of switching to the distance-based perspective is that the nearest-
exemplar decision rule works equally well for more than two exemplars, which gives us a
multi-class version of the basic linear classifier.1 Figure 8.7 (left) illustrates this for three
exemplars. Each decision region is now bounded by two line segments.
As you would expect, the 2-norm decision boundaries are more regular than the 1-norm ones:
mathematicians say that the 2-norm decision regions are convex, which means that linear
interpolation between any two points in the region can never go outside it. Clearly, this
doesn’t hold for 1-normdecision regions (Figure 8.7 (right)).
To summarise, the main ingredients of distance-based models are
 distance metrics, which can be Euclidean, Manhattan, Minkowski or Mahalanobis, among
many others;
 exemplars: centroids that find a centre of mass according to a chosen distance metric, or
medoids that find the most centrally located data point; and
 distance-based decision rules, which take a vote among the k nearest exemplars.

Nearest-neighbour classification
In the previous section we saw how to generalise the basic linear classifier to more than two
classes, by learning an exemplar for each class and using the nearest-exemplar decision rule
to classify new data. In fact, the most commonly used distance-based classifier is even more
straightforward than that: it simply uses each training instance as an exemplar. Consequently,
‘training’ this classifier requires nothing more than memorising the training data. This
extremely simple classifier is known as the nearestneighbour classifier. Its decision regions
are made up of the cells of a Voronoi tesselation, with piecewise linear decision boundaries
selected from the Voronoi boundaries (since adjacent cells may be labelled with the same
class).
What are the properties of the nearest-neighbour classifier? First, notice that, unless the
training set contains identical instances from different classes, we will be able to separate the
classes perfectly on the training set – not really a surprise, as we memorized all training
examples! Furthermore, by choosing the right exemplars we can more or less represent any
decision boundary, or at least an arbitrarily close piecewise linear approximation. It follows
that the nearest-neighbour classifier has low bias, but also high variance: move any of the
exemplars spanning part of the decision boundary, and you will also change the boundary.
This suggests a risk of overfitting if the training data is limited, noisy or unrepresentative.
From an algorithmic point of view, training the nearest-neighbour classifier is very fast,
taking only O(n) time for storing n exemplars. The downside is that classifying a single
instance also takes O(n) time, as the instance will need to be compared with every exemplar
to determine which one is the nearest. It is possible to reduce classification time at the
expense of increased training time by storing the exemplars in a more elaborate data
structure, but this tends not to scale well to large numbers of features.
In fact, high-dimensional instance spaces can be problematic for another reason: the infamous
curse of dimensionality. High-dimensional spaces tend to be extremely sparse, which means
that every point is far away from virtually every other point, and hence pairwise distances
tend to be uninformative. However, whether or not you are hit by the curse of dimensionality
is not simply a matter of counting the number of features, as there are several reasons why the
effective dimensionality of the instance space may be much smaller than the number of
features. For example, some of the features may be irrelevant and drown out the relevant
features’ signal in the distance calculations. In such a case it would be a good idea, before
building a distance-based model, to reduce dimensionality by performing_feature selection,
as will be discussed in Chapter 10. Alternatively, the data may live on a manifold of lower
dimension than the instance space (e.g., the surface of a sphere is a two-dimensional manifold
wrapped around a three-dimensional object), which allows other dimensionality-reduction
techniques such as _principal component analysis, which will be explained in the same
chapter. In any case, before applying nearest-neighbour classification it is a good idea to plot
a histogram of pairwise distances of a sample to see if they are sufficiently varied.
Notice that the nearest-neighbour method can easily be applied to regression problems with a
real-valued target variable. In fact, the method is completely oblivious to the type of target
variable and can be used to output text documents, images and videos. It is also possible to
output the exemplar itself instead of a separate target, in which case we usually speak of
nearest-neighbour retrieval. Of course we can only output targets (or exemplars) stored in the
exemplar database, but if we have a way of aggregating these we can go beyond this
restriction by applying the k-nearest neighbor method. In its simplest form, the k-nearest
neighbour classifier takes a vote between the k ≥ 1 nearest exemplars of the instance to be
classified, and predicts the majority class. We can easily turn this into a probability estimator
by returning the normalized class counts as a probability distribution over classes.

Figure 8.9. (left) Decision regions of a 3-nearest neighbour classifier; the shading
represents the predicted probability distribution over the five classes. (middle) 5-nearest
neighbour. (right)7-nearest neighbour.
Figure 8.9 illustrates this on a small data set of 20 exemplars from five different classes, for k
= 3, 5, 7. The class distribution is visualised by assigning each test point the class of a
uniformly sampled neighbour: so, in a region where two of k = 3 neighbours are red and one
is orange, the shading is a mix of two-thirds red and one-third orange. While for k = 3 the
decision regions are still mostly discernible, this is much less so for k = 5 and k = 7. This
may seem at odds with our earlier demonstration of the increase in the number of decision
regions with increasing k in Example 8.2. However, this increase is countered by the fact that
the probability vectors become more similar to each other. To take an extreme example: if k
is equal to the number of exemplars n, every test instance will have the same number of
neighbours and will receive the same probability vector which is equal to the prior
distribution over the exemplars. If k = n −1 we can reduce one of the class counts by 1,
which can be done in c ways: the same number of possibilities as with k = 1!
We conclude that the refinement of k-nearest neighbour – the number of different predictions
it can make – initially increases with increasing k, then decreases again. Furthermore, we can
say that the bias increases and the variance decreases with increasing k. There is no easy
recipe to decide what value of k is appropriate for a given data set. However, it is possible to
sidestep this question to some extent by applying distance weighting to the votes: that is, the
closer an exemplar is to the instance to be classified, the more its vote counts.
Figure 8.10 demonstrates this, using the reciprocal of the distance to an exemplar as the
weight of its vote. This blurs the decision boundaries, as the model now applies a
combination of grouping by means of theVoronoi boundaries, and grading by means of
distance weighting. Furthermore, since the weights decrease quickly for larger distances, the
effect of increasing k is much smaller than with unweighted voting. In fact, with distance
weighting we can simply put k = n and still obtain a model that makes different predictions in
different parts of the instance space. One could say that distance weighting makes k-nearest
neighbor more of a global model, while without it (and for small k) it is more like an
aggregation of local models.
If k-nearest neighbour is used for regression problems, the obvious way to aggregate the
predictions from the k neighbours is by taking the mean value, which can again be distance-
weighted. This would lend the model additional predictive power by predicting values that
aren’t observed among the stored exemplars. More generally, we can apply k-means to any
learning problem where we have an appropriate ‘aggregator’ for multiple target values.

Figure 8.10. (left) 3-nearest neighbour with distance weighting on the data from Figure 8.9.
(middle) 5-nearest neighbour. (right) 7-nearest neighbour.
Hierarchical clustering
In this we take a look at methods that represent clusters using trees. Here we consider trees
called dendrograms, which are purely defined in terms of a distance measure. Because
dendrograms use features only indirectly, as the basis on which the distance measure is
calculated, they partition the given data rather than the entire instance space, and hence
represent a descriptive clustering rather than a predictive one.
A precise definition of a dendrogram is as follows.
Definition 8.4 (Dendrogram). Given a data set D, a dendrogram is a binary tree with the
elements of D at its leaves. An internal node of the tree represents the subset of elements in
the leaves of the subtree rooted at that node. The level of a node is the distance between the
two clusters represented by the children of the node. Leaves have level 0.
For this definition, we need a way to measure how close two clusters are. You might think
that this is straightforward: just calculate the distance between the two cluster means.
Furthermore, taking cluster means as exemplars assumes Euclidean distance, and we may
want to use one of the other distance metrics. This has led to the introduction of the so-called
linkage function, which is a general way to turn pairwise point distances into pairwise cluster
distances.
Single and complete linkage both define the distance between clusters in terms of a particular
pair of points. Consequently, they cannot take the shape of the cluster into account, which is
why average and centroid linkage can offer an advantage. However, centroid linkage can lead
to non-intuitive dendrograms, as illustrated in Figure 8.17. The issue here is that we have
L({1}, {2}) < L({1}, {3}) and L({1}, {2}) < L({2}, {3}) but L({1}, {2}) > L({1,2}, {3}).
The first two inequalities mean that 1 and 2 are the first to be merged into a cluster; but the
second inequality means that the level of cluster {1,2,3} in the dendrogram drops below the
level of {1,2}. Centroid linkage violates the requirement of monotonicity, which stipulates
that L(A,B) < L(A,C) and L(A,B) < L(B,C) implies L(A,B) < L(A∪B,C) for any clusters A, B
and C. The other three linkage functions are monotonic (the example also serves as an
illustration why average linkage and centroid linkage are not the same).
Hierarchical clustering methods have the distinct advantage that the number of clusters does
not need to be fixed in advance. However, this advantage comes at considerable
computational cost. Furthermore, we now need to choose not just the distance measure used,
but also the linkage function.

You might also like