0% found this document useful (0 votes)

111 views25 pages

Linear Models: The Least-Squares Method, The Perceptron: A Heuristic Learning Algorithm For

This document summarizes linear models and distance-based models for machine learning. It discusses: - The key properties of linear models that make them simple yet stable, including their parametric form, stability to variations in training data, and resistance to overfitting. - The least squares method for finding the optimal parameters for linear regression and classification models by minimizing residuals. - How linear regression can be expressed in matrix form and extended to multiple features. - The perceptron algorithm for learning linear classifiers by updating weights based on misclassified examples. - An introduction to distance-based models including nearest neighbors classification and clustering.

Uploaded by

Chrsan Ram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

111 views25 pages

Linear Models: The Least-Squares Method, The Perceptron: A Heuristic Learning Algorithm For

Uploaded by

Chrsan Ram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

UNIT IV:

Linear models: The least-squares method, The perceptron: a heuristic learning algorithm for
linear classifiers, Support vector machines, obtaining probabilities from linear classifiers,
Going beyond linearity with kernel methods. Distance Based Models: Introduction,
Neighbours and exemplars, Nearest Neighbours classification, Distance Based Clustering,
Hierarchical Clustering.
----------------------------------------------------------------------------------------------------------------
Linear Models:
The models that can be understood in terms of lines and planes, commonly called linear
models. In machine learning, linear models are of particular interest because of their
simplicity. The reasons for simplicity are.
 Linear models are parametric, meaning that they have a fixed form with a small number of
numeric parameters that need to be learned from data. This is different from tree or rule
models, where the structure of the model is not fixed in advance.
 Linear models are stable, which is to say that small variations in the training data have
only limited impact on the learned model.
 Linear models are less likely to overfit the training data than some other models, largely
because they have relatively few parameters.
The last two points can be summarized by saying that linear models have low variance but
high bias. Such models are often preferable when you have limited data and want to avoid
over fitting. High variance low bias models such as decision trees are preferable if data is
abundant but under fitting is a concern.

4.1 THE LEAST-SQUARES METHOD:

The differences between the actual and estimated function values on the training examples
are called residuals To learn linear models for classification and
regression, the least-squares method, introduced by Carl Friedrich Gauss in the late
eighteenth century, consists in finding such that is minimised. The following
example illustrates the method in the simple case of a single feature, which is called
univariate regression.
Figure 7.1. The red solid line indicates the result of applying linear regression to 10
measurements of body weight (on the y-axis, in kilograms) against body height (on the x-axis,
in centimetres). The orange dotted lines indicate the average height h = 181 and the average
weight w = 74.5; the regression coefficient ˆb = 0.78. The measurements were simulated by
adding normally distributed noise with mean 0 and variance 5 to the true model indicated by
the blue dashed line (b = 0.83).
It is worthwhile to note that the expression for the regression coefficient or slope ˆb derived
in this example has n times the covariance between h and w in the enumerator and n times the
variance of h in the denominator. This is true in general: for a feature x and a target variable
y, the regression coefficient is

This can be understood by noting that the covariance is measured in units of x times units of
y (e.g., metres times kilograms in Example 7.1) and the variance in units of x squared (e.g.,
metres squared), so their quotient is measured in units of y per unit of x (e.g., kilograms per
metre).
In other words, univariate linear regression can be understood as consisting of two steps:
1. normalisation of the feature by dividing its values by the feature’s variance;
2. Calculating the covariance of the target variable and the normalised feature.
The sum of the residuals of the least-squares solution is zero:

The result follows because ˆ a = y − ˆb x, as derived in Example 7.1. While this property is
intuitively appealing, it is worth keeping in mind that it also makes linear regression
susceptible to outliers: points that are far removed from the regression line, often because of
measurement errors.
Example 7.2 (The effect of outliers). Suppose that, as the result of a transcription error, one
of the weight values in Figure 7.1 is increased by 10 kg. Figure 7.2 shows that this has a
considerable effect on the least-squares regression line.
Figure 7.2. The effect of an outlier in univariate regression. One of the blue points got moved
up10 units to the green point, changing the red regression line to the green line.
Suppose that, as the result of a transcription error, one of the weight values in Figure 7.1 is
increased by 10 kg. Figure 7.2 shows that this has a considerable effect on the least-squares
regression line.

Multivariate linear regression:

We can write univariate linear regression in matrix form as

In the second form of this equation, y, a, X and are n-vectors, and b is a scalar. In case of
d features, all that changes is that X becomes an n-by-d matrix, and b becomes a d-vector of
regression coefficients. We can apply the by now familiar trick of using homogeneous
coordinates to simplify these equations as follows:

with X0 an n-by-(d +1) matrix whose first column is all 1’s and the remaining columns are the
columns of X, and w has the intercept as its first entry and the regression coefficients as the
remaining d entries. For convenience we will often blur the distinction between these two
formulations and state the regression equation as y = Xw+ with X having d columns and w
having d rows, from the context it will be clear whether we are representing the intercept by
means of homogeneous coordinates, or have rather zero-centered the target and features to
achieve a zero intercept. In the univariate case we were able to obtain a closed-form solution
for w: which we can we do in the multivariate case. First, we are likely to need the co-
variances between every feature and the target variable.
Consider the expression XTy, which is an n-vector, the j-th entry of which is the product of
the j-th row of XT – i.e., the j-th column of X, which is (x1j , . . . ,xn j) – with (y1, . . . , yn):

Assuming for the moment that every feature is zero-centred, we have μ j = 0 and thus XTy is
an n-vector holding all the required covariances (times n).
In the multivariate case, to normalize the features to have unit variance, we can achieve this
by means of a d-by-d scaling matrix: a diagonal matrix with diagonal entries 1/nσjj. If S is a
diagonal matrix with diagonal entries nσjj , we can get the required scaling matrix by simply
inverting S. So our first stab at a solution for the multivariate regression problem is

As it turns out, the general case requires a more elaborate matrix instead of S:
Notice that if we do assume σ12 = 0 then the components of ˆw reduce to σ jy /σjj ,which brings
us back to Equation 7.2. Assuming uncorrelated features effectively decomposes a
multivariate regression problem into d univariate problems.

Using least-squares regression for classification

We can also use linear regression to learn a binary classifier by encoding the two classes as
real numbers. For instance, we can label the Pos positive examples with y ⊕ =+1 and the Neg
negative examples with y- =−1. It then follows that XTy =Pos μ⊕−Neg μ-, where μ⊕ and μ- are
d-vectors containing each feature’s mean values for the positive and negative examples,
respectively.
In the general case, the least-squares classifier learns the decision boundary w·x = t with
w= (XTX)−1(Pos μ⊕ −Neg μ-) (7.7) We would hence assign class ˆ y = sign(w· x−t ) to
instance x, where
sign(x) =+1 if x > 0
0 if x = 0
−1 if x < 0
Various simplifying assumptions can be made, including zero-centred features, equal
variance features, uncorrelated features and equal class prevalences. In the simplest case,
when all these assumptions are made, Equation 7.7 reduces to w = c(μ ⊕ −μ-) where c is some
scalar that can be incorporated in the decision threshold t . We recognize this as the basic
linear classifier.

THE PERCEPTRON
A linear classifier that will achieve perfect separation on linearly separable data is the
perceptron. The perceptron iterates over the training set, updating the weight vector every
time it encounters an incorrectly classified example.
For example, let xi be a misclassified positive example, then we have yi =+1and w·xi < t. We
therefore want to find w` such that w`·xi >w·xi , which moves the decision boundary towards
and hopefully past xi . This can be achieved by calculating the new weight vector as w` =
w+ηxi , where 0 < η ≤ 1 is the learning rate.
We then have w`·xi =w·xi +ηxi·xi >w·xi as required. Similarly, if xj is a misclassified negative
example, then we have yj =−1 and w·xj > t. In this case we calculate the new weight vector as
w`=w−ηxj, and thus w`·xj = w·xj −ηxj ·xj < w·xj . The two cases can be combined in a single
update rule:
w` =w+ηyi xi (7.8)
The perceptron training algorithm is given in Algorithm 7.1. It iterates through the training
examples until all examples are correctly classified. The algorithm can easily be turned into
an online algorithm that processes a stream of examples, updating the weight vector only if
the last received example is misclassified. The perceptron is guaranteed to converge to a
solution if the training data is linearly separable, but it won’t converge otherwise.

Figure 7.5. (left) A perceptron trained with a small learning rate (η = 0.2). The circled
examples are the ones that trigger the weight update. (middle) Increasing the learning rate to
η = 0.5 leads in this case to a rapid convergence. (right) Increasing the learning rate further
to η = 1 may lead to too aggressive weight updating, which harms convergence. The starting
point in all three cases was the basic linear classifier.
Figure 7.5 gives a graphical illustration of the perceptron training algorithm. In this particular
example the weight vector is initialized to the basic linear classifier, which means the
learning rate does have an effect on how quickly we move away from the initial decision
boundary.
However, if the weight vector is initialised to the zero vector, it is easy to see that the learning
rate is just a constant factor that does not affect convergence. We will set it to 1 in the next.
The key point of the perceptron algorithm is that, every time an example xi is misclassified,
we add yixi to the weight vector. After training has completed, each example has been
misclassified zero or more times denote this number αi for example xi.
Using this notation the weight vector can be expressed as

In other words, the weight vector is a linear combination of the training instances. The
perceptron shares this property with, e.g., the basic linear classifier:

where c(x) is the true class of example x (i.e., +1 or −1), α ⊕ = 1/Pos and α- = 1/Neg. In the
dual, instance-based view of linear classification we are learning instance weights α i rather
than feature weights wj. In this dual perspective, an instance x is classified as

.
This means that, during training, the only information needed about the training data is all
pair wise dot products: the n-by-n matrix G =XX T containing these dot products is called the
Gram matrix. Algorithm 7.2 gives the dual form of the perceptron training algorithm.
SUPPORT VECTOR MACHINES
For a given training set and decision boundary, let m+ be the smallest margin of any positive,
and m- the smallest margin of any negative, then we want the sum of these to be as large as
possible.
This sum is independent of the decision threshold t, as long as we keep the nearest positives
and negatives at the right sides of the decision boundary, and so we re-adjust t such that m+
and m- become equal. Figure 7.7 depicts this graphically in a two dimensional instance space.
The training examples nearest to the decision boundary are called support vectors: as we shall
see, the decision boundary of a support vector machine (SVM) is defined as a linear
combination of the support vectors.

Figure 7.7. The geometry of a support vector classifier. The circled data points are the
support vectors, which are the training examples nearest to the decision boundary. The
support vector machine finds the decision boundary that maximises the margin m/||w||.
The margin is thus defined as m/||w||, where m is the distance between the decision boundary
and the nearest training instances (at least one of each class) as measured along w. Since we
are free to rescale t , ||w|| and m, it is customary to choose m = 1. Maximising the margin
then corresponds to minimising ||w|| or, more conveniently,1/2||w||2, provided of course that
none of the training points fall inside the margin. This leads to a quadratic, constrained
optimisation problem:

We will approach this using the method of Lagrange multipliers.Adding the constraints with
multipliers αi for each training example gives the Lagrange Function

While this looks like a formidable formula, some further analysis will allow us to derive
the simpler dual form of the Lagrange function.
By taking the partial derivative of the Lagrange function with respect to t and setting it to 0

we find that for the optimal threshold t we have

Similarly, by taking the partial derivative of the Lagrange function with respect to w we see
that the Lagrange multipliers define the weight vector as a linear combination of the training
examples:

Since this partial derivative is 0 for an optimal weight vector we conclude

the same expression as we derived for the perceptron.

For a support vector machine, the αi are non-negative reals. What they have in common is
that, if αi = 0 for a particular example xi, that example could be removed from the training set
without affecting the learned decision boundary. In the case of support vector machines this
means that αi > 0 only for the support vectors: the training examples nearest to the decision
boundary.

Now, by plugging the expressions and back into the

Lagrangian we are able to eliminate w and t , and hence obtain the dual optimization
problem, which is entirely formulated in terms of the Lagrange multipliers:

The dual problem is to maximize this function under positivity constraints and one equality
constraint:
The dual form of the optimization problem for support vector machines illustrates two
important points.
First, it shows that searching for the maximum-margin decision boundary is equivalent to
searching for the support vectors: they are the training examples with non-zero Lagrange
multipliers, and through , they completely determine the decision
boundary.
Secondly, it shows that the optimization problem is entirely defined by pairwise dot products
between training instances: the entries of the Gram matrix.
Figure 7.8. (left) A maximum-margin classifier built from three examples, with w = (0,−1/2)
and margin 2. The circled examples are the support vectors: they receive non-zero Lagrange
multipliers and define the decision boundary. (right) By adding a second positive the
decision
boundary is rotated to w= (3/5,−4/5) and the margin decreases to 1.

OBTAINING PROBABILITIES FROM LINEAR CLASSIFIERS

A linear classifier produces scores ˆs(xi ) =w·xi−t that are thresholded at 0 in order to classify
examples. Due to the geometric nature of linear classifiers, such scores can be used to obtain
the (signed) distance of xi from the decision boundary.
To see this, notice that the length of the projection of xi onto w is ||xi||cosθ, where θ is the
angle between xi and w. Since w· xi = ||w||||xi ||cosθ, we can write this length as (w· xi )/||
w||. This gives the following signed distance:

with w = w/||w|| rescaled to unit length and t = t/||w|| the corresponding rescaled intercept.
The sign of this quantity tells us which side of the decision boundary we are on: positive
distances for points on the ‘positive’ side of the decision boundary (the direction in which w
points) and negative distances on the other side.
This geometric interpretation of the scores produced by linear classifiers offers an interesting
possibility for turning them into probabilities, a process that was called calibration.
Let denote the mean distance of the positive examples to the decision boundary: i.e.,
=w·μ −t , where μ⊕ is the mean of the positive examples and w is unit length (although the
⊕

latter assumption is not strictly necessary, as it will turn out that the weight vector will be
rescaled). It would not be unreasonable to expect that the distance of positive examples to the
decision boundary is normally distributed around this mean that is, when plotting a histogram
of these distances, we would expect the familiar bell curve to appear.
Under this assumption, the probability density function of d is
Similarly, the distances of negative examples to the decision boundary can be expected to be
normally distributed around .
We will assume that both normal distributions have the same variance σ2.
Suppose we now observe a point x with distance d(x). We classify this point as positive if
d(x) > 0 and as negative if d(x) < 0, but we want to attach a probability pˆ(x) = P( ⊕|d(x)) to
these predictions. Using Bayes’ rule we obtain

where LR is the likelihood ratio obtained from the normal score distributions, and clr is the
class ratio. We will assume for simplicity that clr = 1 in the derivation below. Furthermore,
assume for now that σ2 = 1 and = = 1/2 . We then have
Figure 7.11. The logistic function, a useful function for mapping distances from a linear
decision boundary into an estimate of the positive posterior probability. The fat red line
indicates the standard logistic function pˆ(d) = 1/1+exp(−d) ; this function can be used to
obtain probability estimates if the two classes are equally prevalent and the class means are
equidistant from the decision boundary and one unit of variance apart. The steeper and
flatter red lines show how the function changes if the class means are 2 and 1/2 units of
variance apart, respectively. The three blue lines show how these curves change if d0 = 1,
which means that the positives are on average further away from the decision boundary.

Going beyond linearity with kernel methods

These are the techniques can be adapted to learn non-linear decision boundaries. The main
idea is simple, to transform the data non-linearly to a feature space in which linear
classification can be applied. It is customary to call the transformed space the feature space
and the original space the input space. The approach thus appears to be to transform the
training data to feature space and learn a model there. In order to classify new data we
transform that to feature space as well and apply the model. However, the remarkable thing is
that in many cases the feature space does not have to be explicitly constructed, as we can
perform all necessary operations in input space.
Example 7.8 (Learning a quadratic decision boundary). The data in Figure7.14 (left) is
not linearly separable, but both classes have a clear circular shape. Figure 7.14 (right) shows
the same data with the feature values squared. In this transformed feature space the data has
become linearly separable, and the perceptron is able to separate the classes. The resulting
decision boundary in the original space is a near-circle. Also shown is the decision boundary
learned by the basic linear classifier in the quadratic feature space, corresponding to an
ellipse in the original space.
In general, mapping points back from the feature space to the instance space is non-trivial.
E.g., in this example each class mean in feature space maps back to four points in the original
space, owing to the quadratic mapping.
Figure 7.14. (left) Decision boundaries learned by the basic linear classifier and the
perceptron using the square of the features. (right) Data and decision boundaries in the
transformed feature space.

Take the perceptron algorithm in dual form.The algorithm is a simple counting algorithm –
the only operation that is somewhat involved is testing whether example xi is correctly
classified by evaluating The key component of this calculation is the dot
product xi ·xj .Assume bivariate examples xi=(xi , yi ) and xj =(xj , y j) for notational
simplicity, the dot product can be written as xi·xj = xi xj+yi y j . The corresponding instances
in the quadratic feature space are
and their dot product is
this is almost equal to
but not quite because
of the third term of cross-products. We can capture this term by extending the feature vector
with a third feature. This gives the following feature space:
Distance-based models
What’s the relevance when trying to understand distance-based machine learning. Well, the
rank (row) and file (column) on a chessboard is not unlike a discrete or categorical feature in
machine learning.
We can switch to real-valued features by imagining a ‘continuous’ chessboard with infinitely
many, infinitesimally narrow ranks and files. Squares now become points, and distances are
not expressed as the number of squares travelled, but simply as a real number on some scale.
If we now look at the shapes obtained by connecting equidistant points, we see that many of
these carry over from the discrete to the continuous case.
For a King, for example, all points a given fixed distance away still form a square around the
current position; and for a KRook they still form a square rotated 45 degrees.
As it happens, these are special cases of the following generic concept.

the distance experienced by the King on a chessboard, who can move diagonally as well as
horizontally and vertically but only one step at a time; it is also called Chebyshev distance.
Dis∞∞(x,y) = maxj |xj − y j |.
This is not strictly a Minkowski distance; however, we can define it as under the
understanding that x0 = 0 for x = 0 and 1 otherwise. This is actually the distance experienced
by a Rook on the chessboard: if both rank and file are different the square is two moves
away, if only one of them is different the square is one move away.
If x and y are binary strings, this is also called the Hamming distance.
Figure 8.4 investigates this for Minkowski distances of various orders. The triangle inequality
dictates that the distance from the origin to C is no more than the sum of the distances from
the origin to A (Dis(O,A)) and from A to C (Dis(A, C)). B is at the same distance from A as
C, regardless of the distance measure used; so Dis(O,A)+Dis(A,C) is equal to the distance
from the origin to B. So, if we draw a circle around the origin through B, the triangle
inequality dictates that C not be outside that circle. As we see in the left figure for Euclidean
distance, B is the only point where the circles around the origin and around A intersect, so
everywhere else the triangle inequality is a strict inequality.
The middle figure shows the same situation for Manhattan distance (p = 1). Now, B and C
are in fact equidistant from the origin, and so travelling via A to C is no longer a detour, but
just one of the many shortest routes. However, if we now decrease p further, we see that C
ends up outside the red shape, and is thus further away than B when seen from the origin,
whereas of course the sum of the distances from the origin to A and from A to C is still equal
to the distance from the origin to B. At this point, our intuition breaks down: Minkowski
distances with p < 1 are simply not very useful as distances since they all violate the triangle
inequality.

Neighbours and exemplars

We understand the basics of measuring distance in instance space, we proceed to consider the
key ideas underlying distance-based models. The two most important of these are:
formulating the model in terms of a number of prototypical instances or exemplars, and
defining the decision rule in terms of the nearest exemplars or neighbours.
Notice that minimising the sum of squared Euclidean distances of a given set of points is the
same as minimising the average squared Euclidean distance. You may wonder what happens
if we drop the square here: wouldn’t it be more natural to take the point that minimises total
Euclidean distance as exemplar? This point is known as the geometric median, as for
univariate data it corresponds to the median or ‘middle value’ of a set of numbers.
In certain situations it makes sense to restrict an exemplar to be one of the given data points.
In that case, we speak of a medoid, to distinguish it from a centroid which is an exemplar that
doesn’t have to occur in the data. Finding a medoid requires us to calculate, for each data
point, the total distance to all other data points, in order to choose the point that minimises it.
Figure 8.5 shows a set of 10 data points where the different ways of determining exemplars
all give different results. In particular, the mean and squared 2-normmedoid can be overly
sensitive to outliers.
Once we have determined the exemplars, the basic linear classifier constructs the decision
boundary as the perpendicular bisector of the line segment connecting the two exemplars. An
alternative, distance-based way to classify instances without direct reference to a decision
boundary is by the following decision rule: if x is nearest to μ⊕ then classify it as positive,
otherwise as negative; or equivalently, classify an instance to the class of the nearest
exemplar. If we use Euclidean distance as our closeness measure, simple geometry tells us we
get exactly the same decision boundary (Figure 8.6 (left)).
So the basic linear classifier can be interpreted from a distance-based perspective as
constructing exemplars that minimise squared Euclidean distance within each class, and then
applying a nearest-exemplar decision rule.
Another useful consequence of switching to the distance-based perspective is that the nearest-
exemplar decision rule works equally well for more than two exemplars, which gives us a
multi-class version of the basic linear classifier.1 Figure 8.7 (left) illustrates this for three
exemplars. Each decision region is now bounded by two line segments.
As you would expect, the 2-norm decision boundaries are more regular than the 1-norm ones:
mathematicians say that the 2-norm decision regions are convex, which means that linear
interpolation between any two points in the region can never go outside it. Clearly, this
doesn’t hold for 1-normdecision regions (Figure 8.7 (right)).
To summarise, the main ingredients of distance-based models are
 distance metrics, which can be Euclidean, Manhattan, Minkowski or Mahalanobis, among
many others;
 exemplars: centroids that find a centre of mass according to a chosen distance metric, or
medoids that find the most centrally located data point; and
 distance-based decision rules, which take a vote among the k nearest exemplars.

Nearest-neighbour classification
In the previous section we saw how to generalise the basic linear classifier to more than two
classes, by learning an exemplar for each class and using the nearest-exemplar decision rule
to classify new data. In fact, the most commonly used distance-based classifier is even more
straightforward than that: it simply uses each training instance as an exemplar. Consequently,
‘training’ this classifier requires nothing more than memorising the training data. This
extremely simple classifier is known as the nearestneighbour classifier. Its decision regions
are made up of the cells of a Voronoi tesselation, with piecewise linear decision boundaries
selected from the Voronoi boundaries (since adjacent cells may be labelled with the same
class).
What are the properties of the nearest-neighbour classifier? First, notice that, unless the
training set contains identical instances from different classes, we will be able to separate the
classes perfectly on the training set – not really a surprise, as we memorized all training
examples! Furthermore, by choosing the right exemplars we can more or less represent any
decision boundary, or at least an arbitrarily close piecewise linear approximation. It follows
that the nearest-neighbour classifier has low bias, but also high variance: move any of the
exemplars spanning part of the decision boundary, and you will also change the boundary.
This suggests a risk of overfitting if the training data is limited, noisy or unrepresentative.
From an algorithmic point of view, training the nearest-neighbour classifier is very fast,
taking only O(n) time for storing n exemplars. The downside is that classifying a single
instance also takes O(n) time, as the instance will need to be compared with every exemplar
to determine which one is the nearest. It is possible to reduce classification time at the
expense of increased training time by storing the exemplars in a more elaborate data
structure, but this tends not to scale well to large numbers of features.
In fact, high-dimensional instance spaces can be problematic for another reason: the infamous
curse of dimensionality. High-dimensional spaces tend to be extremely sparse, which means
that every point is far away from virtually every other point, and hence pairwise distances
tend to be uninformative. However, whether or not you are hit by the curse of dimensionality
is not simply a matter of counting the number of features, as there are several reasons why the
effective dimensionality of the instance space may be much smaller than the number of
features. For example, some of the features may be irrelevant and drown out the relevant
features’ signal in the distance calculations. In such a case it would be a good idea, before
building a distance-based model, to reduce dimensionality by performing_feature selection,
as will be discussed in Chapter 10. Alternatively, the data may live on a manifold of lower
dimension than the instance space (e.g., the surface of a sphere is a two-dimensional manifold
wrapped around a three-dimensional object), which allows other dimensionality-reduction
techniques such as _principal component analysis, which will be explained in the same
chapter. In any case, before applying nearest-neighbour classification it is a good idea to plot
a histogram of pairwise distances of a sample to see if they are sufficiently varied.
Notice that the nearest-neighbour method can easily be applied to regression problems with a
real-valued target variable. In fact, the method is completely oblivious to the type of target
variable and can be used to output text documents, images and videos. It is also possible to
output the exemplar itself instead of a separate target, in which case we usually speak of
nearest-neighbour retrieval. Of course we can only output targets (or exemplars) stored in the
exemplar database, but if we have a way of aggregating these we can go beyond this
restriction by applying the k-nearest neighbor method. In its simplest form, the k-nearest
neighbour classifier takes a vote between the k ≥ 1 nearest exemplars of the instance to be
classified, and predicts the majority class. We can easily turn this into a probability estimator
by returning the normalized class counts as a probability distribution over classes.

Figure 8.9. (left) Decision regions of a 3-nearest neighbour classifier; the shading
represents the predicted probability distribution over the five classes. (middle) 5-nearest
neighbour. (right)7-nearest neighbour.
Figure 8.9 illustrates this on a small data set of 20 exemplars from five different classes, for k
= 3, 5, 7. The class distribution is visualised by assigning each test point the class of a
uniformly sampled neighbour: so, in a region where two of k = 3 neighbours are red and one
is orange, the shading is a mix of two-thirds red and one-third orange. While for k = 3 the
decision regions are still mostly discernible, this is much less so for k = 5 and k = 7. This
may seem at odds with our earlier demonstration of the increase in the number of decision
regions with increasing k in Example 8.2. However, this increase is countered by the fact that
the probability vectors become more similar to each other. To take an extreme example: if k
is equal to the number of exemplars n, every test instance will have the same number of
neighbours and will receive the same probability vector which is equal to the prior
distribution over the exemplars. If k = n −1 we can reduce one of the class counts by 1,
which can be done in c ways: the same number of possibilities as with k = 1!
We conclude that the refinement of k-nearest neighbour – the number of different predictions
it can make – initially increases with increasing k, then decreases again. Furthermore, we can
say that the bias increases and the variance decreases with increasing k. There is no easy
recipe to decide what value of k is appropriate for a given data set. However, it is possible to
sidestep this question to some extent by applying distance weighting to the votes: that is, the
closer an exemplar is to the instance to be classified, the more its vote counts.
Figure 8.10 demonstrates this, using the reciprocal of the distance to an exemplar as the
weight of its vote. This blurs the decision boundaries, as the model now applies a
combination of grouping by means of theVoronoi boundaries, and grading by means of
distance weighting. Furthermore, since the weights decrease quickly for larger distances, the
effect of increasing k is much smaller than with unweighted voting. In fact, with distance
weighting we can simply put k = n and still obtain a model that makes different predictions in
different parts of the instance space. One could say that distance weighting makes k-nearest
neighbor more of a global model, while without it (and for small k) it is more like an
aggregation of local models.
If k-nearest neighbour is used for regression problems, the obvious way to aggregate the
predictions from the k neighbours is by taking the mean value, which can again be distance-
weighted. This would lend the model additional predictive power by predicting values that
aren’t observed among the stored exemplars. More generally, we can apply k-means to any
learning problem where we have an appropriate ‘aggregator’ for multiple target values.

Figure 8.10. (left) 3-nearest neighbour with distance weighting on the data from Figure 8.9.
(middle) 5-nearest neighbour. (right) 7-nearest neighbour.
Hierarchical clustering
In this we take a look at methods that represent clusters using trees. Here we consider trees
called dendrograms, which are purely defined in terms of a distance measure. Because
dendrograms use features only indirectly, as the basis on which the distance measure is
calculated, they partition the given data rather than the entire instance space, and hence
represent a descriptive clustering rather than a predictive one.
A precise definition of a dendrogram is as follows.
Definition 8.4 (Dendrogram). Given a data set D, a dendrogram is a binary tree with the
elements of D at its leaves. An internal node of the tree represents the subset of elements in
the leaves of the subtree rooted at that node. The level of a node is the distance between the
two clusters represented by the children of the node. Leaves have level 0.
For this definition, we need a way to measure how close two clusters are. You might think
that this is straightforward: just calculate the distance between the two cluster means.
Furthermore, taking cluster means as exemplars assumes Euclidean distance, and we may
want to use one of the other distance metrics. This has led to the introduction of the so-called
linkage function, which is a general way to turn pairwise point distances into pairwise cluster
distances.
Single and complete linkage both define the distance between clusters in terms of a particular
pair of points. Consequently, they cannot take the shape of the cluster into account, which is
why average and centroid linkage can offer an advantage. However, centroid linkage can lead
to non-intuitive dendrograms, as illustrated in Figure 8.17. The issue here is that we have
L({1}, {2}) < L({1}, {3}) and L({1}, {2}) < L({2}, {3}) but L({1}, {2}) > L({1,2}, {3}).
The first two inequalities mean that 1 and 2 are the first to be merged into a cluster; but the
second inequality means that the level of cluster {1,2,3} in the dendrogram drops below the
level of {1,2}. Centroid linkage violates the requirement of monotonicity, which stipulates
that L(A,B) < L(A,C) and L(A,B) < L(B,C) implies L(A,B) < L(A∪B,C) for any clusters A, B
and C. The other three linkage functions are monotonic (the example also serves as an
illustration why average linkage and centroid linkage are not the same).
Hierarchical clustering methods have the distinct advantage that the number of clusters does
not need to be fixed in advance. However, this advantage comes at considerable
computational cost. Furthermore, we now need to choose not just the distance measure used,
but also the linkage function.

NOTES Agile - Unit 1
No ratings yet
NOTES Agile - Unit 1
40 pages
6.036 Cheat Sheet
No ratings yet
6.036 Cheat Sheet
2 pages
Stat Itics
100% (2)
Stat Itics
209 pages
Mit Data Science Machine Learning Program Brochure
No ratings yet
Mit Data Science Machine Learning Program Brochure
17 pages
Summary Writing Skills in English Langua
No ratings yet
Summary Writing Skills in English Langua
21 pages
USA Taekwondo Kyorugi Competition Rules From January 1 2015 V2
No ratings yet
USA Taekwondo Kyorugi Competition Rules From January 1 2015 V2
83 pages
Classification, Prediction
100% (1)
Classification, Prediction
67 pages
Diabetes Prediction Using Machine Learning
No ratings yet
Diabetes Prediction Using Machine Learning
8 pages
ML GTU Solution
No ratings yet
ML GTU Solution
83 pages
Modern Actuarial Statistics-II - Exam MAS-II: Syllabus of Basic Education 2020
No ratings yet
Modern Actuarial Statistics-II - Exam MAS-II: Syllabus of Basic Education 2020
13 pages
BPJ Vol 11 No 2 P 795-805
No ratings yet
BPJ Vol 11 No 2 P 795-805
11 pages
Data Mining Techniques and Tools For Syn PDF
No ratings yet
Data Mining Techniques and Tools For Syn PDF
45 pages
CME Iceberg Order Detection and Prediction
No ratings yet
CME Iceberg Order Detection and Prediction
16 pages
FIFA Video Game - Players Classification
No ratings yet
FIFA Video Game - Players Classification
26 pages
Introduction To Recurrent Neural Network
No ratings yet
Introduction To Recurrent Neural Network
9 pages
Heads Sample Blore
No ratings yet
Heads Sample Blore
35 pages
Chap 2 Software Project Management
100% (1)
Chap 2 Software Project Management
49 pages
A.I Seminal
No ratings yet
A.I Seminal
27 pages
Project Management Foundation: Subject Incharge: Dr. Rahul V. Dandage
100% (1)
Project Management Foundation: Subject Incharge: Dr. Rahul V. Dandage
74 pages
Image Classification in Remote Sensing
No ratings yet
Image Classification in Remote Sensing
8 pages
Software Project Management: Aamir Anwar Lecturer Computer Science SZABIST, Islamabad
100% (1)
Software Project Management: Aamir Anwar Lecturer Computer Science SZABIST, Islamabad
25 pages
SPM PDF
No ratings yet
SPM PDF
110 pages
Module 4. Planning Projects - PM
100% (1)
Module 4. Planning Projects - PM
39 pages
Deep Gender Identification Model With Biometric Fingerprint Data
No ratings yet
Deep Gender Identification Model With Biometric Fingerprint Data
6 pages
Comparison of Naive Bayes Classifier and C-LSTM
No ratings yet
Comparison of Naive Bayes Classifier and C-LSTM
6 pages
Software Process Models: Chapter 2 & 3 in Software Engineering Book
No ratings yet
Software Process Models: Chapter 2 & 3 in Software Engineering Book
41 pages
Software Engineering: Project Management & Estimation
No ratings yet
Software Engineering: Project Management & Estimation
61 pages
It8075 SPM Nov Dec 2020
No ratings yet
It8075 SPM Nov Dec 2020
2 pages
SPM Unit 5 Notes
No ratings yet
SPM Unit 5 Notes
24 pages
SE MODULE 3 Unlocked
100% (1)
SE MODULE 3 Unlocked
12 pages
Agile Project Management - TutorialsPoint
No ratings yet
Agile Project Management - TutorialsPoint
4 pages
Agile Scrum
No ratings yet
Agile Scrum
14 pages
Question Paper Final - SPM
No ratings yet
Question Paper Final - SPM
1 page
Software Project Management
No ratings yet
Software Project Management
20 pages
Project Decision Analysis
No ratings yet
Project Decision Analysis
30 pages
Lecturenote - 1938410780chapter 9 - Emerging Trends in Software Engineering (Lecture 15)
No ratings yet
Lecturenote - 1938410780chapter 9 - Emerging Trends in Software Engineering (Lecture 15)
20 pages
Use Case Quiz Q
No ratings yet
Use Case Quiz Q
8 pages
II CSE CS3352 FDS QB Unit4
100% (1)
II CSE CS3352 FDS QB Unit4
6 pages
Halstead's Operators and Operands in C, C++, JAVA (By Indranil Nandy)
100% (6)
Halstead's Operators and Operands in C, C++, JAVA (By Indranil Nandy)
5 pages
Module 03. Project Planning and Scheduling - PM
100% (1)
Module 03. Project Planning and Scheduling - PM
43 pages
SPM 2
No ratings yet
SPM 2
45 pages
L04-The Components of Software Quality Assurance System
No ratings yet
L04-The Components of Software Quality Assurance System
47 pages
It8075 SPM Unit IV
No ratings yet
It8075 SPM Unit IV
26 pages
Chapter 3 Software Project Management - Part1
No ratings yet
Chapter 3 Software Project Management - Part1
20 pages
SQA
No ratings yet
SQA
10 pages
Project Closure Chapter Outline
No ratings yet
Project Closure Chapter Outline
10 pages
PROJ MANAGEMENT Team Selection and Acquisition
No ratings yet
PROJ MANAGEMENT Team Selection and Acquisition
4 pages
Week 1 Introduction Software Project Versus Other Type of Projects
No ratings yet
Week 1 Introduction Software Project Versus Other Type of Projects
21 pages
Softwre Project Management
No ratings yet
Softwre Project Management
24 pages
Software Project Management Plan (SPMP) : Web-Based Drawing Tool
100% (1)
Software Project Management Plan (SPMP) : Web-Based Drawing Tool
22 pages
Pseudocode Examples From Dave Mulkey
No ratings yet
Pseudocode Examples From Dave Mulkey
17 pages
Novel Metrics in Software Industry
No ratings yet
Novel Metrics in Software Industry
7 pages
Regression: Unit Iii
No ratings yet
Regression: Unit Iii
54 pages
Lesson 4 - Introduction Machine Learning
No ratings yet
Lesson 4 - Introduction Machine Learning
44 pages
Multilayer Perceptron
No ratings yet
Multilayer Perceptron
16 pages
It8075 SPM Unit V
No ratings yet
It8075 SPM Unit V
22 pages
System Process: CMMI, ISO
No ratings yet
System Process: CMMI, ISO
23 pages
Questions Take Home Agile Hybrid Management
No ratings yet
Questions Take Home Agile Hybrid Management
6 pages
Scrum Training Notes - Week 1
No ratings yet
Scrum Training Notes - Week 1
8 pages
Software Project 1
100% (1)
Software Project 1
2 pages
PMI Framework Processes Presentation
No ratings yet
PMI Framework Processes Presentation
17 pages
Unit-V Risk Management Reactive vs. Proactive Risk Strategies
No ratings yet
Unit-V Risk Management Reactive vs. Proactive Risk Strategies
13 pages
Risk Management in Software Engineering
No ratings yet
Risk Management in Software Engineering
15 pages
2023-Key Contractor Selection Criteria For Db-Epc Projects in Construction
No ratings yet
2023-Key Contractor Selection Criteria For Db-Epc Projects in Construction
14 pages
Stqa Unit 3 Notes
No ratings yet
Stqa Unit 3 Notes
22 pages
Unit - 1 Intoduction To Software Project Management
No ratings yet
Unit - 1 Intoduction To Software Project Management
13 pages
It2403-Software Project Management 2 Marks Questions
No ratings yet
It2403-Software Project Management 2 Marks Questions
13 pages
Module 3
No ratings yet
Module 3
36 pages
Complete Download Post Mining of Association Rules Techniques For Effective Knowledge Extraction 1st Edition Yanchang Zhao PDF All Chapters
100% (6)
Complete Download Post Mining of Association Rules Techniques For Effective Knowledge Extraction 1st Edition Yanchang Zhao PDF All Chapters
71 pages
Sad Lec22 - Notes - Cocomo, Cmmi and Case Tool
No ratings yet
Sad Lec22 - Notes - Cocomo, Cmmi and Case Tool
35 pages
Ch-03 - Software Project Management
No ratings yet
Ch-03 - Software Project Management
73 pages
Unit 7 Se
No ratings yet
Unit 7 Se
29 pages
Project Scheduling - Probabilistic PERT
No ratings yet
Project Scheduling - Probabilistic PERT
23 pages
PERT Sample Question
No ratings yet
PERT Sample Question
4 pages
Software Project Managementl T P C
No ratings yet
Software Project Managementl T P C
1 page
SPM-UNIT-1 Jntuh r18 Notes
No ratings yet
SPM-UNIT-1 Jntuh r18 Notes
38 pages
CP7301 Software Process and Project Management Question Banks Annaunivhub - Blogspot.in
No ratings yet
CP7301 Software Process and Project Management Question Banks Annaunivhub - Blogspot.in
7 pages
Project Selection Methods
No ratings yet
Project Selection Methods
2 pages
Software Project Management
No ratings yet
Software Project Management
2 pages
Mba ZG523 Course Handout - PM
No ratings yet
Mba ZG523 Course Handout - PM
8 pages
Publications
No ratings yet
Publications
26 pages
INT354 Syllabus
No ratings yet
INT354 Syllabus
2 pages
6th - SEM Data Science Notes
No ratings yet
6th - SEM Data Science Notes
46 pages
Spam Email Detection Using Machine Learning
No ratings yet
Spam Email Detection Using Machine Learning
8 pages
DEEP LEARNING (Previous Question Papers)
No ratings yet
DEEP LEARNING (Previous Question Papers)
3 pages
KNN MinorProject
No ratings yet
KNN MinorProject
44 pages
UnderstandingDeepLearning 03-26-25 C 15 28
No ratings yet
UnderstandingDeepLearning 03-26-25 C 15 28
14 pages
MC4001 - Software Project Management
No ratings yet
MC4001 - Software Project Management
66 pages
Tomato Disease Prediction Model Using Machine Learning Algorithms and Image Processing Techniques
No ratings yet
Tomato Disease Prediction Model Using Machine Learning Algorithms and Image Processing Techniques
6 pages
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
From Everand
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
Carl A. Bolton
No ratings yet

Linear Models: The Least-Squares Method, The Perceptron: A Heuristic Learning Algorithm For

Uploaded by

Linear Models: The Least-Squares Method, The Perceptron: A Heuristic Learning Algorithm For

Uploaded by

UNIT IV:

4.1 THE LEAST-SQUARES METHOD:

Multivariate linear regression:

Using least-squares regression for classification

we find that for the optimal threshold t we have

Since this partial derivative is 0 for an optimal weight vector we conclude

the same expression as we derived for the perceptron.

Now, by plugging the expressions and back into the

OBTAINING PROBABILITIES FROM LINEAR CLASSIFIERS

Going beyond linearity with kernel methods

Neighbours and exemplars

You might also like