01 Basics 02knn 01
01 Basics 02knn 01
Lecture Notes
Sebastian Raschka
Department of Statistics
University of Wisconsin–Madison
http://stat.wisc.edu/∼sraschka/teaching/stat479-fs2018/
Fall 2018
2.1 Introduction
Nearest neighbor algorithms are among the “simplest” supervised machine learning algo-
rithms and have been well studied in the field of pattern recognition over the last century.
While nearest neighbor algorithms are not as popular as they once were, they are still widely
used in practice, and I highly recommend that you are at least considering the k-Nearest
Neighbor algorithm in classification projects as a predictive performance benchmark when
you are trying to develop more sophisticated models.
In this lecture, we will primarily talk about two different algorithms, the Nearest Neighbor
(NN) algorithm and the k -Nearest Neighbor (k NN) algorithm. NN is just a special case of
k NN, where k = 1. To avoid making this text unnecessarily convoluted, we will only use the
abbreviation NN if we talk about concepts that do not apply to k NN in general. Otherwise,
we will use k NN to refer to nearest neighbor algorithms in general, regardless of the value
of k.
during the training phase. For this reason, k NN is also called a lazy learning algorithm.
What it means to be a lazy learning algorithm is that the processing of the training examples
is postponed until making predictions 1 – again, the training consists literally of just storing
the training data.
1 When you are reading recent literature, note that the *prediction* step is now often called ”inference”
Then, to make a prediction (classb label or continuous target), the kNN algorithms find the k
nearest neighbors of a query point and compute the class label (classification) or continuous
target (regression) based on the k nearest (most “similar”) points. The exact mechanics will
be explained in the next sections. However, the overall idea is that instead of approximating
the target function f (x) = y globally, during each prediction, k NN approximates the target
function locally. In practice, it is easier to learn to approximate a function locally than
globally.
? ?
Figure 1: Illustration of the nearest neighbor classification algorithm in two dimensions (features
x 1 and x 2). In the left subpanel, the training examples are shown as blue dots, and a query point
that we want to classify is shown as a question mark. In the right subpanel, the class labels are, and
the dashed line indicates the nearest neighbor of the query point, assuming a Euclidean distance
metric. The predicted class label is the class label of the closest data point in the training set (here:
class 0).
In the previous lecture, we learned about different kinds of categorization schemes, which
may be helpful for understanding and distinguishing different types of machine learning
algorithms.
To recap, the categories we discussed were
C
• eager vs lazy;
• batch vs online;
B
• parametric vs nonparametric;
A
• discriminative vs generative.
Since k NN does not have an explicit training step and defers all of the computation until
prediction, we already determined that k NN is a lazy algorithm.
Further, instead of devising one global model or approximation of the target function, for
each different data point, there is a different local approximation, which depends on the
data point itself as well as the training data points. Since the prediction is based on a D
comparison of a query point with data points in the training set (rather than Ba global
model), k NN is also categorized as instance-based (or “memory-based”) method. While
k NN is a lazy instance-based learning algorithm, an example of an eager instance-based
learning algorithm would be the support vector machine, which will be covered later in this
course.
Lastly, because we do not make any assumption about the functional form of the k NN
algorithm, a k NN model is also considered a nonparametric model. However, categorizing
Sebastian Raschka STAT479 FS18. L01: Intro to Machine Learning Page 3
While neural networks are gaining popularity in the computer vision and pattern recognition
field, one area where k -nearest neighbors models are still commonly and successfully being
used is in the intersection between computer vision, pattern classification, and biometrics
(e.g., to make predictions based on extracted geometrical features2 ).
Other common use cases include recommender systems (via collaborative filtering3 ) and
outlier detection4 .
After introducing the overall concept of the nearest neighbor algorithms, this section provides
a more formal or technical description of the 1-nearest neighbor (NN) algorithm.
Training algorithm:
for i = 1, ..., n in the n-dimensional training dataset D (|D| = n):
• store training example hx[i] , f x[i] i
Prediction algorithm 5 :
closest point := None
closest distance := ∞
• for i = 1, ..., n:
geometrical features extraction”. In: Procedia Computer Science 65 (2015), pp. 529–537.
3 Youngki Park et al. “Reversed CF: A fast collaborative filtering algorithm using a k-nearest neighbor
graph”. In: Expert Systems with Applications 42.8 (2015), pp. 4022–4028.
4 Guilherme O Campos et al. “On the evaluation of unsupervised outlier detection: measures, datasets,
and an empirical study”. In: Data Mining and Knowledge Discovery 30.4 (2016), pp. 891–927.
5 We use ”:=” as an assignment operator.
Sebastian Raschka STAT479 FS18. L01: Intro to Machine Learning Page 4
? ?
In this section, we will build some intuition for the decision boundary of the NN classification
model. Assuming a Euclidean distance metric, the decision boundary between any two
training examples a and b is a straight line. If a query point is located on the decision
boundary, this means its equidistant from both training example a and b.
While the decision boundary between a pair of points is a straight line, the decision boundary
of the NN model on a global level, considering the whole training set, is a set of connected,
convex polyhedra. All points within a polyhedron are closest to the training example inside,
and all points outside the polyhedron are closer to a different training example.
a a c
Figure 3: Illustration of the nearest neighbor decision boundary as the union of the polyhedra of
training examples belonging to the same class.
Previously, we described the NN algorithm, which makes a prediction by assigning the class
label or continuous target value of the most similar training example to the query point
(where similarity is typically measured using the Euclidean distance metric for continuous
features).
Instead of basing the prediction of the single, most similar training example, k NN considers
the k nearest neighbors when predicting a class label (in classification) or a continuous target
value (in regression).
2.4.1 Classification
In the classification setting, the simplest incarnation of the k NN model is to predict the
target class label as the class label that is most often represented among the k most similar
training examples for a given query point. In other words, the class label can be considered
as the “mode” of the k training labels or the outcome of a “plurality voting.” Note that in
literature, k NN classification is often described as a “majority voting.” While the authors
usually mean the right thing, the term “majority voting” is a bit unfortunate as it typically
refers to a reference value of >50% for making a decision. In the case of binary predictions
(classification problems with two classes), there is always a majority or a tie. Hence, a
majority vote is also automatically a plurality vote. However, in multi-class settings, we do
not require a majority to make a prediction via k NN. For example, in a three-class setting
a frequency > 13 ( approx 33.3%) could already enough to assign a class label.
Sebastian Raschka STAT479 FS18. L01: Intro to Machine Learning Page 6
A
y:
Majority vote:
Plurality vote:
B
y:
Majority vote: None
Plurality vote:
Remember that the NN prediction rule (recall that we defined NN as the special case of
k NN with k = 1) is the same for both classification or regression. However, in k NN we have
two distinct prediction algorithms:
More formally, assume we have a target function f (x) = y that assigns a class label y ∈
{1, . . . , t} to a training example,
f : Rm → {1, ..., t}. (3)
(Usually, we use the letter k to denote the number of classes in this course, but in the context
of k-NN, it would be too confusing.)
Assuming we identified the k nearest neighbors (Dk ⊆ D) of a query point x[q] ,
k
X
h(x[q] ) = arg max δ(y, f (x[i] )). (5)
y∈{1,...,t}
i=1
Here, δ denotes the Kronecker Delta function
(
1, if a = b,
δ(a, b) = (6)
0, if a 6= b.
Or, in simpler notation, if you remember the “mode” from introductory statistics classes:
h(x[t] ) = mode f x[1] , . . . , f x[k]
. (7)
A common distance metric to identify the k nearest neighbors Dk is the Euclidean distance
measure, v
u m [a]
uX 2
[a] [b] [b]
d(x , x ) = t xj − xj , (8)
j=1
Sebastian Raschka STAT479 FS18. L01: Intro to Machine Learning Page 7
which is a pairwise distance metric that computes the distance between two data points x[a]
and x[b] over the m input features.
2.4.2 Regression
The general concept of k NN for regression is the same as for classification: first, we find
the k nearest neighbors in the dataset; second, we make a prediction based on the labels
of the k nearest neighbors. However, in regression, the target function is a real- instead of
discrete-valued function,
f : Rm → R. (9)
A common approach for computing the continuous target is to compute the mean or average
target value over the k nearest neighbors,
k
1X
h x[t] = f x[i] .
(10)
k i=1
As an alternative to averaging the target values of the k nearest neighbors to predict the
label of a query point, it is also not uncommon to use the median instead.