0% found this document useful (0 votes)
16 views27 pages

Lec 02

This lecture focuses on supervised learning, particularly the Nearest Neighbours algorithm for classification tasks. It explains the process of classifying a new input by finding the nearest input in the training set and discusses the k-Nearest Neighbours (kNN) method, including its sensitivity to noise and the importance of choosing the right hyperparameters. The lecture also highlights challenges such as the Curse of Dimensionality and the computational costs associated with the algorithm.

Uploaded by

Dániel Krebsz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views27 pages

Lec 02

This lecture focuses on supervised learning, particularly the Nearest Neighbours algorithm for classification tasks. It explains the process of classifying a new input by finding the nearest input in the training set and discusses the k-Nearest Neighbours (kNN) method, including its sensitivity to noise and the importance of choosing the right hyperparameters. The lecture also highlights challenges such as the Curse of Dimensionality and the computational costs associated with the algorithm.

Uploaded by

Dániel Krebsz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

CSC 411: Introduction to Machine Learning

Lecture 2: Nearest Neighbours

Mengye Ren and Matthew MacKay

University of Toronto

UofT CSC411 2019 Winter Lecture 02 1 / 27


Introduction

Today (and for the next 5 weeks) we’re focused on supervised learning.
This means we’re given a training set consisting of inputs and
corresponding labels
Machine learning - learning a program. Labels are the expected output of
the correct program when given the inputs.

Task Inputs Labels


object recognition image object category
image captioning image caption
document classification text document category
speech-to-text audio waveform text
.. .. ..
. . .

Goal: correctly predict labels for data not in the training set (“in the wild”)
i.e. our ML algorithm must generalize

UofT CSC411 2019 Winter Lecture 02 2 / 27


Input Vectors

Machine learning algorithms need to handle lots of types of data: images,


text, audio waveforms, credit card transactions, etc.

Common strategy: represent the input as an input vector in Rd


I Representation = mapping to another space that’s easy to manipulate
I Vectors are a great representation since we can do linear algebra!

UofT CSC411 2019 Winter Lecture 02 3 / 27


Input Vectors

What an image looks like to the computer:

[Image credit: Andrej Karpathy]

UofT CSC411 2019 Winter Lecture 02 4 / 27


Input Vectors
Can use raw pixels:

Can do much better if you compute a vector of meaningful features.


UofT CSC411 2019 Winter Lecture 02 5 / 27
Input Vectors

Mathematically, our training set consists of a collection of pairs of an input


vector x ∈ Rd and its corresponding target, or label, t
I Regression: t is a real number (e.g. stock price)
I Classification: t is an element of a discrete set {1, . . . , C }
I These days, t is often a highly structured object (e.g. image)

Denote the training set {(x(1) , t (1) ), . . . , (x(N) , t (N) )}


I Note: these superscripts have nothing to do with exponentiation!

UofT CSC411 2019 Winter Lecture 02 6 / 27


Nearest Neighbours
Suppose we’re given a novel input vector x we’d like to classify.
The idea: find the nearest input vector to x in the training set and copy its
label.
Can formalize “nearest” in terms of Euclidean distance
v
u d
uX
||x − y||2 = t (xj − yj )2
j=1

Algorithm:
1. Find example (x∗ , t ∗ ) (from the stored training set) closest to x.
That is:
x∗ = argmin dist(x(i) , x)
x(i) ∈train. set
2. Output y = t ∗

Note: we don’t need to compute the square root. Why?


UofT CSC411 2019 Winter Lecture 02 7 / 27
Nearest Neighbours: Decision Boundaries

We can visualize the behavior in the classification setting using a Voronoi


diagram.

UofT CSC411 2019 Winter Lecture 02 8 / 27


Nearest Neighbours: Decision Boundaries

Decision boundary: the boundary between regions of input space assigned to


different categories.

UofT CSC411 2019 Winter Lecture 02 9 / 27


Nearest Neighbours: Decision Boundaries

Example: 3D decision boundary

UofT CSC411 2019 Winter Lecture 02 10 / 27


k-Nearest Neighbours
[Pic by Olga Veksler]

Nearest Neighbours sensitive to noise or mis-labeled data (“class noise”).


Solution?
Smooth by having k nearest Neighbours vote

Algorithm (kNN):
1. Find k examples {(x(r ) , t (r ) )}kr=1 closest to the test instance x
2. Classification output is majority class
Xk
y = arg max I[t = t (r ) ]
t
r =1

UofT CSC411 2019 Winter Lecture 02 11 / 27


K-Nearest Neighbours
k=1

[Image credit: ”The Elements of Statistical Learning”]


UofT CSC411 2019 Winter Lecture 02 12 / 27
K-Nearest Neighbours
k=15

[Image credit: ”The Elements of Statistical Learning”]


UofT CSC411 2019 Winter Lecture 02 13 / 27
k-Nearest Neighbours

Tradeoffs in choosing k? Remember: goal is to correctly classify unseen


examples
Small k
I Good at capturing fine-grained patterns
I May overfit, i.e. be sensitive to random idiosyncrasies in the training

data

Large k
I Makes stable predictions by averaging over lots of examples
I May underfit, i.e. fail to capture important regularities

Rule of thumb: k < sqrt(n), where n is the number of training examples

UofT CSC411 2019 Winter Lecture 02 14 / 27


Choosing Hyperparameters using a Validation Set

k is an example of a hyperparameter, something we can’t fit as part of the


learning algorithm itself, but which controls the behavior of the algorithm
We want to choose hyperparameters based on how well the algorithm
generalizes
Thus, we separate some of our available data into a validation set, distinct
from the training set
Model’s performance on the validation set indicates how well it generalizes
I choose hyperparameters which leads to best performance (lowest error)
on validation set
I Note: error here means number of incorrectly classified examples

UofT CSC411 2019 Winter Lecture 02 15 / 27


Test Set
Now hyperparameters might have overfit to the validation set! Validation
performance not good assessment of generalization of final algorithm
Solution: separate an additional test set from the available data and
evaluate on it once hyperparameters are chosen
I Available data partitioned into 3 sets: training, validation, and test

The test set is used only at the very end, to measure the generalization
performance of the final configuration.
UofT CSC411 2019 Winter Lecture 02 16 / 27
K-Nearest Neighbours

[Image credit: ”The Elements of Statistical Learning”]

UofT CSC411 2019 Winter Lecture 02 17 / 27


The Curse of Dimensionality

Low-dimensional visualizations are misleading!


I Given a new point, we want to classify it based on a point only a small
distance away
I But in high dimensions, “most” points are far apart.
At least how many points are needed to guarantee the nearest neighbor is
closer than ?
I The volume of a single ball of radius  is O(d )
d
I The total volume of
 [0, 1] is 1.
1 d
I Therefore O (  ) balls are needed to cover the volume.
Assuming data follows uniform distribution, training set size must grow
exponentially with the number of dimensions for points to be close by!

UofT CSC411 2019 Winter Lecture 02 18 / 27


The Curse of Dimensionality

Edge length of hypercube required to occupy given fraction r of volume of


unit hypercube [0, 1]d is r 1/d
I If d = 10 and r = 0.1, the edge length required is 0.11/10 ≈ 0.8
I To use 10% of the data to make our decision, must cover 80% of the
range of each dimension!

[Image credit: ”The Elements of Statistical Learning”]

UofT CSC411 2019 Winter Lecture 02 19 / 27


The Curse of Dimensionality
In high dimensions, “most” points are approximately the same distance.

Saving grace: some datasets (e.g. images) may have low intrinsic
dimension, i.e. lie on or near a low-dimensional manifold. So nearest
Neighbours sometimes still works in high dimensions.

UofT CSC411 2019 Winter Lecture 02 20 / 27


Normalization

Nearest Neighbours can be sensitive to the ranges of different features.


Often, the units are arbitrary:

Simple fix: normalize each dimension to be zero mean and unit variance.
I.e., compute the mean µj and standard deviation σj , and take
xj − µj
x̃j =
σj

Caution: depending on the problem, the scale might be important!

UofT CSC411 2019 Winter Lecture 02 21 / 27


Computational Cost

Number of computations at training time: 0

Number of computations at test time, per query (naı̈ve algorithm)


I Calculuate D-dimensional Euclidean distances with N data points:
O(ND)
I Sort the distances: O(N log N)

This must be done for each query, which is very expensive by the standards
of a learning algorithm!

Need to store the entire dataset in memory!

Tons of work has gone into algorithms and data structures for efficient
nearest Neighbours with high dimensions and/or large datasets.

UofT CSC411 2019 Winter Lecture 02 22 / 27


Example: Digit Classification
Decent performance when lots of data

UofT CSC411 2019 Winter Lecture 02 23 / 27


Example: Digit Classification

KNN can perform a lot better with a good similarity measure.


Example: shape contexts for object recognition. In order to achieve
invariance to image transformations, they tried to warp one image to match
the other image.
I Distance measure: average distance between corresponding points on
warped images
Achieved 0.63% error on MNIST, compared with 3% for Euclidean KNN.
Competitive with conv nets at the time, but required careful engineering.

[Belongie, Malik, and Puzicha, 2002. Shape matching and object recognition using shape
contexts.]
UofT CSC411 2019 Winter Lecture 02 24 / 27
Example: 80 Million Tiny Images

80 Million Tiny Images was the


first extremely large image
dataset. It consisted of color
images scaled down to 32 × 32.
With a large dataset, you can
find much better semantic
matches, and KNN can do
some surprising things.
Note: this required a carefully
chosen similarity metric.

[Torralba, Fergus, and Freeman, 2007. 80 Million Tiny Images.]

UofT CSC411 2019 Winter Lecture 02 25 / 27


Example: 80 Million Tiny Images

[Torralba, Fergus, and Freeman, 2007. 80 Million Tiny Images.]


UofT CSC411 2019 Winter Lecture 02 26 / 27
Conclusions

Simple algorithm that does all its work at test time — in a sense, no
learning!

Can control the complexity by varying k

Suffers from the Curse of Dimensionality

Next time: decision trees, another approach to regression and


classification

UofT CSC411 2019 Winter Lecture 02 27 / 27

You might also like