K NN Annotated Slides
K NN Annotated Slides
1/33 2/33
Announcement Plan
3/33 4/33
Introduction The k-NN Algorithm - Formal Definition
5/33 6/33
Input: classification training dataset D = {(x 1 , y1 ), . . . , (x n , yn )}, 3-NN for binary classification using Euclidean distance
and parameter k ∈ N+ , and a distance metric d (x, x )(e.g.
x − x 2 , euclidean distance)
k-NN Algorithm
Store all training data
For any test point x:
1. Find its top k nearest neighbors (under metric d)
2. Return the most common label among these k neighbors (If for
regression, return the average value of the k neighbors)
7/33 8/33
The choice of metric The choice of metric
r =1
9/33 10/33
11/33 12/33
The choice of k The choice of k
1. What if we set k very large? Top k-neighbors will include examples that are very far
away...
2. What if we set k very small (k = 1)? 2. What if we set k very small (k = 1)?
training error = 0)
13/33 14/33
15/33 16/33
1-Nearest Neighbors Decision Boundary (Cont) Plan
17/33 18/33
19/33 20/33
Bayes Optimal Predictor Bayes Optimal Predictor
• Assume our data is collected in an i.i.d fashion, i.e., X , Y ∼ P • Assume our data is collected in an i.i.d fashion, i.e., X , Y ∼ P
(say y ∈ {−1, 1}) (say y ∈ {−1, 1})
• Bayes optimal predictor: hopt (x) = arg maxy P(y |x) • Bayes optimal predictor: hopt (x) = arg maxy P(y |x)
• Example: • Example:
Question: What’s the Question: What’s the
P(+1|x) = 0.8 P(+1|x) = 0.8 probability of hopt making a
probability of hopt making a
P(−1|x) = 0.2 mistake on x? P(−1|x) = 0.2 mistake on x?
Answer:
yb = hopt (x) = 1 yb = hopt (x) = 1
BayesOpt = 1 − P(yb |x) =0.2
21/33 22/33
23/33 24/33
Guarantee of k-NN when k = 1 and n → ∞ Guarantee of k-NN when k = 1 and n → ∞
25/33 26/33
27/33 28/33
Curse of Dimensionality Explanation The distance between two sampled points increases as p grows
• Example (Cont):
If n = 1000, how big is l ?
In [0, 1]p , we
uniformly sample
p l
two points x, x ,
2 0.100000
calculate
10 0.630957
d (x, x ) x − x 2
100 0.954993
1000 0.995405 Let’s plot the
• If p 0 almost the entire space is needed to find the 10-NN → distribution of such
This breaks down the k−NN assumptions. distance
• Question: Could we increase the number of data points, n, until
the nearest neighbors are truly close to the test point. How
many data points would we need such that l becomes truly
small?
Distance increases as p → ∞
29/33 30/33
• Strength
• k-NN: the simplest ML algorithm (very good baseline, should
always try!)
• It is very easy to understand and often gives reasonable
performance without a lot of adjustment.
→ It is a good baseline method to try before considering more
advanced techniques
• No training involved (“lazy"). New training examples can be
added easily
• Works well when data is low-dimensional (e.g., can compare
against the Bayes optimal)
31/33 32/33
Strengths, Weakeness
• Weaknesses
• Data needs to be pre-processed.
• Suffer when data is high-dimensional, due to the fact that in
highdimension space, data tends to spread far away from each
other
• It is expensive and slow: to determine the nearest neighbor of a
new point x, must compute the distance to all n training
examples
33/33