Lecture 2 - Nearest-Neighbors Methods
Lecture 2 - Nearest-Neighbors Methods
Sajid Mahmood
CS 444
Slides were created by Chloé-Agathe Azencott
Centre for Computational Biology, Mines ParisTech
[email protected]
LECTURE 2 - NEAREST-
NEIGHBORS METHODS
LEARNING
OBJECTIVES
● Implement the nearest-neighbor and k-nearest-
neighbors algorithms.
● Compute distances between real-valued vectors as
well as objects represented by categorical features.
● Define the decision boundary of the nearest-
neighbor algorithm.
● Explain why kNN might not work well in high
dimension.
3
NEAREST
NEIGHBORS
4
●
HOW WOULD YOU COLOR THE
BLANK CIRCLES?
5
●
HOW WOULD YOU COLOR THE
BLANK CIRCLES?
6
Partitioning the space
The training data partitions the entire space
7
NEAREST
● Learning: NEIGHBOR
– Store all the training examples
● Prediction:
– For x: the label of the training example closest to it
8
K NEAREST
● Learning:
NEIGHBORS
– Store all the training examples
● Prediction:
– Find the k training examples closest to x
– Classification?
9
K NEAREST
● Learning:
NEIGHBORS
– Store all the training examples
● Prediction:
– Find the k training examples closest to x
– Classification
Majority vote: Predict the class of the most frequent
label among the k neighbors.
10
K NEAREST
● Learning:
NEIGHBORS
– Store all the training examples
● Prediction:
– Find the k training examples closest to x
– Classification
Majority vote: Predict the class of the most frequent
label among the k neighbors.
– Regression?
11
K NEAREST
● Learning:
NEIGHBORS
– Store all the training examples
● Prediction:
– Find the k training examples closest to x
– Classification
Majority vote: Predict the class of the most frequent
label among the k neighbors.
– Regression
Predict the average of the labels of the k neighbors.
12
CHOIC
E OF K
● Small k: noisy
The idea behind using more than 1 neighbor is to
average out the noise
● Large k: computationally intensive
If k = n ?
13
CHOIC
E OF K
● Small k: noisy
The idea behind using more than 1 neighbor is to
average out the noise
● Large k: computationally intensive
If k=n, then we predict
– for classification: the majority class
– for regression: the average value
● Set k by cross-validation
● Heuristic: k ≈ √n
14
NON-
PARAMETRIC
LEARNING
Non-parametric learning algorithm:
– the complexity of the decision function grows with the
number of data points.
– contrast with linear regression (≈ as many parameters as
features).
– Usually: decision function is expressed directly in terms
of the training examples.
– Examples:
● kNN (this chapter)
● tree-based methods (Chap. 9)
● SVM (Chap. 10)
15
INSTANCE-BASED LEARNING
● Learning:
●
– Storing training instances.
Predicting:
– Compute the label for a new instance based on its
similarity with the stored instances.
● Also called lazy learning.
● Similar to case-based reasoning
– Doctors treating a patient based on how patients with
similar symptoms were treated,
– Judges ruling court cases based on legal precedent.
16
INSTANCE-BASED LEARNING
●
Learning:
– Storing training instances.
● Predicting:
– Compute the label for a new instance based on its
similarity with the stored instances.
where the magic happens!
● Also called lazy learning.
● Similar to case-based reasoning
– Doctors treating a patient based on how patients with
similar symptoms were treated,
– Judges ruling court cases based on legal precedent.
17
COMPUTING
DISTANCES &
SIMILARITIES
18
DISTANCES
BETWEEN
●
INSTANCES
Distance
19
DISTANCES
BETWEEN
●
INSTANCES
Distance
20
DISTANCES
BETWEEN
●
INSTANCES
Euclidean distance
21
DISTANCES
BETWEEN
●
INSTANCES
Euclidean distance
● Manhattan distance
22
DISTANCES
BETWEEN
●
INSTANCES
Euclidean distance
● Manhattan distance
– L1 = Manhattan.
– L2 = Euclidean.
– L∞ ?
23
DISTANCES
BETWEEN
●
INSTANCES
Euclidean distance
● Manhattan distance
– L1 = Manhattan.
– L2 = Euclidean.
– L∞
24
SIMILARITY
BETWEEN
INSTANCES
● Pearson's correlation
Geometric interpretation?
25
SIMILARITY
BETWEEN
●
INSTANCES
PEARSON'S CORRELATION (CENTERED DATA)
26
CATEGORICAL
FEATURES
● Ex: a feature that can take 5 values
– Sports
– World
– Culture
– Internet
– Politics
●
Naive encoding: x1 in {1, 2, 3, 4, 5}:
– Why is Sports closer to World than Politics?
●
One-hot encoding: x1, x2, x3, x4, x5
– Sports: [1, 0, 0, 0, 0]
– Internet: [0, 0, 0, 1, 0] 27
CATEGORICAL
FEATURES
● Represent object as the list of presence/absence (or
counts) of features that appear in it.
● Example: small molecules
features = atoms and bonds of a certain type
– C, H, S, O, N...
– O-H, O=C, C-N....
BINARY
REPRESENTATI
0 1 1 ON
0 0 1 0 0 0 1 0 1 0 0
1
no occurrence 1+ occurrences
of the 1st feature of the 10th feature
● Hamming distance
Number of bits that are different
Equivalent to
?
BINARY
REPRESENTATI
0 1 1 ON
0 0 1 0 0 0 1 0 1 0 0
1
no occurrence 1+ occurrences
of the 1st feature of the 10th feature
● Hamming distance
Number of bits that are different
Equivalent to
BINARY
REPRESENTATI
0 1 1ON
0 0 1 0 0 0 1 0 1 0 0 1
● Tanimoto/Jaccard similarity
Number of shared features (normalized)
COUNTS
REPRESENTATI
0 1 2ON
0 0 1 0 0 0 4 0 1 0 0
7
# occurrences
no occurrence of the 10th feature
of the 1st feature
● MinMax similarity
Number of shared features (normalized)
111011011110
211021011120
111011010100
311011010100
CATEGORICAL
●
FEATURES
A = 100011010110 / 300011010120
● B = 111011011110 / 211021011120
● C = 111011010100 / 311011010100
● Hamming distance
d(A, B) = 3 d(A, C) = 3 d(B, C)
● =2
Tanimoto
s(A, B) =similarity
6/9 s(A, C) = 5/8 s(B, C) = 7/9
= 0.67 = 0.63 = 0.78
● MinMax similarity
s(A, B) = 8/13 s(A, C) = 7/11 s(B, C) = 8/13
= 0.62 = 0.64 = 0.62 3
5
CATEGORICAL
FEATURES
● Feature
s
● When new data has unknown features: ignore
them.
=
BACK TO
NEAREST
NEIGHBORS
ADVANTAGE
S OF KNN
● Training is very fast
– Just store the training examples.
– Can use smart indexing procedures to speed-up testing
(slower training).
● Keeps the training data
–Useful if we want to do something else with it.
● Rather robust to noisy data (averaging k votes)
● Can learn complex functions
DRAWBACK
S OF KNN
● Memory requirements
● Prediction can be
?
slow.
– Complexity of labeling
1 new data point
DRAWBACK
S OF KNN
● Memory requirements
● Prediction can be
slow.
Complexity of labeling 1 new data point:
But kNN works best with lots of samples...
● construction
→ Efficient dataspace:
structures (k-D trees,
time: query:
ball-trees)
●
Draw the
Voronoi cell of ?
the blue dot.
VORONOI
●
TESSELATION
Voronoi cell of x:
– set of all points of the space closer to x than any other point of
the training set
– polyhedron
● Voronoid tesselation of the space: union of all
Voronoi cells.
VORONOI
TESSELATION
The Voronoi tesselation defines the decision
●
k nearest neighbors of A
according to s
among the items rated
by user u
SUMM
ARY
● kNN
– very simple training
– prediction can be expensive
● Relies on a “good” distance/similarity between
instances
● Decision boundary = Voronoi tesselation
● Curse of dimensionality: hyperspace is very
big.
REFER
●
ENCES
A Course in Machine Learning.
http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf
– kNN: Chap 3.2 — 3.3
– Categorical variables: Chap 3.1
– Curse of dimensionality: Chap 3.5
●
More on
– Kd-trees
https://www.ri.cmu.edu/pub_files/pub1/moore_andrew_1991
_ 1/moore_andrew_1991_1.pdf
http://www.alglib.net/other/nearestneighbors.php
– Voronoi tessellation
http://philogb.github.io/blog/2010/02/12/voronoi-tessellation
/
Lab
Even though we use the same scoring strategy, we don’t get the
same optimum. That’s because the cross-validation evaluation
strategy is different: scikit-learn compute one AUC per fold and
averages them.
THE KNN PERFORMS MUCH WORSE
THAN THE LINEAR MODELS. WITH
SUCH A LARGE NUMBER OF FEATURES,
THIS IS NOT UNEXPECTED.
COMPUTING NEAREST NEIGHBORS BASED ON CORRELATION
WORKS BETTER THAN BASED ON MINKOWSKI DISTANCES. INDEED
THIS ALLOWS TO COMPARE THE PROFILES OF THE GENE
EXPRESSIONS (WHICH GENES HAVE HIGH EXPRESSION / LOW
EXPRESSION SIMULTANEOUSLY). STILL LOGISTIC REGRESSION
WORKS BEST.
57