0% found this document useful (0 votes)
32 views

Lecture 2 - Nearest-Neighbors Methods

The document discusses the k-nearest neighbors (kNN) machine learning algorithm. It explains that kNN involves storing all training examples and classifying new examples based on the labels of its k nearest neighbors. The algorithm finds the k closest training examples to make a prediction, which can be a majority vote for classification or average of labels for regression. Choosing k involves a tradeoff between noise and computational cost. kNN is a non-parametric, instance-based approach to learning.

Uploaded by

Ali Raza
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Lecture 2 - Nearest-Neighbors Methods

The document discusses the k-nearest neighbors (kNN) machine learning algorithm. It explains that kNN involves storing all training examples and classifying new examples based on the labels of its k nearest neighbors. The algorithm finds the k closest training examples to make a prediction, which can be a majority vote for classification or average of labels for regression. Choosing k involves a tradeoff between noise and computational cost. kNN is a non-parametric, instance-based approach to learning.

Uploaded by

Ali Raza
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

MACHINE LEARNING Dr.

Sajid Mahmood

CS 444
Slides were created by Chloé-Agathe Azencott
Centre for Computational Biology, Mines ParisTech
[email protected]
LECTURE 2 - NEAREST-
NEIGHBORS METHODS
LEARNING
OBJECTIVES
● Implement the nearest-neighbor and k-nearest-
neighbors algorithms.
● Compute distances between real-valued vectors as
well as objects represented by categorical features.
● Define the decision boundary of the nearest-
neighbor algorithm.
● Explain why kNN might not work well in high
dimension.

3
NEAREST
NEIGHBORS

4

HOW WOULD YOU COLOR THE
BLANK CIRCLES?

5

HOW WOULD YOU COLOR THE
BLANK CIRCLES?

6
Partitioning the space
The training data partitions the entire space

7
NEAREST
● Learning: NEIGHBOR
– Store all the training examples
● Prediction:
– For x: the label of the training example closest to it

8
K NEAREST
● Learning:
NEIGHBORS
– Store all the training examples
● Prediction:
– Find the k training examples closest to x
– Classification?

9
K NEAREST
● Learning:
NEIGHBORS
– Store all the training examples
● Prediction:
– Find the k training examples closest to x
– Classification
Majority vote: Predict the class of the most frequent
label among the k neighbors.

10
K NEAREST
● Learning:
NEIGHBORS
– Store all the training examples
● Prediction:
– Find the k training examples closest to x
– Classification
Majority vote: Predict the class of the most frequent
label among the k neighbors.
– Regression?

11
K NEAREST
● Learning:
NEIGHBORS
– Store all the training examples
● Prediction:
– Find the k training examples closest to x
– Classification
Majority vote: Predict the class of the most frequent
label among the k neighbors.
– Regression
Predict the average of the labels of the k neighbors.

12
CHOIC
E OF K
● Small k: noisy
The idea behind using more than 1 neighbor is to
average out the noise
● Large k: computationally intensive
If k = n ?

13
CHOIC
E OF K
● Small k: noisy
The idea behind using more than 1 neighbor is to
average out the noise
● Large k: computationally intensive
If k=n, then we predict
– for classification: the majority class
– for regression: the average value
● Set k by cross-validation
● Heuristic: k ≈ √n

14
NON-
PARAMETRIC
LEARNING
Non-parametric learning algorithm:
– the complexity of the decision function grows with the
number of data points.
– contrast with linear regression (≈ as many parameters as
features).
– Usually: decision function is expressed directly in terms
of the training examples.
– Examples:
● kNN (this chapter)
● tree-based methods (Chap. 9)
● SVM (Chap. 10)

15
INSTANCE-BASED LEARNING
● Learning:

– Storing training instances.
Predicting:
– Compute the label for a new instance based on its
similarity with the stored instances.
● Also called lazy learning.
● Similar to case-based reasoning
– Doctors treating a patient based on how patients with
similar symptoms were treated,
– Judges ruling court cases based on legal precedent.
16
INSTANCE-BASED LEARNING

Learning:
– Storing training instances.
● Predicting:
– Compute the label for a new instance based on its
similarity with the stored instances.
where the magic happens!
● Also called lazy learning.
● Similar to case-based reasoning
– Doctors treating a patient based on how patients with
similar symptoms were treated,
– Judges ruling court cases based on legal precedent.
17
COMPUTING
DISTANCES &
SIMILARITIES

18
DISTANCES
BETWEEN

INSTANCES
Distance

19
DISTANCES
BETWEEN

INSTANCES
Distance

20
DISTANCES
BETWEEN

INSTANCES
Euclidean distance

21
DISTANCES
BETWEEN

INSTANCES
Euclidean distance

● Manhattan distance

Why is this called the Manhattan distance?

22
DISTANCES
BETWEEN

INSTANCES
Euclidean distance

● Manhattan distance

● Lq-norm: Minkowski distance

– L1 = Manhattan.
– L2 = Euclidean.
– L∞ ?
23
DISTANCES
BETWEEN

INSTANCES
Euclidean distance

● Manhattan distance

● Lq-norm: Minkowski distance

– L1 = Manhattan.
– L2 = Euclidean.
– L∞
24
SIMILARITY
BETWEEN
INSTANCES
● Pearson's correlation

● Assuming the data is centered

Geometric interpretation?

25
SIMILARITY
BETWEEN

INSTANCES
PEARSON'S CORRELATION (CENTERED DATA)

● Cosine similarity: the dot product can be used to


measure similarities.

26
CATEGORICAL
FEATURES
● Ex: a feature that can take 5 values
– Sports
– World
– Culture
– Internet
– Politics

Naive encoding: x1 in {1, 2, 3, 4, 5}:
– Why is Sports closer to World than Politics?

One-hot encoding: x1, x2, x3, x4, x5
– Sports: [1, 0, 0, 0, 0]
– Internet: [0, 0, 0, 1, 0] 27
CATEGORICAL
FEATURES
● Represent object as the list of presence/absence (or
counts) of features that appear in it.
● Example: small molecules
features = atoms and bonds of a certain type
– C, H, S, O, N...
– O-H, O=C, C-N....
BINARY
REPRESENTATI
0 1 1 ON
0 0 1 0 0 0 1 0 1 0 0
1

no occurrence 1+ occurrences
of the 1st feature of the 10th feature
● Hamming distance
Number of bits that are different

Equivalent to
?
BINARY
REPRESENTATI
0 1 1 ON
0 0 1 0 0 0 1 0 1 0 0
1

no occurrence 1+ occurrences
of the 1st feature of the 10th feature
● Hamming distance
Number of bits that are different

Equivalent to
BINARY
REPRESENTATI
0 1 1ON
0 0 1 0 0 0 1 0 1 0 0 1

● Tanimoto/Jaccard similarity
Number of shared features (normalized)
COUNTS
REPRESENTATI
0 1 2ON
0 0 1 0 0 0 4 0 1 0 0
7
# occurrences
no occurrence of the 10th feature
of the 1st feature
● MinMax similarity
Number of shared features (normalized)

If x is binary, MinMax and Tanimoto are equivalent


CATEGORICA
L FEATURES
● Feature
s
● Compute the Hamming distance and Tanimoto and
MinMax similarities between these objects: ?
CATEGORICA
L FEATURES
● Feature
s
● Compute the Hamming distance and Tanimoto and
MinMax similarities between these objects:
100011010110
300011010120

111011011110
211021011120

111011010100
311011010100
CATEGORICAL

FEATURES
A = 100011010110 / 300011010120
● B = 111011011110 / 211021011120
● C = 111011010100 / 311011010100

● Hamming distance
d(A, B) = 3 d(A, C) = 3 d(B, C)
● =2
Tanimoto
s(A, B) =similarity
6/9 s(A, C) = 5/8 s(B, C) = 7/9
= 0.67 = 0.63 = 0.78
● MinMax similarity
s(A, B) = 8/13 s(A, C) = 7/11 s(B, C) = 8/13
= 0.62 = 0.64 = 0.62 3
5
CATEGORICAL
FEATURES
● Feature
s
● When new data has unknown features: ignore
them.

=
BACK TO
NEAREST
NEIGHBORS
ADVANTAGE
S OF KNN
● Training is very fast
– Just store the training examples.
– Can use smart indexing procedures to speed-up testing
(slower training).
● Keeps the training data
–Useful if we want to do something else with it.
● Rather robust to noisy data (averaging k votes)
● Can learn complex functions
DRAWBACK
S OF KNN
● Memory requirements
● Prediction can be
?
slow.
– Complexity of labeling
1 new data point
DRAWBACK
S OF KNN
● Memory requirements
● Prediction can be
slow.
Complexity of labeling 1 new data point:
But kNN works best with lots of samples...
● construction
→ Efficient dataspace:
structures (k-D trees,
time: query:
ball-trees)

→ Approximate solutions based on hashing


● kNN are fooled by irrelevant attributes.
E.g. p=1000, only 10 features are relevant; distances
become meaningless.
DECISION
BOUNDARY OF

KNN
Classification
● Decision boundary: Line separating the positive
from negative regions.
● What decision boundary is the kNN building?
VORONOI
TESSELATION
● Voronoi cell of x:
– set of all points of the space closer to x than any other point of
the training set
– polyhedron
● Voronoid tesselation of the space: union of all
Voronoi cells.

Draw the
Voronoi cell of ?
the blue dot.
VORONOI

TESSELATION
Voronoi cell of x:
– set of all points of the space closer to x than any other point of
the training set
– polyhedron
● Voronoid tesselation of the space: union of all
Voronoi cells.
VORONOI
TESSELATION
The Voronoi tesselation defines the decision

boundary of the 1-NN.

● The kNN also partitions the space (in a more


complex way).
CURSE OF
DIMENSIONALITY
● Remember from Chap 3
● When p ↗ the proportion of a hypercube outside of
its inscribed hypersphere approaches 1.
● Volume of a p-sphere:

● What this means:


– hyperspace is very big
– all points are far apart
– dimensionality reduction needed.
KNN
VARIAN
● ε-ball neighborsTS
– Instead of using the k nearest neighbors, use all points
within a distance ε of the test point.
– What if there are no such points?
KNN
VARIAN
● Weighted kNN TS
– Weigh the vote of each neighbor according to the
distance to the test point.

– Variant: learn the optimal weights [e.g. Swamidass,


Azencott et al. 2009, Influence Relevance Voter]
COLLABORATIVE
FILTERING
● Collaborative filtering: recommend items that
similar users have liked in the past
similar users = users with similar tastes
● item-based kNN
–similarity between items: adjusted cosine
similarity
Sum over the users that rated both item A and item B

Rating of item A by user u Average rating by user u


COLLABORATIVE
FILTERING
– score of item A for user u:

k nearest neighbors of A
according to s
among the items rated
by user u
SUMM
ARY
● kNN
– very simple training
– prediction can be expensive
● Relies on a “good” distance/similarity between
instances
● Decision boundary = Voronoi tesselation
● Curse of dimensionality: hyperspace is very
big.
REFER

ENCES
A Course in Machine Learning.
http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf
– kNN: Chap 3.2 — 3.3
– Categorical variables: Chap 3.1
– Curse of dimensionality: Chap 3.5

More on
– Kd-trees
https://www.ri.cmu.edu/pub_files/pub1/moore_andrew_1991
_ 1/moore_andrew_1991_1.pdf
http://www.alglib.net/other/nearestneighbors.php
– Voronoi tessellation
http://philogb.github.io/blog/2010/02/12/voronoi-tessellation
/
Lab
Even though we use the same scoring strategy, we don’t get the
same optimum. That’s because the cross-validation evaluation
strategy is different: scikit-learn compute one AUC per fold and
averages them.
THE KNN PERFORMS MUCH WORSE
THAN THE LINEAR MODELS. WITH
SUCH A LARGE NUMBER OF FEATURES,
THIS IS NOT UNEXPECTED.
COMPUTING NEAREST NEIGHBORS BASED ON CORRELATION
WORKS BETTER THAN BASED ON MINKOWSKI DISTANCES. INDEED
THIS ALLOWS TO COMPARE THE PROFILES OF THE GENE
EXPRESSIONS (WHICH GENES HAVE HIGH EXPRESSION / LOW
EXPRESSION SIMULTANEOUSLY). STILL LOGISTIC REGRESSION
WORKS BEST.

57

You might also like