0% found this document useful (0 votes)
18 views39 pages

08 - KNN

Note for DSME(DOTE)

Uploaded by

gordonlam145
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views39 pages

08 - KNN

Note for DSME(DOTE)

Uploaded by

gordonlam145
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

DM & BDA

k-Nearest Neighbours
Content

1 Motivation

2 Nearest Neighbours for Classi cation

3 Nearest Neighbours for Regression

4 The Curse of Dimensionality

5 Advantages & Shortcomings

6 Real-Life Applications

Reading: Shmueli et al., §7


Lantz, §3
fi
Motivation

“Dark dining” restaurants:

You are served


a tomato:

How do you recognise what you are eating?


Motivation

Previous food experience:


Ingredient Sweetness Crunchiness Food Type
10 9 fruit

1 4 protein

10 1 fruit

7 10 vegetable

3 10 vegetable

1 1 protein

… … … …
Motivation

Scatterplot in the sweetness-crunchiness plane:


crunchiness

sweetness
Motivation

Similar food types tend to be clustered together:


vegetables fruits
crunchiness

proteins

sweetness
Motivation

Let’s look at tomatoes: 1-Nearest Neighbour


tomatoes closest to oranges
oranges are fruits, hence
tomatoes are fruits

2-Nearest Neighbour
crunchiness

tomatoes closest to oranges


3 and grapes
2 oranges and grapes are fruits,
hence tomatoes are fruits

1 3-Nearest Neighbour
tomatoes closest to oranges,
grapes and nuts
2/3 of nearest neighbours are
sweetness fruits, hence tomatoes are fruits
Content

1 Motivation

2 Nearest Neighbours for Classi cation

3 Nearest Neighbours for Regression

4 The Curse of Dimensionality

5 Advantages & Shortcomings

6 Real-Life Applications

Reading: Shmueli et al., §7.1


Lantz, §3
fi
k-Nearest Neighbours for Classi cation

Input
n training samples (Xi1, …, Xip; Yi), i = 1, …, n, with Xi1, …, Xip
predictors and Yi the categorical outcome.
New data point (X1, …, Xp) whose outcome should be found.

Algorithm
1 Find the k training samples with the smallest distance
dist((Xi1 , . . . , Xip ), (X1 , . . . , Xp )).

2 Assign to the new data point (X1, …, Xp) the majority category
among these k training samples.
fi
k-Nearest Neighbours for Classi cation

Input
n training samples (Xi1, …, Xip; Yi), i = 1, …, n, with Xi1, …, Xip
predictors and Yi the outcome.
New data point (x1, …, xQuestions:
p) whose category should be found.

1 Which distance function should we use?


Algorithm
2 What about binary/categorical predictors?
1 Find the k training samples with the smallest distance
3 How should we choose k?
dist((Xi1 , . . . , Xip ), (X1 , . . . , Xp )).

2 Assign to the new data point (X1, …, Xp) the majority category
among these k training samples.
fi
(1) Distance Functions

Typically, distances are measured by a q-norm:


1/q
p
distq ((Xi1 , . . . , Xip ), (X1 , . . . , Xp )) = |Xij Xj | q
j=1

q=1 (Manhattan/taxicab distance):


p
dist1 ((Xi1 , . . . , Xip ), (X1 , . . . , Xp )) = |Xij Xj |
j=1
q=2 (Euclidean distance):
p
dist2 ((Xi1 , . . . , Xip ), (X1 , . . . , Xp )) = (Xij X j )2
j=1
q=∞ (maximum distance):
dist ((Xi1 , . . . , Xip ), (X1 , . . . , Xp )) = max {|Xij Xj | : j = 1, . . . , p}
(1) Distance Functions

Manhattan
distance

Euclidean
distance

Maximum
distance
(1) Scaling: Motivation

Nearest Neighbour methods are sensitive to scaling:

Imagine we add a “spiciness”


dimension to our food example.
We measure spiciness according
to the Scoville scale (see right).

What would happen to our


Nearest Neighbour classi er?
fi
(1) Scaling: Methods

1 Min-max normalization: works well if predictor roughly uniform


old old
new
Xij min Xij : i = 1, . . . , n
Xij = old : i = 1, . . . , n old : i = 1, . . . , n
max Xij min Xij

2 Z-score normalization: good in the presence of outliers


old n
new
Xij µj 1 old
Xij = with µj = Xij
j n i=1
n
1 old 2
and j = Xij µj
n i=1

The same transformation must be applied


to the new data point (X1, …, Xp)!
(2) Binary and Categorical Predictors

Consider the binary predictor “male/female”:

The predictor can be replaced 1 if male,


male =
with 1 dummy numerical variable: 0 if female.

Consider a 3-category temperature predictor “hot/medium/cold”:


3 if hot,
The predictor can be replaced temp =
1 2 if medium,
with 1 dummy numerical variable:
1 if cold.

1 if hot,
is hot =
The predictor can be replaced with 0 otherwise.
2
2/3 dummy numerical variables: 1 if medium,
is med =
0 otherwise.

(3) How to Choose k:
The Bias/Variance Trade-Off
Consider the dataset:

k = 1: k = 99:

What happens if k approaches n?


(3) How to Choose k:
The Bias/Variance Trade-Off

reducible error
error

optimum
bias
variance

large k model complexity small k

bias: large k converges towards “training average” (under tting)


variance: small k reacts to noise/outliers (over tting)

rule of thumb: k = n gives good performance


fi
fi
(3) How to Choose k:
The Bias/Variance Trade-Off
The bias/variance trade-off in machine learning

low variance high variance


low bias
high bias
(3) How to Choose k:
Validation and Test Sets
Naïve idea: choose k that performs best on the training data
• For each k = 1, …, n (or a suitable subset), check what
percentage of the training samples is correctly classi ed
by the k-Nearest Neighbours method over all training data.
• Choose the parameter k that gives the best result.
25%
the best estimator
for the training data
classi cation error

20%
is the training data
15% itself!
10%

5%

120 100 80 60 40 20 0
parameter k
fi
fi
(3) How to Choose k:
Validation and Test Sets
Better idea: split samples into training and validation sets!

80% 20%
training validation
data data

Remove e.g. 20%, 1/3 or 1/2 of your training samples and


use them as validation samples.
For each k = 1, …, n (or a suitable subset), check what
percentage of the validation samples is correctly classi ed
by the k-Nearest Neighbours method.
Choose the parameter k that gives the best result on the
validation set.

fi
(3) How to Choose k:
Validation and Test Sets
Better idea: split samples into training and validation sets!

80% 20%
training validation
25% data data
classi cation error

20% optimum
lies here
15% somewhere

10%

5%

120 100 80 60 40 20 0
parameter k
fi
(3) How to Choose k:
Validation and Test Sets
Better idea: split samples into training and validation sets!

80% 20%
training validation
25% data data
classi cation error

20% optimum
Can w lies here
15% e expe
somewhere
p erform c t the s
ance o ame
n new
10% data?

5%

120 100 80 60 40 20 0
parameter k
fi
(3) How to Choose k:
Validation and Test Sets
Estimate of generalization error: use test set!

60% 20% 20%


training validation test
data data data
Remove e.g. 40% of your training samples; use one half as
validation samples, the other half as test samples.
For each k = 1, …, n (or a suitable subset), check what
percentage of the validation samples is correctly classi ed
by the k-Nearest Neighbours method.
Choose the parameter k that gives the best result.
For the optimal k, check what percentage of the test samples
is correctly classi ed — this gives the generalization error.
fi
fi
(3) How to Choose k:
Validation and Test Sets
Estimate of generalization error: use test set!

60% 20% 20%


training validation test
25% data data data
estimate of
classi cation error

20%
generalization error
(from test set)
15%

10%

5%

120 100 80 60 40 20 0
parameter k
fi
k-Nearest Neighbours: Algorithm Variants

1 Propensities (“con dence values”) for predictions:

• The k-NN method assigns the majority category Y among the


k nearest training samples to a new data point.
• We can de ne the following propensity for this assignment:
# of nearest neighbours with category Y
100% · (0%, 100%]
k

2 Cutoff values (“don’t know’s”):

• We can specify a minimum con dence under which the k-NN


method does not assign any category (“I don’t know!”).
• Useful if wrong categorization is costly (e.g. HIV diagnosis).
fi
fi
fi
Content

1 Motivation

2 Nearest Neighbours for Classi cation

3 Nearest Neighbours for Regression

4 The Curse of Dimensionality

5 Advantages & Shortcomings

6 Real-Life Applications

Reading: Shmueli et al., §7.2


Lantz, §3
fi
k-Nearest Neighbours for Regression

Input
n training samples (Xi1, …, Xip; Yi), i = 1, …, n, with Xi1, …, Xip
predictors and Yi the numerical outcome.
New data point (X1, …, Xp) whose outcome should be found.

Algorithm
1 Find the k training samples with the smallest distance
dist((Xi1 , . . . , Xip ), (X1 , . . . , Xp )).

2 Assign to the new data point (X1, …, Xp) the average of the
numerical outcomes among these k training samples.
Choice of k in Regression Problems

Choice of parameter k:

The parameter k can be chosen in the same way as for a


classi cation problem (using validation or validation/test set).
We replace the classi cation error with the mean square error:
m
1 2
MSE = (ŷi yi ) where m = number of samples
m i=1 in the validation
or test set
ŷi = predicted numerical
response for sample i
yi = actual numerical
response for sample i
fi
fi
Example: Pandora
music streaming & recommendation
service that adapts to your taste
musicians characterize songs by up
to 450 features on a 0-5 scale
(the Music Genome Project)

Using a modi ed k-NN algorithm, Pandora recommends songs


that are (dis-)similar to the ones previously (dis-)liked by the user
fi
Content

1 Motivation

2 Nearest Neighbours for Classi cation

3 Nearest Neighbours for Regression

4 The Curse of Dimensionality

5 Advantages & Shortcomings

6 Real-Life Applications
Reading: Lantz, §3
fi
The Curse of Dimensionality (1): Fixed-Size
Training Sets Don’t Cover the Space
Assumptions:

5,000 training samples with p numerical features in [0,1]


training samples are uniformly distributed on [0,1]p
we measure distances by the maximum norm

Question: How close are the 4 nearest neighbours of a point?

Approximate We need to nd the smallest hypercube that covers


Answer: 1/1,000 of the volume of the [0,1]p hypercube.

the [0,1]p hypercube has volume 1


a hypercube with side length c has volume cp
1/p
p 1 1
c = c=
1000 1000
fi
The Curse of Dimensionality (1): Fixed-Size
Training Sets Don’t Cover the Space
1/p
1
c=
1000
c
1

0.75

In 100 dimensions, we need 93% side length


0.5

In 3 dimensions, we need 10% side length


0.25

0 10 20 30 40 50 60 70 80 90 100

p
The Curse of Dimensionality (1): Fixed-Size
Training Sets Don’t Cover the Space

Filling 50% of the volume takes side lengths of…

1
50% in R

2 3
71% in R 80% in R
The Curse of Dimensionality (2):
Similarity Breaks Down in High Dimensions
The volume of a high-dimensional orange
is concentrated in the skin, not the pulp
n/2
n
Volume of an n-dimensional ball of radius R: n R ,
2 +1
where is Euler’s gamma function.
n/2
volume
n Rn
outside: 2 +1
n/2
volume (R ) n
n
inside: 2 +1
n/2 n/2
n [Rn (R )n ] n Rn
2 +1 2 +1
as n .
The Curse of Dimensionality (2):
Similarity Breaks Down in High Dimensions
The volume of a high-dimensional orange
is concentrated in the skin, not the pulp
n/2
n
Volume of an n-dimensional ball of radius R: n R ,
2 +1
ConisseEuler’s
where quencgamma function.
each fe e: If the
ature is n/2 training d
volume uniform n ata is h
in a nthin Rly d i st r i g h-dimen
ibuted, s i o n al and
outside: + s
1 he l l a way fro t h e s a m
2
m the c p les in fa
n/2 entre! ct live
volume (R ) n
n
inside: 2 +1
n/2 n/2
n [Rn (R )n ] n Rn
2 +1 2 +1
as n .
Content

1 Motivation

2 Nearest Neighbours for Classi cation

3 Nearest Neighbours for Regression

4 The Curse of Dimensionality

5 Advantages & Shortcomings

6 Real-Life Applications

Reading: Shmueli et al., §7.3


Lantz, §3
fi
Advantages & Shortcomings

Very simple but often surprisingly effective


Fast training phase (just need to store the training set)
Non-parametric approach that can make use of large
amounts of data

Does not produce a model that offers insights into the


relationship between features and response
Slow classi cation phase (requires determination of
k nearest neighbours) — it is called a lazy learner
Requires choice of suitable k
Preprocessing required for scaling, binary/categorical
features and missing values
Suffers from the curse of dimensionality
fi
Content

1 Motivation

2 Nearest Neighbours for Classi cation

3 Nearest Neighbours for Regression

4 The Curse of Dimensionality

5 Advantages & Shortcomings

6 Real-Life Applications
fi
Real-Life Applications

1 Optical Character Recognition

2 Face Recognition 3 Recommender


Systems

You might also like