08 - KNN
08 - KNN
k-Nearest Neighbours
Content
1 Motivation
6 Real-Life Applications
1 4 protein
10 1 fruit
7 10 vegetable
3 10 vegetable
1 1 protein
… … … …
Motivation
sweetness
Motivation
proteins
sweetness
Motivation
2-Nearest Neighbour
crunchiness
1 3-Nearest Neighbour
tomatoes closest to oranges,
grapes and nuts
2/3 of nearest neighbours are
sweetness fruits, hence tomatoes are fruits
Content
1 Motivation
6 Real-Life Applications
Input
n training samples (Xi1, …, Xip; Yi), i = 1, …, n, with Xi1, …, Xip
predictors and Yi the categorical outcome.
New data point (X1, …, Xp) whose outcome should be found.
Algorithm
1 Find the k training samples with the smallest distance
dist((Xi1 , . . . , Xip ), (X1 , . . . , Xp )).
2 Assign to the new data point (X1, …, Xp) the majority category
among these k training samples.
fi
k-Nearest Neighbours for Classi cation
Input
n training samples (Xi1, …, Xip; Yi), i = 1, …, n, with Xi1, …, Xip
predictors and Yi the outcome.
New data point (x1, …, xQuestions:
p) whose category should be found.
2 Assign to the new data point (X1, …, Xp) the majority category
among these k training samples.
fi
(1) Distance Functions
Manhattan
distance
Euclidean
distance
Maximum
distance
(1) Scaling: Motivation
1 if hot,
is hot =
The predictor can be replaced with 0 otherwise.
2
2/3 dummy numerical variables: 1 if medium,
is med =
0 otherwise.
…
(3) How to Choose k:
The Bias/Variance Trade-Off
Consider the dataset:
k = 1: k = 99:
reducible error
error
optimum
bias
variance
20%
is the training data
15% itself!
10%
5%
120 100 80 60 40 20 0
parameter k
fi
fi
(3) How to Choose k:
Validation and Test Sets
Better idea: split samples into training and validation sets!
80% 20%
training validation
data data
fi
(3) How to Choose k:
Validation and Test Sets
Better idea: split samples into training and validation sets!
80% 20%
training validation
25% data data
classi cation error
20% optimum
lies here
15% somewhere
10%
5%
120 100 80 60 40 20 0
parameter k
fi
(3) How to Choose k:
Validation and Test Sets
Better idea: split samples into training and validation sets!
80% 20%
training validation
25% data data
classi cation error
20% optimum
Can w lies here
15% e expe
somewhere
p erform c t the s
ance o ame
n new
10% data?
5%
120 100 80 60 40 20 0
parameter k
fi
(3) How to Choose k:
Validation and Test Sets
Estimate of generalization error: use test set!
20%
generalization error
(from test set)
15%
10%
5%
120 100 80 60 40 20 0
parameter k
fi
k-Nearest Neighbours: Algorithm Variants
1 Motivation
6 Real-Life Applications
Input
n training samples (Xi1, …, Xip; Yi), i = 1, …, n, with Xi1, …, Xip
predictors and Yi the numerical outcome.
New data point (X1, …, Xp) whose outcome should be found.
Algorithm
1 Find the k training samples with the smallest distance
dist((Xi1 , . . . , Xip ), (X1 , . . . , Xp )).
2 Assign to the new data point (X1, …, Xp) the average of the
numerical outcomes among these k training samples.
Choice of k in Regression Problems
Choice of parameter k:
1 Motivation
6 Real-Life Applications
Reading: Lantz, §3
fi
The Curse of Dimensionality (1): Fixed-Size
Training Sets Don’t Cover the Space
Assumptions:
0.75
0 10 20 30 40 50 60 70 80 90 100
p
The Curse of Dimensionality (1): Fixed-Size
Training Sets Don’t Cover the Space
1
50% in R
2 3
71% in R 80% in R
The Curse of Dimensionality (2):
Similarity Breaks Down in High Dimensions
The volume of a high-dimensional orange
is concentrated in the skin, not the pulp
n/2
n
Volume of an n-dimensional ball of radius R: n R ,
2 +1
where is Euler’s gamma function.
n/2
volume
n Rn
outside: 2 +1
n/2
volume (R ) n
n
inside: 2 +1
n/2 n/2
n [Rn (R )n ] n Rn
2 +1 2 +1
as n .
The Curse of Dimensionality (2):
Similarity Breaks Down in High Dimensions
The volume of a high-dimensional orange
is concentrated in the skin, not the pulp
n/2
n
Volume of an n-dimensional ball of radius R: n R ,
2 +1
ConisseEuler’s
where quencgamma function.
each fe e: If the
ature is n/2 training d
volume uniform n ata is h
in a nthin Rly d i st r i g h-dimen
ibuted, s i o n al and
outside: + s
1 he l l a way fro t h e s a m
2
m the c p les in fa
n/2 entre! ct live
volume (R ) n
n
inside: 2 +1
n/2 n/2
n [Rn (R )n ] n Rn
2 +1 2 +1
as n .
Content
1 Motivation
6 Real-Life Applications
1 Motivation
6 Real-Life Applications
fi
Real-Life Applications