Lecture_07_slides
Lecture_07_slides
k-Nearest Neighbour
2
Outline
P(x | y = c)P(y = c)
P(y = c | x) =
P(x)
Prior probability ·
P(y =
c)
Generative model ·
P(x)y =
P(y)y c)
=
i 1
=
word "money
why helpful
-
/ I ↓ of emails
spam
xi
:
"Money" spam
example
Bayes rule x
=(Rd
Continuous features
i
fx(x)
·
fx|y=c(x)P(y = c)
P(y = c | x) =
fx(x)
Probability distribution for a continuous random variable ↑
--I -
functor of
f
probabilly density IR
x =
(2) ·
as
X
al
F v = IR
Note : Pr(x =
v) =
0
d
but Pr( +
=D) =
(f ,
x)dx D CIR
Recall ,
(f ,
(d :
1 and f, x) - ↓x =
1R4
IRY
Naive Bayes assumption
Continuous features
fx|y=c(x)P(y = c)
P(y = c | x) =
fx(x)
condit of class
I probability density
(a) a.
sprea
-
a
,
ly =
<
f (j)
ficly=
(x) >
4
C
j =
cj(y =
features are
conaliterally indep .
given class a
Gaussian Naive Bayes
Assumes that the generative model for each feature follows a gaussian
distribution I e
L 0
-
;,
2)2 of
- ;, uj
:
mean
f (x ; I e S
, 2
2
2
O variance
NC e;, c)
-
..
I
- -
Gaussian Naive Bayes
Estimating the conditional Gaussian distribution for each feature
using sample data
xi y} * yeh 12 k3
;
,
, ...,
Recall E
Eit
: ~
i:
IIc isIc
where I
C
has the indices of
data
correspondy to class 2 .
E
)
C ;
⑧
↑
>
i
x
-Ic
I
define
C
Note
E
I Ic
by divid by 1I,-7 :
Gaussian Naive Bayes
Classifier prediction
fx|y=c(x)P(y = c)
P(y = c | x) =
fx(x)
to
procket label y for data point ,
need to
(y= c(x) for cah1 2
k3
we
P
compare every
, , ...,
denominater the
f
,
(12) is
same for
any <(1 ,
2,
K)
numerates
so , its sufficient to
compare
.
Gaussian Naive Bayes
Naive Bayes assumption and final classifier
fx1|y=c(x1)fx2|y=c(x2)…fxd|y=c(xd)P(y = c)
P(y = c | x) =
fx(x)
d
If E xj)
(x) assumphon
fx(y
=
c i
=
3 xj/y =
Gam data
->
computed
2
f (j) N (u
5j )
=
xj/ y =
< ;, c ,
be small take
logarithm of the product above
fxj(y !xj)
Since can
.
:
=
d
s
logp(y =
e)
og
= =
j
=
, 1
Gaussian Naive Bayes
Summary
-probabilish classifier
assumphon features
conditonally indep
class
-
are
given
: .
I
-trainig 15
easier
hard
prache to
varify ,
• 2) Show, conditioned on being stopped, probability of Group Population Number arrested Number stopped
being arrested is independent of being black
Black 1,8 x 10^6 10 x 10^3 5 x 10^5
ofcolor
White 2,7 x 10^6 2 x 10^3 1 x 10^5
&
white populate
I
A &
black
A beg black
areated
&
:
B
B :
being arrested
&
C
being stopped stopped
:
blacks areated
a
-
DP(AIB) =
-
10
=
5/6
# arrested x 103
PcA)
Ijettorys
:
(Recall ·
P (AIB) =
PCA) FD A , B are
independent)
A & B are
independent
Conditonal
② independence given 2 .
P(A ,
B1C) =
P(A1C) P (BIC) ?
↑ (A , BIC) =
P(AIC) =
Mos
~x 10S
P(BIC) =
x1c3
6 x 10S
x3
P(A , BIC =
6 x 10S
-
110s
6 10S
x
113
6 10S
, X
x
~
I
k-Nearest Neighbour
k-NN Problem setup
-D his *
Supervised machine learning
fire & .
labels
Recall terminology:
Features: input variables
Label: what we are predicting
k-NN Abstraction for classification
j 1
=
P
=
1
, 2 ,
3 ....
P =
1 :
Manhattan (1) distance
P =
x , x
=
(0 , 13d /
x =
(1 ,
0
,
0
,
0
,
1 .. N
↑
, o
Hamming distence :
# posshons that the rectors differ
example D
(i) [8 7) 2
:
:
,
Hanning
k-NN Distance Metric
How to measure the proximity between data points? → Measure distance
1)
13(x)D(x x) =
qu I Drankaltem
(x x) =
I
,
Euclidean
I
-
-
I
M I
ic
-
k-NN Feature scaling
Normalize
k-NN
Visualisation When K=1:
take label of nearest neighbor
K=1
K=3
K=9
-
Implementation
test
test decide it
.
Given x want
to label y
-
▪ Initialize k N
"
Xin
,
i 1
D(x
= ...,
,
(xi ,
y= Y
=
test
For each test sample: x
Very problem-dependent.
Must try them all out and see what works best.
k-NN
Setting hyperparameters
Different dataset:
5-fold cross-validation
for the value of k.
K .
NN doeint
work so well
-
>
② Manhattan ↑
I
distance works
better in
high
alimension
Distribution of all pairwise distances between randomly
-
distributed points within d-dimensional unit squares, Reference
Theory: Aggarwal et al. 2001, On the Surprising Behavior of Distance Metrics in High Dimensional Space
More intuition: StackExchange article
k-NN
Summary
Advantages Disadvantages
tunig hyperparameter
▪ Easy to implement
-
the
only train
▪ Does not work as well in high
->
▪ No training required
dimensions
▪ New data can be added
▪ Sensitive to noisy data and
seamlessly
skewed class distribution
▪ Versatile - useful for regression
▪ Requires high memory
and classification
▪ Prediction stage is slow with
large data, requires comparison
with all samples in dataset
ML
supervised - -
unsupervised
~- Rinear Next
nagression
~ Rogistic
-
alustering
dimensional
regression
~ Naire
Bayes
-
duchen
y
,
U-NN -
:
next
Reinforcement
~neural network Learn .
~ decision tree