0% found this document useful (0 votes)
8 views

Lecture_07_slides

The document outlines a course structure that includes topics such as Gaussian Naive Bayes and k-Nearest Neighbour (k-NN), along with student feedback and adjustments made to the course. It discusses the implementation of Gaussian Naive Bayes as a probabilistic classifier and the k-NN algorithm for classification based on proximity among data points. Additionally, it covers assessment methods and provides a framework for understanding conditional independence in the context of the course material.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture_07_slides

The document outlines a course structure that includes topics such as Gaussian Naive Bayes and k-Nearest Neighbour (k-NN), along with student feedback and adjustments made to the course. It discusses the implementation of Gaussian Naive Bayes as a probabilistic classifier and the k-NN algorithm for classification based on proximity among data points. Additionally, it covers assessment methods and provides a framework for understanding conditional independence in the context of the course material.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Gaussian Naive Bayes

k-Nearest Neighbour
2
Outline

▪ Student reviews, discussion and measures to incorporate them

▪ Gaussian Naive Bayes

▪ k-Nearest Neighbour (k-NN)


Student reviews and how I accounted for them
1) hand-writing hard to follow and fast:
- slower writing, recording of lectures

2) course organization is not clear


- added a sheet about lectures and topic

3) grading not clear


- See Moodle with the example worked out
- Example final will be placed
4) more problem sets
- two problem sets were provided since the quiz, more will be provided throughout and worked
examples in class

Miscellaneous comments to discuss.


Length and clarity of exercise
Connection with Analysis Numerique: seems only linear regression (important topic)
Typos? L1: 0, L2: 2, L3: 1, L4: 0, L5: 0, L6: 0
2) Course structure

Lecture Exercise hour


Week Date Topic Quiz PS Python exercise
1 25.09.23 Introduc0on - EX00, EX01,EX02: python background
2 02.10.23 Linear regression - EX03: linear regression
3 09.10.23 Logis0c regression - EX04: case study, regression for system iden0fica0on
4 16.10.23 AI ethics PS1 -
5 23.10.23 Mul0nomial logis0c regression, feature engineering 25.10.23 - -
6 30.10.23 Data sta0s0cs, Naïve Bayes - EX05: logis0c regression, cross-valida0on
7 06.11.23 Gaussian Naïve Bayes, k-NN - EX06: KNN, hyper parameter tuning
8 13.11.23 Clustering (k-means), dimensionality reduc0on PS2 -
9 20.11.23 PCA, Neural networks (NN) 22.11.23 - EX07: neural network for character recogni0on
10 27.11.23 Convolu0onal neural network (CNN) - EX08: CNN for image classifica0on
11 04.12.23 CNN, Decision-tree PS3 EX09: python background on pandas package
12 11.12.23 Random forest, AI in Industry 13.12.23 - EX10: decision-tree for 0tanic
13 18.12.23 RL, Review Prac0ce final
3) Assessment

Sample of last year exam will be available on Moodle


6
Introduction Linear regression Logistic regression

Feature engineering Data statistics Naive Bayes

KNN Clustering Dimensionality reduction

Neural networks Convolutional neural Decision-trees


networks
Gaussian Naive Bayes
Recall - Naive Bayes for classification
Bayes’ theorem for finite-valued features xiR" xje40 13 , ,

P(x | y = c)P(y = c)
P(y = c | x) =
P(x)
Prior probability ·

P(y =

c)

Generative model ·
P(x)y =

c) features are independent


the clay
given
Si
.

Naive Bayes assumption 4p(xj)y c)


=

P(y)y c)
=

i 1
=

word "money
why helpful
-

? P(x; # emails hang


/y <)
:

/ I ↓ of emails
spam
xi
:

"Money" spam
example
Bayes rule x
=(Rd
Continuous features

i
fx(x)

·
fx|y=c(x)P(y = c)
P(y = c | x) =
fx(x)
Probability distribution for a continuous random variable ↑

--I -

functor of
f
probabilly density IR
x =
(2) ·
as
X
al
F v = IR
Note : Pr(x =

v) =
0

d
but Pr( +
=D) =

(f ,
x)dx D CIR

Recall ,
(f ,
(d :
1 and f, x) - ↓x =
1R4
IRY
Naive Bayes assumption
Continuous features
fx|y=c(x)P(y = c)
P(y = c | x) =
fx(x)
condit of class
I probability density
(a) a.
sprea
-
a
,
ly =
<

I well-defined probably deasily ful

Naire Baye's anssumption :

f (j)
ficly=
(x) >
4
C
j =

cj(y =

features are
conaliterally indep .

given class a
Gaussian Naive Bayes
Assumes that the generative model for each feature follows a gaussian
distribution I e
L 0
-
;,
2)2 of
- ;, uj
:
mean

f (x ; I e S
, 2

data for feature ;


; /y =

e given class label C

2
2
O variance

NC e;, c)
-
..

I
- -
Gaussian Naive Bayes
Estimating the conditional Gaussian distribution for each feature
using sample data
xi y} * yeh 12 k3
;
,
, ...,

Recall E
Eit
: ~

i:
IIc isIc
where I
C
has the indices of
data
correspondy to class 2 .

E
)
C ;


>
i
x
-Ic
I
define
C

Note
E

: some texts varianc

I Ic
by divid by 1I,-7 :
Gaussian Naive Bayes
Classifier prediction

fx|y=c(x)P(y = c)
P(y = c | x) =
fx(x)
to
procket label y for data point ,

need to
(y= c(x) for cah1 2
k3
we
P
compare every
, , ...,

denominater the
f
,
(12) is
same for
any <(1 ,
2,
K)

numerates
so , its sufficient to
compare
.
Gaussian Naive Bayes
Naive Bayes assumption and final classifier
fx1|y=c(x1)fx2|y=c(x2)…fxd|y=c(xd)P(y = c)
P(y = c | x) =
fx(x)
d

If E xj)
(x) assumphon
fx(y
=

c i
=

3 xj/y =

Gam data
->
computed
2

f (j) N (u
5j )
=

xj/ y =

< ;, c ,

be small take
logarithm of the product above
fxj(y !xj)
Since can
.

:
=

d
s

log (1 fxjly-d ; 14(y x) I l fx;; ) +

logp(y =

e)
og
= =

j
=

, 1
Gaussian Naive Bayes
Summary

-probabilish classifier

assumphon features
conditonally indep
class
-

are
given
: .

I
-trainig 15
easier

hard
prache to
varify ,

but empirically has worked well -

See this explained example as introduction : https://www.youtube.com/watch?v=H3EjCKtlVog&t=60s


Exercise - conditional independence
Background of dataset and problem

we base next exercise


inspired by
I
this dataset (small modificators of
numbers for ease of computate)
Source: Manuel Foerster and Dominik Karos Article
Step by step example - conditional independence
• 1) Show the probability of being arrested is not independent of being Black

Group Population Number arrested

Black 1,8 x 10^6 10 x 10^3


White 2,7 x 10^6 2 x 10^3

• 2) Show, conditioned on being stopped, probability of Group Population Number arrested Number stopped
being arrested is independent of being black
Black 1,8 x 10^6 10 x 10^3 5 x 10^5
ofcolor
White 2,7 x 10^6 2 x 10^3 1 x 10^5
&
white populate

I
A &
black
A beg black
areated

&
:

B
B :

being arrested
&
C
being stopped stopped
:

blacks areated
a
-
DP(AIB) =

-
10
=
5/6
# arrested x 103

PcA)
Ijettorys
:

(Recall ·
P (AIB) =

PCA) FD A , B are
independent)
A & B are
independent
Conditonal
② independence given 2 .

P(A ,
B1C) =
P(A1C) P (BIC) ?

↑ (A , BIC) =

P(AIC) =
Mos
~x 10S

P(BIC) =
x1c3
6 x 10S

x3
P(A , BIC =

6 x 10S
-
110s
6 10S
x
113
6 10S
, X
x
~
I

k-Nearest Neighbour
k-NN Problem setup
-D his *
Supervised machine learning
fire & .

Goal: Use a dataset to produce useful predictions on never-before-seen data.


,

labels

Recall terminology:
Features: input variables
Label: what we are predicting
k-NN Abstraction for classification

Problem: Classifying data points among different categories.

KNN (K-Nearest Neighbors) algorithm assumes that data


points of similar classes exist in close proximity
(Similar Inputs have similar outputs )

It classifies an unknown data point according to the category


⑭nearest neighbors
of its K
(K = 1, 3, 5, …0) for binary classification
E
k-NN Distance Metric
How to measure the proximity between data points? → Measure distance
=IRd
(x ! -xi)")
C' .

Minkowski distance D(x , x2) =

j 1
=

P
=

1
, 2 ,
3 ....

P =
1 :
Manhattan (1) distance
P =

2 : Evalidean (22) distance .

x , x
=

(0 , 13d /
x =

(1 ,
0
,
0
,
0
,
1 .. N

, o

Hamming distence :
# posshons that the rectors differ

example D
(i) [8 7) 2
:
:

,
Hanning
k-NN Distance Metric
How to measure the proximity between data points? → Measure distance

1)
13(x)D(x x) =

qu I Drankaltem
(x x) =

I
,
Euclidean

↑ Level sets of the two most common distances

I
-

L1 (Manhattan) distance L2 (Euclidean) distance


I
I

-
I
M I
ic

-
k-NN Feature scaling

Features might have different scales


Distance metric gives more importance to the feature with largest scale
Normalizing data allows equal exploitation of information from features

Z-Score Standardisation: Set mean (µ) to 0, standard deviation (σ) to 1


For each feature:
k-NN
Feature scaling - Visualisation

Normalize
k-NN
Visualisation When K=1:
take label of nearest neighbor

K=1

In which category belongs the point? Gentoo


k-NN
K>1 Instead of copying label from nearest neighbor,
take majority vote from K closest points

K=3

In which category belongs the point? Adélie


k-NN
K>1 Instead of copying label from nearest neighbor,
take majority vote from K closest points

K=9

In which category belongs the point? Gentoo


What is a pseudo-code?

▪ Description of an algorithm in a language independent way


▪ It’s what we think the algorithm should do before we encode it in
python,matlab, C, etc.
▪ In this course, I don’t require a formal syntax (otherwise, it becomes a
programming again)
because its label unknown
k-NN unknown
is

-
Implementation
test
test decide it
.

Given x want
to label y
-

▪ Initialize k N

▪ For every known data point (x -


xtesl
1
N
=

"

Xin
,

i 1
D(x
= ...,
,

• Calculate the distance with the unknown data point


in
in
▪ Pick the k nearest known data points from the unknown data point !
,
...
iz "
y y ▪ Get the labels of the selected k entries i
, ...
i are indices of the
3 -
,
k
, ...,

▪ Return the mode (majority vote) of the k labels K closest point .


k-NN
Implementation (in your exercises this week)
k-NN
Implementation

(xi ,
y= Y
=

Save training data


k-NN
Implementation

test
For each test sample: x

▪ Find k-closest training data


▪ Predict mode of k-closest
training data
k-NN
Implementation

Q: With N examples, how


fast are training and
prediction?
k-NN
Implementation

Q: With N examples, how


fast are training and
prediction?

A: Train O(1), predict O(N)


k-NN
Implementation

Q: With N examples, how


fast are training and
prediction?

A: Train O(1), predict O(N)

If we are using for real-time


decision-making, this could
be bad: we want classifiers
that are fast at prediction;
slow for training can be ok
k-NN
Hyperparameters

What is the best value of k to use ?


What is the best distance to use ?

These are hyperparameters: choices about the


model/algorithm that we set rather than learn
k-NN
Hyperparameters

What is the best value of k to use ?


What is the best distance to use ?

These are hyperparameters: choices about the


model/algorithm that we set rather than learn

Very problem-dependent.
Must try them all out and see what works best.
k-NN
Setting hyperparameters

Validation set accuracy for different values


of k for our penguin classification task

(Seems that k = 5 or 7 works best


for this example)
Train, validate, test in ML
Setting hyperparameters
Your Dataset

Cross-Validation : Split data into folds, try each fold as validation


and average the results
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Test

Useful for small datasets


k-NN
Setting hyperparameters

Different dataset:
5-fold cross-validation
for the value of k.

Each point: single


outcome.

The line goes


through the mean, bars
indicate standard
deviation

(Seems that k ~= 7 works best


for this data)
k-NN
For randomly distributed points in high dimensions, distances concentrate within a very small range
> >
Di high dimension :

K .

NN doeint
work so well
-

>
② Manhattan ↑

I
distance works

better in
high
alimension
Distribution of all pairwise distances between randomly
-
distributed points within d-dimensional unit squares, Reference

Theory: Aggarwal et al. 2001, On the Surprising Behavior of Distance Metrics in High Dimensional Space
More intuition: StackExchange article
k-NN
Summary

Advantages Disadvantages
tunig hyperparameter
▪ Easy to implement
-
the
only train
▪ Does not work as well in high
->
▪ No training required
dimensions
▪ New data can be added
▪ Sensitive to noisy data and
seamlessly
skewed class distribution
▪ Versatile - useful for regression
▪ Requires high memory
and classification
▪ Prediction stage is slow with
large data, requires comparison
with all samples in dataset
ML

supervised - -
unsupervised
~- Rinear Next
nagression
~ Rogistic
-

alustering
dimensional
regression
~ Naire
Bayes
-

duchen
y
,

U-NN -

:
next
Reinforcement
~neural network Learn .

~ decision tree

You might also like