0% found this document useful (0 votes)
24 views29 pages

05 Density Estimation

Uploaded by

Mostafa Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views29 pages

05 Density Estimation

Uploaded by

Mostafa Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Pattern Classification

05. Density Estimation

AbdElMoniem Bayoumi, PhD

Spring 2023
Recap: Gaussian Densities
• Assume a multi dimensional Gaussian
density for each 𝑃(𝑋|𝐶! )

• Features may be independent (or


conditionally independent), i.e.,
independent Gaussians

• Features may be dependent in other cases

2
Recap: Applying Bayes Rule
• One way on how to apply Bayes rule in practical
situations:

– Obtain the training set 𝑋 1 , 𝑋 2 ⋯ 𝑋(𝑀)

– Assume a multi-dimensional Gaussian density for each


class, i.e., 𝑃(𝑋|𝐶! )

– To obtain the form of each density we need 𝜇! and Σ! for


each class 𝑖 à estimate from training set

– Estimate the a priori probabilities 𝑃(𝐶! ) from the training


set, i.e., according to the frequencies of each class

– Using the obtained estimates, plug in Bayes rule to


obtain the classification rule

3
Density Estimation
• In Bayes rule, the probability densities
have to be estimated

• One way is to assume that they are


multivariate Gaussian and estimate 𝜇 & Σ of
these distributions

• Estimate the densities from data

4
Histogram Analysis
𝑚
𝑝̂ 𝑥 =
𝑀 ∗ (𝑠𝑖𝑧𝑒 𝑜𝑓 𝑏𝑖𝑛)

• m is the number of data points falling


within a given range, i.e., histogram bin

• M is the total number of points (that


belongs to the same class)

• Size of bin: size of the histogram bin

5
Histogram Analysis
• Consider 1-D example:
– m is number of data points within the given
range, e.g., 2 < 𝑥 ≤ 3

𝑚
𝑝̂ 2 < 𝑥 ≤ 3 =
" 𝒙
𝒑 𝑀 ∗ (𝑠𝑖𝑧𝑒 𝑜𝑓 𝑏𝑖𝑛) 𝒑 𝒙

0 1 2 3 4 5 6 𝒙 0 1 2 3 4 5 6 𝒙
Data was originally generated
from this density 6
Histogram Analysis
𝚫
!%$
! 𝑝 𝑥 𝑑𝑥 ≈ 𝚫 . 𝑝(𝑥)
𝚫
!"$

𝚫 𝚫
• Probability (generated point 𝜖 𝑥 − , 𝑥 +
$ $
) ≈ 𝚫 . 𝑝(𝑥) ≡ 𝑧

𝒑 𝒚

𝒙
𝒚
Bin size ≡ Δ 7
Histogram Analysis
𝚫
!%$
! 𝑝 𝑥 𝑑𝑥 ≈ 𝚫 . 𝑝(𝑥)
𝚫
!"$

𝚫 𝚫
• Probability (generated point 𝜖 0𝑿 −
𝟐
, 𝑿 + 1) ≈ 𝚫 . 𝑝(𝑥) ≡ 𝑧
𝟐

• Assume we draw a number M of points according to p(x)


à binomial distribution

• Binomial distribution with probability 𝑧 for number of


points falling in BIN

8
Histogram Analysis
𝑃 𝑘 𝑝𝑜𝑖𝑛𝑡𝑠 𝑓𝑎𝑙𝑙𝑖𝑛𝑔 𝑖𝑛 𝐵𝐼𝑁 𝑜𝑢𝑡 𝑜𝑓 𝑀 𝑝𝑜𝑖𝑛𝑡𝑠
𝑀 5
= 𝑧 (1 − 𝑧)675
𝑘

𝐸 # 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖𝑛 𝐵𝐼𝑁 = 𝑀. 𝑧
= 𝑀. 𝑝 𝑥 . ∆

• Example: flip a coin 10 times


10 8
𝑃 8 𝐻𝑒𝑎𝑑𝑠 = 𝑝 (1 − 𝑝)9:78
8
𝐸 # 𝐻𝑒𝑎𝑑𝑠 = 𝑝. 𝑀 = 0.5 ∗ 10 = 5
𝑝 ≡ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 ℎ𝑒𝑎𝑑
9
Histogram Analysis
𝐸 # 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖𝑛 𝐵𝐼𝑁 = 𝑀. 𝑧
= 𝑀. 𝑝 𝑥 . ∆

• If k points fall in the histogram range then


assuming:
𝑘 ≈ 𝑀. 𝑝 𝑥 . ∆

• Then, estimate of p(x) is:


𝑘 Recall:
𝑚
𝑝 𝑥 = 𝑝̂ 𝑥 =
𝑀 ∗ (𝑠𝑖𝑧𝑒 𝑜𝑓 𝑏𝑖𝑛)
𝑀∆

10
Histogram Analysis
• Weak method of estimation

• Discontinuity of these density estimates,


even though the true densities are
assumed to be smooth
" 𝒙
𝒑 𝒑 𝒙

𝒙
0 1 2 3 4 5 6 𝒙 0 1 2 3 4 5 6
11
Naïve Estimator
• Instead of partitioning X, i.e., feature space, into
a number of prespecified ranges, we perform a
similar range analysis for every X

𝒉 𝒉
𝑿− 𝑿 𝑿+
𝟐 𝟐

𝒉 𝒉
#𝑝𝑜𝑖𝑛𝑡𝑠 𝑓𝑎𝑙𝑙𝑖𝑛𝑔 𝑖𝑛 5𝑿 − 𝟐 , 𝑿 + 𝟐 6
𝑃& 𝑋 =
𝑀ℎ

12
Naïve Estimator
• Drawbacks:

– Discontinuity of the density estimates

– All data points are weighted equally


regardless of their distance to the
estimation point, i.e, X

13
Kernel Density Estimator

• a.k.a. Parzen Window Density Estimator


• Choose a bump function

Bump fn.

• Summation of bump functions:


Estimate of density

14
Kernel Density Estimator
• Choose bump function as Gaussian with
standard deviation (bandwidth) h:

7< =
𝑒 =;=
𝜙; 𝑥 =
2𝜋ℎ

𝜙0 𝑥

h: bandwidth

𝒉
15
Kernel Density Estimator
• Choose bump function as Gaussian with
standard deviation (bandwidth) h:
#$ &
𝑒 %"&
𝜙" 𝑥 =
2𝜋ℎ

• X(m) are the points generated from the


density P(X) that we want to estimate

𝝓𝒉 𝑿 − 𝑿(𝒎)

Bump fn.

16
X(m)
Kernel Density Estimator

𝑓 𝑥

𝑓 𝑥 − 𝑥1

𝑥1
17
Kernel Density Estimator
• Summation of bump functions:

𝑴
𝟏
Q 𝑿 =
𝑷 V 𝝓𝒉 𝑿 − 𝑿(𝒎)
𝑴
𝒎?𝟏
Summation over # of generated points

Estimate of density

18
Kernel Density Estimator

0.75
True density
0.25

1 2

0.75
Density Estimation
0.25

19
Kernel Density Estimator
• 𝝓𝒉 does not have to be Gaussian
𝟏 𝑿
𝝓𝒉 = 𝒈( )
𝒉 𝒉

where 𝒈 = is any suitable bump function that


integrates to 1:
A
∫@A 𝒈 𝒙 𝒅𝒙 = 𝟏
e.g.
#$ & #$ &
𝑒 % 𝑒 %"&
g x = → 𝜙" 𝑥 =
2𝜋 2𝜋ℎ
20
Kernel Density Estimator
• 𝝓𝒉 does not have to be Gaussian
𝟏 𝑿
𝝓𝒉 = 𝒈( )
𝒉 𝒉

where 𝒈 = is any suitable bump function that


integrates to 1:
A
∫@A 𝒈 𝒙 𝒅𝒙 = 𝟏

𝒇 𝒙 𝒙
𝒇
𝒉

-1 1 -h h 21
Kernel Density Estimator
• Naïve estimator is equivalent to a Parzen
window estimator with:
1 1
𝑔 𝑥 =W 1, − ≤ 𝑥 <
2 2
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• In this case:
𝑴
𝟏 𝑿 − 𝑿(𝒎)
@
𝑷 𝑿 = D𝒈 𝑿 − 𝑿(𝒎)
𝑴𝒉 𝒉 𝒈
𝒎"𝟏
𝒉

𝑿(𝒎)
𝒉 𝒉
𝑿(𝒎) − 𝑿(𝒎) +
𝟐 𝟐
22
Kernel Density Estimator
1-D form:
𝑴
𝟏
B 𝑿 =
𝑷 E 𝝓𝒉 𝑿 − 𝑿(𝒎)
𝑴
𝒎"𝟏
𝑴
𝟏 𝑿 − 𝑿(𝒎)
= E𝒈
𝑴𝒉 𝒉
𝒎"𝟏

?
∫>? 𝒈 𝒙 𝒅𝒙 = 𝟏

?
'
∫>? 𝑷 𝒙 𝒅𝒙 = 𝟏

23
Kernel Density Estimator
Multi-dimension Form:
𝑴
𝟏 𝑿 − 𝑿(𝒎)
B 𝑿 =
𝑷 F𝒈
𝑴𝒉 𝑵 𝒉
𝒎&𝟏

,
∫#, 𝒈 𝑿 𝒅𝑿 = 𝟏

For example: multi-dimension independent


Gaussian density:
* $'&
# ∑'()
𝑒 %
𝑔 𝑋 =
(2𝜋)./%
diagonal bandwidth matrix
24
How to choose h?

Large h Small h

25
How to choose h?
• Too small h à bumpy estimate or non-smooth

• Too large h à the estimate could be too


smooth that essential details of the density
will be lost or smoothed out

26
How to choose h?

True Density

Too small h

Too large h

27
Optimal h
• The optimal H (diagonal bandwidth matrix) can be
approximated as :
𝟏
𝟒 𝑵$𝟒
𝑯𝒊 = 𝝈𝒊 normal reference rule
𝑵+𝟐 𝑴
where
𝜎& = Σ' &,&

• Σ' is the estimated covariance matrix, i.e.,


,
1
Σ' = - (𝑋 𝑚 − 𝜇)(𝑋
̂ ̂ -
𝑚 − 𝜇)
𝑀 𝑵
)*+
𝟏
𝒉𝒐𝒑𝒕 = D 𝑯𝒊
./
Σ' &,& ≡ 𝑖 diagonal element of Σ' 𝑵
𝒊J𝟏
𝑁 ≡ dimensions
For multi-variate normal kernel & diagonal bandwidth matrix

Bowman, A.W., and Azzalini, A. (1997), Applied Smoothing Techniques for Data Analysis,
29
London: Oxford University Press [page 32].
Acknowledgment
• These slides have been created relying on
lecture notes of Prof. Dr. Amir Atiya

30

You might also like