0% found this document useful (0 votes)
51 views43 pages

Parzen Window

Parzen Window

Uploaded by

Arnab Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views43 pages

Parzen Window

Parzen Window

Uploaded by

Arnab Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

• Introduction

– All Parametric densities are unimodal (have a single


local maximum), whereas many practical problems
involve multi-modal densities.
Nonparametric procedures can be used with
arbitrary distributions and without the assumption
that the forms of the underlying densities are
known.

– There are two types of nonparametric methods:


• Estimating P(x|j)
• Bypass probability and go directly to a-posteriori
probability estimation 2
Parzenwindon Technique used in density
estimation
Density Estimation
• Basic idea:
Probability that a vector x will fall in region R is:
P   p(x' )dx' (1)

P is a smoothed (or averaged) version of the density function
p(x) if we have a sample of size n; therefore, the probability
that k points fall in R is then: Prob. that the rest are not

n k nk n!
Pk  BIN (k | n, P)    P (1  P)  P k (1  P) nk (2)
k  k!(n  k )!
No. of unique splits k vs n-k Prob. that k of particular x-es are in R
and the expected and variance value for k is:
E(k) = nP , Var(k)=nP(1-P) (3) 3
What is ML estimation of P = ?
n k nk
 P ln( Pk )   P (ln    k ln( P)  (n  k ) ln(1  P))   0
k  P 1 P
ˆ k
Max( Pk | θ) is reached for θ   P (4)

n
O
• Therefore, the ratio k/n is a good estimate for the
probability P and hence for the density function p.
• If p(x) is continuous and that the region R is so small
that p does not vary significantly within it, we can
g
write:

p(x ')dx '  p(x)V

(5)

where x is a point within R and V the volume enclosed by R.


Combining equation (1) , (4) and (5) yields: k/n
p ( x) 
V 4
n
via r
Vi
vid

FIGURE 4.1. The relative probability an estimate given by Eq. 4 will yield a
particular value for the probability density, here where the true probability
was chosen to be 0.7. Each curve is labeled by the total number of patterns n
sampled, and is scaled to give the same maximum (at the true probability).
The form of each curve is binomial, as given by Eq. 2. For large n, such
binomials peak strongly at the true probability. In the limit n → ∞, the curve
approaches a delta function, and we are guaranteed that our estimate will give
5
the true probability.
• Density Estimation (cont’d)

– Justification of equation (5)

 p(x ')dx '  p(x)V



(5)

We assume that p(x) is continuous and that region


R is so small that p does not vary significantly
within R. Since p(x) = constant, it is not a part of
the sum.

6
 p(x' )dx'  p(x' ) dx'  p(x' ) ()
 

Where: (R) is: a length in R


a surface in R 2
a volume in R 3
a hypervolume in R n

Since p(x)  p(x΄) = constant, therefore in R 3:


 p(x' )dx'  p(x).V

k
and p (x) 
nV 7
The volume V needs to approach 0 anyway if we want to use
this estimation
– Practically, V cannot be allowed to become small since the
number of samples is always limited
– One will have to accept a certain amount of variance in the
ratio k/n
– Theoretically, if an unlimited number of samples is available,
we can circumvent this difficulty
To estimate the density of x, we form a sequence of regions
R1, R2,…containing x: the first region contains one sample, the
second two samples and so on.

E
Let Vn be the volume of Rn, kn the number of samples falling in
Rn and pn(x) be the nth estimate for p(x):
kn / n
pn (x)  (7)
Vn 9
Three necessary conditions should apply if we want pn(x) to
converge to p(x):
1) lim Vn  0 smaller and smaller regions
n 

2) lim k n   a huge # of observations should be in R


n 

3) lim k n / n  0 # of observations should be a very samll fraction of total


n 

There are two different ways of obtaining sequences of regions


that satisfy these conditions:
(a) Fix the volume Vn and determine kn from the data. Shrink an
initial region where Vn = 1/n and show that
pn ( x)  p( x)
n
This is called “the Parzen-window estimation method”
(b) Fix the value of kn and determine the corresponding volume
Vn from the data. Specify kn as some function of n, such as kn =
n; the volume Vn is grown until it encloses kn neighbors of x.
This is called “the kn-nearest neighbor estimation method” 10
FIGURE 4.2. There are two leading methods for estimating the
density at a point, here at the center of each square. The one shown
in the top row is to start with a large volume centered on the test
point and shrink it according to a function such as Vn=1/n. The
other method, shown in the bottom row, is to decrease the volume
in a data-dependent way, for instance letting the volume enclose
some number kn = n of sample points. The sequences in both
cases represent random variables that generally converge and allow
the true density at the test point to be calculated. 11
• Parzen Windows (Kernel Density Estimation)
– Parzen-window approach Eh
to estimate densities
assume that the region Rn is a d-dimensional
hypercube
Vn  hnd (hn : length of the edge of n )
Let  (u) be the following window function (top hat):
 1

set
1 u j  j  1,... , d

I
 (u)   2
0 otherwise
– ((x-xi)/hn) is equal to unity if xi falls within the
hypercube of volume Vn centered at x and equal to
zero otherwise.
ÉFTIE 12
ur a Mj
a j l la E E
U t t
guy i U E
• The number of samples in this hypercube is:

t
i n
 x  xi 
k n     
i 1  hn 

By substituting kn IFin equation (7), we obtain the


following estimate:
kn / n 1 i n
1  x  xi 
pn (x)  pn (x)     
n i 1 Vn  hn 
Vn o
Parzen windows estimate

pn(x) estimates p(x) as an average of functions of x and


the samples (xi) (i=1, …, n).
13
• The density estimate is a superposition of
kernel functions and the samples xi.

1 n 1  x  xi 
pn (x)     
n i 1 Vn  hn 

•  (u) interpolates the density between samples.


• Each sample xi contributes to the estimate
based on its distance from x.

14
chi hypercube centered at
is in
we decide whether I
2 using chi function
HEE
The no of samples within the hypercube located
at a is given by
1k
E 41 S

at pen ÉEEEETTE
EI suppose methane tampers ni iii staining
width be h 3 Here 1,5T s
bet the window
a line Un hit hi 3
and
at 2 1
We want to estimate density

D 1,34 4G I 4C
D
YLE if
4 713 1

L fact
Piecewise defined univariate probability busily pep
EI
era
E
M
otherwise he

In order to approximate the probability density with


pulse we can use a Gaussian
a Parten window
window
I
um E
Let us choose kn hit
The Parzen windowestimates for different number
n and window size h as giver below
of samples a can only observe
note that for single sample we
the window function As the number of
the form of to
increases estimation
the converges
samples

E i
th EiIi K

With
Properties of  (u)
• The kernel function  (u) can have a more
general form (i.e., not just hypercube).
• In order for pn(x) to be a legitimate estimate
(nonnegative and integrate to one)  (u) ,
must be a valid density itself:
 (u)  0

  (u)du  1
15
Effect of the window width hn on pn(x)
If we define the function δn(x) by
1 x
 n ( x)   ( )
Vn hn
then we can write pn(x) as the average
1 n
pn (x)    n (x  xi ).
n i 1
Since Vn = hnd, hn clearly affects both the amplitude and
the width of δn(x) (Fig. 4.3).
Thus, as hn approaches zero, δn(x−xi) approaches a Dirac
delta function centered at xi, and pn(x) approaches a
superposition of delta functions centered at the samples. 16
FIGURE 4.3. Examples of two-dimensional circularly
symmetric normal Parzen windows for three different values of
h. Note that because the δ(x) are normalized, different vertical
scales must be used to show their structure.

If hn is very large, the amplitude of δn is small, and pn(x) is


the superposition of n broad, slowly changing functions and
is a very smooth “out-of-focus” estimate of p(x). 17
FIGURE 4.4. Three Parzen-window density estimates
based on the same set of five samples, using the window
functions in Fig. 4.3. As before, the vertical axes have been
scaled to show the structure of each distribution.

If hn is very small, the peak value of δn (x−xi) is large and


occurs near x = xi. In this case p(x) is the superposition of n
sharp pulses centered at the samples — an erratic, “noisy”
estimate. 18
If Vn is too large, the estimate will suffer from too little
resolution; if Vn is too small, the estimate will suffer
from too much statistical variability. With a limited
number of samples, the best we can do is to seek some
acceptable compromise. However, with an unlimited
number of samples, it is possible to let Vn slowly
approach zero as n increases and have pn(x) converge
to the unknown density p(x).

conditions
- φ(u) must be well-behaved.
- Vn  0 at a rate lower than 1/n

Refer to the textbook for the proof of the convergence of the mean and variance
19
Expected Value/Variance
of estimate pn(x)
• The expected value of the estimates approaches p(x)
as Vn  0: convolution with true density
1 n  1  x  xi  1  xv
E  pn (x)   E          p( v)dv    n  x  v  p( v)dv.
n i 1 Vn  hn   Vn  hn 

• The variance of the estimate is given by:


sup( (.)) E[ pn (x)]
Var[ pn (x)] 
nVn
• The variance can be decreased by allowing
nVn   (e.g.,Vn  1/ n )
20
• Illustration
• The behavior of the Parzen-window method

• Case where p(x) N(0,1)


Let (u) = (1/(2) exp(-u2/2) and hn = h1/n (n>1)
(h1: known parameter)
n
Thus: 1 1 x xi
p ( x)
n
n i 1 hn hn
is an average of normal densities centered at the samples xi.
Numerical results: For n = 1 and h1=1
1 1/2
p1 ( x) (x x1 ) e (x x1 ) 2 N ( x1 ,1)
2
For n = 10 and h = 0.1, the contributions of the individual
samples are clearly observable ! 21
22
23
24
Probability density with different levels of smoothing (h = 0.2 and h = 0.5).

25
FIGURE 4.5. Parzen-window estimates of a univariate normal density using different
window widths and numbers of samples. The vertical axes have been scaled to best
show the structure in each graph. Note particularly that the n = ∞ estimates are the same 26
(and match the true density function), regardless of window width.
FIGURE 4.6. Parzen-
window estimates of a
bivariate normal
density using different
window widths and
numbers of samples.
The vertical axes have
been scaled to best
show the structure in
each graph. Note
particularly that the
n=∞ estimates are the
same (and match the
true distribution),
regardless of window
width.
27
Figure 2.25 Illustration of the kernel density model .We see that h acts as a
smoothing parameter and that if it is set too small (top panel), the result is a
very noisy density model, whereas if it is set too large (bottom panel), then
the bimodal nature of the underlying distribution from which the data is
generated (shown by the green curve) is washed out. The best density
model is obtained for some intermediate value of h (middle panel).
28
• Case where p(x) = 1.U(a,b) + 2.T(c,d) (unknown
density) (mixture of a uniform and a triangle density)

FIGURE 4.7. Parzen-


window estimates of a
bimodal distribution using
different window widths
and numbers of samples.
Note particularly that the
n=∞ estimates are the same
(and match the true
distribution), regardless of
window width.

29
Classification using kernel-based
density estimation
• In classifiers based on Parzen-window estimation:

– We estimate the densities for each category and


classify a test point by the label corresponding to
the maximum posterior

– The decision region for a Parzen-window classifier


depends upon the choice of window function as
illustrated in the following figure.

30
very low error on training examples Better generalization
FIGURE 4.8. The decision boundaries in a two-dimensional Parzen-
window dichotomizer depend on the window width h. At the left a small
h leads to boundaries that are more complicated than for large h on same
data set, shown at the right. Apparently, for these data a small h would be
appropriate for the upper region, while a large h would be appropriate for
the lower region; no single window width is ideal overall. 31
Parzen classifier

32
Parzen classifier

33
Drawbacks of kernel-based methods

• Require a large number of samples.


• Require all the samples to be stored.
• Evaluation of the density could be very slow if the
number of data points is large.
• Possible solution: use fewer kernels and adapt the
positions and widths in response to the data (e.g.,
mixtures of Gaussians!)

Example from Dr Khotanzad’s lecture notes (p144)


34
• Parzen Window (cont.)
– Parzen Window – Probabilistic Neural Network
• Compute a Parzen estimate based on n patterns
– Patterns with d features sampled from c classes
– The input unit is connected to n patterns

x1
x2
.. W11 ..
● p1
p2
Input unit
. . Input
.

. .
Wd2 . patterns
.
.
Wdn
xd pn

Modifiable weights or θ (trained) 35


pn ..
..
p1
p2
. 1
.

.
2
. .
Input Category
pk .

.
patterns . units
.
.

.
c
.
pn

Activations
(Emission of nonlinear functions)
36
• Training the network →Algorithm
1. Normalize each pattern x of the training set to 1,

2. Place the first training pattern on the input units


3. Set the weights linking the input units and the first
pattern units such that: w1 = x1
4. Make a single connection from the first pattern unit
to the category unit corresponding to the known
class of that pattern
5. Repeat the process for all remaining training
patterns by setting the weights such that
wk = xk (k = 1, 2, …, n)
We finally obtain the following network
37
FIGURE 4.9. A probabilistic neural network (PNN) consists of d input units, n
pattern units, and c category units. Each pattern unit forms the inner product of its
weight vector and the normalized pattern vector x to form z = wt x, and then it emits
exp[(z − 1)/σ2]. Each category unit sums such contributions from the pattern unit
connected to it. This ensures that the activity in each of the category units represents
the Parzen-window density estimate using a circularly symmetric Gaussian window
of covariance σ2I, where I is the d ×d identity matrix.
38
• Testing the network
– Algorithm

1. Normalize the test pattern x and place it at the input units


2. Each pattern unit computes the inner product in order to
yield the net activation
zk  net k  w k .x
t

 z k  1
and emit a nonlinear function f ( zk )  exp  2 
  
– That is, if we let our effective width h be a constant,
the window function is

39
3. Each output unit sums the contributions from
all pattern units connected to it
n
pn (x |  j )   i  P( j | x)
i 1

4. Classify by selecting the maximum value of


pn(x|j) (j = 1, …, c)

40
Benefits of PNNs
• Speed of learning; requires only a single pass
through the training data.
• The space complexity; O((n+1)d). The time
complexity for classification by the parallel
implementation is O(1).
• New training patterns can be incorporated into
a previously trained classifier quite easily;

41
Choosing the window function
• One of the problems encountered in the Parzen-
window/PNN approach concerns the choice of the
sequence of cell volumes sizes V1, V2, ... or overall
window size (or indeed other window parameters,
such as shape or orientation).
• If we take Vn = V1/n, the results for any finite n will
be very sensitive to the choice for the initial vol. V1.
• If V1 is too small, most of the volumes will be empty,
and the estimate pn (x) will be very erratic.
• On the other hand, if V1 is too large, important spatial
variations in p(x) may be lost due to averaging over
the cell volume.
42

You might also like