Parzen Window
Parzen Window
n k nk n!
Pk BIN (k | n, P) P (1 P) P k (1 P) nk (2)
k k!(n k )!
No. of unique splits k vs n-k Prob. that k of particular x-es are in R
and the expected and variance value for k is:
E(k) = nP , Var(k)=nP(1-P) (3) 3
What is ML estimation of P = ?
n k nk
P ln( Pk ) P (ln k ln( P) (n k ) ln(1 P)) 0
k P 1 P
ˆ k
Max( Pk | θ) is reached for θ P (4)
n
O
• Therefore, the ratio k/n is a good estimate for the
probability P and hence for the density function p.
• If p(x) is continuous and that the region R is so small
that p does not vary significantly within it, we can
g
write:
p(x ')dx ' p(x)V
(5)
FIGURE 4.1. The relative probability an estimate given by Eq. 4 will yield a
particular value for the probability density, here where the true probability
was chosen to be 0.7. Each curve is labeled by the total number of patterns n
sampled, and is scaled to give the same maximum (at the true probability).
The form of each curve is binomial, as given by Eq. 2. For large n, such
binomials peak strongly at the true probability. In the limit n → ∞, the curve
approaches a delta function, and we are guaranteed that our estimate will give
5
the true probability.
• Density Estimation (cont’d)
6
p(x' )dx' p(x' ) dx' p(x' ) ()
k
and p (x)
nV 7
The volume V needs to approach 0 anyway if we want to use
this estimation
– Practically, V cannot be allowed to become small since the
number of samples is always limited
– One will have to accept a certain amount of variance in the
ratio k/n
– Theoretically, if an unlimited number of samples is available,
we can circumvent this difficulty
To estimate the density of x, we form a sequence of regions
R1, R2,…containing x: the first region contains one sample, the
second two samples and so on.
E
Let Vn be the volume of Rn, kn the number of samples falling in
Rn and pn(x) be the nth estimate for p(x):
kn / n
pn (x) (7)
Vn 9
Three necessary conditions should apply if we want pn(x) to
converge to p(x):
1) lim Vn 0 smaller and smaller regions
n
set
1 u j j 1,... , d
I
(u) 2
0 otherwise
– ((x-xi)/hn) is equal to unity if xi falls within the
hypercube of volume Vn centered at x and equal to
zero otherwise.
ÉFTIE 12
ur a Mj
a j l la E E
U t t
guy i U E
• The number of samples in this hypercube is:
t
i n
x xi
k n
i 1 hn
1 n 1 x xi
pn (x)
n i 1 Vn hn
14
chi hypercube centered at
is in
we decide whether I
2 using chi function
HEE
The no of samples within the hypercube located
at a is given by
1k
E 41 S
at pen ÉEEEETTE
EI suppose methane tampers ni iii staining
width be h 3 Here 1,5T s
bet the window
a line Un hit hi 3
and
at 2 1
We want to estimate density
D 1,34 4G I 4C
D
YLE if
4 713 1
L fact
Piecewise defined univariate probability busily pep
EI
era
E
M
otherwise he
E i
th EiIi K
With
Properties of (u)
• The kernel function (u) can have a more
general form (i.e., not just hypercube).
• In order for pn(x) to be a legitimate estimate
(nonnegative and integrate to one) (u) ,
must be a valid density itself:
(u) 0
(u)du 1
15
Effect of the window width hn on pn(x)
If we define the function δn(x) by
1 x
n ( x) ( )
Vn hn
then we can write pn(x) as the average
1 n
pn (x) n (x xi ).
n i 1
Since Vn = hnd, hn clearly affects both the amplitude and
the width of δn(x) (Fig. 4.3).
Thus, as hn approaches zero, δn(x−xi) approaches a Dirac
delta function centered at xi, and pn(x) approaches a
superposition of delta functions centered at the samples. 16
FIGURE 4.3. Examples of two-dimensional circularly
symmetric normal Parzen windows for three different values of
h. Note that because the δ(x) are normalized, different vertical
scales must be used to show their structure.
conditions
- φ(u) must be well-behaved.
- Vn 0 at a rate lower than 1/n
Refer to the textbook for the proof of the convergence of the mean and variance
19
Expected Value/Variance
of estimate pn(x)
• The expected value of the estimates approaches p(x)
as Vn 0: convolution with true density
1 n 1 x xi 1 xv
E pn (x) E p( v)dv n x v p( v)dv.
n i 1 Vn hn Vn hn
25
FIGURE 4.5. Parzen-window estimates of a univariate normal density using different
window widths and numbers of samples. The vertical axes have been scaled to best
show the structure in each graph. Note particularly that the n = ∞ estimates are the same 26
(and match the true density function), regardless of window width.
FIGURE 4.6. Parzen-
window estimates of a
bivariate normal
density using different
window widths and
numbers of samples.
The vertical axes have
been scaled to best
show the structure in
each graph. Note
particularly that the
n=∞ estimates are the
same (and match the
true distribution),
regardless of window
width.
27
Figure 2.25 Illustration of the kernel density model .We see that h acts as a
smoothing parameter and that if it is set too small (top panel), the result is a
very noisy density model, whereas if it is set too large (bottom panel), then
the bimodal nature of the underlying distribution from which the data is
generated (shown by the green curve) is washed out. The best density
model is obtained for some intermediate value of h (middle panel).
28
• Case where p(x) = 1.U(a,b) + 2.T(c,d) (unknown
density) (mixture of a uniform and a triangle density)
29
Classification using kernel-based
density estimation
• In classifiers based on Parzen-window estimation:
30
very low error on training examples Better generalization
FIGURE 4.8. The decision boundaries in a two-dimensional Parzen-
window dichotomizer depend on the window width h. At the left a small
h leads to boundaries that are more complicated than for large h on same
data set, shown at the right. Apparently, for these data a small h would be
appropriate for the upper region, while a large h would be appropriate for
the lower region; no single window width is ideal overall. 31
Parzen classifier
32
Parzen classifier
33
Drawbacks of kernel-based methods
x1
x2
.. W11 ..
● p1
p2
Input unit
. . Input
.
. .
Wd2 . patterns
.
.
Wdn
xd pn
.
2
. .
Input Category
pk .
.
patterns . units
.
.
.
c
.
pn
Activations
(Emission of nonlinear functions)
36
• Training the network →Algorithm
1. Normalize each pattern x of the training set to 1,
z k 1
and emit a nonlinear function f ( zk ) exp 2
– That is, if we let our effective width h be a constant,
the window function is
39
3. Each output unit sums the contributions from
all pattern units connected to it
n
pn (x | j ) i P( j | x)
i 1
40
Benefits of PNNs
• Speed of learning; requires only a single pass
through the training data.
• The space complexity; O((n+1)d). The time
complexity for classification by the parallel
implementation is O(1).
• New training patterns can be incorporated into
a previously trained classifier quite easily;
41
Choosing the window function
• One of the problems encountered in the Parzen-
window/PNN approach concerns the choice of the
sequence of cell volumes sizes V1, V2, ... or overall
window size (or indeed other window parameters,
such as shape or orientation).
• If we take Vn = V1/n, the results for any finite n will
be very sensitive to the choice for the initial vol. V1.
• If V1 is too small, most of the volumes will be empty,
and the estimate pn (x) will be very erratic.
• On the other hand, if V1 is too large, important spatial
variations in p(x) may be lost due to averaging over
the cell volume.
42