0% found this document useful (0 votes)
30 views51 pages

Non-Parametric Methods

The document discusses non-parametric density estimation methods, specifically histograms and kernel density estimation. Histograms estimate density by counting observations within bins of a fixed width. Kernel density estimation uses a kernel function centered on each observation to estimate the density at any given point as the average of the kernel values. The bandwidth of the kernel determines the smoothness of the estimated density, and must be chosen to balance resolution and variability.

Uploaded by

bill.morrisson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views51 pages

Non-Parametric Methods

The document discusses non-parametric density estimation methods, specifically histograms and kernel density estimation. Histograms estimate density by counting observations within bins of a fixed width. Kernel density estimation uses a kernel function centered on each observation to estimate the density at any given point as the average of the kernel values. The bandwidth of the kernel determines the smoothness of the estimated density, and must be chosen to balance resolution and variability.

Uploaded by

bill.morrisson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

CS 509: Pattern Recognition

Non-parametric Methods

Dr. Mohammed Ayoub Alaoui Mhamdi


Bishop's University
Sherbrooke, Qc, Canada
[email protected]
Introduction
 Density estimation with parametric models assumes that
the forms of the underlying density functions are known.
 However, common parametric forms do not always fit the
densities actually encountered in practice.
 In addition, most of the classical parametric densities are
unimodal, whereas many practical problems involve
multimodal densities.
 Non-parametric methods can be used with arbitrary
distributions and without the assumption that the forms of
the underlying densities are known.
 Histograms.
 Kernel Density Estimation / Parzen Windows.
 k-Nearest Neighbor Density Estimation.
 Real Example in Figure-Ground Segmentation
2
Histograms

3
Histogram Density Representation
 Consider a single continuous variable x and let’s say we have
a set of of them . Our goal is to model from .
 Standard histograms simply partition into distinct bins of
width and then count the number of observations falling
into bin .
 To turn this count into a normalized probability density, we
simply divide by the total number of observations and by the
width of the bins.
 This gives us:

 Hence the model for the density p(x) is constant over the
width of each bin. (And often the bins are chosen to have the
same width .)
4
Histogram Density Representation

5
Histogram Density as a Function of Bin Width

6
Histogram Density as a Function of Bin
Width
 The green curve is the
underlying true density from
which the samples were drawn.
It is a mixture of two
Gaussians.
 When is very small (top), the
resulting density is quite spiky
and hallucinates a lot of
structure
When not is present
very bigin . (bottom), the resulting density is quite
smooth and consequently fails to capture the bimodality of .
 It appears that the best results are obtained for some
intermediate value of , which is given in the middle figure.
 In principle, a histogram density model is also dependent on

7
the choice of the edge location of each bin.
Analyzing the Histogram Density
What are the advantages and disadvantages of the
histogram density estimator?
Advantages:
 Simple to evaluate and simple to use.
 One can throw away once the histogram is computed.
 Can be computed sequentially if data continues to come in.
Disadvantages:
 The estimated density has discontinuities due to the bin
edges rather than any property of the underlying density.
 Scales poorly (curse of dimensionality): we would have bins
if we divided each variable in a -dimensional space into
bins.
8
What can we learn from Histogram Density
Estimation?
Lesson 1: To estimate the probability density at a particular
location, we should consider the data points that lie within
some local neighborhood of that point.
 This requires we define some distance measure.
 There is a natural smoothness parameter describing the spatial
extent of the regions (this was the bin width for the
histograms).
 Lesson 2: The value of the smoothing parameter should
neither be too large or too small in order to obtain good
results.
With these two lessons in mind, we proceed to kernel
density estimation and nearest neighbor density estimation,
9 two closely related methods for density estimation.
The Space-Averaged / Smoothed Density
Consider again samples x from underlying density
p(x).
Let denote a small region containing x.
The probability mass associated with is given by

Suppose we have samples . The probability of each


sample falling into is .
How will the total number of points falling into be
distributed?
This will be a binomial distribution:

10
The Space-Averaged / Smoothed Density
The expected value for k is thus

The binomial for peaks very sharply about the mean.


So, we expect to be a very good estimate for the
probability (and thus for the space-averaged density).
This estimate is increasingly accurate as n increases.

11
The Space-Averaged / Smoothed Density
Assuming continuous and that is so small that does
not appreciably vary within it, we can write:

where is a point within and is the volume enclosed by .


After some rearranging, we get the following estimate
for

12
Example
Simulated an example of example the density at 0.5 for
an underlying zero-mean, unit variance Gaussian.
Varied the volume used to estimate the density.
Red=1000, Green=2000, Blue=3000, Yellow=4000,
Black=5000.

13
Practical Concerns
The validity of our estimate depends on two contradictory
assumptions:
1. The region must be sufficiently small the the density is
approximately constant over the region.
2. The region must be sufficiently large that the number of
points falling inside it is sufficient to yield a sharply peaked
binomial.
Another way of looking it is to fix the volume and increase
the number of training samples. Then, the ratio will
converge as desired. But, this will only yield an estimate of
the space-averaged density ().
We want p(x), so we need to let V approach 0. However, with
a fixed , will become so small, that no points will fall into it
and our estimate would be useless: .
Note that in practice, we cannot let V to become arbitrarily
14
small because the number of samples is always limited.
Practical Concerns
How can we skirt these limitations when an unlimited
number of samples if available?
 To estimate the density at , form a sequence of regions
containing with the having sample (), having samples ()
and so on.
 Let be the volume of , be the number of samples falling in
, and be the nth estimate for :
*

 If is to converge to we need the following three


conditions

15
Practical Concerns
 ensures that our space-averaged density will converge to .
 basically ensures that the frequency ratio will converge to
the probability (the binomial will be sufficiently peaked).
 is required for to converge at all. It also says that
although a huge number of samples will fall within the
region , they will form a negligibly small fraction of the
total number of samples.

16
Practical Concerns
There are two common ways of obtaining regions that
satisfy these conditions:
1. Shrink an initial region by specifying the volume as
some function of such as . Then, we need to show that
converges to . (This is like the Parzen window we’ll
talk about next.)
2. Specify as some function of such as . Then, we grow
the volume until it encloses neighbors of . (This is the
k-nearest-neighbor).

Both of these methods converge...

17
18
Parzen Windows
Let’s temporarily assume the region is a -dimensional
hypercube with being the length of an edge.
The volume of the hypercube is given by

We can derive an analytic expression for :


 Define a windowing function:

 This windowing function defines a unit hypercube centered


at the origin.
 Hence, s equal to unity if falls within the hypercube of
volume centered at , and is zero otherwise

19
Parzen Windows
The number of samples in this hypercube is therefore
given by

Substituting in equation (*), yields the estimate

Hence, the windowing function , in this context called


a Parzen window, tells us how to weight all of the
samples in to determine at a particular .

20
Example

But, what undesirable traits from histograms are inherited


by Parzen window density estimates of the form we’ve
just defined?
Discontinuities...
21 Dependence on the bandwidth.
Generalizing the Kernel Function
What if we allow a more general class of windowing
functions rather than the hypercube?
If we think of the windowing function as an interpolator,
rather than considering the window function about only, we
can visualize it as a kernel sitting on each data sample in .
And, if we require the following two conditions on the
kernel function , then we can be assured that the resulting
density will be proper: non-negative and integrate to .

For our previous case of , then it follows will also satisfy


these conditions.
22
Example: A Univariate Guassian Kernel
A popular choice of the kernel is the Gaussian kernel:

The resulting density is given by:

It will give us smoother estimates without the


discontinuites from the hypercube kernel.

23
Effect of the Window Width
An important question is what effect does the window
width have on ?
Define as

and rewrite as the average

24
Effect of the Window Width
 clearly affects both the amplitude and the width of .

25
Effect of Window Width (And, hence,
Volume )
But, for any value of , the distribution is normalized:

If is too large, the estimate will suffer from too little
resolution.
If is too small, the estimate will suffer from too much
variability.
In theory (with an unlimited number of samples), we can
let slowly approach zero as increases and then will
converge to the unknown . But, in practice, we can, at
best, seek some compromise.

26
Example: Revisiting the Univariate
Guassian Kernel

27
Example: A Bimodal Distribution

28
Parzen Window-Based Classifiers
Estimate the densities for each category.
Classify a query point by the label corresponding to the
maximum posterior (i.e., one can include priors).
As you guessed it, the decision regions for a Parzen
window-based classifier depend upon the kernel
function.

29
Parzen Window-Based Classifiers
During training, we can make the error arbitrarily low by
making the window sufficiently small, but this will have an
ill-effect during testing (which is our ultimate need).
Think of any possibilities for system rules of choosing the
kernel?
One possibility is to use cross-validation. Break up the data
into a training set and a validation set. Then, perform
training on the training set with varying bandwidths. Select
the bandwidth that minimizes the error on the validation
set.
There is little theoretical justification for choosing one
window width over another.
30
Nearest Neighbor Methods
 Selecting the best window / bandwidth is a severe limiting
factor for Parzen window estimators.
 methods circumvent this problem by making the window
size a function of the actual training data.
 The basic idea here is to center our window around and
let it grow until it captures samples, where is a function
of n.
 These samples are the nearest neighbors of .
 If the density is high near then the window will be relatively
small leading to good resolution.
 If the density is low near , the window will grow large, but it
will stop soon after it enters regions of higher density.
31
 In either case, we estimate according to
Nearest Neighbor Methods
We want to go to infinity as n goes to infinity thereby
assuring us that will be a good estimate of the
probability that a point will fall in the window of volume
Vn.
But, we also want to grow sufficiently slowly so that the
size of our window will go to zero.
Thus, we want to go to zero.
Recall these conditions from the earlier discussion; these
will ensure that converges to as approaches infinity.

32
Examples of Estimation
Notice the discontinuities in the slopes of the estimate.

33
Estimation From 1 Sample

We don’t expect the density estimate from 1 sample to


be very good, but in the case of it will diverge!
With and , the estimate for is

34
But, as we increase the number of samples, the estimate will
improve.

35
Limitations
The Estimator suffers from an analogous flaw from which
the Parzen window methods suffer.
What is it? How do we specify the ?
We saw earlier that the specification of can lead to
radically different density estimates (in practical situations
where the number of training samples is limited).
 One could obtain a sequence of estimates by taking and
choose different values of .
But, like the Parzen window size, one choice is as good as
another absent any additional information.
Similarly, in classification scenarios, we can base our
judgement on classification error.
36
Posterior Estimation for Classification
 We can directly apply the methods to estimate the
posterior probabilities from a set of n labeled samples.
 Place a window of volume around and capture
samples, with ki turning out to be of label .
 The estimate for the joint probability is thus

 A reasonable estimate for the posterior is thus

 Hence, the posterior probability for is simply the


fraction of samples within the window that are
labeled . This is a simple and intuitive result.
37
Example: Figure-Ground Discrimination
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

Figure-ground discrimination is an important low-level


vision task.
Want to separate the pixels that contain some
foreground object (specified in some meaningful way)
from the background.

38
Example: Figure-Ground Discrimination
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

This paper presents a method for figure-ground


discrimination based on non-parametric densities for
the foreground and background.
They use a subset of the pixels from each of the two
regions. They propose an algorithm called iterative
sampling-expectation for performing the actual
segmentation.
The required input is simply a region of interest
(mostly) containing the object.

39
Example: Figure-Ground Discrimination
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

Given a set of samples where each is a dimensional


vector.
We know the kernel density estimate is defined as

where the same kernel ϕ with different bandwidth σj is


used in each dimension.

40
The Representation
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

 The representation used here is a function of RGB:

 Separating the chromaticity from the brightness allows them

to us a wider bandwidth in the brightness dimension to


account for variability due to shading effects.
 And, much narrower kernels can be used on the and
41 chromaticity channels to enable better discrimination.
The Color Density
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

Given a sample of pixels , the color density estimate is


given by

where we have simplified the kernel definition:

They use Gaussian kernels

with a different bandwidth in each dimension.

42
Data-Driven Bandwidth
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

The bandwidth for each channel is calculated directly from


the image based on sample statistics.

where is the sample variance.

43
Initialization: Choosing the Initial Scale
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
For initialization, they compute a distance between the
foreground and background distribution by varying the scale
of a single Gaussian kernel (on the foreground).
To evaluate the “significance” of a particular scale, they
compute the normalized KL-divergence:

where and are the density estimates for the foreground and
background regions respectively. To compute each, they use
about of the pixels (using all of the pixels would lead to quite
slow performance).

44
45
Iterative Sampling-Expectation Algorithm
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
Given the initial segmentation, they need to refine the
models and labels to adapt better to the image.
However, this is a chicken-and-egg problem. If we know the
labels, we could compute the models, and if we knew the
models, we could compute the best labels.
They propose an EM algorithm for this. The basic idea is to
alternate between estimating the probability that each pixel
is of the two classes, and then given this probability to refine
the underlying models.
EM is guaranteed to converge (but only to a local
minimum).
46
Iterative Sampling-Expectation Algorithm
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

1. Initialize using the normalized KL-divergence.


2. Uniformly sample a set of pixel from the image to use in
the kernel density estimation. This is essentially the ‘M’
step (because we have a non-parametric density).
3. Update the pixel assignment based on maximum
likelihood (the ‘E’ step).
4. Repeat until stable. One can use a hard assignment of the
pixels and the kernel density estimator we’ve discussed,
or a soft assignment of the pixels and then a weighted
kernel density estimate (the weight is between the
different classes).
5. The overall probability of a pixel belonging to the
47
foreground class
Results: Stability
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

48
Results
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

49
Results
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.

50
Summary
Advantages:
 No assumptions are needed about the distributions ahead
of time (generality).
 With enough samples, convergence to an arbitrarily
complicated target density can be obtained.
Disadvantages:
 The number of samples needed may be very large
(number grows exponentially with the dimensionality of
the feature space).
 There may be severe requirements for computation time
and storage.

51

You might also like