Non-Parametric Methods
Non-Parametric Methods
Non-parametric Methods
3
Histogram Density Representation
Consider a single continuous variable x and let’s say we have
a set of of them . Our goal is to model from .
Standard histograms simply partition into distinct bins of
width and then count the number of observations falling
into bin .
To turn this count into a normalized probability density, we
simply divide by the total number of observations and by the
width of the bins.
This gives us:
Hence the model for the density p(x) is constant over the
width of each bin. (And often the bins are chosen to have the
same width .)
4
Histogram Density Representation
5
Histogram Density as a Function of Bin Width
6
Histogram Density as a Function of Bin
Width
The green curve is the
underlying true density from
which the samples were drawn.
It is a mixture of two
Gaussians.
When is very small (top), the
resulting density is quite spiky
and hallucinates a lot of
structure
When not is present
very bigin . (bottom), the resulting density is quite
smooth and consequently fails to capture the bimodality of .
It appears that the best results are obtained for some
intermediate value of , which is given in the middle figure.
In principle, a histogram density model is also dependent on
7
the choice of the edge location of each bin.
Analyzing the Histogram Density
What are the advantages and disadvantages of the
histogram density estimator?
Advantages:
Simple to evaluate and simple to use.
One can throw away once the histogram is computed.
Can be computed sequentially if data continues to come in.
Disadvantages:
The estimated density has discontinuities due to the bin
edges rather than any property of the underlying density.
Scales poorly (curse of dimensionality): we would have bins
if we divided each variable in a -dimensional space into
bins.
8
What can we learn from Histogram Density
Estimation?
Lesson 1: To estimate the probability density at a particular
location, we should consider the data points that lie within
some local neighborhood of that point.
This requires we define some distance measure.
There is a natural smoothness parameter describing the spatial
extent of the regions (this was the bin width for the
histograms).
Lesson 2: The value of the smoothing parameter should
neither be too large or too small in order to obtain good
results.
With these two lessons in mind, we proceed to kernel
density estimation and nearest neighbor density estimation,
9 two closely related methods for density estimation.
The Space-Averaged / Smoothed Density
Consider again samples x from underlying density
p(x).
Let denote a small region containing x.
The probability mass associated with is given by
10
The Space-Averaged / Smoothed Density
The expected value for k is thus
11
The Space-Averaged / Smoothed Density
Assuming continuous and that is so small that does
not appreciably vary within it, we can write:
12
Example
Simulated an example of example the density at 0.5 for
an underlying zero-mean, unit variance Gaussian.
Varied the volume used to estimate the density.
Red=1000, Green=2000, Blue=3000, Yellow=4000,
Black=5000.
13
Practical Concerns
The validity of our estimate depends on two contradictory
assumptions:
1. The region must be sufficiently small the the density is
approximately constant over the region.
2. The region must be sufficiently large that the number of
points falling inside it is sufficient to yield a sharply peaked
binomial.
Another way of looking it is to fix the volume and increase
the number of training samples. Then, the ratio will
converge as desired. But, this will only yield an estimate of
the space-averaged density ().
We want p(x), so we need to let V approach 0. However, with
a fixed , will become so small, that no points will fall into it
and our estimate would be useless: .
Note that in practice, we cannot let V to become arbitrarily
14
small because the number of samples is always limited.
Practical Concerns
How can we skirt these limitations when an unlimited
number of samples if available?
To estimate the density at , form a sequence of regions
containing with the having sample (), having samples ()
and so on.
Let be the volume of , be the number of samples falling in
, and be the nth estimate for :
*
15
Practical Concerns
ensures that our space-averaged density will converge to .
basically ensures that the frequency ratio will converge to
the probability (the binomial will be sufficiently peaked).
is required for to converge at all. It also says that
although a huge number of samples will fall within the
region , they will form a negligibly small fraction of the
total number of samples.
16
Practical Concerns
There are two common ways of obtaining regions that
satisfy these conditions:
1. Shrink an initial region by specifying the volume as
some function of such as . Then, we need to show that
converges to . (This is like the Parzen window we’ll
talk about next.)
2. Specify as some function of such as . Then, we grow
the volume until it encloses neighbors of . (This is the
k-nearest-neighbor).
17
18
Parzen Windows
Let’s temporarily assume the region is a -dimensional
hypercube with being the length of an edge.
The volume of the hypercube is given by
19
Parzen Windows
The number of samples in this hypercube is therefore
given by
20
Example
23
Effect of the Window Width
An important question is what effect does the window
width have on ?
Define as
24
Effect of the Window Width
clearly affects both the amplitude and the width of .
25
Effect of Window Width (And, hence,
Volume )
But, for any value of , the distribution is normalized:
If is too large, the estimate will suffer from too little
resolution.
If is too small, the estimate will suffer from too much
variability.
In theory (with an unlimited number of samples), we can
let slowly approach zero as increases and then will
converge to the unknown . But, in practice, we can, at
best, seek some compromise.
26
Example: Revisiting the Univariate
Guassian Kernel
27
Example: A Bimodal Distribution
28
Parzen Window-Based Classifiers
Estimate the densities for each category.
Classify a query point by the label corresponding to the
maximum posterior (i.e., one can include priors).
As you guessed it, the decision regions for a Parzen
window-based classifier depend upon the kernel
function.
29
Parzen Window-Based Classifiers
During training, we can make the error arbitrarily low by
making the window sufficiently small, but this will have an
ill-effect during testing (which is our ultimate need).
Think of any possibilities for system rules of choosing the
kernel?
One possibility is to use cross-validation. Break up the data
into a training set and a validation set. Then, perform
training on the training set with varying bandwidths. Select
the bandwidth that minimizes the error on the validation
set.
There is little theoretical justification for choosing one
window width over another.
30
Nearest Neighbor Methods
Selecting the best window / bandwidth is a severe limiting
factor for Parzen window estimators.
methods circumvent this problem by making the window
size a function of the actual training data.
The basic idea here is to center our window around and
let it grow until it captures samples, where is a function
of n.
These samples are the nearest neighbors of .
If the density is high near then the window will be relatively
small leading to good resolution.
If the density is low near , the window will grow large, but it
will stop soon after it enters regions of higher density.
31
In either case, we estimate according to
Nearest Neighbor Methods
We want to go to infinity as n goes to infinity thereby
assuring us that will be a good estimate of the
probability that a point will fall in the window of volume
Vn.
But, we also want to grow sufficiently slowly so that the
size of our window will go to zero.
Thus, we want to go to zero.
Recall these conditions from the earlier discussion; these
will ensure that converges to as approaches infinity.
32
Examples of Estimation
Notice the discontinuities in the slopes of the estimate.
33
Estimation From 1 Sample
34
But, as we increase the number of samples, the estimate will
improve.
35
Limitations
The Estimator suffers from an analogous flaw from which
the Parzen window methods suffer.
What is it? How do we specify the ?
We saw earlier that the specification of can lead to
radically different density estimates (in practical situations
where the number of training samples is limited).
One could obtain a sequence of estimates by taking and
choose different values of .
But, like the Parzen window size, one choice is as good as
another absent any additional information.
Similarly, in classification scenarios, we can base our
judgement on classification error.
36
Posterior Estimation for Classification
We can directly apply the methods to estimate the
posterior probabilities from a set of n labeled samples.
Place a window of volume around and capture
samples, with ki turning out to be of label .
The estimate for the joint probability is thus
38
Example: Figure-Ground Discrimination
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
39
Example: Figure-Ground Discrimination
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
40
The Representation
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
42
Data-Driven Bandwidth
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
43
Initialization: Choosing the Initial Scale
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
For initialization, they compute a distance between the
foreground and background distribution by varying the scale
of a single Gaussian kernel (on the foreground).
To evaluate the “significance” of a particular scale, they
compute the normalized KL-divergence:
where and are the density estimates for the foreground and
background regions respectively. To compute each, they use
about of the pixels (using all of the pixels would lead to quite
slow performance).
44
45
Iterative Sampling-Expectation Algorithm
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
Given the initial segmentation, they need to refine the
models and labels to adapt better to the image.
However, this is a chicken-and-egg problem. If we know the
labels, we could compute the models, and if we knew the
models, we could compute the best labels.
They propose an EM algorithm for this. The basic idea is to
alternate between estimating the probability that each pixel
is of the two classes, and then given this probability to refine
the underlying models.
EM is guaranteed to converge (but only to a local
minimum).
46
Iterative Sampling-Expectation Algorithm
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
48
Results
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
49
Results
Source: Zhao and Davis. Iterative Figure-Ground Discrimination. ICPR 2004.
50
Summary
Advantages:
No assumptions are needed about the distributions ahead
of time (generality).
With enough samples, convergence to an arbitrarily
complicated target density can be obtained.
Disadvantages:
The number of samples needed may be very large
(number grows exponentially with the dimensionality of
the feature space).
There may be severe requirements for computation time
and storage.
51