Poisson Point Processes Imaging, Tracking, and Sensing
Poisson Point Processes Imaging, Tracking, and Sensing
Roy L. Streit
123
Roy L. Streit
Metron Inc.
1818 Library St
Reston Virginia 20190-6281, USA
[email protected]
www.roystreit.com
vii
viii Preface
that would have facilitated my own understanding, had I but known the material at
the outset.
I am indebted to many individuals and institutions for their help in writing this
book. I thank Prof. Don Tufts (University of Rhode Island) for the witty, but appro-
priate, phrase “the alternative tradition in signal processing.” This phrase captures
to my mind the novelty I feel about the subject of PPPs. I thank Dr. Wolfgang Koch
(Fraunhofer-FKIE/University of Bonn) for cluing me in to the splendid aphorism
that begins this book. It is a great moral support against the ubiquity of the advocates
of simulation. I thank Dr. Dale Blair (Georgia Tech Research Institute, GTRI) for
suggesting a tutorial as a way to socialize PPPs. That the project eventually became
a book is my own fault, not his.
I thank Dr. Keith Davidson (Office of Naval Research) for supporting the research
that became the basis Chapter 6 on multitarget intensity tracking. I thank Metron,
Inc., for providing a nurturing mathematical environment that encouraged me to
explore seriously the various applications of PPPs. Such working environments are
the result of sustaining leadership and management over many years.
I thank Dr. Lawrence Stone, one of the founders of Metron, for many helpful
comments on early drafts of several chapters. These resulted in improvements of
content and clarity. I thank Dr. James Ferry (Metron) for helpful discussions over
many months. I have learned much from him. I thank Dr. Grant Boquet (Metron)
for his insight into wedge products, and for helping me to learn and use LATEX. His
patience is remarkable. I also thank Dr. Lance Kaplan (US Army Research Labo-
ratory), Dr. Marcus Graham (US Naval Undersea Warfare Center), and Dr. Frank
Ehlers (NATO Undersea Research Centre, NURC) for their encouragement and
helpful comments on early drafts of the tutorial that started it all.
I thank my wife Nancy, our family Adam, Kristen, Andrew, and Katherine, and
our ever-hopeful four-legged companions, Sam and Eddie, for their steadfast love
and support. They are first to me, now and always.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Chapter Parade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Part I: Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Part II: Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Part III: Beyond the Poisson Point Process . . . . . . . . . . . . . . . 5
1.1.4 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 The Real Line Is Not Enough . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 General Point Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 An Alternative Tradition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Part I Fundamentals
ix
x Contents
3 Intensity Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.1 Maximum Likelihood Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.1.1 Necessary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1.2 Gaussian Crosshairs and Edge Effects . . . . . . . . . . . . . . . . . . 60
3.2 Superposed Intensities with Sample Data . . . . . . . . . . . . . . . . . . . . . . . 63
3.2.1 EM Method with Sample Data . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.2 Interpreting the Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.3 Simple Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.4 Affine Gaussian Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 Superposed Intensities with Histogram Data . . . . . . . . . . . . . . . . . . . . . 73
3.3.1 EM Method with Histogram Data . . . . . . . . . . . . . . . . . . . . . . 73
3.3.2 Affine Gaussian Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.1 Parametric Tying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.2 Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Chapter 1
Introduction
Poisson point processes (PPPs) are very useful theoretical models for diverse appli-
cations involving the geometrical distribution of random occurrences of points in a
multidimensional space. Both the number of points and their locations are modeled
as random variables. Nonhomogeneous PPPs are designed specifically for applica-
tions in which spatial and/or temporal nonuniformity is important. Homogeneous
PPPs are idealized models useful only for applications involving spatial or temporal
uniformity.
The exposition shows that nonhomogeneous PPPs require little or no additional
conceptual and mathematical burden above that required by homogeneous PPPs.
1 Physics community folklore attributes this aphorism to Maxwell, but the available reference [68]
is to Kurt Lewin (1890-1947), a pioneer of social, organizational, and applied psychology.
important properties of PPPs follow from the simulation. This approach enables
those new to the subject to understand quickly what PPPs are about; however, it
is the reverse of the usual textbook approach in which the simulation procedure is
derived almost as an afterthought from a few “idealized” assumptions. The style
is informal throughout. The main properties of PPPs that are deemed most useful
to practitioners are discussed first. Many basic operations are applied to PPPs to
produce new point processes that are also PPPs. Several of the most important of
these operations are superposition, independent thinning, nonlinear mappings, and
stochastic transformations such as transition and measurement processes. Examples
are presented to assist understanding.
Chapter 3 discusses estimation problems for PPPs. The defining parameter of a
PPP is its intensity function, or intensity for short. When the intensity is known, the
PPP is fully characterized. In many applications the intensity is unknown and is esti-
mated from data. This chapter discusses the case when the form of intensity function
is specified in terms of a finite number of parameters. The estimation problem is to
determine appropriate values for these parameters from given measured data. Max-
imum likelihood (ML) and maximum a posteriori (MAP) methods are the primary
estimation methods explored in this book. ML algorithms for estimating an intensity
that is specified as a Gaussian sum are obtained by the method of Expectation-
Maximization (EM). Gaussian sums, when normalized to integrate to one, are called
Gaussian mixtures, and are widely used in applications to model probability density
functions (pdfs). Two different kinds of data are considered—PPP sample data that
comprise the points of a realization of a PPP, and histogram data that comprise only
the numbers of points that fall into a specified grid of histogram cells.
Chapter 4 explores the quality of estimators of intensity, where quality is quan-
tified in terms of the Cramér-Rao Bound (CRB). The CRB is a lower bound on the
variance of any unbiased estimator, and it is determined directly from the mathemat-
ical form of the likelihood function of the data. It is a remarkable fact that the CRB
for general PPP intensity estimation takes a simple form. The CRB of the Gaussian
sum intensity model in Chapter 3 is given in this chapter. The CRB is presented for
PPP sample data, sometimes called “count record” data as well as histogram data.
Sample data constitute a realization of the PPP, so the points are i.i.d. (independent
and identically distributed).
Richardson in 1972 [102] and then again by Lucy in 1974 [72]. In these applications
it is known as the Richardson-Lucy algorithm.
SPECT (single photon emission computed tomography) is much more com-
monly used diagnostically than PET. The reconstructed image is estimated from
multiple snapshots made by a movable gamma camera. A reconstruction algorithm
for SPECT based on EM was derived by Miller, Snyder, and Miller [82] in 1985.
Loosely speaking, the algorithm averages several Shepp-Vardi algorithms, one for
each gamma camera snapshot; that is, it is a multi-snapshot average. It is presented
in Section 5.4.
Transmission tomography (called computed tomography (CT)) is discussed in
Section 5.5. A reconstruction algorithm based on EM was derived by Lange and
Carson [65] in 1984. While based on EM, its detailed structure differs significantly
from that of Shepp-Vardi algorithms. CRBs for PET and CT are presented in Sec-
tion 5.6.
Chapter 6 presents multitarget tracking applications of PPPs. The multitarget
state is modeled as a PPP. The Bayesian posterior point process is not a PPP, but
it is approximated by a PPP. It is called an intensity filter because it recursively
updates the intensity of the PPP approximation. An augmented state space enables
“on line” estimates of the clutter and target birth PPPs to be produced as intrinsic
parts of the filter. The information update of the intensity filter is seen to be identical
to the first step of the Shepp-Vardi algorithm for PET discussed in Chapter 5.
The PHD (Probability Hypothesis Density) filter is obtained from the intensity
filter by modifying the posterior PPP intensity with a priori knowledge of the clut-
ter and target birth PPPs. The PHD filter was first derived by other methods in
the multitarget application by Mahler in a series of papers beginning about 1994.
For details, see [76] and the papers referenced therein. The relationship between
Mahler’s method and the approach taken here is discussed.
The multisensor intensity filter is also presented in Chapter 6. The PPP multiple
target model is the same as in the single sensor case. The sensors are assumed condi-
tionally independent. The resulting intensity filter is essentially the same as the first
step of the Miller-Snyder-Miller algorithm for SPECT given in Chapter 5. Sensor
data are, by analogy, equivalent to the gamma camera snapshots.
Chapter 7 discusses distributed networked sensors using an approach based
on ideas from stochastic geometry, one of the classic roots of PPPs. These
results typically provide ensemble system performance estimates rather than a
performance estimate for a given sensor configuration. The contrast between
point-to-event and event-to-event properties is highlighted, and Slivnyak’s The-
orem is discussed as a method that relates these two concepts. Thresh-
old effects are dramatic and provide significant insights. For example, recent
results from geometric random graph theory show that—with very high
probability—randomly distributed networks very abruptly achieve communication
diversity as the sensor communication range increases. This result guarantees that
the overwhelming majority of random sensor distributions achieve (do not achieve)
communication diversity if the sensor communication range R is larger (smaller)
than some threshold, say Rthresh .
1.1 Chapter Parade 5
1.1.4 Appendices
Several appendices contain material that would interfere with the flow of the expo-
sition in one way or another. Others contain material that supplements the text.
whose realizations are sets in this class. For most applications point processes are
defined on Rm , m ≥ 1.
A finite point process is a random variable whose realizations in any bounded
subset R of S are sets containing only a finite number of points of R. The number of
points and their locations can be chosen in many ways. An important subclass com-
prises finite point processes with independently and identically distributed (i.i.d.)
points.
There are many members of the class i.i.d. finite point processes, at least two of
which are named processes. One is the binomial point process (BPP). The BPP is a
finite point process in which the number of points n is binomially distributed on the
integers {0, 1, . . . , K }, where K ≥ 0 is an integer parameter. The points of the BPP
are located according to a spatial random variable X on S with probability density
function (pdf) p X (x). Explicitly, for any bounded subset R ⊂ S, the probability of
n points occurring in R is
K
Pr[n] = ( pR )n (1 − pR ) K − n , (1.1)
n
this case often referred to as random finite sets. However, PPPs are also sometimes
defined on discrete spaces and discrete-continuous spaces (see Section 2.12). In
such cases, the discrete points of the PPP can be repeated with nonzero probability.
Sets do not, by definition, have repeated elements, so it is more accurate to speak of
the points in a PPP realization as a random finite list, or multiset, as such lists are
sometimes called. To avoid making too much of these subtleties, random finite sets
and lists are both referred to simply as PPP realizations.
PPPs constitute an alternative tradition to that of the more widely known stochastic
processes, with which they are sometimes confused. A stochastic process is a family
X (t) of random variables indexed by a parameter t that is usually, but not always,
thought of as time. The mathematics of stochastic processes, especially Gaussian
stochastic processes, is widely known and understood. They are pervasive, with
applications ranging from physics and engineering to finance and biology.
Poisson stochastic processes (again, not to be confused with PPPs) provided one
of the first models of white Gaussian noise. Specifically, in vacuum tubes, electrons
are emitted by a heated cathode and travel to the anode, where they contribute to
the anode current. The anode current is modeled as shot noise, and the fluctuations
of the current about the average are approximately a white Gaussian noise process
when the electron arrival rate is high [11, 101]. The emission times of the electrons
constitute a one dimensional PPP. Said another way, the occurrence times of the
jump discontinuities in the anode current are a realization of a PPP.
Wiener processes are also known as Brownian motion processes. The sample
paths are continuous, but nowhere differentiable. The process is sometimes thought
of more intuitively as integrated white Gaussian noise. Armed with this intuition it
may not be too surprising to the reader that nonhomogeneous PPPs approximate the
level crossings of the correlation function of Gaussian processes [48] and sequential
probability ratio tests.
The concept of independent increments is important for both point processes
and stochastic processes; however, the concept is not exactly the same for both. It is
therefore more appropriate to speak of independent scattering in point processes and
of independent increments for stochastic processes. As is seen in Chapter 2, every
PPP is an independent scattering process. In contrast, every independent increments
stochastic process is a linear combination of a Wiener process and a Poisson process
[119]. Further discussion is given in Section 2.9.1. The mathematics of PPPs is not
as widely known as that of stochastic processes.
PPPs have many properties such as linear superposition and invariance under
nonlinear transformation that are useful in various ways in many applications. These
fundamental properties are presented in the next chapter.
Part I
Fundamentals
Chapter 2
The Poisson Point Process
Readers new to PPPs are urged to read the first four subsections below in order.
After that, they are free to move about the chapter as their fancy dictates. There is
a lot of information here. It cannot be otherwise for there are many wonderful and
useful properties of PPPs.
1 What he really said [27]: “It can scarcely be denied that the supreme goal of all theory is to make
the irreducible basic elements as simple and as few as possible without having to surrender the
adequate representation of a single datum of experience.”
The emphasis throughout the chapter is on the PPP itself, although applica-
tions are alluded to in several places. The event space of PPPs and other finite
point processes is described in Section 2.1. The concept of intensity is discussed
in Section 2.2. The important concept of orderliness is also defined. PPPs that are
orderly are discussed in Sections 2.3 through 2.11. PPPs that are not orderly are
discussed in the last section, which is largely devoted to PPPs on discrete and
discrete-continuous spaces.
The points of a PPP occur in the state space S. This space is usually the Euclidean
space, S = Rm , m ≥ 1, or some subset thereof. Discrete and discrete-continuous
spaces S are discussed in Section 2.12. PPPs can be defined on even more abstract
spaces, but this kind of generality is not needed for the applications discussed in this
book.
Realizations of PPPs on a subset R of S comprises the number n ≥ 0 and the
locations x1 , . . . , xn of the points in R. The realization is denoted by the ordered
pair
ξ = (n, {x1 , . . . , xn }) .
The set notation signifies only that the ordering of the points x j is irrelevant, but
not that the points are necessarily distinct. It is better to think of {x1 , . . . , xn } as an
unordered list. Such lists are sometimes called multisets. Context will always make
clear the intended usage, so for simplicity of language, the term set is used here and
throughout the book.
It is standard notation to include n explicitly in ξ even though n is determined
by the size of the set {x1 , . . . , xn }. There are many technical reasons to do so; for
instance, including n makes expectations easier to define and manipulate.
If n = 0, then ξ is the trivial event (0, ∅), where ∅ denotes the empty set. The
event space is the collection of all possible finite subsets of R:
E(R) = {(0, ∅)} ∪∞
n=1 (n, {x 1 , . . . , x n }) : x j ∈ R,
j = 1, . . . , n .
(2.1)
The event space is clearly very much larger in some sense than the space S in which
the individual points reside.
2.2 Intensity
Every PPP is parameterized by a quantity called the intensity. Intensity is an intuitive
concept, but it takes different mathematical forms depending largely on whether the
state space S is continuous, discrete, or discrete-continuous. The continuous case
2.3 Realizations 13
for all bounded subsets R of S, i.e., subsets contained in some finite radius m-
dimensional sphere. The sets R include—provided they are bounded—convex sets,
sets with “holes” and internal voids, disconnected sets such as the union of disjoint
spheres, and sets that are interwoven like chain links.
The intensity function λ(s) need not be continuous, e.g., it can have step dis-
continuities. The only requirement on λ(s) is the finiteness of the integral (2.2).
The special case of homogeneous PPPs on S = Rm with R = S shows that the
inequality (2.2) does not imply that S λ(s) ds < ∞. Finally, in physical problems,
the integral (2.2) is a dimensionless number, so λ(s) has units of number per unit
volume of Rm .
The intensity for general PPPs on the continuous space S takes the form
where δ( · ) is the Dirac delta function and, for all j, the weights w j are nonnegative
and the points a j ∈ S are distinct: ai = a j for i = j. The intensity λ D (s) is not a
function in the strict meaning of the term, but a “generalized” function. It is seen in
the next section that the PPP corresponding to the intensity λ D (s) is orderly if and
only if w j = 0 for all j; equivalently, a PPP is orderly if and only if the intensity
λ D (s) is a function, not a generalized function.
The concept of orderliness can be generalized so that finite point processes other
than PPPs can also be described as orderly. There are several nonequivalent defini-
tions of the general concept, as discussed in [118]; however, these variations are not
used here.
2.3 Realizations
The discussion in this section and through to Section 2.11 is implicitly restricted to
orderly PPPs, that is, to PPPs with a well defined intensity function on a continuous
space S ⊂ Rm . Realizations and other properties of PPPs on discrete and discrete-
continuous spaces are discussed in Section 2.12.
Realizations are conceptually straightforward to simulate for bounded subsets
of continuous spaces S ⊂ Rm . Bounded subsets are “windows” in which PPP
14 2 The Poisson Point Process
realizations are observed. Stipulating a window avoids issues with infinite sets; for
example, realizations of homogeneous PPPs on S = Rm have an infinite number
of points but only a finite number in any bounded window.
Every realization of a PPP on a bounded set R is an element of the event space
E(R). The realization ξ therefore comprises the number n ≥ 0 and the locations
{x1 , . . . , xn } of the points in R.
A two-step procedure, one step discrete and the other continuous, generates (or,
simulates) one realization ξ ∈ E(R) of a nonhomogeneous PPP with intensity λ(s)
on a bounded subset R of S. The procedure also fully reveals the basic statistical
structure of the PPP. If R λ(s) ds = 0, ξ is the trivial event. If R λ(s) ds > 0,
the realization is obtained as follows:
λ(s)
p X (s) = for s ∈ R . (2.5)
R λ(s) ds
The output is the ordered pair ξo = (n, (x1 , . . . , xn )). Replacing the ordered
n-tuple (x1 , . . . , xn ) with the set {x1 , . . . , xn } gives the PPP realization ξ =
(n, {x1 , . . . , xn }).
The careful distinction between ξo and ξ is made to avoid annoying, and some-
times confusing, problems later when order is important. For example, it is seen in
Section 2.4 that the pdfs (probability density functions) of ξo and ξ differ by a factor
of n! . Also, the points {x1 , . . . , xn } are i.i.d. when conditioned on the number n of
points. The conditioning on n is implicit in the statement of Step 2.
For continuous spaces S ⊂ Rm , an immediate consequence of Step 2 is that the
points {x1 , . . . , xn } are distinct with probability one: repeated elements are allowed
in theory, but in practice they never occur (with probability one). Another way to
say this is that the list, or multiset, {x1 , . . . , xn } is a set with probability one. The
statement fails to hold when the PPP is not orderly, that is, when the intensity (2.3)
has one or more Dirac delta function components. It also does not hold when the
state space S is discrete or discrete-continuous (see Section 2.12).
An acceptance-rejection procedure (see, e.g., [56]) is used to generate the i.i.d.
samples of (2.5). Let
2.3 Realizations 15
p X (s)
α = max , (2.6)
s∈R g(s)
where g(s) > 0 is any bounded pdf on R from which i.i.d. samples of R can
be generated via a known procedure. The function g( · ) is called the importance
function. For each point x with pdf g, compute t = p X (x)/(α g(x)). Next, generate
a uniform variate u on [0, 1] and compare u and t: if u > t, reject x; if u ≤ t, accept
it. The accepted samples are distributed as p X (x).
The acceptance-rejection procedure is inefficient for some problems, that is, large
numbers of i.i.d. samples from the pdf (2.5) may be drawn before finally accepting
n samples. As is well known, efficiency depends heavily on the choice of the impor-
tance function g( · ). Table 2.1 outlines the overall procedure and indicates how
the inefficiency can occur. If inefficiency is a concern, other numerical procedures
may be preferred in practice. Also, evaluating R λ(s) ds may require care in some
problems.
– IF n = 0, STOP
• Step 2.
– FOR j = 1 : ce f f n
• Draw random sample x with pdf g
• Compute
t = α μλ(s)
g(s)
Example 2.1 The two-step procedure is used to generate i.i.d. samples from a PPP
whose intensity function is nontrivially structured. These samples also show the
difficulty of observing this structure in small sample sets. Denote the multivariate
Gaussian pdf on Rm with mean μ and covariance matrix Σ by
16 2 The Poisson Point Process
1 1
N (s ; μ, Σ) = √ exp − (s − μ)T Σ −1 (s − μ) . (2.7)
det (2π Σ) 2
a
λ(s) ≡ λ(x, y) = + b f (x, y), (2.8)
64 σ 2
a b
c d
Fig. 2.1 Realizations of the PDF (2.9) of the intensity function (2.8) for σ = 1, a = 20, and
b = 80. Samples are generated by the acceptance-rejection method. The prominent horizontal
notch in the intensity is hard to see from the samples alone
2.4 Likelihood Function 17
For σ = 1, numerical integration gives the mean intensity μ = R λ(x, y)
dx dy = 92.25 , approximately. A pseudo-random integer realization of the Poisson
discrete variable (2.4) is n = 90, so 90 i.i.d. samples of the pdf (cf. (2.5))
are drawn via the acceptance-rejection procedure with g(x, y) = 1/(64 σ 2 ) . The
pdf (2.9) is shown as the 3-D plot in Fig. 2.1a and as a set of equispaced contours
in Fig. 2.1b, respectively Fig. 2.1c and 2.1d, show the 90 sample points with and
without reference to the intensity contours.
The horizontal “notch” is easily missed using these 90 samples in Fig. 2.1c. The
detailed structure of an intensity function can be estimated reliably only in special
circumstances, e.g., when a large number of realizations is available, or when the
PPP has a known parametric form (see Section 3.1).
The random variable Ξ with realizations in E(R) for every bounded subset R of
S is a PPP if its realizations are generated via the two-step procedure. Let pΞ (ξ )
denote the pdf of Ξ evaluated at Ξ = ξ . Let Ξ ≡ (N , X ), where N is the
number of points and X ≡ {x1 , . . . , x N } is the point set. Let the realization be
ξ = (n, {x1 , . . . , xn }). From the definition of conditioning,
n
pX |N ( {x1 , . . . , xn } | n) = n! p X (x j ) , (2.11)
j=1
where X is the random variable corresponding to a single sample point whose pdf
is (2.5). The n! in (2.11) arises from the fact that there are n! equally likely ordered
i.i.d. trials that generate the unordered set X . Substituting (2.4) and (2.11) into (2.10)
gives the pdf of Ξ evaluated at ξ = (n, {x1 , . . . , xn }) ∈ E(R):
pΞ (ξ ) = p N (n) pX |N ( {x1 , . . . , xn } | n)
n
n
R λ(s) ds
λ(x j )
= exp − λ(s) ds n!
R n! λ(s) ds
j=1 R
n
= exp − λ(s) ds λ(x j ) , for n ≥ 1. (2.12)
R j=1
18 2 The Poisson Point Process
1
pX |N (x1 , . . . , xn | n) ≡ pX |N ( {x1 , . . . , xn } | n) (2.13)
n!
n
= p X (x j ) (2.14)
j=1
n
λ(x j )
= . (2.15)
j=1 R λ(s) ds
Let ξo = (n, (x1 , . . . , xn )). Using (2.15) and the definition of conditioning gives
This notation interprets arguments in the usual way, so it is easier to understand and
manipulate than (2.12). For example, the discrete pdf p N (n) of (2.4) is merely the
integral of (2.17) over x1 , . . . , xn , but taking the same integral of (2.12) requires an
additional thought to restore the missing n!.
The argument ξo in (2.17) is written simply as ξ below. This usage may cause
some confusion, since then the left hand side of (2.17) becomes pΞ (ξ ), which is
the same as the first equation in (2.12), a quantity that differs from it by a factor of
n!. A similar ambiguity arises from using the same subscript X | N on both sides of
(2.13). Context makes the intended meaning clear, so these abuses of notation will
not cause confusion.
In practice, when the number of points in a realization is very large, the points of
a PPP realization are often replaced by a smaller data set. If the smaller data set also
reduces the information content, the likelihood function obtained in this section no
longer applies. An example of a smaller data set (called histogram count data) and
its likelihood function is given in Section 2.9.1.
2.5 Expectations
Expectations are decidedly more interesting for point processes than for ordinary
random variables. Expectations are taken of real valued functions F defined on the
event space E(R), where R is a bounded subset of S. Thus F(ξ ) evaluates to a real
2.5 Expectations 19
number for all ξ ∈ E(R). The expectation of F(ξ ) is written in the very general
form
where the sum, properly defined, is matched to the likelihood function of the point
process. In the case of PPPs, the likelihood function is that of the two-step simu-
lation procedure. The sum is often referred to as an “ensemble average” over all
realizations of the point process.
The sum is daunting because of the huge size of the set E(R). Defining the
expectation carefully is the first and foremost task of this section. The second is
to show that for PPPs the expectation, though fearsome, can be evaluated explicitly
for many functions of considerable application interest.
2.5.1 Definition
Let ξ = (n, {x1 , . . . , xn }). For analytical use, it is convenient to rewrite the
function F(ξ ) = F (n, {x1 , . . . , xn }) in terms of a function that uses an easily
understood argument list, that is, let
The sum in (2.21) is an odd looking discrete-continuous sum that needs interpreta-
tion. The conditional factorization
pΞ (ξ ) = p N (n) pX |N (x1 , . . . , xn | n)
20 2 The Poisson Point Process
The expectation is formidable, but it is not as bad as it looks. Its inherently straight-
forward structure is revealed by verifying that E[F] = 1 for F(n, x1 , . . . , xn ) ≡ 1.
The details of this trivial exercise are omitted.
The expectation of non-symmetric functions is undefined. The definition is
extended—formally—to general functions, say G(n, x1 , . . . , xn ), via its sym-
metrized version:
The expectation of G is defined by E[G] = E G Sym . This definition works
because G Sym is a symmetric function of its arguments, a fact that is straightfor-
ward to verify. The definition is clearly compatible with the definition for symmetric
functions since G Sym (n, x1 , . . . , xn ) ≡ G(n, x1 , . . . , xn ) if G is symmetric.
The expectation is defined by (2.23) for any finite point process with events in
E (R), not just PPPs. For PPPs and other i.i.d. finite point processes (such as BPPs),
n
pX |N (x1 , . . . , xn | n) = p X (x j ), (2.25)
j=1
∞
n
E[F] ≡ p N (n) ··· F(n, x1 , . . . , xn ) p X (x j ) dx1 · · · dxn .
n=0 R R j=1
(2.26)
PPPs are assumed throughout the remainder of this chapter, so the discrete proba-
bility distribution p N (n) and pdf p X (x) are given by (2.4) and (2.5).
The expected number of points in R is E[N (R)]. When the context clearly
identifies the set R, the expectation is written simply as E[N ]. By substituting
F(n, x1 , . . . , xn ) ≡ n into (2.26) and observing that the integrals all integrate
2.5 Expectations 21
E[N ] ≡ n p N (n)
n=0
= λ(s) ds . (2.27)
R
The explicit sums in (2.27) and (2.28) are easily verified by direct calculation using
(2.4).
N
F(Ξ ) = f (X j ) (2.29)
j=1
n
F(n, x1 , . . . , xn ) = f (x j ) , for n ≥ 1 , (2.30)
j=1
and, for n = 0, by F(0, ∅) ≡ 0. The special case of (2.30) for which f (x) ≡ 1
reduces to F(n, x1 , . . . , xn ) = n, the number of points in R. The mean of F is
given by
22 2 The Poisson Point Process
⎡ ⎤
N
E[F] = E ⎣ f (X j )⎦ (2.31)
j=1
= f (x) λ(x) dx . (2.32)
R
∞
n
n
E[F] = p N (n) ··· f (x j ) p X (x j ) dx1 · · · dxn .
n=0 j=1 R R j=1
∞
E[F] = p N (n) n f (x j ) p X (x j ) dx j .
n=0 R
n
G(ξ ) = g(x j ) , (2.33)
j=1
where g(x) is a real valued function. Then the expected value of the product is
E[F G] = f (x) λ(x) dx g(x) λ(x) dx + f (x) g(x) λ(x) dx. (2.34)
R R R
Before verifying this result in the next paragraph, note that since the means of F
and G are determined as in (2.32), the result is equivalent to
The special case f (x) ≡ 1, (2.36) reduces to the variance (2.28) of the number of
points in R.
The result (2.34) is verified by direct evaluation. Write
n
F(ξ ) G(ξ ) = f (xi ) g(x j ) + f (x j ) g(x j ). (2.37)
i, j=1 j=1
i= j
The second term in (2.37) is (2.30) with f (x j ) g(x j ) replacing f (x j ), so its expec-
tation is the second term of (2.34). The expectation of the first term is evaluated in
much the same way as (2.32); details are omitted. The identity (2.34) is sometimes
written
⎡ ⎤
⎢
N
⎥
E⎢ ⎣ f (X i ) g(X )
j ⎦
⎥ = f (x) λ(x) dx g(x) λ(x) dx . (2.38)
i, j=1 R R
i= j
2 This section can be skipped entirely on a first reading of the chapter. The material presented is
used only Chapter 8.
24 2 The Poisson Point Process
Under mild regularity conditions, Campbell’s Theorem says that when θ is purely
imaginary
θF
E e = exp eθ f (x)
− 1 λ(x) dx , (2.40)
R
where f (x) is a real valued function. The expectation exists for any complex θ for
which the integral converges. It is obtained by algebraic manipulation. Substitute
the explicit form (2.17) into the definition of expectation and churn:
∞
n
E e θF
= p N (n) ··· eθ j=1 f (x j )
pX |N (x1 , . . . , xn | n) dx1 · · · dxn
n=0 R R
⎧ ⎫
∞
⎨ n ⎬
1
= e− R λ(s) ds
··· eθ f (x j )
λ(x j ) dx1 · · · dxn
n! R R ⎩ ⎭
n=0 j=1
∞ n
− λ(s) ds 1 θ f (s)
= e R e λ(s) ds
n! R
n=0
= e− R λ(s) ds
exp e θ f (s)
ds . (2.41)
R
The last expression is obviously equivalent to (2.40). See [49, 57, 63] for further
discussion.
The characteristic
√ function of F is given by (2.40) with θ = iω, where ω is
real and i = −1, and R = R. The convergence of the integral requires that the
Fourier transform of f exist as an ordinary function, i.e., it cannot be a generalized
function. As is well known, the moment generating function is closely related to the
characteristic function [93, Section 7.3]. Expanding the exponential gives
(ω)2 F
E ei ω F = E 1 + i ω F + (i)2 + ···
2!
(ω)2
= 1 + i ω E[F] + (i)2 E[F 2 ] + · · · ,
2!
dn
E F n = (−i)n n
E eiω F . (2.42)
dω ω=0
E ei ω1 F + i ω2 G = exp ei ω1 f (x) + i ω2 g(x)
− 1 λ(x) dx . (2.43)
R
To see this, simply use ω1 f (x) + ω2 g(x) in place of f (x) in (2.40). An imme-
diate by-product of this result is an expression for the joint moments of F and G.
Expanding (2.43) in a joint power series and assuming term by term integration is
valid gives
E ei ω1 F + i ω2 G
(i ω1 )2 2 (i ω1 )(i ω2 ) (i ω2 )2 2
= E 1 + i ω1 F + i ω2 G + F + FG + G + ···
2! 2! 2!
= 1 + i ω1 E[F] + i ω2 E[G]
(i ω1 )2 (i ω1 )(i ω2 ) (i ω2 )2
+ E[F 2 ] + E[F G] + E[G 2 ] + · · · ,
2! 2! 2!
where terms of order larger than two are omitted. Taking partial derivatives gives the
joint moment of order (r, s) as
∂r ∂s
E F r G s = (−i)r +s E ei ω1 F + i ω2 G . (2.44)
∂ω1 ∂ω2
r s ω1 = ω2 = 0
In particular, a direct calculation for the case r = s = 1 verifies the earlier result
(2.34).
The form (2.40) of the characteristic function also characterizes the PPP; that is,
a finite point process whose expectations of random sums satisfies (2.40) is neces-
sarily a PPP. The details are given in the next subsection.
N
F(Ξ ) = f (X j ) , n ≥ 1, (2.45)
j=1
E e−F = exp e− f (x) − 1 λ(x) dx (2.46)
R
for some nonnegative function λ(x). The goal is to show that (2.46) implies that
the finite point process Ξ is necessarily a PPP with intensity function λ(x). This is
done by showing that Ξ satisfies the independent scattering property for any finite
number k of sets A j such that S = ∪kj=1 A j and Ai ∩ A j = ∅ for i = j.
Consider a nonnegative function f with values f 1 , f 2 , . . . , f k on the specified
sets A1 , A2 , . . . , Ak , respectively, so that
A j = {x : f (x) = f j } .
Let
mj = λ(x) dx .
Aj
Observe that
k
f (X j ) ≡ f j N (A j ) , (2.48)
j=1 j=1
where N (A j ) is the number of points in A j . For the given function f , the assumed
identity (2.46) is equivalent to
⎡ ⎤
k
k
E e− j=1 f j N (A j )
= exp ⎣ e− f j − 1 m j ⎦ . (2.49)
j=1
By varying the choice of function values f j ≥ 0, the result (2.50) is seen to hold
for all z j ∈ (0, 1).
The joint characteristic function of several random variables is the product
of the individual characteristic functions if and only if the random variables are
2.6 Campbell’s Theorem2 27
independent [93], and the characteristic function of the Poisson distribution with
mean m j is (in this notation) em j (z j − 1) . Therefore, the counts N (A j ) are indepen-
dent and Poisson distributed with mean m j . Since the sets A j are arbitrary, the finite
point process Ξ is a PPP.
The class of functions for which the identity (2.46) holds must include the class
of all nonnegative functions that are piecewise constant, with arbitrarily specified
values f j , on an arbitrarily specified finite number of disjoint sets A j . The discus-
sion here is due to Kingman [63].
G Ξ ( f ) = L Ξ (− log f )
⎡ ⎛ ⎞⎤
N
= E ⎣exp ⎝ log f (X j )⎠⎦
j=1
⎡ ⎤
N
= E⎣ f (X j )⎦. (2.52)
j=1
The probability generating functional is the analog for finite point processes of the
probability generating function for random variables.
The Laplace and probability generating functionals are defined for general finite
point processes Ξ , not just PPPs. If Ξ is a PPP with intensity function λ(x), then
G Ξ ( f ) = exp ( f (x) − 1) λ(x) dx . (2.53)
R
2.7 Superposition
A very useful property of independent PPPs is that their sum is a PPP. Two PPPs
on S are superposed, or summed, if realizations of each are combined into one
event. Let Ξ and Υ denote these PPPs, and let their intensities be λ(s) and ν(s).
If (m, {x1 , . . . , xm }) and (n, {y1 , . . . , yn }) are realizations of Ξ and Υ , then the
combined event is (m + n, {x1 , . . . , xm , y1 , . . . , yn }). Knowledge of which points
originated from which realization is assumed lost.
The combined event is probabilistically equivalent to a realization of a PPP
whose intensity function is λ(s) + ν(s). To see this, let ξ = (r, {z 1 , . . . , zr }) ∈
E(R) be an event constructed in the manner just described. The partition of this
event into an m point realization of Ξ and an r − m point realization of Υ is
unknown. Let the sets Pm and its complement Pmc be such a partition, where
Pm ∪ Pmc = {z 1 , . . . , zr }. Let Pm denote the collection of all partitions of size m.
There are
r r!
≡
m m!(r − m)!
r
1
p(ξ ) = r pΞ (m, Pm ) pΥ r − m, Pmc .
m=0 m Pm ∈Pm
where μ ≡ R (λ(s) + ν(s)) ds. The double sum in the last expression is recog-
nized (after some thought) as an elaborate way to write an r -term product. Thus,
e−μ
r
p(ξ ) = (λ(z i ) + ν(z i )) . (2.54)
r!
i=1
Comparing (2.54) to (2.12) shows that p(ξ ) is the pdf of a PPP with intensity func-
tion given by λ(s) + ν(s).
More refined methods that do not rely on partitions show that superposition holds
for a countable number of independent PPPs. The intensity of the superposed PPP
is the sum of the intensities of the constituent PPPs, provided the sum converges.
For details, see [63].
2.7 Superposition 29
The Central Limit Theorem for sums of random variables has an analog for point
processes called the Poisson Limit Theorem: the superposition of a large number
of “uniformly sparse” independent point processes converges in distribution to a
homogeneous PPP. These point processes need not be PPPs. The first statement and
proof of this difficult result dates to the mid-twentieth century. For details on R1 ,
see [62, 92]. The Poisson Limit Theorem also holds in the multidimensional case.
For these details, see [15, 40].
x i 2 10
λc (x, y) = cN ; ,σ , (2.55)
y j 01
i∈{−1,0,1} j∈{−1,0,1}
a b
c d
Fig. 2.2 Superposition of an equispaced grid of nine PPPs with circular Gaussian intensities (2.55)
of equal weight and spread, σ = 1. Samples from the PPP components are generated indepen-
dently and superposed to generate samples from the unimodal flat-topped intensity function
30 2 The Poisson Point Process
To see this, consider first the special case that α(x) is constant on R. Let
μα
μ = λ(x) dx, μα = λα (x) dx, β = .
R R μ
n
Pr[m | n] = β m (1 − β)n−m , m ≤ n.
m
∞
n
Pr[m] = β m (1 − β)n−m Pr[n] (2.57)
n=m
m
∞
n! μn −μ
= β m (1 − β)n−m e
n=m
m!(n − m)! n!
∞
(β μ)m −μ
((1 − β) μ)n−m
= e
m! n=m
(n − m)!
(β μ)m −β μ μm
= e ≡ α e−μα .
m! m!
The probability that ξα has m points after thinning ξ is, by the preceding argument,
Poisson distributed with mean μα . The samples x1
, . . . , xm
are i.i.d., and their
pdf on R
is λα (x)/μα . Now extend the intensity function from R
to all R by
setting it to zero outside the cell. Superposing these cell-level PPPs and taking the
limit as cell size goes to zero shows that λα (x) is the intensity function on the full
set R. Further details are omitted.
An alternative demonstration exploits the acceptance-rejection method. Generate
a realization of the PPP with intensity function λ(x) from the homogeneous PPP
with intensity
function Λ = maxx∈R λ(x). Redefine μα = R λα (x) dx, and let
|R| = R dx. The probability that no points remain in R after thinning by α(x) is
The void probabilities v(R) for a sufficiently large class of “test” sets R characterize
a PPP, a fact whose proof is unfortunately outside the scope of the present book.
32 2 The Poisson Point Process
(A clean, relatively accessible derivation is given in [136, Theorem 1.2].) Given the
result, it is clear that the thinned process is a PPP with intensity function λα (x).
Example 2.3 Triple Thinning. The truncated and scaled zero-mean Gaussian inten-
sity function on the rectangle [−2σ, 2σ ] × [−2σ, 3σ ],
λc (x, y) = c N (x ; 0, σ 2 )N (y ; 0, σ 2 ),
is depicted in Fig. 2.3a for c = 2000 and σ = 1. Its mean intensity (i.e., the inte-
gral of λ2000 over the rectangle) is μ0 = 1862.99. Sampling the discrete Poisson
variate with mean μ0 gives, in this realization, 1892 points. Boundary conditions
are imposed by the thinning functions
a b
c d
Fig. 2.3 Triply thinning the Gaussian intensity function by (2.58) for σ = 1 and c = 2000 yields
samples of an intensity with hard boundaries on three sides
2.9 Declarations of Independence 33
α1 (x, y) = 1 − e−y if y ≥ 0
x −2
α2 (x, y) = 1 − e if x ≤ 2 (2.58)
−x − 2
α3 (x, y) = 1 − e if x ≥ −2,
where α j (x, y) = 0 for conditions not specified in (2.58). The overall thinning
function, α1 α2 α3 , is depicted in Fig. 2.3b overlaid on the surface corresponding to
λ1 . The intensity of the thinned PPP, namely α1 α2 α3 λ2000 , is nonzero only on the
rectangle [−2σ, 2σ ] × [0, 3σ ]. It is depicted in Fig. 2.3c. Thinning the 1892 points
of the realization of λ2000 leaves the 264 points depicted in Fig. 2.3d. These 264
points are statistically equivalent to a sample generated directly from the thinned
PPP. The mean thinned intensity is 283.19.
3 This name conveys genuine meaning in the point process context, but it seems of fairly recent
vintage [84, Section 3.1.2] and [123, p. 33]. It is more commonly called independent increments,
which can be confusing because the same name is used for a similar, but different, property of
stochastic processes. See Section 2.9.4.
34 2 The Poisson Point Process
and B ⊂ R denote bounded subsets of R. The point processes Ξ (A) and Ξ (B)
are obtained by restricting realizations of Ξ to A and B, respectively. Simply put,
the points in ξ(A) are the points of ξ that are in A ∩ R, and the same for ξ(B).
This somewhat obscures the fact that the realizations ξ A and ξ B are obtained from
the same realization ξ . Intuition may suggest that constructing ξ A and ξ B from the
very same realization ξ will force the point processes Ξ (A) and Ξ (B) to be highly
correlated in some sense. Such intuition is in need of refinement, for it is incorrect.
This is the subtlety mentioned above.
Let ξ denote an arbitrary realization of a point process Ξ (A∪ B) on the set A∪ B.
The point process Ξ (A ∪ B) is an independent scattering process if
for all disjoint subsets A and B of R, that is, for all subsets such that A ∩ B = ∅.
The pdfs in (2.59) are determined by the specific character of the point process,
so they are not in general those of a PPP. The product in (2.59) is the reason the
property is called independent scattering.
A nonhomogeneous multidimensional PPP is an independent scattering point
process. To see this it is only necessary to verify that (2.59) holds. Define thinning
probability functions, α(x) and β(x), by
%
1, if x ∈ A
α(x) =
0, if x ∈
/ A
and
%
1, if x ∈ B
β(x) =
0, if x ∈
/ B.
The point processes Ξ (A) and Ξ (B) are obtained by α-thinning and β-thinning
realizations ξ of the PPP Ξ (A ∪ B), so they are PPPs. Let λ(x) be the intensity
function of the PPP Ξ (A ∪ B). Let ξ = (n, {x1 , . . . , xn }) be an arbitrary realiza-
tion of Ξ (A ∪ B). The pdf of ξ is, from (2.12),
n
pΞ (A∪B) (ξ ) = e− A∪B λ(x) dx
λ(x j ). (2.60)
j=1
Because the points of the α-thinned and β-thinned realizations are on disjoint sets
A and B, the realizations ξ A = (i, {y1 , . . . , yi }) and ξ B = (n, {z 1 , . . . , z k }) are
necessarily such that i + k = n and
{y1 , . . . , yi } ∪ {z 1 , . . . , z k } = {x1 , . . . , xn }.
Because Ξ (A) and Ξ (B) are PPPs, the pdfs of ξ A and ξ B are
2.9 Declarations of Independence 35
i
pΞ (A) (ξ A ) = e− A λ(x) dx
λ(y j )
j=1
k
pΞ (B) (ξ B ) = e− B λ(x) dx
λ(z j ) .
j=1
The product of these two pdfs is clearly equal to that of (2.60). The key elements of
the argument are that the thinned processes are PPPs, and that the thinned realiza-
tions are free of overlap when the sets are disjoint. The argument extends easily to
any finite number of disjoint sets.
Example 2.4 Likelihood Function for Histogram Data. A fine illustration of the util-
ity of independent scattering is the way it makes the pdf of histogram data easy to
determine. Denote the cells of a histogram by R1 , . . . , R K , K ≥ 1. The cells are
assumed disjoint, so R j ∩ R j = ∅ for i = j. Histogram data are nonnegative
integers that count the number of points of a realization of a point process that fall
within the various cells. No record is kept of the locations of the points within any
cell. Histogram data are very useful for compressing large volumes of sample (point)
data.
Denote the histogram data by n 1:K ≡ {n 1 , . . . , n K }, where n j ≥ 0 is the number
of points of the process that lie in R j . Let the point process Ξ be a PPP, and let
Ξ (R j ) denote
the PPP obtained by restricting Ξ to R j . The intensity function of
Ξ (R j ) is R j λ(s) ds. The histogram cells are disjoint. By independent scattering,
the PPPs Ξ (R1 ), . . . , Ξ (R K ) are independent and the pdf of the histogram data is
& '
n j
K
Rj λ(s) ds
p (n 1:K ) = exp − λ(s) ds (2.61)
Rj n j!
j=1
n j
K λ(s) ds
Rj
= exp − λ(s) ds , (2.62)
R n j!
j=1
The point process has realizations in the event space E([0, 1]), but it is not a PPP
because of the way the points are sampled for n = 3.
For any c ∈ [0, 1], define the random variable
%
1, if x < c
X c (x) = (2.64)
0, if x ≥ c.
The number of points in a realization of the point process in the interval [a, b]
conditioned on n points in [0, 1] is
G n (a, b, m) = Pr exactly m points of {x1 , . . . , xn } are in [a, b] . (2.65)
Nh + Nt = n.
Nh + Nt = N
holds, but it is not enough to induce any dependence whatever between Nh and Nt .
This property is counterintuitive when first encountered, but it plays an important
role in many applications. To give it a name, since one seems to be lacking in the
literature, Poisson’s gambit4 is the assumption that the number of Bernoulli trials is
Poisson distributed. Poisson’s gambit is realistic in many applications, but in oth-
ers it is only an approximation. The name is somewhat whimsical—it is not used
elsewhere in the literature.
Invoking Poisson’s gambit, the number N is an integer valued, Poisson dis-
tributed random variable with intensity λ > 0. Sampling N gives the length n of
the sequence of Bernoulli trials performed. Then n = n h + n t , where n h and n t
are the observed numbers of heads and tails. The random variables Nh and Nt are
independent Poisson distributed with mean intensities pλ and (1− p)λ, respectively.
To see this, note that the probability of a Poisson distributed number of n Bernoulli
trials with outcomes n h and n t is
4 A gambit in chess involves sacrifice or risk with hope of gain. The sacrifice here is loss of control
over the number of Bernoulli trials, and the gain is independence of the numbers of different
outcomes.
38 2 The Poisson Point Process
The final product is the statement that the number of heads and tails are independent
Poisson distributions with the required parameters. For further comments, see, e.g.,
[52, Section 9.3] or [42, p. 48].
Example 2.6 Independence of Thinned and Culled PPPs. The points of a PPP that
are retained and those that are culled during Bernoulli thinning are both PPPs. Their
intensities are p(x)λ(x) and (1 − p(x))λ(x), respectively, where p(x) is the prob-
ability that a point at x ∈ S is retained. Poisson’s gambit implies that the numbers
of points in these two PPPs are independent. Step 2 of the realization procedure
guarantees that the sample points are of the two processes are independent. The
thinned and culled PPPs are therefore independent, and superposing them recovers
the original PPP, since the intensity function of the superposition is the sum of the
component intensities. In other words, splitting a PPP into two parts using Bernoulli
thinning, and subsequently merging the parts via superposition recovers the original
PPP.
Example 2.7 Coloring Theorem. Replace the Bernoulli trials in Example 2.6 by
independent multinomial trials with k ≥ 2 different outcomes, called “colors” in
[63, Chapter 5], with probabilities { p1 (x), . . . , pk (x)}, where
p1 (x) + · · · + pk (x) = 1 .
Poisson’s gambit and Step 2 of the realization procedure shows that the PPPs inde-
pendent. The intensity of their superposition is
k
λ j (x) = p j (x) λ(x) = λ(x),
j=1 j=1
If an orderly point process satisfies the independent scattering property and the num-
ber of points in any bounded set R is finite and not identically zero (with probability
one), then the number of points of the process in a given set R is necessarily Poisson
distributed—the Poisson distribution is inevitable (as Kingman wryly observes).
This result shows that if the number points in realizations of the point process is
2.9 Declarations of Independence 39
not Poisson distributed for even one set R, then it is not an independent scattering
process, and hence not a PPP. To see this, a physics-style argument (due to Kingman
[63, pp. 9–10]) is adopted.
Given a set A = ∅ with no “holes”, or voids, define the family of sets At , t ≥ 0
by
At = ∪a∈A x ∈ Rm : x − a ≤ t ,
where · is the usual Euclidean distance. Because A has no voids, the boundary
of At encloses the boundary of As if t > s. Let
pn (t) = Pr [N (At ) = n]
and
qn (t) = Pr [N (At ) ≤ n] ,
where N (At ) is the random variable that equals the number of points in a realization
that lie in At . The point process is orderly, so it is assumed that the function pn (t)
is differentiable. Let
μ(t) ≡ E [N (At )] .
Finding an explicit mathematical form for this expectation is not the goal here. The
goal is to show that
μn (t)
pn (t) = e−μ(t) .
n!
Ath = At+h \ At .
Another way to write this probability uses independent scattering. For sufficiently
small h > 0, the probability that one point falls in Ath is
μ(t + h) − μ(t) = Pr N Ath = 1 ≥ 0 .
d p0 (t) dμ(t) d
− = p0 (t) ⇔ (μ(t) + log p0 (t)) = 0 .
dt dt dt
where the last step follows from pn (t) = qn (t) − qn−1 (t). Multiplying both sides
by e μ(t) and using the product differentiation rule gives
d dμ(t)
pn (t) e μ(t) = pn−1 (t) e μ(t) .
dt dt
Solving the recursion starting with (2.70) gives pn (t) = e−μ(t) μn (t)/ n! , the Pois-
son density (2.4) with mean μ(t).
The class of sets without voids is a very large class of “test” sets. To see that
the Poisson distribution is inevitable for more general sets requires more elaborate
theoretical methods. Such methods are conceptually lovely and mathematically rig-
orous. They confirm but do not deepen the insights provided by the physics-style
argument, so they are not presented here.
2.9 Declarations of Independence 41
where
t
Λ(t) = λ(τ ) dτ , t ≥ t0 .
t0
The interarrival times are identically exponentially distributed if the PPP is homo-
geneous. Explicitly, for λ(t) ≡ λ0 ,
pt j−1 (τ ) ≡ p0 (τ ) = λ0 e−λ0 τ .
Because of independent scattering property of PPPs, the interarrival times are also
independent in this case.
In contrast to the discontinuous sample paths of the Poisson process, the sample
paths of the Wiener process are continuous with probability one. For Wiener pro-
cesses, the random variable X (t1 ) is zero mean Gaussian distributed with variance
42 2 The Poisson Point Process
1 −1
ν(y) = λ A (y − b) , (2.75)
| A|
[100, Chapter 4]. Suppose that Ξ is a PPP with intensity function λ(x) > 0 for
all x ∈ S ≡ R1 , and let
x
y = f (x) = λ(t) dt for − ∞ < x < ∞. (2.76)
0
The point process f (Ξ ) is a PPP with intensity one. To see this, use (2.74) to obtain
λ f −1 (y) λ(x)
ν(y) = = = 1,
| ∂ f (x)/∂ x| λ(x)
where the chain rule is used to show that | ∂ f −1 (x)/∂ x| = 1/| ∂ f (x)/∂ x|. An
alternative, but more direct way, to see the same thing is to observe that since f is
monotone, its inverse exists and the mean number of points in any bounded interval
[a, b] is
f −1 (b) b
d f (x) = dy = b − a . (2.77)
f −1 (a) a
are all commensurate, that is, all have the same intrinsic dimension. For these func-
tions, if Ξ is a PPP, then so is f (Ξ ). The intensity function of f (Ξ ) is
ν(x) = λ f −1 (y) dM(y), (2.81)
M(y)
where dM(y) is the differential in the tangent space at the point f −1 (y) of the set
M(y). The special case of projection mappings provides the basic intuitive insight
into the nonlinear mapping property of PPPs. To see that the result holds requires
a more careful and mathematically subtle analysis than is deemed appropriate here.
See [63, Section 2.3] for further details.
In practice, the sets M(y) are commensurate for most nonlinear mappings. For
example, it is easy to see that the projections have this property. However, some
nonlinear functions do not. As the next example shows, the problem with forbidden
mappings is that they lead to “intensities” that are generalized functions.
Example 2.10 A Forbidden Nonlinear Mapping. The sets M(y) of the function f :
R2 → R1 defined by
,
- 0, if x12 + x22 < 1
y = f (x1 , x2 ) =
x12 + x22 − 1, if x12 + x22 ≥ 1
and
ν(y) = 1 dθ = 2π(y + 1), y > 0.
M(y)
This gives
Example 2.11 Polar Coordinate Projections. The change of variables from Carte-
sian to polar coordinates in the plane, given by
(y1 , y2 ) = f (x1 , x2 )
1/2
≡ x12 + x22 , arctan(x1 , x2 ) ,
maps a PPP with intensity function λ(x1 , x2 ) on R2 to a PPP with intensity function
If λ(x1 , x2 ) ≡ 1, then ν(y1 , y2 ) = y1 . From (2.79), the projection onto the range
y1 gives a PPP on [0, ∞) ⊂ R1 with intensity function ν(y1 ) = 2π y1 , and the
projection onto the angle y2 is of infinite intensity on [0, 2π ]. Alternatively, if
−1/2
λ(x1 , x2 ) = x12 + x22 , then ν(y1 , y2 ) ≡ 1. The projection onto range is
ν(y1 ) = 2π ; the projection onto angle is ∞.
A PPP that undergoes a Markovian transition remains a PPP. Let Ψ be the transition
pdf, so that the likelihood that the point x in the state space S transforms to the
point y ∈ S is Ψ (y | x). Let Ξ be the PPP on S with intensity function λ(s), and let
ξ = (m, {x1 , . . . , xm }) be a realization of Ξ . After transitioning the constituent
points, this realization is η ≡ (m, {y1 , . . . , ym }), where y j is a realization of the
pdf Ψ ( · | x j ), j = 1, . . . , m. The realizations {y j } are independent. The transition
process, denoted by Ψ (Ξ ), is a PPP on S with intensity function
ν(y) = Ψ (y | x) λ(x) dx . (2.83)
S
To see this, let R be any bounded subset of S. Let μ = R λ(s) ds and observe
that the likelihood of the transition event η is, by construction,
2.11 Stochastic Transformations 47
⎛ ⎞
m
p(η) = ··· ⎝ Ψ (y j | x j )⎠ pΞ (m, {x1 , . . . , xm }) dx1 · · · dxm
R R j=1
⎛ ⎞⎛ ⎞
m m
e−μ ⎝
= ··· Ψ (y j | x j )⎠ ⎝ λ(x j )⎠ dx1 · · · dxm
m! R R j=1 j=1
−μ
m
e
= Ψ (y j | x j ) λ(x j ) dx j .
m! R
j=1
e−μ
m
p(η) = ν(y j ) . (2.84)
m!
j=1
Since
ν(y) dy = Ψ (y | x) λ(x) dx dy
R
R R
= λ(x) dx = μ , (2.85)
R
it follows from (2.12) that the transition Poisson process Ψ (Ξ ) is also a PPP.
z = h(x) + w ,
48 2 The Poisson Point Process
where h(x) is the measurement the sensor produces of a target at x in the absence
of noise, and the error w is a zero mean Gaussian distributed with covariance matrix
Σ. The conditional pdf form of the very same equation is N (z | h(x), Σ). The pdf
form is general and not limited to additive noise, so it is used here. Because (z | x)
is a pdf,
(y | x) dy = 1
T
for every x ∈ S.
Now, as in the previous section, let ξ = (m, {x1 , . . . , xm }) be the PPP realiza-
tion and λ(x) the PPP intensity function. Each point x j is observed by a sensor. The
sensor generates a measurement z j ∈ T ≡ Rκ , κ ≥ 1 for the target x j . The pdf of
this measurement is (y | x). In words, (z j | x j ) is the pdf of z j conditioned on x j .
Let η = (m, {z 1 , . . . , z m }). Then η is a realization of a PPP defined on the range
T of the pdf . To see this, it is only necessary to follow the same reasoning used to
establish (2.83). The intensity function of this PPP is
ν(y) = (y | x) λ(x) dx , y ∈ T . (2.86)
S
The PPP (Ξ ) is called a “measurement” process because it includes the effects of
measurement errors. It is also an appropriate name for many applications, including
tracking. (It is called a translated process in [119, Chapter 3].)
Example 2.12 PPP Target Modeling. This example is multi-purpose. At the sim-
plest level, it is merely an example of a measurement process. Another purpose is
described shortly. For concreteness, the example is presented in terms of an active
sonar sensor. Such sensors generate a measurement of target location by transmitting
a “ping” and detecting the same ping after it reflects off a target, e.g., a ship. The sen-
sor estimates target direction θ from the arrival angle of the reflected ping, and it esti-
mates range r from the travel time difference between the transmitted and reflected
ping. In two dimensions, target measurements are range, r = (x 2 + y 2 )1/2 , and
angle, θ = arctan(x, y). In the notation above,
2
(x + y 2 )1/2
h(x, y) = . (2.87)
arctan(x, y)
The errors in these measurements are assumed to be additive zero mean Gaussian
distributed with variances σr2 and σθ2 , respectively. The measurement pdf condi-
tioned on target state is therefore
(r, θ | x, y) = N r ; (x 2 + y 2 )1/2 , σr2 N θ ; arctan(x, y), σθ2 . (2.88)
2.11 Stochastic Transformations 49
a b
c d
Fig. 2.4 The predicted measurement PPP intensity function in polar coordinates of a Gaussian
shaped PPP intensity function in the x-y plane: σx = σ y = 1, σr = 0.1, σθ = 0.15 (radians),
and c = 200, x0 = 6, y0 = 0
50 2 The Poisson Point Process
Figure 2.4a, b give the intensities (2.89) and (2.90), respectively. A realization of the
PPP with intensity function λc (x, y) generated by the two step procedure is given
in Fig. 2.4c. Randomly perturbing each of these samples gives the realization in
Fig. 2.4d. The predicted intensity ν(r, θ ) is nearly Gaussian in the r -θ plane. If the
likelihood function (2.88) is truncated to the semi-infinite strip (2.82), the predicted
intensity (2.90) is also restricted to the semi-infinite strip.
λ = {λ1 , λ2 , . . .}.
2.12 PPPs on Other Spaces 51
μ(R) = λj .
j∈R
In Step 1, the total number of samples n is drawn from the Poisson random variable
with parameter μ(R). In Step 2, these n samples, denoted by φx j , are i.i.d. draws
from the multinomial distribution with pdf
% (
λj
: j ∈R .
μ(R)
The integers x j range over the set of indices of the discrete points in R, but they are
otherwise unrestricted. The PPP realization is
Nothing prevents the same discrete point, say φ j ∈ R, from occurring more than
once in the list {φx1 , . . . , φxn }; that is, repeated samples of the points in R are per-
mitted. The number n j of occurrences of φ j ∈ R as a point of the PPP realization ξ
is a Poisson distributed random variable with parameter λ j and pdf (2.91). Because
of Poisson’s gambit, these Poisson variates are independent. The two definitions are
therefore equivalent.
The event space of PPPs on Φ is
E(R) = {(0, ∅)} ∪∞
n=1 (n, {φx1 , . . . , φxn }) : φx j ∈ R, j = 1, . . . , n .
(2.92)
Except for the small change in notation that highlights the indices x j , it is identical
to (2.1). The pdf of the unordered realization ξ is
pΞ (ξ ) = e− j∈R λj
λx j . (2.93)
j∈R
52 2 The Poisson Point Process
This is the discrete space analog of the continuous space expression (2.12). The
expectation operator is changed only in that integrals are everywhere replaced by
sums over the discrete points of R ⊂ Φ. The notions of superposition and thinning
are also unchanged.
The intensity functions of transition and measurement processes are similar to
(2.83) and (2.86), but are modified to accommodate discrete spaces. The transition
pdf Ψ (φ j | φi ) is now a transition matrix whose (i, j)-entry is the probability that
the discrete state φi maps to the discrete state φ j . The intensity of the transition
process Ψ (Ξ ) is
where λ(x) is the intensity function of a PPP, say Υ , on the state space S. If the
conditioning variable takes values u in a discrete space U, the pdf (φ j | u) is the
probability of φ j given u ∈ U and the measurement intensity vector is
where in this case λ(u) is the intensity vector of the discrete PPP defined on U. The
discrete-continuous case is discussed in the next section.
Example 2.13 Histograms. The cells {R j } of a histogram are probably the most
natural example of a set of discrete isolated points. Consider a PPP Ξ defined on
the underlying continuous space in which the histogram cells reside. Aggregating,
or quantizing, the i.i.d. points of realizations of Ξ into the nonoverlapping cells
{R j } and reporting only the total counts in each cell yields a realization of a PPP
on a discrete space with points φ j ≡ R j . The intensity vector of this discrete PPP,
call it Ξ H , are
λj = λc (s) ds ,
Rj
2.12 PPPs on Other Spaces 53
where λc (s) is the intensity function of Ξ . By the independent scattering, since the
histogram cells {R j } are disjoint, the number of elements in cell R j is Poisson
distributed with parameter λ j . The fact that the points φ j are, or can be, repeated in
realizations of the discrete PPP Ξ H hardly needs saying.
Concrete examples of discrete spaces occur in emission and transmission tomog-
raphy. In these examples, the points in Φ correspond to the individual detectors in a
detector array, and the number of occurrences of φ j in a realization is the number of
detected photons (or other particle) in the j-th detector. These topics are discussed
in Chapter 5.
The event space of a PPP on the augmented space is E(R+ ). The event space E(R)
is a proper subset of E(R+ ).
Realizations are generated as before for the bounded sets R. For bounded sets
R+ , the integrals in (2.4) are replaced by the integrals over R+ as defined in (2.97);
otherwise, Step 1 is unchanged. Step 2 is modified slightly. If n is the outcome of
Step 1, then n i.i.d. Bernoulli trials with probabilities
λ(φ)
Pr[φ] =
λ(φ) + R λ(s) ds
R λ(s)
ds
Pr[R] =
λ(φ) + R λ(s) ds
54 2 The Poisson Point Process
are performed. The number n(φ) is the number of occurrences of φ in the realiza-
tion. The number of i.i.d. samples drawn from R is n − n(φ).
The number n(φ) is a realization of a random variable, denoted by N (φ), that is
Poisson distributed with parameter λ(φ). This is seen from the discussion in Sec-
tion 2.9.2. The expected number of occurrences of φ is λ(φ). Also, probability of
repeated occurrences of φ is never zero. The possibility of repeated occurrences of
φ is important to understanding augmented PPP models for applications such as
multitarget tracking.
The probability that the list {x1 , . . . , xn } is a set is the probability that no more
than one realization of φ occurs in the n Bernoulli trials. Consequently, if λ(φ) > 0,
the probability that the list {x1 , . . . , xn } is a set is strictly less than one. In aug-
mented spaces, random finite sets are more accurately describes as random finite
lists.
The likelihood function and expectation operator are unchanged, except that the
integrals are over either R or R+ , as the case may be. Superposition and thinning
are unchanged. The intensity of the diffusion and prediction processes are also
unchanged from (2.83) and (2.86), except that the integrals are over S + .
It is necessary to define the transitions Ψ (y | φ) and Ψ (φ | y) for all y ∈ S,
as well as Ψ (φ | φ) = Pr[φ | φ]. The measurement, or data, likelihood function
L( · | φ) must also be defined. These quantities have natural interpretations in target
tracking.
Example 2.14 Tracking Interpretations. A one-point augmented space is used in
Chapter 6. The state φ is the hypothesis that no target is present in the tracking
region R, and the point x ∈ R is the hypothesis that a target is present with state x.
State transitions and the measurement likelihood function are interpreted in tracking
applications as follows:
Initiation and termination of target track is therefore an intrinsic part of the tracking
function when using a Bayesian tracking method (see Appendix C) on an augmented
state space S + .
As is seen in Chapter 6, augmented spaces play an important role in simplifying
difficult enumerations related to joint detection and tracking of targets. Only one
state φ is considered here, but there is no intrinsic limitation.
number of weighted Dirac delta functions located at the isolated points {a j }. The
points {a j } are identified with the discrete points Φ = {φ j }. Let S + = S ∪ Φ.
Realizations on the augmented space S + generated in the manner outlined above for
the one point augmented case map directly to realizations of the non-orderly PPP
on S via the identification φ j ↔ a j . Other matters are similarly handled.
Chapter 3
Intensity Estimation
The basic geometric and algebraic properties of PPPs are discussed in Chapter 2.
These fundamental properties are exploited in this chapter to obtain algorithms for
estimating the PPP intensity function from data using the method of maximum
likelihood (ML). Estimation algorithms are critically important in applications in
which intensity functions are not fully determined a priori. Many heuristic methods
have also been proposed in the literature; though not without interest, they are not
discussed here.
The fundamental notion is that the intensity function is specified parametrically,
and that appropriate numerical values of one or more of the parameters are unknown
and must be estimated from collected data. Two kinds of intensity models are nat-
ural for PPPs: those that do not involve superposition, and those that do. Moreover,
two kinds of data are natural for PPPs: sample data and histogram data (see below
in Section 3.1). This gives four combinations, each of which requires a different,
though closely related, ML algorithm. In all cases, the parameterized intensity is
written λ(s; θ ), where the parameter vector θ is estimated from data.
The first section of this chapter discusses intensity models that do not involve
superposition. The natural method of ML estimation is used: Find the right
likelihood function for the available data, set the gradient with respect to the param-
eters to be estimated equal to zero, and solve. The likelihood functions discussed in
this section correspond to PPP sample and histogram data. This takes care of two of
the four combinations mentioned above.
The remaining sections are all about intensity functions that involve superposi-
tion, or linear combinations, of intensity functions. The EM method is the natural
method for deriving ML estimation algorithms, or estimators, in this case. Read-
ers with little or no familiarity with it may want to consult Appendix A or other
references before reading these sections. Other readers, who wish merely to get
quickly to the algorithm, are provided with two tables that outline the steps of the
EM algorithm for affine Gaussian sums, perhaps the most important example. All
readers, even those for whom EM may have lost some of its charm, are provided
with enough details to “read along” with little need to lift a pencil. These details
reside primarily in the E-step and can be skipped at the reader’s pleasure with little
loss of continuity or essential content.
Parametric models of superposed intensities come in at least two forms—
Gaussian sums and step functions. In the latter, the steps often correspond to pixels
(or voxels) in an image. With sufficiently many terms, both models can approximate
any continuous intensity function arbitrarily closely. For that reason they are some-
times described as nonparametric even though they have parameters that must be
either specified or estimated. To distinguish them from truly nonparametric “mod-
ern” sequential Monte Carlo (SMC) models involving particles, Gaussian sums and
step functions are herein referred to as parametric models.
Gaussian sum models are the main objects of study in this chapter. Step function
models are also very important, but they are discussed mostly in the Chapter 5 on
medical imaging. SMC methods are discussed in the context of the tracking appli-
cations in Section 6.3.
The ML estimate is, by definition, the global maximum. The problem in practice is
that the likelihood function is often plagued by multiple local maxima. For example,
when the intensity is a Gaussian shape of unknown location plus a constant (see
(4.38) below), the likelihood function of the location parameter may have multiple
local maxima. When it is important to find the global maximum, locally convergent
algorithms are typically restarted with different initializations, and the ML estimate
is taken to be the estimate with the largest likelihood found. This chapter is con-
tent to investigate algorithms that converge to a local maximum of the likelihood
function.
Two important kinds of data for PPPs are considered. One kind is PPP sample
data, sometimes called “count record” data, which means simply that the available
data set is a realization of a PPP with intensity λ(s; θ ). The other is histogram data,
in which a realization of the PPP is represented not by the points themselves, but by
3.1 Maximum Likelihood Algorithms 59
the number of points that fall in a fixed number of nonoverlapping cells that partition
the space S. Such data are equivalent to a realization of a PPP on a discrete space.
In practice, both kinds of data must be carefully collected to avoid unintentionally
altering its essential Poisson character. The quality of the ML estimators may be
adversely affected if the data do not match the underlying PPP assumption.
Liid (θ ; x) = log p (x ; θ )
m
= − λ (s ; θ ) ds + log λ x j ; θ . (3.1)
R j=1
m
1
∇θ λ x j ; θ = ∇θ λ (s ; θ ) ds. (3.3)
j=1
λ xj ; θ R
This system of equations may have multiple solutions, depending on the form of the
intensity λ(x ; θ ) and the particular PPP sample data x.
A similar system of equations holds for histogram data. Adopting the notation of
Section 2.9.1 for a K cell histogram with counts data m 1:K = {m 1 , . . . , m K }, the
logarithm of the pdf is, from (2.62),
K
Lhist (θ ; m 1:K ) = − λ (s ; θ ) ds + log(m j !)
R j=1
& '
K
+ m j log λ (s ; θ ) ds . (3.4)
j=1 Rj
60 3 Intensity Estimation
K
mj
∇θ λ(s ; θ ) ds = ∇θ λ (s ; θ ) ds. (3.5)
j=1 R j λ(s ; θ ) ds Rj R
As for PPP sample data, the system may have multiple solutions.
It is necessary to verify that the loglikelihood function of the data is concave
at the ML solution, that is, that the negative Hessian matrix of the loglikelihood
function is positive definite. In practice, intuition often replaces verification.
This example is a two dimensional problem that involves training the “crosshairs” of
a receiver on a dim light source. In optical communications, this means estimating
the brightest direction to the source [114, 115, 119, Section 4.5]. The receiver in this
case is a photodetector with a (flat) photoemissive surface. (Photoemissive materials
give off electrons when they absorb energetic photons.) Photons are detected over
a specified finite time period. Recording
the number and locations
of each detected
photon provides i.i.d. data x = x j ∈ R2 : j = 1, . . . , m . This kind of data may
or may not be practical for large m, depending on the application. Feasible or not,
it is nonetheless interesting to consider. In practice, the photodetector surface R is
often divided into a number of disjoint regions that constitute histogram cells, and a
count of the number of photons arriving in each cell is recorded. The photodetector
surface is assumed to be rectangular of known size, and its center is taken as the
coordinate system origin. The axes are taken parallel to the sides of R.
Example 3.1 Sample Data. The intensity of the light distributed across the photode-
tector surface is proportional to a Gaussian pdf. The 2 × 2 covariance matrix Σ
determines the elliptical shape of the “spotlight”. For
simplicity,
suppose the shape
is circular with known width ρ, so that Σ = Diag ρ 2 , ρ 2 . The light intensity is
λ(x ; I0 , μ) = I0 N (x ; μ, Σ) , (3.6)
where I0 /(2 π ρ 2 ) is the peak intensity and the vector μ = [μ1 , μ2 ]T ∈ R2 is the
location of the peak. The parameters to be estimated are θ = (I0 , μ1 , μ2 ). The
only constraint is that I0 > 0.
For sample data, the necessary conditions yield three equations in three
unknowns. Setting the derivative in (3.3) with respect to I0 to zero gives the ML
estimate
m
Iˆ0 = , (3.7)
R N (s ; μ, Σ) ds
3.1 Maximum Likelihood Algorithms 61
where the integral is a double integral over the photoemissive surface. The estimate
Iˆ0 automatically satisfies the nonnegativity constraint. The value of μ in (3.7) is
the ML estimate μ̂, so it is coupled to the necessary equations for μ. Setting the
gradient1 in (3.3) with respect to the vector μ equal to zero, substituting the estimate
(3.7), and rearranging terms gives [114]
1
m
s N (s ; μ, Σ) ds
R = xj . (3.8)
R N (s ; μ, Σ) ds m
j=1
The left hand side is the mean vector of the Gaussian pdf restricted to the set R.
The equation thus says that ML estimate μ̂ is such that the conditional mean on
R equals the sample mean. The conditional mean equation is uniquely solvable for
rectangular domains, as shown in Appendix B.
If the bulk of the source distribution is visible on the photoemissive surface, then
ˆ
R N (s ; μ, Σ) ds ≈ 1. This gives the approximation I0 ≈ m. Also, the left
hand side of (3.8) is approximately the unconditional mean μ, so the ML estimate
μ̂ approximates the sample mean in this case.
Example 3.2 Compensating for Edge Effects. When the peak of the light distribu-
tion lies near the edge of the photodetector, many source photons are not counted.
This example shows that the estimator (3.8) automatically compensates for the
missed measurements. Intuitively, this is because the sample mean in (3.8) is, by
its very nature, an estimate of the conditional mean expressed analytically by the
left hand side.
A realization of the PPP (3.6) with intensity I0 = 250 and mean μ =
[0.8, 0.5]T and ρ = 0.5 is generated. Only 143 of the points in the realization
fall on the photodetector surface, which is taken to be the square (−1, 1) × (−1, 1).
These points are depicted in the left hand side of Figure 3.2 . The ML estimated
mean is computed by solving (3.8) by the method discussed in Appendix B,
giving μ̂ M L = [0.75007, 0.53021]T . The error is remarkably small, especially
compared to the error in the sample mean of the 143 detected photons, namely
[0.49600, 0.37698]T . The estimated intensity, computed from (3.7) with μ̂ M L ,
gives Iˆ0 = 250.72 .
The right hand side of Fig. 3.1 repeats the procedure but with the mean shifted
further into the corner to [0.8, 0.8]T . Only 112 of the PPP samples fall on the
photodetector. The ML estimate is μ̂ M L = [0.74687, 0.79664]T . The sample mean
for the 112 detected photons is, by contrast, [0.49445, 0.51792]T . The estimated
intensity is Iˆ0 = 245.57.
Example 3.3 Histogram Data. ML estimates of the parameters of (3.6) are now
given for histogram data. The necessary condition (3.5) for λ0 gives
Fig. 3.1 ML mean and coefficient estimates for the model (3.6) obtained by solving equation (3.8).
The mean in the left hand figure is (0.8, 0.5). The mean in the right hand figure is (0.8, 0.8). Filled
circle = μ̂ M L ; Open circle = uncorrected sample mean; Square = true mean
K
mj
Iˆ0 =
j=1
. (3.9)
R N (s ; μ, Σ) ds
K
This estimator is identical to (3.7) since m = j=1 m j . It is also coupled to the
estimate μ̂. Manipulating the necessary conditions for μ in much the same way as
done in (3.8) gives
R j s N (s ; μ, Σ) ds
K
s N (s ; μ, Σ) ds 1
R = mj . (3.10)
R N (s ; μ, Σ) ds m R j N (s ; μ, Σ) ds
j=1
K
μ̂ ≈ m j γ j (μ̂) ,
m
j=1
where
R j s N s ; μ̂, Σ ds
γ j μ̂ =
R j N s ; μ̂, Σ ds
is the conditional mean of cell R j . Since γ j μ̂ ∈ R j regardless of μ̂, replace
the Gaussian density in each cell with a uniform density. This final approximation,
3.2 Superposed Intensities with Sample Data 63
K
μ̂ ≈ m j γ̄ j ,
m
j=1
where
1
γ̄ j = s ds
Rj Rj
once, while the M-step is the parameter update step. For readers unfamiliar with
EM, there may no easier way to learn it than by first reading the appendix, or some
equivalent discussion elsewhere, and subsequently plunging headlong into the prob-
lem addressed in the next section. This problem is an excellent first application of
EM. Readers new to EM are forewarned that much of the action lies in understand-
ing and manipulating subscripts. Further background on EM and its many variations
are given in [80, 81].
L
λ(x ; θ ) = λ (x ; θ ) , (3.11)
=1
where the parameter vector of the -th PPP is θ , and θ = (θ1 , . . . , θ L ) is the
parameter vector for the superposition. Different components of λ(x ; θ ) can take
different parametric forms. The parameters of the intensities λ (x ; θ ) are assumed
parametrically untied, i.e., there is no functional relationship between the vectors θi
and θ j for i = j. This simplifying assumption is unnecessary theoretically, as well
as inappropriate in some applications. An example of the latter is when the centroids
of λ (x ; θ ) are required to form an equispaced grid whose orientation and spacing
are to be estimated from data.
3.2.1.1 E-step
The natural choice of the “missing data” are the conditionally independent random
indices k j , 1 ≤ k j ≤ L , that identify which of the superposed PPPs generated the
point x j . Let
denote the complete data. (In the language of Section 8.1, xc is a realization of a
marked PPP.) For = 1, . . . , L, let
xc () = (x j , k j ) : k j = . (3.13)
Let n c () ≥ 0 denote the number of indices j such that k j = , and let
It follows from the definition of k j that ξc () is a realization of the PPP whose
intensity is λ ( · ). The pdf of ξc () is (from (2.12))
p (ξc () ; θ ) = exp − λ (s ; θ ) ds λ x j ; θ .
R j : k j =
L
p (xc ; θ ) = p (ξc () ; θ )
=1
m
= exp − λ(s ; θ ) ds λk j x j ; θk j .
R j=1
L (θ ; xc ) = log p(xc ; θ )
m
= − λ (s ; θ ) ds + log λk j x j ; θk j . (3.15)
R j=1
L m
λk j x j ; θk j
= 1, (3.17)
k1 , ..., kn = 1 j=1
λ xj ; θ
and
L
λk j x j ; θk j λk j x j ; θk j
= . (3.18)
k1 , ..., k j−1 , k j+1 , ..., km = 1
λ xj ; θ λ xj; θ
L j x j ; θkj
≡ ··· L (θ ; xc ) . (3.19)
λ xj ; θ (n)
k1 =1 km =1 j=1
Substituting (3.15) gives, after interchanging the summation order and using (3.17)
and (3.18),
m
L λ x j ; θ (n)
Q θ ; θ (n) = − λ(s ; θ ) ds +
(n)
log λ x j ; θ .
R j=1 =1
λ xj ; θ
(3.20)
Equivalently, Q θ ; θ (n) separates into the L term sum,
L
Q θ ; θ (n) = Q θ ; θ (n) , (3.21)
=1
where
λ
j
m x ; θ
(n)
Q θ ; θ (n) = − λ (s ; θ ) ds +
(n)
log λ x j ; θ .
R j=1
λ xj ; θ
(3.22)
3.2.1.2 M-step
The M-step maximizes (3.21) over all feasible θ , that is,
θ (n+1) = arg max Q θ ; θ (n) .
θ
(n+1)
Then θ satisfies the necessary conditions:
3.2 Superposed Intensities with Sample Data 67
m
1
w x j ; θ (n) ∇θ λ x j ; θ = ∇θ λ (s ; θ ) ds , (3.24)
j=1
λ x j ; θ R
Solving the
uncoupled systems (3.24) gives the recursive update for θ , namely
(n+1) (n+1)
θ (n+1) = θ1 , . . . , θL . This completes the M-step.
where |dx| is the infinitesimal volume. On the other hand, the probability that a
point is generated by the superposed PPP in the same infinitesimal is
λ x j ; θ (n) |dx| .
Therefore, the probability that x j is generated by the -th PPP conditioned on the
event that it was generated by the superposed PPP is the ratio of these two proba-
bilities. Cancelling |dx| in the ratio gives the weight w x j ; θ (n) . A more careful
proof that avoids the use of infinitesimals is omitted.
The solution of (3.24) depends on the form of λ (x; θ ). It differs from the
direct
ML necessary
conditions (3.3) primarily by the presence of the weights
w x j ; θ (n) . Several illustrative examples are now given for the components of
the superposed intensity λ(x ; θ ) in (3.11).
The only parameter is θ = I . The full parameter vector θ includes not only θ
but also the parameters θ j , j = , of the other PPPs. Given θ (n) and, in particular,
θ(n) = I(n) , the EM update I(n+1) is, from (3.24),
I(n)
m
(n+1) 1
I = , (3.27)
|R| λ x ; θ (n)
j=1 j
where |R| = R ds. The fraction
(n)
I
w x j ; θ (n) =
λ x j ; θ (n)
is the probability that the point x j is generated by the -th PPP, so the summation
(3.27) is the expected number of points generated by the PPP conditioned on the
current parameter set θ (n) . The division by |R| converts this number to intensity.
EM updates for θ j , j = , depend on the form of the PPP intensities.
Example 3.5 Scaling a Known Component. Let the -th PPP intensity be
λ (s ; θ ) = I f (s) , (3.28)
1
m
(n+1)
I = w x j ; θ (n) , (3.29)
R f (s) ds j=1
(n)
(n) I f x j
w x j ; θ = .
λ x j ; θ (n)
The denominator in (3.29) corrects for the intensity that lies outside of R. Substi-
tuting into (3.29) and moving I(n) outside the sum gives
(n)
(n+1) I
m
f x j
I =
(n)
, (3.30)
R f (s) ds j=1 λ x j ; θ
(n)
(n+1) I f x j
I = . (3.32)
|R | λ x j ; θ (n)
x j ∈R
It is seen through the notational fog that the EM algorithm converges in a single
step to
{ j : x j ∈ R j }
Iˆ = ,
|R |
where { · } is the number of points in the set. In other words, the ML estimator is
proportional to the histogram if the cells are of equal size.
λ (s ; θ ) = I N (s ; μ , Σ ) , (3.33)
where the constant I is the total “signal level.” In some applications one or more of
the parameters {I , μ, Σ} are known. The more common possibilities are consid-
ered here.
Case 1. If I is the only estimated parameter because μ and Σ are known, then
θ = I is estimated using the recursion (3.30) with f (x) = N (x ; μ , Σ ).
Case 2. If signal level and location vector μ are estimated,
and Σ is known,
(n) (n) (n) (n) (n+1)
then θ = (I , μ ). Given θ and thus θ = I , μ , the EM updates I
(n+1) (n+1)
and μ are coupled. They are manipulated into a nested form in which μ is
(n+1)
computed as the solution of a nonlinear equation, and then I is computed from
(n+1)
μ . To see this, proceed in the same manner as (3.29) to obtain
70 3 Intensity Estimation
1
m
I(n+1) = w x j ; θ (n) , (3.34)
R N (s ; μ , Σ ) ds j=1
where in this equation μ is the sought after EM update, and the weights are
(n) (n)
I N x j ; μ , Σ
w x j ; θ (n) = . (3.35)
λ x j ; θ (n)
The necessary condition with respect to μ in (3.24) is, after multiplying both sides
by Σ ,
m
(n)
w x j ; θ x j − μ = I N (s ; μ , Σ ) (s − μ ) ds . (3.36)
j=1 R
(n+1) (n+1)
Substituting for I the estimate I from (3.34) and simplifying gives μ as
the solution of a nonlinear equation:
m
(n) x
R s N (s ; μ , Σ ) ds j=1 w x j ; θ j
= m . (3.37)
N (s ; μ , Σ ) ds w x ; θ (n)
R j=1 j
(Compare this expression to the direct ML equation (3.8) to see how the EM method
exploits Bayes Theorem to “split” the data into L parts, one for each component.)
The right hand side of this equation is a probabilistic mean of the data, that is, a
convex combination of {x j }. In general, solving (3.37) for μ(n+1)
requires numerical
methods. The estimate μ (n+1)
is then substituted into (3.34) to
evaluate I(n+1) .
If it is known that R N (s ; μ , Σ ) ds ≈ 1, then R s N (s ; μ , Σ )
ds ≈ μ . Given this approximation, the EM updates to (3.34) and (3.37) are “self
solving”, so that
(n)
m m N
s ; μ , Σ
I(n+1) ≈ w x j ; θ (n) = I(n) (3.38)
j=1 j=1
λ x j ; θ (n)
m
j=1 w x j ; θ (n) x j
μ(n+1)
≈ m . (3.39)
j=1 w x j ; θ (n)
The approximate iteration requires that R N s ; μ(n) , Σ ds ≈ 1 at every EM
(n)
iteration, so it cannot be used if the iterates μ drift too close to the edge of the
set R.
Case 3. Finally, if the signal level, mean vector, and covariance
matrix are all
estimated, then θ = (I , μ , Σ ). Given θ and thus θ = I , μ(n)
(n) (n) (n)
, Σ (n)
,
3.2 Superposed Intensities with Sample Data 71
(n+1)
The gradient in (3.24) with respect to μ and Σ 2 involve I . The gradient
equation for μ is essentially identical to (3.36); the gradient for Σ is very similar
(n+1)
but messier. Substituting the estimate I for I in both equations using (3.34)
(n+1) (n+1)
and simplifying gives μ and Σ as the solution of a coupled system of
nonlinear equations:
m
(n) x
R s N (s ; μ , Σ ) ds j=1 w x j ; θ j
= m (3.41)
(n)
R N (s ; μ , Σ ) ds j=1 w x j ; θ
m T
(n) x − μ
T
R N (s; μ , Σ ) (s − μ ) (s − μ ) ds = j=1 w x j ; θ j x j − μ
m .
(n)
R N (s; μ , Σ ) ds j=1 w x j ; θ
(3.42)
m
(n+1)
I = w x j ; θ (n) . (3.43)
(n+1) (n+1)
R N s ; μ , Σ ds j=1
is evaluated after
solving for μ(n+1)
and Σ(n+1) .
Again, if R N (s ; μ , Σ ) ds ≈ 1, the left hand sides of the equations (3.41)
and (3.42) are the mean vector μ and covariance matrix Σ , respectively. Given
this approximation, the EM update equations are simply
2Use the matrix identity ∇ R N (s ; μ, R) = 12 N (s ; μ, R) −R −1 + R −1 (s − μ)(s − μ)T
R −1 , where the gradient is written in matrix form as ∇ R = ∂ ρ∂ i j , where R = ρi j .
72 3 Intensity Estimation
m
I(n+1) ≈ w x j ; θ (n) (3.44)
j=1
m
(n+1) j=1 w x j ; θ (n) x j
μ ≈ m (3.45)
j=1 w x j ; θ (n)
m T
j=1 w x j ; θ (n) x j − μ(n+1)
x j − μ(n+1)
(n+1)
Σ ≈ m . (3.46)
(n)
j=1 w x j ; θ
Table 3.1 EM algorithm for affine Gaussian sum with PPP sample data
Given data: x1:m = {x1 , . . . , xm }
L
Fit the intensity: λ(s ; θ) = λbgnd (s) + =1 I N (s ; μ , Σ ), s ∈ R
Estimate the parameters: θ = {(I , μ , Σ ) : = 1, . . . , L}
• FOR = 1 : L, initialize coefficients, means, and covariance matrices:
I (0) > 0, μ (0) ∈ R, and Σ (0) positive definite
• END FOR
• FOR EM iteration index n = 0, 1, 2, . . . until convergence:
– FOR j = 1 : m and = 1 : L, compute:
I (n) N (x j ; μ (n), Σ (n))
w j (n) = L
λbgnd (x j ) + =1 I (n) N (x j ; μ (n), Σ (n))
– END FOR
– FOR = 1 : L, compute:
• N (n) = mj=1 w j (n)
m
• E (n) = N1(n) w j (n) x j
mj=1
• V (n) = N (n)
1
j=1 w j (n) (x j − μ (n)(x j − μ (n))
T
for μ (n + 1) and Σ (n + 1)
• Compute:
N (n)
I (n + 1) =
R N (s ; μ (n + 1), Σ (n + 1)) ds
– END FOR
• END FOR EM iteration (Test for convergence)
• If converged: FOR = 1 : L, Iˆ = I (nlast ), μ̂ = μ (nlast ), Σ̂ = Σ (nlast ) END FOR
The mean and covariance updates are nested in the approximate EM update, that
is, the update μ(n+1)
is used in Σ(n+1) .
Table 3.1 outlines the steps of the EM algorithm for affine Gaussian sums, that
is, for a PPP that is a Gaussian sum plus an arbitrary intensity, λbgnd (s). This
3.3 Superposed Intensities with Histogram Data 73
intensity models the superposition of a PPP whose intensity is a Gaussian sum and
a background PPP of known intensity. In many applications, the background inten-
sity is assumed homogeneous, but this restriction is unnecessary for the estimation
algorithm to converge.
The EM method is widely used for superposition (mixture) problems with either
independent or conditionally independent data, but it is versatile and is applicable
to any problem in which useful missing data can be found. In this section EM is
used for histogram data.
Histogram data are difficult to treat because the loglikelihood function (3.4) involves
integrals over the individual histogram cells. The key insight, for those who wish to
skip the notationally burdensome details until need arises, is that the points of the
PPP are the missing data. Said another way, the integrals are sums, and the variables
of integration are—like the index of the summation in the superposition—a very
appropriate choice for the missing data. Other choices are possible, but this is the
choice made here.
3.3.1.1 E-step
The histogram notation of Section 3.1 is retained, as is the superposed intensity
(3.11). In EM parlance, the incomplete data loglikelihood function is (2.62). Missing
data arise from the histogram nature of the data and from the intensity superposition.
For j = 1, . . . , K , the count m j in cell R j corresponds to m j points in cell R j , but
the precise locations of these points are not observed. Let
ξ j ≡ x j1 , . . . , x jm j , x jr ∈ R j , r = 1, . . . , m j ,
denote these locations. Denote the collection of all missing points by ξ1:K =
(ξ1 , . . . , ξ K ). The points in ξ1:K are the (missing) points of the PPP realization
from which the histogram data are generated. The missing data that arise from the
superposition are the same as before. The point x jr is generated by one of the com-
ponents of the superposition;denote the index of this component by k jr , where
1 ≤ k jr ≤ L. Let K j = K j1 , . . . , K jm j , and denote all missing indices by
K1:K = (K1 . . . , K K ).
The complete data are (m 1:K , ξ1:K , K1:K ). Because the points x jr are equivalent
to points of the PPP realization, and the indices k jr indicate the components of the
superposition that generated them, the definition of the complete data likelihood
function is
74 3 Intensity Estimation
mj
K
phc ( m 1:K , ξ1:K , K1:K ; θ ) = exp − λ (s ; θ ) ds λk jr x jr ; θ .
R j=1 r =1
(3.47)
Missing data do not affect the integral over all R because exp − R λ(s ; θ ) ds
is the normalization constant that makes phc ( · ) a pdf. The conditional pdf of the
missing data is, by Bayes Theorem,
Unlike the weights (3.25), these weights are not dimensionless; they carry the same
units as the intensity. For j = 1, . . . , K , let
··· dx j1 · · · dx jm j ≡ dξ j ,
mj
Rj Rj (R j )
where dξ j = dx j1 · · · dx jm j , and
··· dξ1 · · · dξ K ≡ dξ1:K ,
(R1 )m 1 (R K )m K (R1 )m 1 ×···×(R K )m K
where dξ1:K ≡ dξ1 · · · dξ K . It is easy to verify from the definition of the weights
that
mj
K
wk jr x jr ; θ dξ1:K = 1 ,
K1:K (R1 )m 1 ×···×(R K )m K j=1 r =1
L
≡ ⎝ ··· ⎠ ··· ⎝ ··· ⎠.
K1:K k11 =1 k1m 1 =1 k K 1 =1 k K m K =1
Integrating and summing over all missing data except x jr and k jr shows that
3.3 Superposed Intensities with Histogram Data 75
mj
K
wk j
r
x j
r
; θ d ξ1:K \ x jr
K1:K \k jr (R1 )m 1 ×···×(R K )m K \R j j
=1 r
=1
= wk jr x jr ; θ . (3.50)
3.3.1.2 M-step
Let n ≥ 0 denote the EM iteration index, and let θ (0) be given. The auxiliary
function is, by definition of the E-step,
Q θ ; θ (n) = E log phc ( m 1:K , ξ1:K , K1:K ; θ ) | θ (n) (3.51)
K mj
≡ log phc (m 1:K , ξ1:K , K1:K ; θ ) wk jr x jr ; θ (n) dξ1:K .
K1:K (R1 ) ×···×(R K )
m1 mK
j=1 r =1
Substituting the logarithm of (3.47) and paying attention to the algebra gives
Q θ ; θ (n) = − λ (s ; θ ) ds
R
mj
K
L
+ log λk jr x jr ; θk jr wk jr x jr ; θ (n) dx jr
j=1 r =1 k jr =1 R j
⎧ ⎫
⎨
⎬
× wk j
r
x j
r
; θ (n) d ξ1:K \ x jr .
⎩ ⎭
K1:K \k jr (R1 ) ×···×(R K )
m1 mK \ R
j j
= j r
=r
From the identity (3.50), the term in braces is identically one. Carefully examining
the sum over k jr shows that the index can be replaced by an index, say , that does
not depend on j and r . Making the sum over the first sum, and recognizing that
the integrals over x jr for each index r are identical, gives the simplified auxiliary
function
K
Q θ ; θ (n) = − λ (s ; θ ) ds + mj w s ; θ (n) log λ (s ; θ ) ds.
R =1 j=1 Rj
(3.52)
L
Q θ ; θ (n) = Q θ ; θ (n) , (3.53)
=1
where
K
(n)
Q θ ; θ = − λ (s ; θ ) ds + mj w s ; θ (n) log λ (s ; θ ) ds.
R j=1 Rj
(3.54)
λ (s ; θ ) = I N (s ; μ , Σ ) , (3.55)
and
K (n)
j=1 m j R j w s ; θ (s−μ )(s−μ )T ds
R sN (s;μ ,Σ )(s−μ )(s−μ )T ds
= K . (3.58)
R N (s ; μ , Σ ) ds j=1 m j R j w (s ; θ (n) ) ds
(n+1) (n+1)
The equations (3.57) and (3.58) are solved jointly for μ and Σ . Then,
(n+1)
I is evaluated using (3.56). This completes the M-step.
3.3 Superposed Intensities with Histogram Data 77
If R N (s ; μ , Σ ) ds ≈ 1, the updates simplify significantly. The left hand
side of (3.57) is the mean, so that
K
(n)
j=1 m j R j s w s ; θ ds
μ(n+1)
≈ K . (3.59)
(n)
j=1 m j R j w s ; θ ds
K (n)
(n+1)
(n+1) T
j=1 m j Rj w s ; θ s − μ s − μ ds
(n+1)
Σ ≈ K . (3.60)
(n) ds
j=1 m j R j w s ; θ
Table 3.2 EM algorithm for affine Gaussian sum with PPP histogram data
Given data: m 1:K = {m 1 , . . . , m K } (K is the number of histogram cells)
L
Fit the intensity: λ(s ; θ) = λbgnd (s) + =1 I N (s ; μ , Σ ), s ∈ R
Estimate the parameters: θ = {(I , μ , Σ ) : = 1, . . . , L}
• FOR = 1 : L, initialize coefficients, means, and covariance matrices:
– I (0) > 0, μ (0) ∈ R, and Σ (0) positive definite
• END FOR
• FOR EM iteration index n = 0, 1, 2, . . . until convergence:
– FOR = 1 : L and s ∈ R, define (be able to evaluate):
I (n) N (s ; μ (n), Σ (n))
w (s ; n) = L
R λbgnd (s) + =1 I (n) N (s ; μ (n), Σ (n)) ds
– END FOR
– FOR = 1 : L, compute:
K
• N (n) = j=1 m j R j w (s ; n) ds
K
• E (n) = N1(n) j=1 m j R j s w (s ; n) ds
K
• V (n) = N1(n) j=1 m j R j w (s ; n) (s − μ )(s − μ ) ds
T
s N (s ; μ , Σ ) ds
R = E (n)
R N (s ; μ , Σ ) ds
R N (s ; μ , Σ ) (s − μ )(s − μ )T ds
= V (n)
R N (s ; μ , Σ ) ds
for μ (n + 1) and Σ (n + 1)
• Compute:
N (n)
I (n + 1) =
R N (s ; μ (n + 1), Σ (n + 1)) ds
– END FOR
• END FOR EM Test for convergence
• If converged: FOR = 1 : L, Iˆ = I (nlast ), μ̂ = μ (nlast ), Σ̂ = Σ (nlast ) END FOR
78 3 Intensity Estimation
Table 3.2 outlines the steps of the EM algorithm for affine Gaussian sums with
histogram data. It is structurally similar to Table 3.1.
3.4 Regularization
L
λ(s) = λbgnd (s) + I N (s ; μl , Σ ) . (3.61)
=1
These heteroscedastic
sums, as they arepicturesquely called in the statistics litera-
ture, have L 2 n x (n x + 1) + n x + 1 free parameters. There are so many param-
1
eters, in fact, that the loglikelihood function (3.1) is unbounded for conditionally
independent data for L ≥ 2. It is bounded only for L = 1, and then only if the
data are full rank.
As often as not, the ugly fact of unboundedness intrudes during the EM iteration
when a covariance matrix abruptly becomes numerically singular. What is happen-
ing is that the covariance matrix is “shrink-wrapping” itself onto some less than
full n x -dimensional data subspace. Over-fitting is a more classical name for the
phenomenon. The likelihood of the corresponding Gaussian component therefore
grows without bound as the EM algorithm bravely iterates toward a maximum it
cannot attain. In practice, unfortunately, it is all too easy to encounter initializations
that are in the domain of attraction of an unbounded point of the likelihood function.
Hence, the need for regularization.
Example 3.7 Contaminated Data. Tukey proposed an interesting mixture model for
data contaminated with outliers. The model is sometimes called a homothetic Gaus-
sian sum, after the Greek thetos, meaning “placed.” The components of a homo-
thetic Gaussian sum have the same mean vector, called the homothetic center. The
covariance matrices are linearly proportional to a common covariance matrix, so the
general form is
L
λ(s) = I N s ; μ, ρ2 Σ . (3.62)
=1
The scale factors, or similitude ratios, ρ , are specified. The estimated parameters
are the mean μ and the coefficients {}, as well as the matrix Σ if it is not specified
in the application. In some settings, it may be useful to allow components of the
homothetic model to have different covariance matrices.
L
λ(s) = N (s) + I δ (s ; ω0 ) ,
=1
where s denotes frequency and N (s) is a known noise spectrum. However, if the
fundamental is a randomly modulated “narrow” broadband signal with mean fun-
damental frequency ω0 and spectral width σ0 , a reasonable model of the power
spectrum is
L
λ(s) = N (s) + I N s ; ω0 , 2 σ02 . (3.63)
=1
80 3 Intensity Estimation
This model differs from the homothetic model because both the locations and the
widths of the harmonics are multiples of the location ω and width σ0 of the funda-
mental. The objective is to estimate ω̂0 and σ̂0 from DFT data. In low signal to noise
ratio (SNR) applications, and in adverse environments where individual harmonics
can fade in and out over time, measurements of the full harmonic structure over
a sliding time window may improve the stability and accuracy of these estimates.
An immediate technical problem arises—DFT data are real numbers, not integer
counts. However, artificially quantizing the DFT measurements renders the data as
nonnegative integers. Estimates of the fundamental frequency can be computed by
thinking of the DFT cells as cells of a histogram and the integer data as histogram
counts of a PPP. A partial justification for this interpretation is given in Appendix
F. The spectral model (3.63) was proposed for generalized (noninteger) harmonic
structure in [73].
As the above discussion shows, the data likelihood function for Gaussian sums is
bounded above only if the covariance matrices are bounded away from singularity.
One way to do this is the force the condition number of the covariance matrices to
be greater than a specified threshold. In practice, this strategy is easily implemented
(n)
in every EM iteration by appropriately modifying the eigenvalues of Σ̂ , the EM
covariance matrix estimate at iteration n. However, this abusive practice destroys
the convergence properties of EM.
A principled method that avoids singularities entirely while also preserving EM
convergence properties employs a Bayesian methodology. In a Bayesian method,
an appropriate prior pdf is assigned to each of the parameters of the Gaussian sum.
For the covariance matrices, the natural prior is the Wishart matrix-valued pdf. The
hyperparameters of the Wishart density is the specified number of degrees of free-
dom and a positive definite target matrix. Similarly, the natural priors for the coeffi-
cients and mean vectors of the sum are Dirichlet and Gaussian densities, respec-
tively. Incorporating these Bayesian priors into the PPP likelihood function and
invoking the EM method leads to Bayesian estimators. In particular, the Bayesian
estimate for the covariance matrices of the Gaussian sum are necessarily bounded
away from singularity because of the Wishart target matrix is positive definite. The
details for Gaussian mixtures, that is, for Gaussian sums that integrate to one, are
given in [99].
The full potential of the Bayesian method is not really exploited here since the
prior densities are used to avoid matrix singularities and other numerical issues.
These methods add robustness to the numerical procedures, so they are certainly
valuable. Nonetheless, in practice, this kind of Bayesian parameter estimate does not
directly address the over-fitting problem itself. The story changes entirely however
if there are application-specific justifications for invoking Bayesian priors.
Chapter 4
Cramér-Rao Bound (CRB) for Intensity
Estimates
The most common measure of estimator quality is the Cramér-Rao bound (CRB)
on estimation variance. It is an important bound in both theory and practice. It
is important theoretically because no unbiased estimator of θ can have a smaller
variance than the CRB. Versions of the CRB are known for biased estimators, but
the bound in this case depends on the specific estimator used.
An estimator whose variance equals the CRB is said to be an “efficient” esti-
mator. Although efficient estimators do not exist for every problem, the CRB is still
useful in practice because, under mild assumptions, maximum likelihood estimators
are asymptotically unbiased and efficient. The CRB is also useful in another way.
Approximate estimation algorithms are often implemented to reduce computational
complexity or accommodate data quality, and it is highly desirable to know how
close these approximate estimators are to the CRB. Such studies, performed by
simulation or otherwise, may reveal that the approximate estimator is good enough,
1 The oldest surviving astronomical text in dialetto toscano (the Italian dialect spoken in Tuscany).
4.1 Background
Several facts about the CRB are worth mentioning explicitly at the outset. Perhaps
the most interesting to readers new to the CRB is that the CRB is determined solely
by the data pdf. In other words, the CRB does not involve any actual data. Data
influence the CRB only via the parametric form of the data pdf. For the CRB to be
useful, it is imperative that the pdf accurately describe the real data.
Classical thinking says the only good estimators are the unbiased estimators
which are defined in the next section. Such thinking is incorrect—there are use-
ful biased estimators in several applications. Moreover, techniques are known for
trading off variance and bias to minimize the mean squared error, a quantity of great
utility in practice. Nonetheless, the classical CRB for unbiased estimators is the
focus to the discussion here. The CRB for biased estimators is given in Section 4.1.4
for estimators with known bias.
For the special case of one parameter, the crucial insight that makes the CRB
tick is the Cauchy-Schwarz inequality. Multiparameter problems require a modi-
fied approach because of the necessity of defining what a lower bound means in
more than one dimension. Maximizing a ratio of positive definite quadratic forms
(a Rayleigh quotient) replaces the Cauchy-Schwarz inequality in the multiparameter
4.1 Background 83
case. This approach parallels the one given in [46]. The CRB for multiple parameters
is discussed in this section in a general context.
Let p X (x ; θ ) denote the parameterized data pdf, where X is the random data, x
is a realization of the data, and θ is a real valued parameter vector.
Let
θ ∈ Θ ⊂ Rn θ ,
where n θ denotes the number of parameters and Θ is the set of all valid parameter
vectors. The pdf p X (x ; θ ) is assumed differentiable with respect to every compo-
nent of the vector θ for all x ∈ R ⊂ Rn x , where n x is the dimension of the data
space. The data space R is independent of θ . The gradient ∇θ p X (x ; θ ) is a column
vector in Rn θ .
For any parameter vector θ ∈ Θ, the Fisher Information Matrix (FIM) of θ is the
n θ × n θ matrix defined by
2 A square (real) matrix A is positive semidefinite if and only if c T A c ≥ 0 for all vectors c.
84 4 Cramér-Rao Bound (CRB) for Intensity Estimates
J (θ ) = E (∇θ log p X (x ; θ )) (∇θ log p X (x ; θ ))T (4.2)
≡ (∇θ log p X (x ; θ )) (∇θ log p X (x ; θ ))T p X (x ; θ ) dx , (4.3)
R
Several lower bounds on the variance of unbiased estimators are available, but the
best known and most widely used is the CRB. To see that (4.7) holds, first write the
equations that are equivalent to the statement that θ̂ (X ) is unbiased:
θ̂i (x) p X (x ; θ ) dx = θi , i = 1, . . . , n θ ,
R
where θi and θ̂i (X ) are the i-th components of θ and θ̂(X ), respectively. Differenti-
ating each of these equations with respect to θ j , j = 1, . . . , n θ , gives
4.1 Background 85
∂
θ̂i (x) p X (x ; θ ) log p X (x ; θ ) dx = δi j ,
R ∂θ j
Now substitute U (x) = a T θ̂(x) and V (x) = b T s(x ; θ0 ). The numerator simpli-
fies to a T b using (4.5) and (4.8). The first factor in the denominator is a T Var θ̂ a.
From (4.6), the second factor in the denominator is b T J (θ0 ) b. Making these substi-
tutions, squaring both sides of (4.9) and then multiplying through by a T Var θ̂ a
gives
T 2
a b
≤ a T
Var θ̂ a . (4.10)
b T J (θ0 ) b
The maximum value of the Rayleigh quotient on the left hand side is unchanged
if b is multiplied by an arbitrary nonzero real number. The equality constraint
b T J (θ0 ) b = 1 eliminates this scale factor. The maximum value is attained at
the solution of the constrained optimization problem:
1
b∗ = J −1 (θ0 ) a .
a T J (θ0 ) a
86 4 Cramér-Rao Bound (CRB) for Intensity Estimates
for all nonzero vectors a. Hence, Var θ̂ − J −1 (θ0 ) is positive semidefinite, and
this establishes (4.7).
4.1.4 Spinoffs
There For all i and j, denote the (i, j)-th elements of
is more to learn from (4.13).
Var θ̂ and J (θ0 ) by Vari j θ̂ and Ji−1
−1
j (θ0 ), respectively. Suppose all the com-
Var j j θ̂ ≥ J j−1
j (θ0 ) . (4.14)
In words, this says that the smallest variance of any unbiased estimator of the j-th
parameter θ j is the ( j, j)-th element of the inverse of the FIM of θ . The result is
important in the many applications in which the off-diagonal elements of the CRB
are of no intrinsic interest.
The inequality (4.13) yields still more. If all the components of a are zero except
for the subset with indices in M ⊂ {1, . . . , n θ }, then
−1
Var M×M θ̂ ≥ J M×M (θ0 ) . (4.15)
Here, Var M×M θ̂ denotes the M × M submatrix of Var θ̂ obtained by removing
all rows and columns of Var θ̂ that are not in the index set M, and similarly for
−1
J M×M (θ0 ). The inequality (4.15) allows selected off-diagonal elements of the CRB
to be evaluated as needed in the application.
If p X (x ; θ ) is twice differentiable, then it can be shown that
J (θ0 ) = − E ∇θ (∇θ )T log p X (x ; θ0 ) , (4.16)
∂2
Ji j (θ0 ) = − log p X (x ; θ0 ) p X (x ; θ0 ) dx . (4.17)
R ∂ θi ∂ θ j
This form of the FIM is widely known and used in many applications. It is also the
inspiration behind the observed information matrix (OIM) that is often used when
the FIM is unavailable. For more details in a PPP
context, see Section 4.7.
An unbiased estimator is efficient if Var θ̂ = C R B(θ0 ). This definition is
standard in the statistical literature, but it is misleading in one respect: the best
unbiased estimator is not necessarily efficient. Said more carefully, there are pdfs
p X (x ; θ ) for which the unbiased estimator with the smallest covariance matrix is
known explicitly, but this estimator does not achieve the CRB.
An estimator θ̂ (x) is biased if
E θ̂ = θ0 + b(θ0 ) (4.18)
and b(θ0 ) is nonzero. The nonzero term b(θ0 ) is called the estimator bias. The bias
clearly depends on the particular estimator, and it is often difficult to evaluate. If the
form of b(θ0 ) is known, and b(θ0 ) is differentiable with respect to θ , then the CRB
for the biased estimator θ̂ is
T
Var θ̂ − θ0 − b(θ0 ) = I + ∇θ b T (θ ) J −1 (θ0 ) I + ∇θ b T (θ ) ,
θ=θ0 θ=θ0
(4.19)
where I is the n x × n x identity matrix, and the gradients are evaluated at the true
value θ0 . The matrix dimensions are consistent since b T (θ ) = b1 (θ ), . . . , bn x (θ )
is a row, and its gradient is the n x × n x matrix
∇θ b1 (θ ), . . . , ∇θ bn x (θ ) .
The matrix J (θ0 ) is the FIM for θ0 . The bound depends on the estimator via the
derivative of the bias.
A Bayesian version of the CRB called the posterior CRB (PCRB) is useful when
the parameter θ is a random variable with a specified prior pdf. A good discussion
of the PCRB, and in fact the very first discussion of it anywhere, is found in Van
88 4 Cramér-Rao Bound (CRB) for Intensity Estimates
For PPP sample data on the bounded set R, the FIM is amazingly simple. If
λ(s ; θ ) > 0 for all s ∈ R, then the FIM for unbiased estimators of θ is
1
J (θ ) = [∇θ λ(s ; θ )] [∇θ λ(s ; θ )]T ds . (4.21)
R λ(s ; θ )
n
log pΞ (ξ ; θ ) = − log n! − λ(s ; θ ) ds + log λ(x j ; θ ) . (4.22)
R j=1
n
1
∇θ log pΞ (ξ ; θ ) = − ∇θ λ(s ; θ ) ds + ∇θ λ(x j ; θ ) . (4.23)
R λ(x j ; θ )
j=1
n
⎝ 1
−2 ∇θ λ(s ; θ ) ds ∇θ λ(x j ; θ )⎠ (4.24)
R λ(x j ; θ )
j=1
& n '⎛ n ⎞T
1
1
+ ∇θ λ(x j ; θ ) ⎝ ∇θ λ(x j ; θ )⎠ .
λ(x j ; θ ) λ(x j ; θ )
i=1 j=1
4.2 CRB for PPP Intensity with Sample Data 89
The FIM is the sum of the expected values of the three terms in (4.24), where the
expectation is given by (2.23). The expectation of the first term is trivial—it is the
expectation of a constant. The other two expectations look formidable, but they are
not. 4The sums in both terms are the same form as (2.30), so the expectation of the
second term is, from (2.32),
T
−2 ∇θ λ(s ; θ ) ds ∇θ λ(s ; θ ) ds , (4.25)
R R
I
C R B(I ) = .
R f (s) ds
The ML estimator is
m
IˆM L = ,
R (s) ds
f
E[m] I f (s) ds
E IˆM L = = R = I.
R f (s) ds R f (s) ds
Var[m] R I f (s) ds
Var IˆM L = 2 = 2 = C R B(I ) ,
R f (s) ds R f (s) ds
so the ML estimator attains the lower bound and is, by definition, an efficient esti-
mator.
Example 4.2 Linear Combination of Known Intensities. Generalizing the previous
example, let λ(x ; θ ) = I T f (x), where the components of the vector f (x) ≡
90 4 Cramér-Rao Bound (CRB) for Intensity Estimates
Evaluating F I M(I ) and its inverse C R B(I ) requires numerical methods. The ML
estimator IˆM L is also not explicitly known, but is found numerically by solving
the necessary equations (3.3), or by the EM recursion (3.29). Even in this simple
example, it is not clear whether or not IˆM L is unbiased.
Example 4.3 Oops. For the parameter vector θ ∈ R2 , define the intensity
where c ∈ R2 is given and R = [0, 1] × [0, 1]. The FIM is, from (4.21),
(θ − c)(θ − c)T
J (θ ) = .
θ − c2
The matrix J (θ ) is clearly rank one, so the CRB fails to exist. The problem lies in
the parameterization, not the FIM. The intensity function (4.27) is constant on R, so
the “intrinsic” dimensionality of the intensity function parameter is only one. The
dimension of θ is two.
. / . /T
K
1
J (θ ) = ∇θ λ(s ; θ ) ds ∇θ λ(s ; θ ) ds .
j=1 R j λ(s ; θ ) ds Rj Rj
(4.28)
This expression reduces to (4.21) in the limit as the number histogram cells goes to
infinity and their size goes to zero. To see (4.28) requires nothing more than matrix
algebra, but it is worth presenting nonetheless. The rest of the section up to Example
4.4 can be skipped on a first reading.
Start with the loglikelihood function of the data. Let n 1:K = (n 1 , . . . , n K )
denote integer counts in the histogram cells. From (2.62), the pdf of the data is
4.3 CRB for PPP Intensity with Histogram Data 91
& '
K
log p(n 1:K ; θ ) = − n j! − λ(s ; θ ) ds + n j log λ(s ; θ ) ds .
j=1 R j=1 Rj
K
nj
∇θ log p(n 1:K ; θ ) = − ∇θ λ(s ; θ ) ds + ∇θ λ(s ; θ ) ds.
R Rj λ(s ; θ ) ds Rj
j=1
(4.29)
The FIM is the expectation of the outer product of this gradient with itself. As was
done with conditionally independent data, the outer product is written as the sum of
three terms:
T
∇θ log p(n 1:K ; θ ) ∇θ log p(n 1:K ; θ )
T
= ∇θ λ(s ; θ ) ds ∇θ λ(s ; θ ) ds
R R
⎛ ⎞T
K
nj
−2 ∇θ λ(s ; θ ) ds ⎝ ∇θ λ(s ; θ) ds ⎠ (4.30)
R R λ(s ; θ ) ds R j
j=1 j
⎛ ⎞⎛ ⎞T
K
K
nj nj
+⎝ ∇θ λ(s ; θ ) ds ⎠⎝ ∇θ λ(s ; θ) ds ⎠ .
j=1R j λ(s ; θ ) ds R j R j λ(s ; θ) ds R j
j=1
The first term is independent of the data, so it is identical to its expectation. The
second term is a product of two factors, the first of which is independent of the data
so it multiplies the expectation of the other factor, which is a sum over K terms.
Since K is the number of histogram cells and is fixed, the expectation of the sum is
K
E nj
∇θ λ(s ; θ ) ds
λ(s ; θ ) ds Rj
j=1 R j
K
R λ(s ; θ ) ds
= j ∇θ λ(s ; θ ) ds
j=1 R j λ(s ; θ ) ds Rj
K
= ∇θ λ(s ; θ ) ds = ∇θ λ(s ; θ ) ds . (4.31)
j=1 Rj R
T
−2 ∇θ λ(s ; θ ) ds ∇θ λ(s ; θ ) ds . (4.32)
R R
& '⎛
⎞T
K
nj n j
∇θ λ(s ; θ ) ds ⎝ ∇θ λ(s ; θ) ds ⎠ .
j=1 j
=1 R j λ(s ; θ ) ds Rj R j
λ(s ; θ) ds R j
There are two cases. In the first case, j = j
and the cells R j and R j
are disjoint.
The summands are therefore independent and the expectation of their product is the
product of their expectations. In the same manner as done in (4.31), the expectation
simplifies to
⎡ ⎤
& ' & 'T
⎢
K ⎥
K
⎢ ⎥
E⎢ ⎥ = ∇θ λ(s ; θ ) ds ∇θ λ(s ; θ ) ds . (4.33)
⎣
⎦ Rj R j
j, j = 1 j, j
= 1
j = j
j = j
In the other case, j = j
and the double sum is the single sum,
& '& 'T
K
nj nj
∇θ λ(s ; θ ) ds ∇θ λ(s ; θ ) ds .
j=1 R j λ(s ; θ ) ds Rj R j λ(s ; θ ) ds Rj
Because K is fixed, the expectation of the sum is the sum of the expectations. Denote
the summand by e j j and write it as a product of sums:
⎛ ⎞⎛ ⎞T
R j ∇θ λ(s ; θ ) ds R j ∇θ λ(s ; θ ) ds
nj nj
ejj = ⎝ ⎠⎝ ⎠ .
R j λ(s ; θ ) ds
ρ=1
R j λ(s ; θ ) ds
ρ =1
Now add the sum over j of (4.34) to (4.33). The double sum no longer has the
exception j = j
, so it becomes the product of single term sums over j. The single
4.4 CRB for PPP Intensity on Discrete Spaces 93
term sums are identical to integrals over all R. Therefore, the expectation of the
third term is
T
∇θ λ(s ; θ ) ds ∇θ λ(s ; θ ) ds + J (θ ) .
R R
Finally, adding the expectations of the three terms gives the FIM (4.28).
Example 4.4 Scaled Intensity (continued). Let the intensity be the same as in Exam-
ple 4.1. The FIM for the scale factor I using histogram data is
& '2 K
K
1 1
f (x) dx
J (I ) = f (x) dx = f (x) dx = R .
j=1 R j f (x) dx
I Rj I
j=1 R j
I
The CRB of I is therefore the same for both conditionally independent sample data
and histogram data. The ML estimator for histogram data is given by (3.6) with the
Gaussian pdf replaced by f (x). Its variance is
K
Var m j
Var IˆM L
j=1
= 2
R f (s) ds
K
j=1 R j I f (s) ds
= 2
R f (s) ds
K
j=1 R j f (s) ds I
= I 2 = .
R f (s) ds R f (s) ds
The derivatives are evaluated at the true value of θ . To see that (4.35) holds, it is only
necessary to follow the steps of the proof of (4.21) for PPPs on continuous spaces,
replacing integrals with the appropriate sums. This requires verifying that certain
results hold in the discrete case, e.g., Campbell’s Theorem. Details are omitted.
94 4 Cramér-Rao Bound (CRB) for Intensity Estimates
A quick intuitive way to see that the result must hold is as follows: imagine that
the points of Φ are isolated points of the region over which the integral in (4.21) is
performed. Let the intensity function be a test function sequence for these isolated
points (of the kind used to define the Dirac delta function). Then (4.21) goes in the
limit to (4.35) as the test function sequence “converges” to the Dirac delta function.
Example 4.5 Intensity Vector. Let Ξ denote a PPP on the discrete space Φ =
{φ1 , φ2 } with intensity vector λ(θ ) = (λ1 (θ ), λ2 (θ )) , where θ = (θ1 , θ2 ) and
λ1 (θ ) = θ1
λ2 (θ ) = θ2 . (4.36)
Example 4.6 Parametrically Tied Intensity Vector. Let Ξ denote a PPP on the dis-
crete space Φ = {φ1 , φ2 } with intensity vector
1 1
J (θ ) = (−I sin θ )2 + (I cos θ )2
I cos θ I sin θ
sin3 θ + cos3 θ
= I .
sin θ cos θ
λS
γS N R = .
λN
∇μ λ(x ; μ) = λ S N (x ; μ, Σ) Σ −1 (x − μ) . (4.39)
From (4.21), the FIM for μ using conditionally independent sample data is written
in the form
The matrix WR (μ) is evaluated at the correct value of μ. The CRB is the inverse of
the FIM, so
1 −1
C R BR (μ) = Σ WR (μ) Σ . (4.42)
λS
in a realization, and thus inversely scales the variance reduction at a given SNR. On
the other hand, the trade-off between signal and noise pdfs and their effect on the
shape (eigenvalues and eigenvectors) of the weighted covariance matrix WR (μ) is
determined by the average fraction of points in the realization that originate from
the signal. This fraction is governed by γ S N R . Good estimation therefore depends
on both λ S and γ S N R .
The CRB for the covariance matrix Σ can also be found using the general result
(4.21). The results seem to provide little insight and are somewhat tedious to obtain,
so they are not presented here. Details for the closely related problem of evaluating
the CRB of Σ for the classical multivariate Gaussian density using i.i.d. data can be
found in [116].
The general result (4.21) will also give the CRB for non-Gaussian signals, that
is, for data that are a realization of the PPP with intensity
Example 4.7 Effect of Gating on Estimated Mean. The structure of the CRB is
explored numerically for the special case R = R1 . Let
R(ρ) = {x : (x − μ)T Σ −1 (x − μ) ≤ ρ 2 },
The term in brackets has units of length, so the CRB has units of length squared, as
required in one dimension.
The effect of SNR and gate size on the CRB is seen by plotting C R BR(ρ) (μ) as
a function of ρ for several values of SNR and for, say, σ = 1. Let pnoise (x) ≡ 1.
It is seen in Fig. 4.1 that the CRB decreases with increasing ρ and SNR. It is also
seen that there is no practical reason to use gates larger than ρ = 2.5 in the one
dimensional case, regardless of SNR.
4.6 Joint CRB for Gaussian Sums 97
√
Fig. 4.1 CRB of (4.43) for μ as a function of gate size ρ: λ N = 2π ≈ 2.5 points per unit
length; S N R = 3(1)10; σ = 1. Gates larger than ρ = 2.5 do little to improve estimation, at
least in R1
λ(x ; μ, Λ) ≡ λ(x ; μ1 , . . . , μ L , λ1 , . . . , λ L )
L
= λ N pnoise (x) + λ N (x ; μ , Σ ), (4.44)
=1
L
λnoise( j) (x) ≡ λ N pnoise (x) + λ N (x ; μ , Σ ) (4.45)
=1
= j
in (4.38). The FIM and CRB are then evaluated using the appropriately interpreted
versions of (4.41) and (4.42).
98 4 Cramér-Rao Bound (CRB) for Intensity Estimates
T
λi N (x ; μi , Σi ) Σi−1 (x − μi ) λ j N (x ; μ j , Σ j ) Σ −1
j (x − μ j )
= λi λ j N (x ; μi , Σi ) N (x ; μ j , Σ j ) Σi−1 (x − μi )(x − μ j )T Σ −1
j .
(4.47)
The FIM, J (μ), is an L×L block matrix. Its entries are matrices of size n x × n x , so
its full dimension is (Ln x ) × (Ln x ). The i j-th block of J (μ) is the integral over R
of (4.47) divided by λ(x ; μ, Λ). Explicitly,
where
4.6 Joint CRB for Gaussian Sums 99
⎡ ⎤
λ1 Σ1−1 0 ··· 0
⎢ 0 λ2 Σ −1 ··· 0 ⎥
⎢ ⎥
D −1 (Λ) = ⎢ .
2
.. .. .. ⎥ (4.51)
⎣ .. . . . ⎦
0 0 · · · λ L Σ L−1
−1
C R B(μ) = D (Λ) WR (μ, Λ) D (Λ) , (4.52)
where
⎡ ⎤
λ1 Σ1 ···
1
0 0
⎢ 0 0 ⎥
λ2 Σ2 ···
1
⎢ ⎥
D (Λ) = ⎢
⎢ .. .. .. .. ⎥
⎥. (4.53)
⎣ . . . . ⎦
0 0 ··· λL ΣL
1
The CRB depends jointly on all the means μ j because, in general, none of the block
matrix elements of WR (μ, Λ) are zero.
The joint CRB separates into CRBs for the individual PPPs in the superposition if
the matrix WR (μ, Λ) is block diagonal. This happens if all the off-diagonal n x × n x
blocks are approximately zero, that is, if
ij
WR (μ, Λ) ≈ 0 for all i = j .
The approximation is a good one if the coefficient of the outer product in the integral
(4.49) is nearly zero for i = j. If the Gaussian pdfs are all well separated, that is, if
−1
max (μi − μ j )T Σi + Σ j (μi − μ j ) 1 ,
i= j
then the joint CRB is block diagonal and splits into separate CRBs for each mean.
More generally, the joint CRB splits the mean vectors into disjoint groups or clus-
ters. The CRBs of the means within each cluster are coupled, but the CRBs of means
in different clusters are independent.
If the coefficients Λ and the means μ are estimated together, the CRB changes yet
again. The FIM in this case, denoted by J (μ, Λ), is larger than J (μ) by exactly L
rows and columns, so it has dimension (L + L n x ) × (L + L n x ). Partition it so
that
100 4 Cramér-Rao Bound (CRB) for Intensity Estimates
UR (μ, Λ) VR (μ, Λ)
J (μ, Λ) = T (μ, Λ) W (μ, Λ) . (4.54)
VR R
Finally, the L × (L n x ) matrix in the upper right hand corner of the partition,
VR (μ, Λ), contains the cross terms, that is, the terms corresponding to the outer
product
The lower left submatrix is clearly the transpose of VR (μ, Λ). The joint CRB of Λ
and μ is the inverse of J (μ, Λ). It differs significantly from the CRB J (μ) for μ
alone.
The inverse of the OIM is always positive definite when θ̂ M L is a local maximum
interior to the domain of allowed parameter vectors, Θ. Intuitively, it is tempting to
think of the OIM as the FIM without the expectation over the data. Unfortunately,
this succinct description is slightly inaccurate technically, since the FIM is evaluated
at the true parameter value while the OIM is evaluated at the ML estimate.
L
λ(x ; θ ) = f (x ; θ ) , (4.58)
=1
where ξ = (n, {x1 , . . . , xn }) is the given realization of the PPP. The following
algebraic identity—whose terms are defined in the next paragraph—is obtained by
straightforward, but tedious, differentiation and matrix algebra:
n
−∇θ [∇θ ]T log p(ξ ; θ ) = A(θ ) + B(θ ) + C(θ ) + S j (θ ) S Tj (θ ) . (4.60)
j=1
Those who wish to verify this result will find the identity
102 4 Cramér-Rao Bound (CRB) for Intensity Estimates
& L '
f (x ; θ )
∇θ log f (x ; θ ) = ∇θ log f (x ; θ )
λ(x ; θ )
=1
f (x j ; θ )
w (x j ; θ ) = , j = 1, . . . , n , = 1, . . . , L . (4.62)
λ(x j ; θ )
are block diagonal, and their -th diagonal blocks are size dim(θ ) × dim(θ ). The
-th blocks are, for = 1, . . . , L,
n
T
A (θ ) = − w (x j ; θ ) ∇θ log f (x j ; θ ) ∇θ log f (x ; θ )
j=1
n
T
B (θ ) = − w (x j ; θ ) ∇θ ∇θ log f (x j ; θ )
j=1
T
C (θ ) = ∇θ ∇θ f (x ; θ ) dx .
R
L
λ(x ; θ ) = λ0 (x) + λ N (x ; μ , Σ ) ,
=1
where λ0 (x) is the intensity of a known PPP background process. Except for λ0 (x),
this sum is the same as (4.58) with
f (x ; θ ) = λ N (x ; μ , Σ ) , = 1, . . . , L .
After EM algorithm convergence, the OIM is easily evaluated using the weights that
are computed during the last EM iteration. Here, for simplicity, the coefficients λ
and covariance matrices Σ are specified, and only the mean vectors are estimated,
so in the above θ = μ and θ ≡ μ = (μ1 , . . . , μ L ). The only change in the
OIM calculation required to accommodate the affine term λ0 (x) is to adjust the
weights to include it in the denominator; explicitly,
λ N (x j ; μ , Σ )
w (x j ; μ) = L , j = 1, . . . , n ,
λ0 (x j ) + =1 λ N (x j ; μ , Σ ) = 1, . . . , L . (4.63)
n
A (μ) = − Σ−1 ⎣ w (x j ; μ) (x j − μ )(x j − μ )T ⎦ Σ−1
j=1
⎛ ⎞
n
B (μ) = ⎝ w (x j ; μ)⎠ Σ−1
j=1
% (
C (μ ) = λ Σ−1 N (x ; μ , Σ ) −Σ + (x − μ )(x − μ ) dx Σ−1
T
R
⎡ ⎤
w1 (x j ; μ) Σ1−1 (x j − μ1 )
⎢ .. ⎥
S j (μ) = ⎣ . ⎦.
w L (x j ; μ) Σ L−1 (x j − μ L )
In these equations, μ is taken to be the ML estimate μ̂ M L ≡ μ̂1 , . . . , μ̂ L .
The equations are more intuitive when written in a different form. Adding A (μ)
and B (μ) gives
104 4 Cramér-Rao Bound (CRB) for Intensity Estimates
⎛ ⎞
n
A (μ) + B (μ) = ⎝ w (x j ; μ)⎠ Σ−1 Σ − Σ̃ Σ−1 , (4.64)
j=1
where the weighted covariance matrix for the -th Gaussian term is
n
j=1 w (x j ; μ) (x j − μ )(x j − μ )T
Σ̃ = n .
j=1 w (x j ; μ)
The coefficient of (4.64) is the conditional expected number of samples that origi-
nate from the -th Gaussian component. Similarly,
C (μ) = − λ N (x ; μ , Σ ) dx Σ−1 Σ − Σ̄ Σ−1 , (4.65)
R
The negative of the coefficient of (4.65) is the expected number of samples from the
-th Gaussian component. If the bulk of the -th Gaussian component lies within R,
the term C (μ) is approximately zero.
With these forms it is clear that the first three terms in the OIM are comparable
in some situations, in which case it is not clear whether or not their sum is positive
definite. In these situations, the fourth term is an important one in the OIM, since
the OIM as a whole must be positive definite at the ML estimate μ̂ M L . The (ρ, )-th
block component of the sum of outer products of S j (μ) is
⎡ ⎤ ⎛ ⎞
n
⎣ S j (μ) S Tj (μ)⎦ = Σρ−1 ⎝ wρ (x j ; μ) w (x j ; μ) (x j − μρ )(x j − μ )T ⎠ Σ−1 .
j=1 ρ j=1
(4.66)
The only off-diagonal block terms of the OIM come from (4.66). When the Gaussian
components are well separated, the off-diagonal blocks are small, and the OIM is
approximately block diagonal. Given good separation, then, for every the sum of
the four terms is positive definite at the ML estimate μ̂ M L .
The OIM calculation requires only L numerical integrals in this case, signifi-
cantly fewer that the analogous FIM calculation. Some of these integrals are unnec-
essary if the bulk of the Gaussian density evaluated at θ̂ lies inside R, for in this
case the integral in C (θ̂ ) vanishes. Avoiding numerical integration is an advantage
in practice, provided it is verified by simulation or otherwise that the OIM performs
satisfactorily as a surrogate for the FIM.
4.7 Observed Information Matrices 105
Abstract PPP methods for tomographic imaging are presented in this chapter. The
primary emphasis is on methods for emission tomography, but transmission tomog-
raphy is also included. The famous Shepp-Vardi algorithm for positron emission
tomography (PET) is obtained via the EM method for time-of-flight data. Single-
photon emission computed tomography (SPECT) is used in practice much more
often than PET. It differs from PET in many ways, yet the models and the mathemat-
ics of the two methods are similar. (Both PET and SPECT are also closely related
to multitarget tracking problems discussed in Chapter 6.) Transmission tomogra-
phy is the final topic discussed. The Lange-Carson algorithm is derived via the EM
method. CRBs for unbiased estimators for emission and transmission tomography
are discussed. Regularization and Grenander’s method of sieves are reviewed in the
last section.
Fig. 5.1 PET processing configuration. (Image released to the Wikimedia Commons by J. Langner
[66])
events emit pairs of (gamma) photons that move in opposite directions. Due to con-
servation of momentum, these directions are essentially collinear if the positrons and
electrons are effectively zero velocity. Departures from straight line motion degrade
spatial resolution, so some systems model these effects to avoid losing resolution.
Straight line propagation is assumed here.
The raw measurement data are the arrival times of photons at an array of detectors
that comprise scintillator crystals and photomultipliers. A pair of photons arriving
within a sufficiently short time window (measured in nanoseconds, ns) at two appro-
priately sited detectors determine that an annihilation event occurred: the event lies
on the chord segment connecting the detectors, and the specific location on the chord
is determined by the time difference of arrivals. Photons without partners within
the time window are discarded. The measurement procedure produces occasional
spurious annihilation events.
Many subtleties attend to the data collection and preprocessing steps for PET
and SPECT. An excellent review of these issues as well as the field of emission
tomography as a whole is found in [70].
PET is now often part of a multisensor fusion system that combines other imag-
ing methodologies such as CT (computed tomography) and MRI (magnetic reso-
nance imaging). These topics are an active area of research.
Transmission tomography, more popularly known as computed tomography
(CT), uses multiple sets of fan beams or cone beams to determine the spatial dis-
tribution of material density, that is, the local spatial variability of the attenuation
112 5 Tomographic Imaging
coefficient. Both fan and cone beams employ a single source that produces radia-
tion in many directions simultaneously. This shortens the data gathering time. Fan
beams are used for two dimensional imaging problems, and their detector array is
one dimensional. Similarly, cone beams are used for three dimensional problems,
and they require a two dimensional detector array. In practice, CT scans are much
less expensive than PET because it does not require the production of short lived
radioisotopes. Contrast enhancement agents may be used to improve imaging of
certain tissue types in medical applications. PET, SPECT, and CT are used diag-
nostically for very different purposes. PET and CT provide significantly higher
resolution images than SPECT.
EM methods are discussed for PET, SPECT, and transmission tomography.
Numerical issues arise with EM algorithms when the number of estimated parame-
ters is large. For PET and SPECT, the parameters are the intensities in the array of
pixels/voxels of the image; for CT, the parameters are the pixel attenuation coeffi-
cients. Good resolution therefore requires a large number of parameters. Regular-
ization methods compatible with the EM method can alleviate many of these issues.
See [119] for further discussion of this and other topics involving medical imaging
applications of PPPs and related point processes.
A prominent alternative method for transmission tomography is Fourier recon-
struction. This approach is fully discussed in the book [28], as well as in the paper
[22]. These are Fourier analysis based methods, of which the Projection-Slice The-
orem is perhaps the best known result in the field. The approach is classical analysis
in the sense that it is ultimately grounded on the Radon transform (1917) and the
Funk-Hecke Theorem, a lovely but little known result that stimulated Radon’s inter-
est in these problems. Fourier methods are not discussed further here.
one or both of the gamma photons causes detections on the wrong pair of detectors,
that is, the annihilation is deemed to occur on the wrong chord.
In time-of-flight (TOF) systems, the arrival times at the detectors are recorded,
but with a much smaller time window of about 0.5 ns. The differential propagation
time data are preprocessed to estimate the locations of every positron-electron anni-
hilation along the detector chord. The time window of 0.5 ns corresponds to a local-
ization uncertainty of about 7.5 cm along the chord. These TOF PET systems have
met with only limited success in practice as of about 2003 [70].
The PET reconstruction problem is presented in two forms. The first uses PPP
sample data that comprises the estimated locations of every annihilation event, i.e.,
TOF data. The algorithm given here for this data is a variant of the original Shepp-
Vardi algorithm. The discussion is based on [119, Chapter 3]. Even though TOF
PET is not in widespread current use, it is presented here because of the insight it
provides and its intuitive mathematical appeal.
The other PET reconstruction problem uses histogram data, that is, the data com-
prises only the numbers of annihilation events measured by the detectors. These
counts are a realization of a PPP defined on the discrete space of detectors. In the
language and notation of Section 2.12.1 for PPPs on discrete spaces, the detectors
are the points of the discrete space Φ. Histogram data are used in the original paper
by Shepp and Vardi [110].
K
λ(x) = λr Ir (x), x ∈ R, (5.1)
r =1
The parameter vector is Λ = (λ1 , . . . , λ K ). From the physics, the input process
is a PPP on R. Realizations of this PPP model the locations of positron-electron
annihilations. An input point x ∈ R is the location of a positron-electron annihi-
lation, and the detector array estimates that it occurred at the point y in the output
(measurement) space T ⊂ Rn y . For PET imaging, T ≡ R, but this restriction is
not used here. The pdf of the measurement is (y | x), so that T (y | x) dy = 1
114 5 Tomographic Imaging
for all x. This pdf is assumed to be known. From (2.86) and (5.1), the measurement
point process is a PPP with intensity
μ(y) = (y | x) λ(x) dx (5.2)
R
K
= λr fr (y), y∈T , (5.3)
r =1
where
fr (y) = (y | x) dx. (5.4)
Rr
Y = (y1 , . . . , ym ) , yj ∈ T . (5.5)
Because this is TOF data, the points correspond to the estimated locations of
positron-electron annihilation events. The order of the points in Y is irrelevant, so
the incomplete data pdf is, from (2.12),
m
p(Y ; Λ) = e− T μ(y) dy
μ(y j ) (5.6)
j=1
& K '
K
m
− λr |Rr |
= e r =1 λr fr (y j ) , (5.7)
j=1 r =1
where |Rr | = Rr dx < ∞. The ML estimate of Λ is
T
H (Λ) = ∇Λ ∇Λ log p(Y ; Λ)
⎡ ⎤
f 1 (y j )
m
1 ⎢ . ⎥ T
= − 2 ⎣ .. ⎦ f 1 (y j ) · · · f K (y j ) . (5.9)
K
j=1 r =1 λr fr (y j ) f K (y j )
The quadratic form z T H (Λ) z is strictly negative for nonzero vectors z if the num-
ber of data points m is at least as great as the number of pixels K , and the m × K
matrix
⎡ ⎤
f 1 (y1 ) f 2 (y1 ) · · · f K (y1 )
⎢ f 1 (y2 ) f 2 (y2 ) · · · f K (y2 ) ⎥
⎢ ⎥
F = ⎢ . .. .. ⎥ (5.10)
⎣ .. . . ⎦
f 1 (ym ) f 2 (ym ) · · · f K (ym )
is full rank. Because the Hessian matrix is negative definite for any Λ = 0, the
function p(Y ; Λ) is strictly log-concave, and hence also strictly concave.
5.2.1.3 E-step
The EM method is used to derive a recursive algorithm to compute the ML estimate
0 The recursion avoids directly solving the nonlinear system ∇Λ p(Y ; Λ) = 0
Λ.
directly. For the PET problem, EM yields a recursion for Λ 0 known as the Shepp-
Vardi algorithm. Because the pdf (5.7) is unimodal, the EM method is guaranteed
to converge to the ML estimate.
Let x j be the unknown location of the annihilation event that generated the
data point y j . The question is, “Which cell/pixel/voxel contains x j ?” Since this is
unknown, let the index k j ∈ {1, . . . , K } indicate the correct cell, that is, let
x j ∈ Rk j , j = 1, . . . , m. (5.11)
The missing data in the sense of EM are the indices K = {k1 , . . . , km }. The joint
pdf of (K, Y) is defined by
K
m
− λr |Rr |
p(K, Y) = e r =1 λk j f k j (y j ) , (5.12)
j=1
116 5 Tomographic Imaging
p(k1 , . . . , km ; Λ) ≡ p(K | Y ; Λ)
p(K, Y ; Λ)
=
p(Y ; Λ)
m
λk j f k j (y j )
= K . (5.13)
j=1 r =1 λr fr (y j )
The dependence of the left hand expression on the data Y is suppressed to simplify
the notation. The logarithm of the joint density (5.12) is
m
log p(K, Y ; Λ) = − λr |Rr | + log λk j f k j (y j ) .
r =1 j=1
m
L (Λ) = − λr |Rr | + log λk j . (5.14)
r =1 j=1
In some applications, the functions log f k j (y j ) are retained because they incorporate
a Bayesian a priori pdf and depend on one or more of the estimated parameters.
5.2.1.4 M-step
Let n ≥ 0 denote the EM
recursion index,
and let the initial value for the intensity
(0) (0) (0) (0)
parameter be Λ = λ1 , . . . , λ K , where λr > 0 for r = 1, . . . , K . The
EM auxiliary function is
Q Λ ; Λ(n) = E L (Λ) ; Λ(n)
K
= ··· L (Λ) p k1 , . . . , km ; Λ(n) . (5.15)
k1 =1 km =1
K
K
m (n)
λk f k (y j )
Q Λ ; Λ(n) = − λr |Rr | + ⎝ K ⎠ log λk . (5.16)
(n)
r =1 k=1 j=1 r =1 λ r f r (y j )
5.2 PET: Time-of-Flight Data 117
Setting ∇Λ Q Λ ; Λ(n) = 0 gives the EM update:
(n)
1
m
λk f k (y j )
λ(n+1) = K (n)
, k = 1, . . . , K . (5.17)
k
|Rk | λ f (y )
j=1 r =1 r r j
The Shepp-Vardi algorithm evaluates (5.17) until it satisfies the stipulated conver-
gence criterion. An intuitive interpretation of the recursion is given below. It is
summarized in Table 5.1.
– END FOR
• Initialize the PET image: Λ(0) = [λ1 (0), . . . , λ K (0)]T
• FOR EM iteration index n = 0, 1, 2, . . . until convergence:
– Update the expected detector count vector, D(n) ≡ [D1 (n), . . . , Dm (n)] :
λr (n)
F ( j, r )
m
λr (n + 1) =
|R r | D j (n)
j=1
– END FOR
– Update vectorized PET image: Λ(n + 1) = [λ1 (n + 1), . . . , λ K (n + 1)]T
• END FOR EM iteration (Test for convergence)
• If converged: Estimated PET image is Λ̂ M L = Λ(nlast ) = [λ1 (nlast ), . . . , λ K (nlast )]T
The expression (5.17) is an ugly ducking in that it is prettier after taking the limit as
|Rk | → 0. The limiting form for small cells (pixels or voxels) is not only insightful,
but also useful in other applications. If
(n+1)
λk → λ(n+1) (x) ,
(n)
λk → λ(n) (x) ,
118 5 Tomographic Imaging
f k (y j )
→ (y j | x) ,
|Rk |
K
K
(n) (n) fr (y j )
λr fr (y j ) = λr |Rk | → (y j | s) λ(n) (s) ds ,
|Rk | R
r =1 r =1
m
(y j | x)
λ(n+1) (x) = λ(n) (x)
(n)
, x ∈ R. (5.18)
R (y j | s) λ (s) ds
j=1
The form of the algorithm used in the PET application is (5.17), not (5.18).
Because there is at most one annihilation event per measurement, the sum over j
is the estimated number of annihilations originating in (x, x + dx) that generate a
measurement. Since annihilations form a PPP, this number is equated to the updated
intensity over (x, x + dx); that is, it is equated to λ(n+1) (x) |dx|. Dividing by |dx|
gives (5.18). A similar interpretation also holds for (5.17).
Most current PET systems record only photon detector pairs and do not collect TOF
data. In such systems, the absence of differential photon propagation times means
that the measurements of the location of the annihilation event along the chord con-
necting the detector pairs are not available. The only change to the PET problem
is that histogram data are used, not PPP sample data, but this change modifies the
likelihood function of the data. As a result the EM formulation is altered, as is the
reconstruction algorithm. These changes are detailed in this section.
5.3 PET: Histogram Data 119
The precise location of the event x within Rr is irrelevant because the intensity of
annihilation events in pixel R j is λ j . The vector Λ of these parameters is estimated
from the measured counts m 1:L .
K
mj = m(r, j) , 1 ≤ j ≤ L. (5.22)
r =1
K
K
E[m j ] = E m(r, j) = E m(r, j) = ( j | r ) λr . (5.23)
r =1 r =1 r =1
The observed data are independent Poisson distributed variables, so their joint pdf
is the product
m j
K
L K r =1 ( j | r ) λr
p(m 1:L ; Λ) = e− r =1 ( j | r ) λr
. (5.24)
m j!
j=1
120 5 Tomographic Imaging
L
K
(( j | r ) λr )m(r, j)
p(m 1:K , 1:L ; Λ) = e−( j | r ) λr . (5.25)
m(r, j)!
j=1 r =1
Since the numerator of the ratio (5.26) is a multinomial expansion, it follows easily
that
where the sum is over all indices m(r, j) that satisfy the measurement constraints
(5.22).
The logarithm of the complete data pdf is, from (5.25),
K
= {−( j | r ) λr + m(r, j) log (( j | r ) λr ) − log m(r, j)!}
j=1 r =1
⎧ ⎛ ⎞ ⎫
K ⎨
L ⎬
= c + −λr + ⎝ m(r, j)⎠ log λr , (5.29)
⎩ ⎭
r =1 j=1
where in the last equation the constant c contains only terms not involving Λ.
5.3.2.2 E-step
Let n denote the EM iteration index, and let Λ(0) = λ(0) 1 , . . . , λ (0)
K > 0 be an
initial set of intensities. For n ≥ 0, the auxiliary function of the EM method is,
5.3 PET: Histogram Data 121
by definition,
Q Λ ; Λ(n) = E Λ(n) log p(m 1:K ,1:L ; Λ)
)
= log p m 1:K ,1:L ; Λ p m 1:K ,1:L ) m 1:L ; Λ(n) ,
{m(r, j)}r,K ,j=1
L
(5.30)
where the sum in (5.30) is over all indices m(r, j) that satisfy the L measurement
constraints (5.22). Substituting (5.29), dropping the irrelevant constant c, and using
the linearity of the expectation operator gives
⎧ ⎛ ⎞ ⎫
K ⎨
L ⎬
Q Λ ; Λ(n) = −λr + ⎝ E Λ(n) m(r, j) ⎠ log λr , (5.31)
⎩ ⎭
r =1 j=1
( j | r ) λr(n)
E Λ(n) m(r, j) = m j K . (5.32)
(n)
r
=1 ( j | r ) λr
Substituting p(m 1:K ,1:L | m 1:L ; Λ) and summing over all indices except
m(1, j), . . . , and m(K , j) gives
122 5 Tomographic Imaging
E Λ m(r, j)
m j mj 1K
m(r
, j)
m(1, j),...,m(K , j)=0 m(r, j)
=1 ( j | r ) λr
m(1, j) · · · m(K , j) r
= m j .
K
r
=1 ( j | r ) λr
(5.34)
Making the sum on m(r, j) the outermost sum gives the numerator of the ratio on
the right hand side of (5.34) as
mj
m j!
m(r, j) (( j | r ) λr )m(r, j)
m(r, j)! (m j − m(r, j))!
m(r, j)=0
⎡ ⎤m j − m(r, j)
×⎣ ( j | r
) λr
⎦ .
r
=r
Canceling the factor m(r, j) in m(r, j)!, changing the index of summation to
m̃(r, j) = m(r, j) − 1, and shuffling terms yields
m j −1
(m j − 1)! (( j | r ) λr )m̃(r, j)
m j ( j | r ) λr
m̃(r, j)! (m j − 1 − m̃(r, j))!
m̃(r, j)=0
⎡ ⎤m j − 1 −m̃(r, j)
×⎣ ( j | r
) λr
⎦ .
r
=r
The sum in this last expression simplifies immediately using the binomial theorem:
. /m j − 1
K
m j ( j | r ) λr ( j | r
) λr
.
r
=1
Substituting this form of the numerator into the expectation (5.34) and canceling
terms gives the desired expectation for any Λ. The expression (5.32) is the special
case Λ = Λ(n) . This concludes the E-step.
L
m j ( j | r )
λr(n+1) = λr(n) K , 1 ≤ r ≤ K. (5.35)
j=1 r
=1 ( j | r
) λr(n)
L( j, r ) = ( j | r )
– END FOR
• Initialize the vectorized PET image: Λ(0) = [λ1 (0), . . . , λ K (0)]T
• FOR EM iteration index n = 0, 1, 2, . . . until convergence:
– Update the expected detector count vector, D(n) ≡ [D1 (n), . . . , D L (n)] :
L
L( j, r )
λr (n + 1) = λr (n) mj
D j (n)
j=1
– END FOR
– Update vectorized PET image: Λ(n + 1) = [λ1 (n + 1), . . . , λ K (n + 1)]T
• END FOR EM iteration (Test for convergence)
• If converged: Estimated PET Image is Λ(nlast ) = [λ1 (nlast ), . . . , λ K (nlast )]T
5.3.2.5 Convergence
General EM theory guarantees convergence of the iteration (5.35) to a stationary
point of the likelihood function. The strict concavity of the loglikelihood func-
tion guarantees that it converges to the global ML estimate, Λ̂ M L . To see that the
observed data pdf p(Y ; Λ) is strictly logconcave, differentiate (5.25) to find the
Hessian matrix of its logarithm:
T
H (Λ) = ∇Λ ∇Λ log p(m 1:L ; Λ)
⎡ ⎤
( j | 1)
L
mj ⎢ . ⎥ T
= − 2 ⎣ .. ⎦ ( j | 1) · · · ( j | K ) .
K
j=1 r =1 ( j | r ) λ j ( j | K )
(5.36)
124 5 Tomographic Imaging
The quadratic form z T H (Λ) z is strictly negative for nonzero vectors z if the L × K
likelihood matrix
⎡ ⎤
(1 | 1) (1 | 2) · · · (1 | K )
⎢ (2 | 1) (2 | 2) · · · (2 | K ) ⎥
⎢ ⎥
L = ⎢ .. .. .. ⎥ (5.37)
⎣ . . . ⎦
(L | K ) (L | K ) · · · (L | K )
is full rank. Because the Hessian matrix is negative definite for any Λ = 0, the
observed data pdf is strictly concave.
A older and lower resolution imaging procedure is called SPECT (single photon
emission computed tomography). In SPECT, a radioisotope (most commonly, tech-
netium 99) is introduced into the body. As the isotope decays, gamma photons are
emitted in all directions. A gamma camera (also often called an Anger camera after
its developer, Hal Anger, in 1957) takes a “snapshot” of the photons emitted in the
direction of the camera. Unlike PET which can be treated as a stack of two dimen-
sional slice problems, SPECT requires solving an inherently a three dimensional
reconstruction problem.
A simplistic depiction of a gamma camera is given in Fig. 5.2. The camera com-
prises several parts:
• The emitted gamma photons are collimated using one of several methods (a
lead plate with parallel drilled holes is common). Fewer than 1% of the incident
gamma photons emerge from the collimator.
• Photons that emerge enter a thallium-activated sodium iodide (NaI(T1)) crystal.
The NaI(T1) crystal is a flat circular plate approximately 1 cm thick and 30 cm
in diameter. (The thinner the crystal, the better the resolution but the less the
efficiency.) If a gamma photon is fully absorbed by an atom in the crystal (pho-
toelectric effect), the atom ejects an electron; if it is partially absorbed (Compton
effect), the atom ejects an electron and another gamma photon.
• Ejected electrons encounter other atoms in the crystalline lattice and produce
visible light in a physical process called scintillation. The number scintillated
light photons is proportional to the energy incident on the crystal.
• The scintillated photons are few in number, so an array of hexagonally-packed
photo-multiplier tubes (PMTs) are affixed to the back of the crystal to boost
the count (without also boosting the noise level). Typical gamma cameras use
5.4 Single-Photon Computed Emission Tomography (SPECT) 125
Fig. 5.2 Sketch of the basic components of a gamma (Anger) camera used in SPECT
between 37 and 120 PMTs. The face of a PMT ranges in size from 5 to 7 cm
across.
• Position logic circuits are used to estimate the locations of the crystal atoms that
absorb gamma photons. An estimation procedure is necessary because several
PMTs detect light from the same event. The position estimate is a convex combi-
nation of the PMT data.
The output is a two dimensional “snapshot” of the estimated locations of the atoms
that absorb gamma photons in the NaI(T1) crystal. The estimated locations depend
on the location of the camera.
In a typical SPECT imaging procedure, the gamma camera is moved to a fixed
number of different viewing positions around the object (and, naturally, is never in
physical contact with it). A snapshot is made at each camera position. The multiple
view snapshots are used to reconstruct the image. The reconstructed SPECT image
is the estimated intensity of radioisotope decay within the three dimensional volume
of the imaged object.
The clinical use of SPECT is well established. A common use of SPECT
is cardiac blood perfusion studies, in which an ECG/EKG (electrocardio-
gram/elektrokardiogram) acquires data from a beating heart and the heart rhythm
is used to time gate the SPECT data collection procedure.
SPECT is much more widely used than PET. There are several reasons for this
difference. One is that SPECT and PET are used for diagnostically different pur-
poses. Many diagnostic needs do not require the high resolution provided by PET.
126 5 Tomographic Imaging
Another is that the radioisotopes needed for SPECT are readily accessible compared
to those for PET. Yet another reason is simply the cost—SPECT procedures are rel-
atively inexpensive. In 2009, SPECT procedures cost on the order of US$500 and
were often performed in physician offices and walk-in medical facilities. In contrast,
PET cost US$30,000 or more and required specialized hospital-based equipment
and staff.
where n Θ is the fixed number of camera viewing positions. See Fig. 5.3. Let F(θ j )
denote the two dimensional plane of the camera face at view angle θ j . An arbitrary
point in F(θ j ) is denoted by (y, θ j ), where y ≡ (y1 , y2 ) ∈ R2 is a two-dimensional
coordinate system. The data comprising the snapshot at view angle θ j are the events
in the list
Sj = y j 1 , θ j , y j 2 , θ j , . . . , y j, n j , θ j ⊂ R3 , (5.38)
where n j ≥ 1 is the number of position estimates for the snapshot at camera angle
θ j . Let S ≡ S0 , S1 , , . . . , Sn Θ −1 . The intensity function λ(x) is estimated from
the data S.
The intensity functions needed to formulate the SPECT likelihood function are
defined on the Cartesian product R3 × F(θ j ) of decay points x in the imaged object
5.4 Single-Photon Computed Emission Tomography (SPECT) 127
Fig. 5.3 Geometry and coordinates for SPECT imaging of a three dimensional object for a gamma
camera with view angle θ. The center P of the camera face lies in the x1 –x2 plane for all camera
view angles θ
and observed gamma photon absorption points y on the crystal face of the camera
at view angle θ j . Let μ(x, y, θ j ) denote the intensity function of the j-th PPP.
There are n Θ of these PPPs. In terms of these intensity functions, the intensity to be
estimated is
Θ −1
n
λ(x) = μ(x, y, θ j ) dy. (5.39)
j=0 F(θ j )
The j-th integral in (5.39) is a double integral over the coordinates y ∈ F(θ j ) that
correspond to points on the camera face.
The intensity function μ(x, y, θ j ) is the superposition of the intensity functions
of two independent PPPs. One is determined by the detected photons. Its intensity
function is denoted by μ0 (x, y, θ j ). The other is determined by the photons that
arrive at the camera face but are not detected. The intensity function of the unde-
tected photons is denoted by μ1 (x, y, θ j ). Thus,
Both μ0 ( · ) and μ1 ( · ) are expressed in terms of three input functions. These inputs
are assumed known. They are discussed next.
and the engineering details of the system design. Its detailed mathematical form
depends on where the decay x occurred and the camera position θ . It is denoted by
pY |X Θ (y | x, θ ). A significant difference between this pdf and the analogous pdf for
PET is that it is depth dependent—it broadens with increasing distance between the
decay point x and the camera face. This pdf is assumed known.
Another function is the survival probability function
⎡ ⎤
Photon emitted at decay point x moving
β(x, y, θ ) = Pr ⎣ toward location y on the camera face at ⎦ . (5.41)
view angle θ arrives at the camera face
The survival function depends on many factors. These factors include the efficiency
of the detector and the three dimensional attenuation density of the object being
imaged. The efficiency is determined as part of a calibration procedure. The atten-
uation is clearly a complex issue since it depends on the object being imaged. It is,
moreover, important for successful imaging. Methods for estimating it are discussed
in some detail in [82] and the references therein. The function β(x, y, θ ) is assumed
known. )
The third required function is the conditional pdf pΘ|X θ j ) x . It is the fraction
of photons emanating from x that propagate toward the camera at view angle θ j .
This fraction is dependent on geometric factors such as the solid angle subtended
by the camera. It is also weighted by the relative length of the “dwell times” of the
camera at the camera angles Θ. It is assumed known.
and
) )
μ1 (x, y, θ j ) = pY |X Θ y ) x, θ j pΘ|X θ j ) x λ(x) β(x, y, θ j ) , (5.43)
respectively. These PPPs are independent because they result from an independent
thinning process determined by β(x, y, θ j ) (cf. Examples 2.6 and 2.7). Their sum
satisfies (5.40). The identity
Θ −1
n
μ(x, y, θ j ) dx dy = λ(x) dx (5.44)
j=0 R3 ×F(θ j ) R3
is used shortly.
The j-th snapshot S j is a realization of a PPP on the two dimensional camera
face, that is, on the plane of the camera face at view angle θ j . The photons arriving
5.4 Single-Photon Computed Emission Tomography (SPECT) 129
These intensities are defined only for values of y ∈ F(θ j ) ⊂ R3 . In light of the
relationship (5.39), the function
, 2 n j −1
L S j (λ) = exp − μ(y, θ j ) dy μ(y jr , θ j )
F(θ j ) r =1
, 2 n j −1
= exp − μ(x, y, θ j ) dx dy μ(x, y jr , θ j ) dx
R3 ×F(θ j ) r =1 R3
(5.46)
is the likelihood function for λ(x) given the data S j . The PPPs for different camera
view angles are independent, so the product
Θ −1
n
LS (λ) = L S j (λ)
j=0
⎧ ⎫
⎨ Θ −1
n
⎬ n
Θ −1
nj
= exp − μ(x, y, θ j ) dx dy μ(y jr , θ j )
⎩ R3 ×F(θ j ) ⎭
j=0 j=0 r =1
% ( n nj
Θ −1
= exp − λ(x) dx μ(x, y jr , θ j ) dx (5.47)
R3 j=0 r =1 R3
• The number N j (0) ≥ 0 of gamma photons that reach the camera but are not
detected;
• The locations {y jr : r = n j + 1, . . . , n j + N j (0)} at which the undetected
gamma photons exit the crystal face of the camera;
• The locations {x jr : r = 1, . . . , n j + N j (0)} of the decays that generated the
detected and undetected gamma photons.
130 5 Tomographic Imaging
The complete data (in the EM sense) for the j-th snapshot are
S
j = x j 1 , y j 1 , θ j , x j 2 , y j 2 , θ j , . . . , x j, n j + N j , y j, n j + N j , θ j ,
(5.48)
The logarithm is
,
Θ −1
n
nj
log LS
(λ) = −
com
λ(x) dx + log μ1 (x jr , y jr , θ j )
R3 j=0 r =1
⎫
n j +N j (0) ⎬
+ log μ0 (x jr , y jr
, θ j ) . (5.50)
⎭
r
=n j +1
n
nj n j +N j (0)
⎬
log LScom
(λ) = c − λ(x)dx + log λ(x jr ) + log λ(x jr ) ,
R3 ⎩ ⎭
j=0 r =1 r
=n j +1
(5.51)
5.4.2.5 E-step
Let m ≥ 0 denote the EM iteration index, and let λ(0) (x) > 0 be specified. The
auxiliary function of the EM method is defined as the conditional expectation of
(5.50):
)
Q λ ) λ(m) = E λ(m) log LScom
(λ)
= c − λ(x) dx + A + B , (5.52)
R3
5.4 Single-Photon Computed Emission Tomography (SPECT) 131
where
⎡ ⎤
Θ −1
n
n j +N j (0)
. /
Θ −1
n
nj
The expectations A and B are evaluated differently because the number of terms in
the r
sum is both random and missing, while the number of terms in the r sum is
specified.
To evaluate A, note that the j-th expectation in (5.53) is the expectation of a
random sum, all of whose summands are λ(x), with respect to the undetected tar-
get PPP on the camera face with view angle θ j . The number of terms in the sum
is Poisson distributed; therefore, replacing λ(x) in (5.42) with λ(m) (x) and using
Campbell’s Theorem gives
Θ −1
n
A = log λ(x)
j=0 R3 ×F(θ j )
)
× pY |X Θ y ) x, θ j pΘ|X θ j | x λ(m) (x) 1 − β(x, y, θ j ) dx dy
= λ(m) (x) log λ(x)
R3
⎧ ⎫
Θ −1
⎨n
) ⎬
× pY | X Θ y ) x, θ j pΘ|X θ j | x 1 − β(x, y, θ j ) dy dx
⎩ F(θ j ) ⎭
j=0
= 1 − β̄(x) λ(m) (x) log λ(x) dx , (5.55)
R3
where
Θ −1
n
)
β̄(x) = pY |X Θ y ) x, θ j pΘ|X θ j | x β(x, y, θ j ) dy (5.56)
j=0 F(θ j )
is the mean survival probability of detected gamma photons originating from decays
at x. Equivalently,
Θ −1
n
β̄(x) = pΘ|X θ j | x β j (x) , (5.57)
j=0
132 5 Tomographic Imaging
where
)
β j (x) = pY |X Θ y ) x, θ j β(x, y, θ j ) dy (5.58)
F(θ j )
is the probability that gamma photons generated from decays at x are detected in
the camera at view angle θ j .
Evaluating B requires using methods akin to those used in Chapter 3. From
(5.47) and (5.49), the conditional pdf of the missing points x jr for r ≤ n j and
all j is
Θ −1
n nj
w j (x jr | y jr , θ j ) , (5.59)
j=0 r =1
w j (x | y, θ j )
)
pY |X Θ y ) x, θ j pΘ|X θ j | x β(x, y, θ j ) λ(m) (x)
= ) , y ∈ F(θ j ).
) (m) (x) dx
R3 pY |X Θ y x, θ j pΘ|X θ j | x β(x, y, θ j ) λ
(5.60)
, . nj /. nj / 2
Θ −1
n
B = ··· log λ(x jr ) w j (x jr | y jr , θ j ) dx j 1 · · · dx j, n j
j=0 R3 R3 r =1 r =1
Θ −1
n
nj
= w j (x | y jr , θ j ) log λ(x) dx
j=0 r =1 R3
= f (m) (x) log λ(x) dx , (5.61)
R3
where
Θ −1
n
nj
(m)
f (x) = w j (x | y jr , θ j ). (5.62)
j=0 r =1
The intuitive interpretation of f (m) (x) is that it is the expected number of decays at
the point x given the data from all camera view angles.
5.4 Single-Photon Computed Emission Tomography (SPECT) 133
Substituting the expressions for A and B into (5.52) gives the final form of the
auxiliary function as
)
)
Q λ ) λ(m)
* +
= c + −λ(x) + 1 − β̄(x) λ(m) (x) + f (m) (x) log λ(x) dx.
R3
(5.63)
5.4.2.6 M-step
The EM update is the solution of the maximization problem defined by
)
)
λ(m+1) (x) = arg max Q λ ) λ(m) (5.64)
λ>0
For a specified variation δ(x), the maximum of (5.65) with respect to must occur
at = 0, for otherwise λ(m+1) (x) is not the solution of the M-step. Evaluating the
derivative of (5.65) with respect to at zero gives
. /
1 − β̄(x) λ(m) (x) + f (m) (x)
δ(x) −1 + dx = 0. (5.66)
R3 λ(m+1) (x)
Since (5.66) must hold for all variations δ(x), it follows that
1 − β̄(x) λ(m) (x) + f (m) (x)
−1 + ≡ 0 for all x. (5.67)
λ(m+1) (x)
To facilitate later discussion (in Section 6.5), the recursion (5.68) is rewritten as a
sum over the camera view angles:
134 5 Tomographic Imaging
.
Θ −1
n
(m+1) (m)
λ (x) = λ (x) pΘ|X θ j | x 1 − β j (x)
j=0
) /
pY |X Θ y jr ) x, θ j β(x, y jr , θ j )
nj
+ ) .
) (m) (x) dx
r =1 R3 pY |X Θ y jr x, θ j pΘ|X θ j | x β(x, y jr , θ j ) λ
(5.69)
5.5.1 Background
In transmission tomography the photon source is external to the tissue or other object
being imaged, so the photon count can be much higher than in PET. It is desirable
to minimize patient radiation exposure, so there are limits on how much the photon
count can be increased.
The estimated parameters are no longer the PPP intensities in a pixel grid, but
rather the attenuation coefficients of the pixels in the grid. PET measures radioac-
tivity concentration and, thus, preferentially images tissues with higher metabolic
rates. Transmission tomography measures x-ray absorption, which is related to
the average electron density (or effective atomic number) of the tissue. Readable
accounts of the physics involved are found in the book by Ter-Pogossian [131]. A
more recent discussion of effective atomic numbers and electron density is given in
[141].
As in PET, the image pixels are denoted by Rr , r = 1, . . . , K . Let μr be
the attenuation coefficient in Rr . The probability that a photon is absorbed while
traversing a line segment of length l lying inside Rr is 1 − e−μr l , so μr has units
of inverse length. Hence, the probability that a photon is not absorbed while travers-
ing an interval of length l inside Rr is e− μr l . Imaging in transmission tomogra-
phy is equivalent to estimating the attenuation coefficients μ = (μ1 , . . . , μ K ).
The EM method is used to estimate the attenuation coefficients. For various
reasons this problem is significantly more difficult than the PET algorithm. The
approach given here parallels that of Lange and Carson [65], who were first use of
the EM method for transmission tomography. In any event, the method here is very
interesting and potentially of utility in applications other than transmission tomog-
raphy. The word photon is used throughout the discussion because it matches the
5.5 Transmission Tomography 135
attenuation assumptions; however, photons are only placeholders for any quantity
of interest that attenuates linearly.
d j = α j t,
so α j has units of number per unit time. The number of photons emitted by source
j is assumed to be Poisson distributed with parameter d j .
The number of photons arriving at the detectors in the array is denoted by
m 1:L = (m 1 , . . . , m L ).
e− l jr μr = exp ⎝− l jr μr ⎠.
r ∈I j r ∈I j
d j exp ⎝− l jr μr ⎠. (5.70)
r ∈I j
136 5 Tomographic Imaging
where the constants mj log d j − log mj ! are omitted because they do not depend
on Λ.
5.5.2.2 EM Convergence
The loglikelihood function (5.71) is strictly concave. To see this, differentiate to find
the (u, v)-th entry of the K × K Hessian matrix:
⎛ ⎞
∂2
is full rank. Many entries of the traversal length matrix are zero, since the indices of
the nonzero entries of row j are identical to the indices in I j . The matrix is full rank
if the configuration of pixels and source-detector pairs is properly designed. The full
rank condition is assumed to be incorporated into the system. Consequently, the EM
method is guaranteed to converge to the global ML estimate.
m j (1)
dj
e−d j .
m j (1)!
The likelihood functions of the missing data of the j-th source-detector pair is
therefore
The source-detector pairs are independent, so the joint likelihood function is the
product:
L
p = pj. (5.77)
j=1
Retaining only terms that depend on the attenuation coefficients μ gives the com-
plete data loglikelihood function in the form
138 5 Tomographic Imaging
⎧
j −1
L ⎨T
& '
t−1
E[m j (t) ; μ] = d j exp − l j,I j (t
) μI j (t
)
t
=1
= γ jt ≡ γ jt (μ). (5.80)
is the probability of a photon successfully traversing pixel RI j (t) and also all pixels
it encounters on the way to the j-th detector. The joint pdf of m j (T j ) and m j (t) is,
from (5.79),
Pr[m j (t), m j (T j ) ; μ]
, m j (t) 2 , 2
−γ jt γ jt m j (t) γ j Tj m j γ j T j m j (t)−m j (T j )
= e 1− .
m j (t)! m j (T j ) γ jt γ jt
(5.82)
Pr[m j (t), m j (T j ) = m j ; μ]
Pr[m j (t) | m j (T j ) = m j ; μ] =
Pr[m j (T j ) = m j ; μ]
% (% γ j T m j (
m j (t) γ j T j m j (t) − m j
e−γ jt ( mjt j)(t)!
γ m j (t)!
m j ! (m j (t) − m j )! γ jt
j
1 − γ jt
= m j
−γ j T j γ jTj
e m j!
m j (t) − m j
−(γ jt − γ j T j ) γ jt − γ j
= e . (5.83)
m j (t) − m j !
E m j (t) | m j ; μ = m j (t) Pr[m j (t) | m j ; μ]
m j (t)=m j
To see this, substitute (5.83) and simplify. The details are straightforward and are
omitted.
gives
j −1 *
L T
Q μ ; μ(n) = Ñ jt μ(n) log e−l jt μt
j=1 t=1
+
+ M̃ jt μ(n) − Ñ jt μ(n) log 1 − e−l jt μt . (5.85)
This form is awkward because the traversal ordering (5.74) does not naturally
aggregate the parameter μt into separate parts of the expression. Rewriting
140 5 Tomographic Imaging
Q μ ; μ(n) in the original pixel ordering is more convenient for this purpose, and
is straightforward.
For r = 1, . . . , K , let M jr μ(n) and N jr μ(n) denote the expected num-
bers of photons entering and exiting pixel r for source-detector pair j, conditioned
on the measurements m 1:K and given the parameter vector μ(n)*. These expecta-
+
tions are evaluated by appropriately permuting the expectations M̃ jt μ(n) and
* +
Ñ jt μ(n) .
Now let Jr denote the indices j of the source-detector pairs whose photons
traverse pixel r . Then (5.85) becomes
K
*
Q μ ; μ(n) = N jr μ(n) log e−l jr μr
r =1 j∈Jr
+
+ M jr μ(n) − N jr μ(n) log 1 − e−l jr μr . (5.86)
5.5.2.6 M-step
The EM recursive update is found by setting the gradient of (5.86) with respect
to μ equal to zero. EM works like magic here—the gradient equations decouple
completely into K equations, one for each attenuation coefficient. The equation for
the update of μr is
l jr
M jr μ(n) − N jr μ(n) = N jr μ(n) l jr . (5.87)
el jr μr − 1
j∈Jr j∈Jr
(n+1)
The solution of this transcendental equation is the EM update μr . The solution
of this equation is unique. To see this, observe that the left hand side of (5.87)
decreases monotonically from +∞ to zero as μ goes from zero to +∞. Since the
right hand side is a positive constant for every EM iteration n, the equation (5.87)
has a unique solution.
An explicit analytical solution of (5.87) is not available, but an explicit solution is
unnecessary for EM theory to work. Given the monotonicity of the left hand side of
the equation, any number of numerical methods will provide the solution to arbitrary
accuracy.
General EM theory guarantees that the iterates converge to a stationary point of
the likelihood function. Because of the strict concavity of the loglikelihood function,
this stationary point is the global ML estimate, μ̂ M L .
Solving the M-step numerically to arbitrary precision makes the Lange-Carson
algorithm an exact EM algorithm. Lange and Carson in their original 1984 paper
5.5 Transmission Tomography 141
[65] introduce an approximate solution to (5.87), but there is no longer any practical
reason to use it.
Improvements to the Lange-Carson algorithm are discussed in [91] and reviewed
in [34]. These include numerical algorithms to solve (5.87), methods to increase the
convergence rate, and regularization to improve image quality.
A summary of the Lange-Carson algorithm is given in Table 5.3.
• END FOR
• FOR r = 1 : K
M( j, r ) = N ( j, r ) = 0
• END FOR
• FOR t = 1 : T j − 1
M j, I j (t) = m j − γ j, I j (T j ) + γ j, I j (t)
N j, I j (t) = m j − γ j, I j (T j ) + γ j, I j (t + 1)
• END FOR
– END FOR
– FOR r = 1 : K , solve for the update μr (n + 1) of the r -th attenuation coefficient
L
l jr
L
(M( j, r ) − N ( j, r )) = N ( j, r ) l jr
j=1
e l jr μr − 1 j=1
– END FOR
– Update vectorized attenuation image: μ(n + 1) = (μ1 (n + 1), . . . , μ K (n + 1))
• END FOR EM iteration (Test for convergence)
• If converged: Estimated attenuation coefficient image is μ̂ M L = (μ1 (nlast ), . . . , μ K (nlast ))
142 5 Tomographic Imaging
The function fr (y) is given by (5.4). The FIM is determined by the likelihood func-
tion ( · ) and by the sizes and locations of the pixels. Similarly, for histogram data,
the piecewise constant intensity is such that
K
λ(s) ds = ( j | r ) λr . (5.89)
Rj r =1
L
1 ⎢ .. ⎥ T
Jhist (Λ) = K ⎣ . ⎦ ( j | 1) · · · ( j | K ) . (5.90)
r =1 ( j | r ) λr
j=1 ( j | K )
diagonal image reveals which pixels may have unreliable estimates. The off-
diagonal entries of the CRB are of interest, but are not as easily displayed intu-
itively. Such entries are potentially useful because they may reveal undesirable long
distance spatial correlations between pixel estimates.
The drawback to the FIM for tomography is that practical imaging problems
require high resolution. If the field of view is large and the scale of the objects of
interest within the image is small, the number of pixels K rapidly becomes very
large. For example, a 100 × 150 pixilated image gives K = 15, 000 pixels. Large K
causes two difficulties in applications. The first is that numerically evaluating any
one of the entries of the FIM is straightforward, but evaluating all K (K + 1)/2 of
them can be time consuming and require more care than is desirable for a devel-
opment effort. In many applications this problem can probably be circumvented in
various ways. The other is more daunting—inverting the full FIM to find the CRB is
infeasible for sufficiently large K. A more practical alternative is provided by Hero
and Fessler [50, 51], who propose a recursive algorithm to compute the CRB for a
subset of the intensities in a region of interest within the larger image.
5.7 Regularization
Tomographic imaging is known to require regularization to reduce noise artifacts
that are visually similar to the Gibbs phenomenon. These artifacts are not caused by
the EM method, but arise in any algorithm which estimates an infinite dimensional
quantity, in this case the intensity, from finite dimensional measured data.
Regularization methods for transmission tomography are reviewed in [34]. Sev-
eral interesting methods for regularizing PET are described in detail in [119]:
These methods differ in important ways, but they have much in common. Perhaps
the simplest to use that preserves the EM form of the intensity estimator is Grenan-
der’s method.
The kernel and the space Z are arbitrary and can be chosen in any convenient
manner. For example, the kernel is often taken to be an appropriately dimensioned
Gaussian pdf. The sieve restricts λ(x) to the collection of all functions of the form
% (
Λ ≡ λ(x) = k(x | z) ζ(z) dz, for some ζ(z) , (5.92)
Z
where ζ(z) ≥ 0 and Z ζ(z) dz < ∞. In effect, the integral (5.92) is a low pass filter
applied to the intensity ζ(z) to produce a smoothed estimate of λ(x).
The basic idea is to compute the ML estimate 0 ζ(z) from the data, and subse-
quently compute the ML intensity
0
λ M L (x) = k(x | z) 0
ζ(z) dz. (5.93)
Z
m
p(Y ; Λ) = e− Z μ(y) dy
μ(y j ).
j=1
Keywords Intensity filter · PHD filter · Derivation via Bayes Theorem · Derivation
via expected target count · Derivation via Shepp-Vardi · Marked multisensor inten-
sity filter · Particle implementation · Gaussian sum implementation · Mean-shift
algorithm · Surrogate CRB · Regularization · Sources of target count error · Mul-
tisensor intensity filter · Variance reduction · Heterogeneous multisensor intensity
filter
to the target physics and to the sensor signal processors of most radar and sonar
systems. Unfortunately, exact MHT algorithms are intractable because the num-
ber of measurement assignment hypotheses grows exponentially with the number
of measurements. These problems are aggravated when multiple sensors are used.
Circumventing the computational difficulties of MHT requires approximation.
Approximate tracking methods based on PPP models are the topics of this chap-
ter. They show much promise in difficult problems with high target and clutter densi-
ties. The key insight is to model the distribution of targets in state space as a PPP, and
then use a filter to update the defining parameter of the PPP—its intensity. To update
the intensity is to update the PPP. The intensity function of the PPP approximation
characterizes the multiple target tracking model. This important point is discussed
further in Section 6.1.1.
The PPP intensity model uses an augmented state space, S + . This enables it
to estimate target birth and measurement clutter processes on-line as part of the
filtering algorithm.
Three approaches to the intensity filter are provided. The first is a Bayesian
derivation given in Appendix D. This approach relies on a “mean field” approxima-
tion of the posterior point process. The relationship between mean field approach
and the “first moment intensity” approximation of the posterior point process used
by Mahler [74] is discussed. The second approach is a short but extraordinarily
insightful derivation that is ideal for readers who wish to avoid the Bayesian analy-
sis, at least on a first reading. The third approach is based on the connection to PET
and the Shepp-Vardi algorithm. The PET interpretation contributes significantly to
understanding the PPP target model. A special case of the intensity filter is the well
known PHD filter. It is obtained by assuming a known target birth-death process, a
known measurement clutter process, and restricting the intensity filter to the non-
augmented target state space S.
Implementation issues are discussed in Section 6.3. Current approaches use
either particle or Gaussian sum methods. An image processing method called the
mean shift algorithm is well suited to point target estimation, especially for particle
methods. Observed information matrices (cf. Section 4.7) are proposed as surrogates
for the error covariance matrices widely used in single target Bayesian filters. The
underlying statistical meaning of OIM estimates is as yet unresolved.
Other topics discussed include a Gaussian sum PPP filter that enables heteroge-
neous target motion and sensor measurement models to be used in an intensity filter
setting. See Section 6.2.2 for details. Sources of error in the estimated target count
are discussed in Section 6.4. Target count estimates are sensitive to the probability
of target detection. This function, PkD (x), depends on target state x at measure-
ment time tk . It varies over time because of slowly changing sensor characteristics
and environmental conditions. Monitoring these and other factors that affect target
detection probability is not typically considered part of the tracking problem.
Another topic is the multiple sensor intensity filter described in Section 6.5. This
topic is the subject of some debate. The filter presented here relies on the validity of
the target PPP model for every sensor. It is also closely related to the imaging prob-
lem SPECT, just as the intensity and PHD filters are related to PET. The variance
6.1 Intensity Filters 149
of the target count is reduced by averaging over the sensors, so that the variance of
estimated target count decreases with the number of sensors.
Several areas of on-going research are not discussed here. Much recent work in
the area is focused on the cardinalized PHD (CPHD) filter recently proposed by
Mahler [41]. This interesting filter does not assume a Poisson distributed number
of targets, so that the posterior finite point process is not an independent scattering
process (see Section 2.9.3). This contributes to what has been appropriately called
[37] “spooky action at a distance.” An even more recent topic of interest is that of
smoothing PHD filters [47, 86, 91]. The intention is to reduce variance by introduc-
ing time lag into the intensity function estimates.
The points of a realization of a PPP on the target state space are a poor representation
of the physical reality of a multiple target state. This is especially easy to see when
exactly one target is present, for then ideally
λ(x) dx = 1 . (6.1)
S
From (2.4), the probability that a realization of the PPP has exactly one point target is
p N (n = 1) = e−1 ≈ 37% .
Hence, 63% of all realizations have either no target or two or more targets. Evi-
dently, realizations of the PPP seriously mismodel this simple tracking problem.
The problem worsens with increasing the target count: if exactly n targets are
present, then the probability that a realization has exactly n points is e−n n n / n! ≈
(2 π n)−1/2 → 0 as n → ∞. Evidently, PPP realizations are poor models of real
targets.
One interpretation of the PPP approximation is that the critical element of the
multitarget model is the intensity function, not the PPP realizations. The shift of
perspective means that the integral (6.1) is the more physically meaningful quantity.
Said another way, the concept of expectation, or ensemble average over realiza-
tions, corresponds more closely to the physical target reality than do the realizations
themselves.
A huge benefit comes from accepting the PPP approximation to the multiple tar-
get state—exponential numbers of assignments are completely eliminated. The PPP
approximation finesses the data assignment problem by replacing it with a stochas-
tic imaging problem, and the imaging problem is easier to solve. It is fortuitous
that the imaging problem is mathematically the same problem that arises in PET;see
Section 5.2. The “at most one measurement per target” rule for tracking corresponds
150 6 Multiple Target Tracking
6.1.2.1 Formulation
Standard filtering notation is adopted, but modified to accommodate PPPs. The gen-
eral Bayesian filtering problem is reviewed in Appendix C. Let S = Rn x denote
the n x -dimensional single target state space. The augmented space is S + ≡ S ∪ φ,
where φ represents the “target absent” hypothesis.
The single target transition function from time tk−1 to time tk , denoted by
Ψk−1 (y | x) ≡ pΞk |Ξk−1 (y | x), is assumed known for all x, y ∈ S + . The aug-
mented state space enables both target initiation and termination to be incorporated
directly into Ψk−1 as specialized kinds of state transitions. Novel aspects of the
transition function are:
υk = (m, {z 1 , . . . , z m }) ∈ E(T )
6.1 Intensity Filters 151
Fig. 6.1 Block diagram of the Bayes update of the intensity filter on the augmented target state
space S + . Because the null state φ is part of the state space, target birth and measurement clutter
estimates are intrinsic to the predicted target and predicted measurement steps. The same block
diagram holds for the PHD filter on the nonaugmented space S
f k|k−1 (x) denote the intensity of the output PPP. Adapting (2.83) to S + gives
f k|k−1 (x) = Ψ (x | y) f k−1|k−1 (y) dy , (6.2)
S+
D
f k|k−1 (x) = PkD (x) f k|k−1 (x)
and
U
f k|k−1 (x) = 1 − PkD (x) f k|k−1 (x) .
as is seen from (2.86). The measurement PPP is a critical component of the intensity
filter because, as is seen in (6.9), it weights the individual terms in the sum that
comprises the filter.
Another way to see the importance of λk|k−1 (z) is to recall the classical single-
target Bayesian tracking problem. The standard Bayesian formulation gives
p(z | x) p(x)
p(x | z) = , (6.4)
p(z)
where the denominator is a scale factor that makes the left hand side a true pdf. It
is very easy to ignore p(z) in practice because the numerator is obtained by mul-
tiplication and the product scaled so that it is a pdf. When multiple conditionally
independent measurements are available, the conditional likelihood is a product and
it is the same story again for the scale factor. However, if the pdfs are summed, not
multiplied, the scale factor must be included for the individual terms to be compara-
ble. Such is the case with the intensity filter: the PPP model justifies adding Bayesian
6.1 Intensity Filters 153
pdfs instead of multiplying them, and the scale factors are crucial to making the sum
meaningful.
The scale factor clearly deserves a respectable name, and it has one. It is called
the partition function in statistical physics and the machine learning communities.
U
f k|k (x) = f k|k−1
U
(x)
= 1 − PkD (x) f k|k−1 (x) . (6.5)
This brings the right hand branch of Fig. 6.1 to the superposition stage, i.e., the
block that says “Add Intensities”.
The left hand branch is more difficult because, as it turns out, the information
updated detected target point process is not a PPP. This is a serious dilemma since
it is highly desirable both theoretically and computationally for the filter recur-
sion to remain a closed loop. The posterior point process of the detected targets
is therefore approximated by a PPP. Three methods are given for obtaining this
approximation.
The first method is a Bayesian derivation of the posterior density of the point pro-
cess on the event space E (S + ) followed by a “mean field” approximation. Details
about Bayesian filters in general are given in Appendix C. The Bayesian derivation
is mathematically rigorous, but not particularly insightful. The situation is improved
considerably by showing the close connections between the mean field approxima-
tion and the “first moment” approximation of the posterior point process.
To gain intuition, one need look no further than to the second method. While not
rigorous, it is intuitively appealing and very convincing. The perspective is further
enriched by the third method. It shows a direct connection between the information
update of the Bayesian filter and the Shepp-Vardi algorithm for PET imaging. This
third method also poses an interesting question about iterative updates of a Bayesian
posterior density.
154 6 Multiple Target Tracking
The special case pk (z | φ) is the pdf of a data point z conditioned on absent target
state φ, that is, the pdf of z given that it is clutter. The predicted intensity at time
tk of the target point process is a PPP with intensity PkD (x) f k|k−1 (x). The intensity
D (x) is the intensity of a PPP that approximates the information updated, or Bayes
f k|k
posterior, detected target process.
The measured data at time tk are m k points in a measurement space T . Denote
these data points by Z k = (z 1 , . . . , z m k ).
The information update of the detected target PPP is obtained intuitively as fol-
lows. The best current estimate of the probability that the point measurement z j
originated from a physical target with state x ∈ S in the infinitesimal dx is (see
Section 3.2.2 and also (5.19))
where the denominator is found using (6.3). Similarly, the probability that z j origi-
nated from a target with state φ is
Because of the “at most one measurement per target” rule, the sum of the ratios over
all measurements z j is the estimated number of targets at x, or targets in φ, that
generated a measurement.
The estimated number of targets at x ∈ S is set equal to the expected number of
D (x) |dx|. Cancelling |dx| gives
targets conditioned on the data υk , namely f k|k
mk
pk (z j | x) PkD (x) f k|k−1 (x)
D
f k|k (x) = . (6.9)
λk|k−1 (z j )
j=1
Similarly, the expected number of targets in state φ is the posterior intensity eval-
uated at φ, namely, f k|k D (φ). The predicted measurement process with intensity
• The target state space S + corresponds to the space in which the radioisotope is
absorbed.
• The measured data Z k ⊂ T correspond to the measured locations of the annihi-
lation events. As noted in the derivation of Shepp-Vardi, the measurement space
T need not be the same as the state space.
D (x) corresponds to the annihilation event inten-
• The posterior target intensity f k|k
sity.
mk
pk (z j | x)
D
f k|k (x)(n+1) = D
f k|k (x)(n) D (s)(n) ds
, (6.11)
j=1 S+ pk (z j | s) f k|k
D (x)(0) ≡ f D
k|k−1 (x) = Pk (x) f k|k−1 (x) initial-
where the predicted intensity f k|k D
izes the algorithm. The first iteration of this version of the Shepp-Vardi algorithm
is clearly identical to the Bayesian information update (6.9) of the detected target
process.
The Shepp-Vardi iteration converges to an ML estimate of the target state inten-
sity given only data at time tk . It is independent of the data at times t1 , . . . , tk−1
except insofar as the initialization influences the ML estimate. In other words, the
iteration leads to an ML estimate of an intensity that does not include the effect of
a Bayesian prior. The problem lies not in the PET interpretation but in the pdf of
the data. To see this it suffices to observe that the parameters of the pdf (5.7) are
not constrained by a Bayesian prior and, consequently, the Shepp-Vardi algorithm
converges to an estimate that is similarly unconstrained. It is, moreover, not obvious
how to impose a Bayesian prior on the PET parameters that does not disappear in
the small cell limit.
Superposing the PPP approximation of the detected target process and the unde-
tected target PPP gives
6.1 Intensity Filters 157
where
is the predicted measurement clutter intensity. The probability PkD (φ) in (6.16) is
the probability that a φ hypothesis generates a measurement at time tk .
The computational parts of the intensity filter are outlined in Table 6.1. The
table clarifies certain interpretive issues that are glossed over in the discussion.
Implementation methods are discussed elsewhere in this chapter.
mk
p(ηk ) = e− Rn z λk|k (z) dz
λk|k (z j )
j=1
mk
= e−Nk|k λk|k (z j ) , (6.18)
j=1
where
Nk|k = λk|k (z) dz
R
nz
= pk (z | x) dz f k|k (x) dx
+
S Rn z
The modeling assumptions of the intensity filter are very general, and specializa-
tions are possible. The most important is the PHD filter discussed in Section 6.2.1.
By assuming certain kinds of a priori knowledge concerning target birth and mea-
surement clutter, and adjusting the filter appropriately, the intensity filter reduces to
the PHD filter. The differences between the intensity and PHD filters are nearly all
attributable to the augmented state space S + . That is, the intensity filter uses the
augmented single target state space S + = S ∪ φ, while the PHD filter uses only
the single target space S. Using S practically forces the PHD filter to employ target
birth and death processes to model initiation and termination of targets.
A different kind of specialization is the marked multitarget intensity filter. This
is a parameterized linear Gaussian sum intensity filter that interprets measurements
as target marks. This interpretation is interesting in the context of PPP target models
because it implies that joint measurement-target point process is a PPP. Details are
discussed in Section 6.2.2.
The state φ is the basis for the on-line estimates of the intensities of the target birth
and measurement clutter PPPs given by (6.13) and 6.15), respectively. If, however,
the birth and clutter intensities are known a priori to be bk (x) and λk (z), then the
predictions b̂k (x) and λ̂k (z) can be replaced by bk (x) and λk (z). This is the basic
strategy taken by the PHD filter.
The use of a posteriori methods makes good sense in many applications. For
example, they can help regularize parameter estimates. These methods can also
incorporate information not included a priori in the Bayes filter. For example,
Jazwinski [54] uses an a posteriori method to derive the Schmidt-Kalman filter for
bias compensation. These methods may improve performance, i.e., if the a priori
birth and clutter intensities are more accurate or stable than their on-line estimated
counterparts, the PHD filter may provide better tracking performance.
Given these substitutions, the augmented space is no longer needed and can be
eliminated. This requires some care. If the recursion is simply restricted to S and no
other changes are made, the filter will not be able to discard targets and the target
count may balloon out of control. To balance the target birth process, the PHD filter
uses a death probability before propagating the multitarget intensity f k−1|k−1 (x).
This probability was intentionally omitted from the intensity filter because transition
into φ is target death, and it is redundant to have two death models.
160 6 Multiple Target Tracking
The death process is a Bernoulli thinning process applied to the PPP at time tk−1
before targets transition and are possibly detected. Let dk−1 (x) denote the prob-
ability that a target at time tk−1 dies before transitioning to time tk . The surviving
target point process is a PPP and its intensity is (1 − dk−1 (x)) f k−1|k−1 (x). Adding
Bernoulli death and restricting the recursion to S reduces the intensity filter to the
PHD filter.
The intensity filter assumes targets have the same motion model, and that the sen-
sor measurement likelihood function is the same for all targets and data. Such
assumptions are idealized at best. An alternative approach is to develop a param-
eterized intensity filter that accommodates heterogeneous target motion models and
measurement pdfs by using target-specific parameterizations. The notion of target-
specific parameterizations in the context of PPP target modeling seems inevitably to
lead to the idea of modeling individual targets as a PPP, and then using superposi-
tion to obtain the aggregate PPP target model. Parameter estimation using the EM
method is natural to superposition problems, as shown in Chapter 3. The marked
multisensor intensity filter (MMIF) is one instance of such an approach.
The MMIF builds on the basic idea that a target at state x is “marked” with
a measurement z. If the target is modeled as a PPP, then the joint measurement-
target vector (z, x) is a PPP on the Cartesian product of the measurement and target
spaces. This is an intuitively reasonable result, but the details needed to see that it is
true are postponed to Section 8.1. The MMIF uses a linear Gaussian target motion
and measurement model for each target and superposes them against a background
clutter model. Since the Gaussian components correspond to different targets, they
need not have the same motion model. Similarly, different sensor measurement
models are possible. Superposition therefore leads to an affine Gaussian sum inten-
sity function on the joint measurement-target space. The details of the EM method
and the final MMIF recursion are given in Appendix E.
The MMIF adheres to the “at most one measurement per target rule” but only in
the mean, or on average. It does this by reinterpreting the single target pdf as a
PPP intensity function, and by interpreting measurements as the target marks. The
expected number of targets that the PPP on the joint measurement-target space pro-
duces is one.
Another feature of the MMIF is that the EM weights depend on the Kalman
filter innovations. The weights in other Gaussian sum filters often involve scaled
multiples of the measurement variances, resulting in filters that are somewhat akin
to “nearest neighbor” tracking filters.
The limitation of MMIF and other parameterized sum approaches is the require-
ment to use a fixed number of terms in the sum. This strongly affects its ability to
model the number of targets. In practice, various devices can compensate for this
limitation, but they are not intrinsic to the filter.
6.3 Implementation 161
6.3 Implementation
Simply put, targets correspond to the local peaks of the intensity function and the
areas of uncertainty correspond to the contours, or isopleths, of the intensity. Very
often in practice, isopleths are approximated by ellipsoids in target state space cor-
responding to error covariance matrices. Methods for locating the local peak con-
centrations of intensity and finding appropriate covariance matrices to measure the
width of the peaks are discussed in this section.
Implementation issues for intensity filters therefore concern two issues. Firstly,
it is necessary to develop a computationally viable representation of the informa-
tion updated intensity function of the filter. Two basic representations are proposed,
one based on particles and the other on Gaussian sums. Secondly, postprocessing
procedures are applied to the intensity function representation to extract the num-
ber of detected targets, together with their estimated states and corresponding error
covariance matrices. Analogous versions of both issues arise in classical single tar-
get Bayesian filters.
The fact remains, however, that a proper statistical interpretation of target point
estimates and their putative error covariances is lacking for intensity filters. The
concern may be dismissed in practice because they are intuitively meaningful and
closely resemble their single target Bayesian analogs. The concern is nonetheless
worrisome and merits further study.
The most common and by far the easiest implementation of nonlinear filters is by
particle, or sequential Monte Carlo (SMC), methods. In such methods the poste-
rior pdf is represented nonparametrically by a set of particles in target state space,
together with a set of associated weights, and estimated target count. Typically these
weights are uniform, so the spatial distribution of particles represents the variabil-
ity of the posterior density. An excellent discussion of SMC methods for Bayesian
single target tracking applications is found in the first four chapters of [104].
Published particle methods for the general intensity filter are limited to date to
the PHD filter. Extensions to the intensity filter are not reported here. An early and
well described particle methodology (as well as an interesting example for tracking
on roads) for PHD filters is given in [111]. Particle methods and their convergence
properties for the PHD filter are discussed in detail in a series of papers by Vo et al.
[137]. Interested readers are urged to consult them for specifics.
Tracking in a surveillance region R using SMC methods starts with an initial set
of particles and weights at time tk−1 together with the estimated number of targets
in R:
xk−1|k−1 (), wk−1|k−1 () : = 1, L S MC and Nk−1|k−1 ,
162 6 Multiple Target Tracking
where wk−1|k−1 () = 1/L S MC for all . For PHD filters the particle method pro-
ceeds in several steps that mimic the procedure outlined in Table 6.2:
⎛ ⎞
m
p (z ( j) | x) P D (x)
⎝1 − PkD (x) + k k k ⎠, (6.20)
λk|k−1 (z j )
j=1
6.3 Implementation 163
where z k (1), . . . , z k (m) are the measurements at time tk . The updated particle
weights are nonuniform.
• Normalization. Compute the scale factor, call it Nk|k , of the sum of the updated
particle weights. Divide all the particle weights by Nk|k to normalize the weights.
• Resampling. Particles are resampled by choosing i.i.d. samples from the discrete
pdf defined by the normalized weights. Resampling restores the particle weights
to uniformity.
If the resampling step is omitted, the SMC method leads to particle weight distri-
butions that rapidly concentrate on a small handful of particles and therefore poorly
represents the posterior intensity. There are many ways to do the resampling in
practice.
By computing Nk|k before resampling, it is easy to see that
Nk|k ≈ λk|k (x) dx
R
= E Number of targets in R . (6.21)
The estimator is poor for sets R0 that are only a small fraction of the total volume
of R.
The primary limitations of particle approaches in many applications are due to
the so-called Curse of Dimensionality1 : the number of particles needed to repre-
sent the intensity function grows exponentially as the dimension of the state space
increases. Most applications to date seem to be limited to four or five dimensions.
The curse is so wicked that Moore’s Law (the doubling of computational capability
every 18 months) by itself will do little to increase the effective dimensional limit
over a human lifetime. Moore’s Law and improved methods together will undoubt-
edly increase the number of dimensions for which particle filters are practical, but
it remains to be seen if general filters of dimension much larger than say six can be
treated directly.
1 The name was first used in 1961 by Richard E. Bellman [9]. The name is apt in very many
problems; however, some modern methods in machine learning actually exploit high dimensional
embeddings.
164 6 Multiple Target Tracking
xk|k () : = 1, . . . , L S MC .
S MC
λk (x) = Ik N (x ; xk|k (), Σker ) , (6.23)
=1
where N (x ; xk|k (), Σker ) is the kernel. The covariance matrix Σker is specified,
not estimated. Intuitively, the larger Σker , the fewer the number of local maxima
in the intensity (6.23), and conversely. The scale factor Ik > 0 is estimated by the
particle filter and is taken as known here.
The form (6.23) has no parameters to estimate, so extend it by defining
S MC
λk (x ; μ) = Ik N x ; xk|k () − μ, Σker , (6.24)
=1
where μ is an unknown rigid translation of the intensity (6.23). It is not hard to see
that the ML estimate of μ is a local maximum of the kernel estimate, that is, a point
estimate for a target. The vector μ is estimated from data using the EM method. The
clever part is using an artificial data set with only one point in it, namely, the origin.
Let r = 0, 1, . . . denote the EM iteration index, and let μ(0) be a specified
initial value for the mean. The auxiliary function is given by (3.20) with m = 1,
x1 = 0, L = L S MC , θ = μ, and
λ (x ; μ) ≡ Ik N x ; xk|k () − μ, Σker .
(r ) N 0 ; xk|k () − μ(r ) , Σker
w μ = L
(r )
=1 N 0 ; x k|k ( ) − μ , Σker
S MC
N xk|k () ; μ(r ) , Σker
= L . (6.25)
(r )
=1 N x k|k ( ) ; μ , Σker
S MC
The auxiliary function in the present case requires no sum over j as done in (3.20),
so
L
S MC
Q(μ ; μ(r ) ) = −Ik + w μ(r ) log Ik N 0 ; xk|k () − μ, Σker .
=1
(6.26)
Substituting (6.25) and canceling the common factor gives the classical mean-shift
iteration:
L S MC
(r +1) =1 N xk|k () ; μ(r ) , Σker xk|k ()
μ = L S MC . (6.28)
(r )
=1 N x k|k () ; μ , Σker
The update of the mean is a convex combination of the particle set. Convergence to
a local maximum μ(r )
k → x̂ k|k is guaranteed as r → ∞.
Different initializations are needed for different targets, so the mean shift algo-
rithm needs a preliminary clustering method to initialize it, as well as to determine
the number of peaks in the data correspond to targets. Also, the size of the ker-
nel depends somewhat on the number of particles and may need to be adjusted to
smooth the intensity surface appropriately.
Identifiability remains a problem with the mean shift algorithm, that is, there is no
identification of the point estimate to a target except through the way the starting
point of the iteration is chosen. This may cause problems when targets are in close
proximity.
One way to try to resolve the problem is to use the particles themselves as points
to feed into another tracking algorithm. This method exploits serial structure in the
filter estimates and may disambiguate closely spaced targets. However, particles are
serially correlated and do not satisfy the conditional independence assumptions of
measurement data, so the resulting track estimates may be biased.
166 6 Multiple Target Tracking
−1 −1
J (μ) = Ik Σker Σ(μ) Σker , (6.29)
L
L
T
Σ(μ) = w(x, μ ; ,
) x − xk|k () + μ x − xk|k (
) + μ dx
=1
=1 R
nx
(6.30)
The CRB is J −1 (μ) evaluated at the true value of μ. Because the integral is over all
of Rn x , a change of variables shows that the information matrix J (μ) is independent
of the true value of μ. This means that the FIM is not target specific.
A local bound is desired, since the mean shift algorithm converges to a local
peak of the intensity. By restricting the intensity model to a specified bounded gate
G ⊂ Rn x , the integral in (6.30) is similarly restricted. The matrix Σ(μ) is thus a
function of G. The gated CRB is local to the gate, i.e., it is a function of the target
within the gate.
S MC
log p(μ) = −Ik L S MC + log ⎣ N (xk|k () ; μ, Σker )⎦ .
=1
6.3 Implementation 167
T 1 S MC
−1
∇μ ∇μ log p(μ) = Σker + 2⎣ −1
N (μ ; xk|k (), Σker ) Σker μ − xk|k () ⎦
κ
=1
⎡ ⎤T
L
S MC
×⎣ −1
N (μ ; xk|k (), Σker ) Σker μ − xk|k () ⎦
=1
⎡ ⎤
L
1 −1 ⎣ S MC
T
− Σker N (μ;xk|k (),Σker ) μ−xk|k () μ−xk|k () ⎦ Σker
−1
,
κ
=1
S MC
κ = N (μ ; xk|k (), Σker ) .
=1
The observed information matrix is evaluated at the MAP estimate μ = x̂k|k . The
T
middle term is proportional to ∇μ p(μ) ∇μ p(μ) and so is zero any stationary
point of p(μ), e.g., at the MAP estimate μ = x̂k|k . The OIM is therefore
−1
O I M x̂k|k = Σker
. L T /
−1
S MC
=1 N (x̂k|k;xk|k (),Σker ) x̂k|k −xk|k () x̂k|k −xk|k () −1
−Σker L S MC Σker ,
=1 N ( x̂ k|k ;x k|k ()!Σker )
(6.32)
The CRB surrogate is O I M −1 x̂k|k . The inverse exists because the OIM is
positive definite at x̂k|k . This matrix is, in turn, a surrogate for the error covariance
matrix.
The OIM for x̂k|k can be computed efficiently in conjunction with any EM
method (see [69] for a general discussion). As noted in Section 4.7, the statistical
interpretation of the OIM is unresolved in statistical circles. Its utility should be
carefully investigated in applications.
covariance matrices are extracted from the means and variances of the Gaussian
sum instead of a myriad of particles. Gaussian sum methods are also potentially
more computationally practical than particle methods for very large numbers of
targets.
The Gaussian sum approach is especially attractive when constant survival and
detection functions are assumed. This assumption means that the PPP thinning func-
tions are independent of state. In this case, assuming also linear Gaussian target
motion and measurement models, the prediction and information update steps are
closed form.
Gaussian sum implementations of the PHD filter are carefully discussed by Vo
and his colleagues [139]. An unnormalized Gaussian sum is used to approximate
the intensity function. These methods are important because they have the potential
to be useful in higher dimensions than particle filters. Nonetheless, despite some
comments to the contrary, Gaussian sum intensity filters do not escape the curse of
target state space dimensionality.
Gaussian sum methods for intensity estimation comprise several steps:
• Prediction. The target intensity at time tk−1 is a Gaussian sum, to which is added
a target birth process that is modeled by a Gaussian sum. The prediction equation
for every component in the Gaussian sum is identical to a Kalman filter prediction
equation.
• Component Update. For each point measurement, the predicted Gaussian compo-
nents are updated using the usual Kalman update equations. The update therefore
increases the number of terms in the Gaussian sum if there is more than one mea-
surement. This step has two parts. In the first, the means and covariance matrices
are evaluated. In the second, the coefficients of the Gaussian sum are updated by
a multiplicative procedure.
• Merging and Pruning. The components of the Gaussian sum are merged and
pruned to obtain a “nominal” number of terms. Various reasonable strategies
are available for such purposes, as detailed in [139]. This step is the analog of
resampling in the particle method.
Some form of pruning is necessary to keep the size of the Gaussian sum bounded
over time, so the last—and most heuristic—step cannot be omitted.
Left out of this discussion are details that relate the weights of the Gaussian
components to the estimated target count. These details can be found in [139]. For
nonlinear target motion and measurement models, [139] proposes both the extended
and the unscented Kalman filters. Vo and his colleagues also present Gaussian sum
implementations of the CPHD filter in [138, 139].
6.3.6 Regularization
Intensity filters are in the same class of stochastic inverse problems as image recon-
struction in emission tomography—the sequence t0 , t1 , . . . , tk of intensity filter
6.3 Implementation 169
estimates f k|k (x) is essentially a movie (in dimension n x ) in target state space. As
discussed in Section 5.7, such problems suffer from serious noise and numerical arti-
facts. The high dimensionality of the PPP parameter, i.e., the number of voxels of the
intensity function, makes regularization a priority in all applications. Regularization
for intensity filters is a relatively new subject. Methods such as cardinalization are
inherently regularizing.
Grenander’s method of sieves used in Section 5.7.1 for regularizing PET adapts
to the intensity filter, but requires some additional structure. The sieve kernel
k0 (x | u) is a pdf on S + , so that
k0 (x | u) dx = 1 (6.33)
S+
The kernel k0 can be a function of time tk if desired. The restriction (6.34) is also
imposed on the predicted target intensity:
f k|k−1 (x) = k0 (x | u) ζk|k−1 (u) du for some ζk|k−1 (u) > 0 . (6.35)
U+
where
3
pk (z | u) = pk (z | x) k0 (x | u) dx . (6.37)
S+
where
3k−1 (x | u) =
Ψ Ψk−1 (x | y) k0 (y | u) dy . (6.38)
S+
for all points x ∈ S + . Like the sieve kernel k0 ( · ), the Bayesian kernel k1 (v | x) is
very flexible. It is easily verified that the function
Φk−1 (v | u) = 3k−1 (x | u) dx
k1 (v | x) Ψ (6.40)
S+
The information updated intensity 0 ζk|k (u) is evaluated via the intensity filter using
the regularized measurement pdf (6.37) and the predicted measurement intensity
(6.36). The regularized target state intensity at time tk is the integral
f k|k (x) = k(x | u)0
ζk|k (u) du . (6.41)
U+
The regularized intensity f k|k (x) depends on the sieve and Bayesian kernels k0 ( · )
and k1 ( · ).
The question of how best to define the k0 ( · ) and k1 ( · ) kernels depends on the
application. It is common practice to define kernels using Gaussian pdfs. As men-
tioned in Section 5.7, the sieve kernel is a kind of measurement smoothing kernel.
If dim(U) < dim(S), the Bayesian kernel disguises observability issues, that is,
many points x ∈ S map with the same probability to a given point u ∈ U. This
provides a mechanism for target state space smoothing.
6.4 Estimated Target Count 171
Accurate knowledge of the target detection probability function PkD (x) is crucial
to correctly estimating target count. An incorrect value of PkD (x) is a source of
systematic error. For example, if the filter uses the value PkD (x) = 0.5 but in fact
all targets always show up in the measured data, the estimated mean target count
will be high by a factor of two. This example is somewhat extreme but it makes
the point that correctly setting the detection probability is an important task for the
track management function. The task involves executive knowledge about changing
sensor performance characteristics, as well as executive decisions external to the
tracking algorithm about the number of targets actually present—decisions that
feedback to validate estimates of PkD (x). Henceforth, the probability of detection
function PkD (x) is assumed accurate.
There are other possible sources of error in target count estimates. Birth-death
processes can be difficult to tune in practice, regardless of whether they are modeled
implicitly as transitions into and out of a state φ in the intensity filter, or explicitly
as in the PHD filter. If in an effort to detect new targets early and hold track on them
as long as possible, births are too spontaneous and deaths are too infrequent, the
target count will be too high on average. Conversely, it will be too low with delayed
initiation and early termination. Critically damped designs, however that concept
is properly defined in this context, would seem desirable in practice. In any event,
tuning is a function of the track management system.
Under the PPP model, the estimated expected number of targets in a given region,
A, is the integral over A of the estimated multitarget intensity. Because the number
is Poisson distributed, the variance of the estimated number of targets is equal to
the mean number. This large variance is an unhappy fact of life. For example,
√ if 10
targets are present, the standard deviation on the estimated number is 10 ≈ 3 .
It is therefore foolhardy in practice to assume that the estimated of the number
of targets is the number that are actually present. Variance reduction in the target
count estimate is a high priority from the track management point of view for both
intensity and PHD filters.
tribute to the filter. Consequently, if the individual sensors estimate target count
correctly, so does the multisensor intensity filter.
Moreover, and just as importantly, the variance of the target count estimate of the
multisensor intensity filter is reduced by a factor of M compared to that of a single
sensor, where M is the number of sensors, assuming for simplicity that the sensor
variances are identical.
This important variance reduction property is analogous to estimators in other
applications. An especially prominent and well known example is power spectral
estimation of wideband stationary time series. For such signals the output bins of
the DFT of a non-overlapped blocks of sampled data are distributed with a mean
level equal to the signal power in the bin, and the variance equal to the mean. This
property of the periodogram is well known, as is the idea of time averaging the peri-
odogram, i.e., the non-overlapped DFT outputs, to reduce the variance of spectral
estimates.2 The Wiener-Khinchin theorem justifies averaging the short term Fourier
transforms of nonoverlapped data records as a way to estimate the power spectrum
with reduced variance. In practice, the number of DFT records averaged is often
about 25.
The multisensor intensity filter is low computational complexity, and applicable
to distributed heterogeneous sensor networks. It is thus practical and widely useful.
Speculating now for the sheer fun of it, if the number of data records in a power
spectral average carries over to multisensor multitarget tracking problems, then the
multisensor intensity filter achieves satisfactory performance for many practical pur-
poses with about 25 sensors.
To motivate the discussion, consider the SPECT imaging application of Section 5.4.
In SPECT, a single gamma camera is moved to several view angles and a snapshot is
taken of light observed emanating from gamma photon absorption events. The EM
recursion given by (5.69) is the superposition of the intensity functions estimated
by each of the camera view angles. Intuitively, different snapshots cannot contain
data from the same absorption event, so the natural way to fuse the multiple cam-
era images into one image is to add them, after first weighting by the fraction of
radioisotope that is potentially visible in each. The theoretical justification is that,
since the number of decays is unknown and Poisson distributed, the estimates of
the spatial distribution of the radioisotope obtained from different view angles are
independent, not conditionally independent, so the intensity functions (images) are
superposed.
The general multisensor multitarget filtering problem is not concerned with
radioisotope decays, but rather with physical entities (aircraft, ships, etc.) that per-
sist over long periods of time—target physics are very different from the physics
2 Averaging trades off variance reduction and spectral resolution. It was first proposed by
M. S. Bartlett [6] in 1948.
6.5 Multiple Sensor Intensity Filters 173
In homogeneous problems the sensor coverages are identical, i.e., Ck () ≡ Ck for
all . Heterogeneous problems are those that are not homogeneous.
Two sensors with the same coverage need not have the same, or even closely
related, probability of detection functions. As time passes, homogeneous problems
may turn into heterogeneous ones, and vice versa. In practice, it is probably desir-
able to set a small threshold to avoid issues with very small probabilities of detec-
tion. Homogeneous and heterogeneous problems are discussed separately.
Fused (x) is the predicted target intensity based on the fused estimate
where f k|k−1
Fused (x) at time t
f k−1|k−1 k−1 . The measured data from sensor is
k ()
m
pk (z k ( j ; ) | x ; ) PkD (x ; )
L k (ξk () | x ; ) = 1 − PkD (x ; ) + .
λk|k−1 (z k ( j ; ) ; )
j=1
(6.46)
M
Fused
f k|k (x) = f k|k (x ; )
M
=1
&M '
1
If the sensor-level intensity filters are maintained by particles, and the number of
particles is the same for all sensors, the multisensor averaging filter is implemented
merely by pooling all the particles (and randomly downsampling to the desired par-
ticle size, if desired).
Multisensor fusion methods sometimes rank sensors by some relative quality
measure. This is unnecessary for the multisensor intensity filter. The reason is that
sensor quality, as measured by the probability of detection functions PkD (x ; ) and
the sensor measurement pdfs pk (z | x ; ), is automatically included in (6.47).
The multisensor intensity filter estimates the number of targets as
Fused
Nk|k = Fused
f k|k (x) dx
S
M
1
= f k|k (x ; ) dx
M
=1 S
M
= Nk|k () , (6.48)
M
=1
6.5 Multiple Sensor Intensity Filters 175
where Nk|k () is the number of targets estimated by sensor . Taking the expectation
of both sides gives
1
M
Fused
E Nk|k = E Nk|k () . (6.49)
M
=1
If the individual sensors are unbiased on average, or in the mean, then E[Nk|k ()] =
N for all , where N is the true number of targets present. Consequently, the multi-
sensor intensity filter is also unbiased.
The estimate Nk|k () is Poisson distributed, and the variance of a Poisson distri-
bution is equal to its mean, so
Var[Nk|k ()] = N , = 1, . . . , M .
The variance of the average in (6.48) is the average of the variances, since the terms
in the sum are independent. Thus,
1
M
Fused
Var Nk|k = 2
Var Nk|k ()
M
=1
N
= . (6.50)
M
In words, the standard deviation of the estimated target count in the √ multisensor
intensity filter is smaller than that of individual sensors by a factor of M, where M
is the number of fielded sensors. This is an important result for spatially distributed
networked sensors.
The averaging multisensor intensity filter is derived by Bayesian methods in
[124]. It is repeated here in outline. The Bayesian derivation of the single sensor
intensity filter in Appendix D is a good guide to the overall structure of most of the
argument.
The key is to exploit the PPP target model on the augmented space S + . Fol-
lowing the lead of (D.5) in Appendix D, the only PPP realizations with nonzero
M
likelihood have m k = =1 m k () microtargets. The m k PPP microtargets are
paired with the m k sensor data points, so the overall joint likelihood function is
the product of the sensor data likelihoods given the microtarget assignments. This
product is then summed over all partitions of the m k microtargets into parts of size
m k (1), . . . , m k (M).
The sum over all partitions is the Bayes posterior pdf on the event space E(S + ). It
is a very complex sum, but it has important structure. In particular, the single target
marginal pdfs are identical, that is, the integrals over all but one microtarget state
are all the same. After tedious algebraic manipulation, the single target marginal pdf
is seen to be
176 6 Multiple Target Tracking
M
p Fused
X (x) = L k (ξk () | x ; ) f k|k−1
Fused
(x) , x ∈ S+ . (6.51)
mk
=1
The mean field approximation is now invoked as in (D.13). Under this approxima-
Fused (x) = c p Fused (x), where the constant c > 0 is estimated. From (6.17)
tion, f k|k X
and (6.18), the measurement intensity is
λFused
k|k (z) = c pk (z | x) p Fused
X (x) dx , (6.52)
S+
⎧ ⎫
M ⎨
k ()
m ⎬
L (c ; ξk (1), . . . , ξk (M)) = e− S+ c p Fused
X (x)dx
λFused (z k ( j ; ))
⎩ k|k ⎭
=1 j=1
−c M m k
∝ e c . (6.53)
Setting the derivative with respect to c to zero and solving gives ML estimate ĉ M L =
mk Fused (x). Further purely technical details
M . The multisensor intensity filter is ĉ M L p X
of the Bayesian derivation provide little additional insight, so they are omitted.
The multiplication of the conditional likelihoods of the sensor data happens at the
PPP event level, where the correct associations of sensor data to targets is assumed
unknown. The result is that the PPP parameters—the intensity functions—are aver-
aged, not multiplied. The multisensor intensity filter therefore cannot reduce the
area of uncertainty of the extracted target point estimates. In other words, the mul-
tisensor intensity averaging filter cannot improve spatial resolution. Intuitively, the
multisensor filter achieves variance reduction in the target count by foregoing spatial
resolution of the target point estimates.
When the probability of detection functions are not identical, the multisensor inten-
sity filter description is somewhat more involved. At each target state x the only
sensors that are averaged are those whose detection functions are nonzero at x. This
leads to a “quilt-like” fused intensity that may have discontinuities at the boundaries
of sensor detection coverages.
The Bayesian derivation of (6.47) outlined above assumes that all the microtar-
gets of the PPP realizations can be associated to any of the M sensors. If, however,
any of these microtargets fall outside the coverage set of a sensor, then the assign-
ment is not valid. The way around the problem is to partition the target state space
appropriately.
6.5 Multiple Sensor Intensity Filters 177
C = ∪=1
M
C() (6.54)
contains points in target state space that are covered by at least one sensor. Partition
C into disjoint, nonoverlapping sets Bρ that comprise points covered by exactly ρ
sensors, ρ = 1, . . . , M. Now partition Bρ into subsets Bρ,1 , . . . , Bρ, jρ that are
covered
by different combinations of ρ sensors. To simplify notation, denote the
sets Bρ j by {Aω }, ω = 1, 2, . . . , Ω. The sets are disjoint and their union is all
of C:
C = ∪Ω
ω=1 Aω , Ai ∩ A j = ∅ for i = j . (6.55)
No smaller number of sets satisfies (6.55) and also has the property that each set Aω
in the partition is covered by the same subset of sensors.
The overall multisensor intensity filter operates on the partition {Aω }. The assign-
ment assumptions of the multisensor intensity filter are satisfied in each of the sets
Aω . Thus, the overall multisensor filter is
⎛ ⎞
1
Fused
f k|k (x) = ⎝ L k (ξk () | x ; )⎠ f k|k−1
Fused
(x) , x ∈ Aω . (6.56)
|Aω |
∈I (Aω )
where I (Aω ) are the indices of the sensors that contribute to the coverage of Aω ,
and |Aω | is the number of sensors that do so.
The multisensor intensity filter is thus a kind of “patchwork” with the pieces
being the sets Aω of the partition. The variance of the multisensor filter is not the
same throughout C—the more sensors contribute to the coverage of a set in the
partition, the smaller the variance in that set.
A simple way to write the multisensor filter in the general case is
, 2
M
=1 wk (x ; ) L k (ξk () | x ; )
Fused
f k|k (x) = M
Fused
f k|k−1 (x) , (6.57)
=1 wk (x ; )
where
%
1, if PkD (x ; ) > 0 ,
wk (x ; ) = (6.58)
0, if PkD (x ; ) = 0 ,
Two central problems of distributed sensor fields are sensor communication and
target detection. Both problems are discussed in this chapter. Sensor communication
problems deal with connectivity issues. Two aspects of connectivity are discussed,
one local and the other global. Local issues are addressed by the distribution of
distances from a target to the nearest sensor in the field. This distance relates to target
detection capability of the fielded sensors. A different but closely related notion is
the distribution of distances between sensors in the field. This distance relates to the
capability of two sensors to communicate with each other. These distributions are
obtained for sensors that are located at the points of a nonhomogeneous PPP. The
distinction between these distributions is highlighted by Slivnyak’s Theorem.
Communication issues for the sensor field as a whole are not addressed directly
by distance distributions. Global connectivity issues are discussed in terms of the
statistical properties of the sensor connectivity graph. This work relates concepts of
communication channel diversity to that of k-connectivity of a geometric random
graph. Communication diversity is one example of a threshold phenomenon, that is,
of a property of a random graph that abruptly takes hold as some appropriate scale
is increased. The recognition that threshold phenomena are ubiquitous in geometric
random graphs is exciting new research.
Target detection by a fielded multisensor array also has local and global aspects.
The local issue deals with the ability of a single sensor to detect to target. The
target to nearest sensor distance distributions address this issue. The probability
that a fielded sensor array will detect a target is a global, or field-level, detection
issue. Field level detection capability is modeled as a coverage problem. Homoge-
neous sensor fields with isotropic detection, that is, fields of identical sensors with
the same detection range regardless of sensor location, are the easiest to analyze.
Extensions to fields with directional sensors with uniformly distributed orientations
are also straightforward. Further extensions to anisotropic problems with preferred
directional orientations are also tractable.
Detection problems bring ideas from stochastic geometry into play naturally.
There are six million stories in stochastic geometry. Only a few are told here. Further
details of the beautiful connections to integral geometry are banished (cruelly) to the
references. (Fortunately, [108] is a delight to read.)
A word of caution is in order. Sensor fields are often deployed in a systematic
manner that, while it may have random aspects, is not amenable to exact or approx-
imate PPP modeling. In such cases, the distance distributions and detection proba-
bilities of PPP models are low fidelity, or even inappropriate. Model limitations can
be mitigated in some applications by using the extensions and variations of PPPs
given in Chapter 8. For example, the points in realizations of the Matérn hard core
point process are not closer together than a specified distance, h. This gives a more
efficient, i.e., nonoverlapping, sensor spatial distribution than does a PPP; however,
they are also harder to analyze because the hard core constraint means that the points
are not i.i.d. conditioned on their number. In other words, the minimum separation
constraint reduces the spatial variability of the points. A different example is that
of a cluster process. These processes increase the spatial variability of the points.
Distance distributions for cluster point processes are given in [140]; see also [8].
For the moment, suppose the target is located at the origin. Let D1 ≤ D2 ≤ · · ·
denote the distances arranged in increasing order from the origin to the points of a
PPP with intensity λ(s) on S ⊂ Rm . The pdf of these distances are easily computed.
The approach here is based on the essentially geometric method of [132]. More
accessible references are [16, 43, 52]. It is assumed that S λ(s) ds = ∞, so there
are infinitely many points in the realization.
The distances Dn are random variables, since they depend on the realization of
the PPP. Let rn be a realization of Dn , and let r0 = 0. Let
Thus, for a < b, S(b) − S(a) is the shell centered at the origin of Rm with inner
and outer radii a and b, respectively, intersected with S. Let NA denote the number
of points of the PPP in A ⊂ S.
The event 0 < r1 < r2 < · · · < rn comprises several events: no points are
in S(r1 ) − S(0), one point is in S(r1 + r1 ) − S(r1 ), no points are in S(r2 ) −
S(r1 + r1 ), one point is in S(r2 + r2 ) − S(r2 ), etc. These shells are nested and
not overlapped, so their probabilities are independent. Hence, setting r0 + r0 ≡ 0,
Let
μ(r ) = λ(s) ds .
S(r )
182 7 Distributed Sensing
Then
Pr[N S(r j ) − S(r j−1 + r j−1 ) = 0] = e−μ(r j ) + μ(r j−1 + r j−1 ) (7.3)
and
Pr[N S(r j + r j ) − S(r j ) = 1] = e−μ(r j + r j ) + μ(r j ) μ(r j + r j ) − μ(r j ) .
n
Pr[0 < r1 < r2 < · · · < rn ] = e−μ(rn +rn ) μ
(r̃ j ) r j .
j=1
Dividing by r1 · · · rn and taking the limits as r j → 0 gives the pdf of the
ordered event 0 < r1 < r2 < · · · < rn :
n
p(r1 , . . . , rn ) = e−μ(rn ) μ
(r j ). (7.5)
j=1
Integrating over r1 , . . . , rn−1 gives p Dn (rn ), the pdf of Dn . The required integrals
are consistent with the event order 0 < r1 < r2 < · · · < rn , so that
rn r2
pDn (rn ) = ··· p(r1 , . . . , rn ) dr1 · · · drn−1 .
0 0
The result is
μn−1 (r ) −μ(r ) d
pDn (r ) = e μ(r ) . (7.6)
(n − 1)! dr
and set
μ(r ; a) = λ(s) ds . (7.8)
S(r ; a)
Proceeding as before leads to densities (7.6) that depend parametrically on the point
a for nonhomogeneous PPPs. This dependence can be quite dramatic. Consider,
e.g., a strongly peaked unimodal intensity. The distance distributions for a point a
far removed from the mode of the PPP intensity will have large means compared to
the means of the distance distributions obtained from a point b that is near the mode
of the intensity.
Example 7.1 Homogeneous PPPs. For homogeneous PPPs the densities (7.6) are
independent of the choice of a. In this case,
μ(r ) ≡ λ cm r m , (7.9)
π m/2
cm = m (7.10)
! 2 +1
m
(λ cm )n r mn − 1 e− λ cm r .
m
pDn (r ) = (7.11)
(n − 1)!
Example 7.2 Distances in the plane. In the plane, the pdf of the distance from a
target to the nearest sensor is
p D1 (r ) = 2 λ π r e−λ π r .
2
184 7 Distributed Sensing
√
Its mean is E[D1 ] = 1/ 2 λ , and its variance is
4−π
Var[D1 ] = E D12 − E[D1 ]2 = .
4λπ
Then, replacing μ by μ̃ in (7.6) gives the pdf of the n-th ordered value of {h(·)}
evaluated at the points of the PPP realizations. Examples include:
1/2
• h ! (x) = x T ! x , where ! is a positive definite matrix, and
m 1/ p
• h p (x) = i=1 |x i |
p , p > 0.
For 0 < p < 1, the function h p (x) is not a generalization of the concept of dis-
tance since it is not convex. The functions h p (x) are of great interest in compressive
sensing.
Example 7.4 Connection to Extreme Value Distributions. For n = 1, the nearest
neighbor distribution in Rm is the integral of the pdf of the distance D1 :
x
F(x) = p D1 (r ) dr , x > 0,
0 x
m λ cm r m − 1 e−λ cm r dx
m
=
0
= 1 − e−λ cm r .
m
(7.13)
7.1 Distance Distributions 185
The rate of decay of this function increases very rapidly as the dimension m
increases, yet another symptom of the Curse of Dimensionality.
The function F(x) is the famous Weibull distribution. It is one of the three stable
laws of extreme value statistics that are allowed by the so-called Trinity Theorem.
The other two stable laws are named for Fréchet and Gumbel. The Trinity Theo-
rem holds whenever the cumulative distribution function of the underlying sample
distribution is continuous and has an inverse.
The nearest neighbor is the minimum of a large number of realizations of i.i.d.
random variables, so the Trinity Theorem must hold for nonhomogeneous intensi-
ties. The limiting distribution for nonhomogeneous PPPs is currently unavailable,
but it is likely that it is Weibull with an intensity equal to the local intensity λ(a)
at the point a from which nearest neighbor distances are centered. Higher order
correction terms would need to be determined by asymptotic methods.
The distances between a sensor and all the other sensors in a given realization is a
conceptually more difficult problem. For one thing there is no legitimate concept of
a reference sensor since different realizations comprise different points.
One way to approach the problem is to average over the sensors in a given real-
ization, and then to seek the expectation of this sum over all PPP realizations. For
example, the distance between a point in a realization and its nearest neighbor can
be averaged over all the points of the realization. Such sums are random sums, but
of a more specialized kind than are used in Campbell’s Theorem. Evaluating the
expectation of these random sums is the goal of Slivnyak’s Theorem.
The event space E(S) is identical to the event space E(S) with the integer counts
removed. An example of such a function is the nearest neighbor distance from an
arbitrary point x ∈ S to the points in ξ :
with f N N (x, {∅}) = 0. The pdf of f N N is given in the previous section when ξ is
a realization of a PPP.
186 7 Distributed Sensing
For distributed sensors a more interesting example is the nearest neighbor dis-
tance from an arbitrary vertex x j in ξ to the other points in the same realization:
f V V x j , {x1 , . . . , xn } \ x j ≡ f V V x j , {x1 , . . . , x j−1 , x j+1 , . . . , xn }
= min xi − x j . (7.16)
1≤i≤n
i= j
The sum of the nearest neighbor distances of the vertices in the PPP realization ξ is
n
F(ξ ) = f V V x j , {x1 , . . . , xn } \ x j . (7.17)
j=1
The second step moves the integrals of the expectation inside the sum over j. The
last step follows from recognizing that the n-fold integrals over S are identical.
Now, for every n, the integral over xn is unchanged by replacing xn with the dummy
variable, x. Therefore,
,
∞ −
e S λ(x)dx
E [F] = ··· f (x, {x1 , . . . , xn−1 })
S n=1 (n − 1)! S S
⎫
n−1 ⎬
× λ(x j ) dx1 · · · dxn−1 λ(x) dx
⎭
j=1
= E f (x, {x1 , . . . , xn }) λ(x) dx ,
S
7.1 Distance Distributions 187
where the last step is merely shifting the index n ← n − 1 so that the infinite sum
starts at n = 0. This gives Slivnyak’s Theorem:
⎡ ⎤
n
E⎣ f x j , {x1 , . . . , xn } \ x j ⎦ = E f (x, {x1 , . . . , xn }) λ(x) dx .
j=1 S
(7.18)
The result relates two different kinds of averages. One is how a point in a realization
relates to other points in the very same realization. The other is how a given point
relates to the points of an arbitrary PPP realization. The latter average can be easier
to evaluate analytically.
where N [ · ] is a set function that counts the number of points in its argument. In
terms of the function fr , the degree of vertex x j is
fr x j , {x1 , . . . , xn } \ x j , (7.20)
so the aggregate vertex degree, summed over all vertices in a given realization of
the graph G r , is the random sum
188 7 Distributed Sensing
n
fr x j , {x1 , . . . , xn } \ x j . (7.21)
j=1
where S(r ) is given by (7.1) and S(r ; x) by (7.7). Hence, from (7.18), the expected
total vertex degree is
d Σ(r ) = λ(y + x) dy λ(x) dx . (7.22)
S S(r )
The mean vertex degree is the ratio of d Σ(r ) and the expected number of vertices.
This ratio, written in a form that emphasizes convexity, is
S λ(y + x) λ(x) dx
d(r ) = dy . (7.24)
S(r ) S λ(x) dx
4 √
(|S| λ) 2 λ
D1V =
E[Number of vertices]
1
= √ . (7.25)
2 λ
The average nearest vertex distance is identical to the mean nearest neighbor dis-
tance for homogeneous PPPs, a result that accords well with intuition.
The question considered in this section is “How many communication paths are
there between pairs of sensors in the network?” The question is answered with high
probability, a phrase that is commonplace in the theory of random graphs. As is
implicitly clear, the answer provided uses only the i.i.d. property of PPPs.
Several standard definitions from graph theory are useful. Let G denote a given
(nonrandom) graph.
• The minimum degree of any vertex in G is denoted by δ(G), that is,
Let |A| denote the area of the set A. The expected coverage is E[ |C| ], where the
expectation is over the disc/grain centers. The probability that any given point x in
R is covered by (contained in) C is defined by
E [ |C| ]
Pr[x ∈ C] = . (7.28)
|R|
The coordinate system is chosen so that the double integral over the point set G on
the right hand side of (7.29) does not change when the lines in L are subjected to a
rigid motion in the plane.1 It can be verified that the natural coordinate system with
this invariance property is the so-called normal form of the line:
where ρ ≥ 0 is the distance from the origin to the foot of the perpendicular to
the line , and θ, 0 ≤ θ < 2π, is the angle measured counterclockwise from
the positive x-axis to the perpendicular line. The angle θ is not limited to [0, π )
because the perpendicular is a line segment with one endpoint always at the origin.
See Fig. 7.1. The double integral over the point set G in (7.29) has units of length.
1 The importance of invariance was not recognized in early discussions of the concept of a random
line. Bertrand’s paradox (1888) involves three different answers to a question about the length of a
random chord of a circle. The history is reviewed in [61, Chapter 1].
192 7 Distributed Sensing
Fig. 7.1 Depiction of coordinate system and the line (ρ, θ). The support function p(θ) and sup-
port line for a convex set K is also shown. The origin O is any specified point interior to K . The
line (ρ, θ) intersects K whenever 0 ≤ ρ ≤ p(θ). The thickness T (θ) is defined later in (7.45)
Let K be a bounded convex set in the plane, and take as the origin any point
interior to K . The double integral over all lines that intersect K is very simple:
dρ dθ = L , (7.31)
G∩K =∅
where p(θ ) is the distance from the origin at angle θ to the tangent line to K . The
function p(θ ) is called the support function of K , and the tangent line is called a
support line. The support function and line are depicted in Fig. 7.1. It can be shown
that p(θ ) + p
(θ ) > 0 and that the infinitesimal of arclength of K is
ds = p(θ ) + p
(θ ) dθ . (7.33)
Thus,
L 2π 2π
L = ds = p(θ ) + p
(θ ) dθ = p(θ ) dθ , (7.34)
0 0 0
where the integral over p
(θ ) is zero because p(θ ) is periodic. Comparing (7.32)
and (7.34) gives (7.31).
7.3 Detection Coverage 193
|K 1 |
Pr ∩ K 1 = ∅ | ∩ K = ∅ = . (7.36)
|K |
π |K |
E[chord length] = , (7.37)
L
where L is the perimeter of K . To see this, let the line (ρ, θ ) intersect K . Denote
by σ (ρ, θ ) the length of the line segment, or chord, subtended by K . The mean
chord length of lines that intersect K is the ratio
G∩K =∅ σ (ρ, θ ) dρ dθ
E[chord length] = .
G∩K =∅ dρ dθ
For any angle θ , the infinitesimal σ (ρ, θ ) dρ is an element of area for K . One part
of K is covered by area elements with angle θ , and the other part is covered by
elements with angle θ + π . The area elements are not overlapped, so integrating
over all area elements gives the numerator as
194 7 Distributed Sensing
& '
π p(θ) p(θ + π ) π
σ (ρ, θ ) dρ + σ (ρ, θ ) dρ dθ = |K | dθ = π |K | .
0 0 0 0
(7.38)
π |K |
E[chord length] = , (7.39)
K d p dθ
In the general case, chord length is replaced by the sum of the subtended line seg-
ments [2].
A field of n sensors with congruent detection regions, denoted K 1 , . . . , K n , are
dropped at random so that for every j the region K j intersects the surveillance
region K 0 . The orientation of the sensors is also random, an assumption that matters
only when the sensor detection capability is not circular.
Let Ik be the area of R that is covered by exactly k sensors. Let F0 and L 0 denote
the area and perimeter of K 0 , respectively. Similarly, let F and L denote the area
and perimeter of the sets K j . The expected value of the area of Ik is
n (2 π F)k (2 π F0 + L 0 L)n − k F0
E [ |Ik | ] = . (7.40)
k (2 π (F + F0 ) + L 0 L)n
It is no accident that this expression resembles the binomial theorem; see [108,
pp. 98–99]. To see this result it is necessary to introduce the concept of mixed areas
(also called quermasse integrals) that were first studied by Minkowski (c. 1903).
The product L 0 L in (7.40) is a mixed area. This endeavor is left to the references.
Dividing (7.40) by the area of the surveillance region gives the probability that a
target is detected by precisely k sensors:
E [ |Ik | ]
Pr[ k-coverage ] =
F0
n (2 π F)k (2 π F0 + L 0 L)n−k
= . (7.41)
k (2 π (F + F0 ) + L 0 L)n
For k = 0 the result gives the probability that a target is not detected by even one
sensor. The set I0 is called the vacancy, and it is of much interest in modern coverage
theory [44].
n F λF
= λ ⇔ = .
F0 F0 n
Regardless of the shape of K 0 , the ratio of its perimeter to its area L 0 /F0 → 0.
Manipulating (7.41) and taking the limit as n → ∞ gives
(λF)k −λF
Pr[ k-coverage ] = e . (7.42)
k!
2 πF κ λF
= , (7.43)
2 πF + 2 πF + L 0L n
where
−1
λF L0 L
κ = +1+ .
n 2 πF0
In the limit as n → ∞, the middle terms both go to one, and the last term goes to
the exponential. This establishes (7.42).
Fig. 7.2 Depiction of the pill-shaped detection region of a sensor with a circular detection region
with detection range R D drifting for a known time period t = t − t0 and known constant
velocity v. The point P(t0 ) is the initial sensor location, and P(t) is its location at time t
Fig. 7.3 Depiction of the varying coverage of 10 drifting sensors pill-shaped detection regions.
The length of the rectangular section of the pill is proportional to drift speed and duration of drift;
the width is the detection range of the sensor. (a) All sensors drift to the east. (b) If, instead, all
sensors drift to the southeast, the total coverage and sensor overlap changes. (c) All 10 drift to the
east on average, but each with a slightly different angle. (d) Random orientations as required by
the coverage theory of Section 7.3.1
mean drift direction. As depicted in Fig. 7.3d, pills scattered randomly in angle and
location do not exhibit a preferred orientation. The coverage results of the previous
section are only valid for the situation depicted in Fig. 7.3d. Therefore, they are
either inappropriate or at best an approximation if the anisotropy is small, that is, if
the pill-shaped regions approximate a circle because v ≈ 0.
A different kind of example of the need for anisotropy arises in barrier problems.
In these problems the convex surveillance set K is typically long and narrow, and
long traverses are unrealistic. These infrequent, but long, traverses make the average
chord length longer than may be reasonable in some applications. For example, the
average chord length for a rectangular barrier region K of width and length D
is, by Crofton’s Theorem (7.37),
7.3 Detection Coverage 197
πD π
E[chord length] = ≈ ≈ 1.57 . (7.44)
2 + 2 D 2
Long transit lengths, while infrequent, nonetheless make the average transit length
significantly larger than the shortest possible transit . If target trajectories are more
likely to be roughly perpendicular to the long side of K , the isotropic model is of low
fidelity and needs modification. Given a pdf of the angle θ of transit, an expression
for the expected transit length can be developed using the methods discussed below.
Anisotropic problems are studied by Dufour [25] in Rκ , κ = 2, 3. Much of this
work seems to be unpublished, but accessible versions of some of the results are
given in [120, pp. 55–74] and also [108, p. 104]. The former discusses anisotropy
generally and gives an interesting application of anisotropy to motor vehicle traffic
flow analysis. Most if not all of Dufour’s results are analytically explicit in the sense
that they are amenable to numerical calculation. Of the many results in [25], two are
discussed here to illustrate how anisotropy enters the problem.
The distribution dF(θ ) of the orientation angle is assumed known. This notation
accommodates diverse angle distributions,including the case when there is only one
possible orientation. If F is differentiable, then F
(θ ) is the pdf of orientation angle.
Let K ⊥ (θ ) denote the orthogonal projection of a bounded set K ⊂ R2 onto a line
through the origin with angle θ . The set K ⊥ (θ ) is an interval on the line if K is
connected (i.e., if K is not the union of two or more disjoint sets). The thickness of
K is defined by
T (θ ) = dρ . (7.45)
K ⊥ (θ)
E [T1 ]
Pr line segment intersects K 1 | line segment intersects K 0 = .
E [T0 ]
7.4 Stereology
Stereology is most often defined as the study of three dimensional objects by means
of plane cross-sections through the object. A 3-D object in R3 can be studied by
2-D and 1-D sections, or by combination of them. More generally, it is the study
of the structure of objects in Rm that are known only via measurements of lower
dimensional linear sections. Stereology is of great practical importance in many
fields ranging from geology (e.g., core samples) to biology (e.g., microscope slide
smears, tissue sections, needle biopsies).
Tomography might be seen as a specialized form of stereology because it
involves (nondestructive) reconstruction of 3-D properties from thin 2-D sections.
However, tomography computes as many plane sections as are needed to image an
object fully, and is therefore typically data rich compared to stereology. Only data
starved tomographic applications (e.g., acoustic tomography of the ocean volume)
might be considered comparable to the problems of stereology.
Distributed multiple sensor problems are not currently thought of as problems of
stereology. The field of view (FOV) of any given sensor in a distributed sensor field
is a limited cross-section of the medium in which the sensors reside. Interpreting
this limited data to understand and estimate statistical properties of the medium as
a whole is a stereological interpretation of the problem. Admittedly, the connection
seems tenuous at best, but the potential of stereology to contribute to understanding
the problems involved is real enough to justify inclusion of a brief mention of the
topic here.
Problems in stereology are difficult, but sometimes have surprising and pleasing
solutions. An excellent example of this is Delesse’s principle from mineralogy, and
it dates to 1848. For concreteness, suppose a rock sample is sliced by a saw and the
cross-section polished. The cross-section is examined for the presence of a specified
mineral. The ratio of the area of the cross-section that intersected the mineral to the
7.4 Stereology 199
total area of the cross-section is called the area fraction, denoted A A . Similarly, the
ratio of the total volume of the mineral to the total volume of the rock sample is
the volume fraction, denoted by VV . Assuming the rock sample is a representative
specimen from a much larger homogeneous volume, Delesse’s principle says that
the expected volume fraction equals the expected area fraction:
A A = VV .
The expectation is over all possible two-dimensional slices through the rock. The
area fraction is an unbiased estimator of the volume fraction. The variance of the
estimate is much harder to evaluate. Delesse’s principle is very practical since the
area fractions of several rock samples are much easier to measure then volume
fraction.
Rosiwal extended the result in 1898 to line fractions. Draw an equispaced grid
of parallel lines on the polished rock surface. Measure the length of the line seg-
ments that overlay the mineral of interest, and divide by the total length of the line
segments. This ratio is the line fraction L L . Rosiwal’s principle says that
L L = A A = VV ,
provided, as before, the rock is a sample from a larger homogeneous volume. Line
fractions are even easier to measure than area fractions.
Similar applications in microscopy are complicated by the simple fact that even
the sharpest knife may not cut a cell, but simply push it aside, unless the tissue is
frozen first. Freezing is not always an option. There are other practical issues as
well.
These methods do not provide estimates of the number of objects in a vol-
ume. In the biological application, for example, the number of cells cut by a two-
dimensional tissue slice does not in general indicate the number of cells in the tissue
volume.
Modern stereology is a highly interdisciplinary field with many connections to
stochastic geometry. An excellent modern book on the subject is [3], which also
gives references to the original papers of Delesse and Rosiwal. The relationships
between stereology and integral geometry, convexity, and Crofton-style formulae
are discussed in [2] and the references cited therein. An informative overview is
also given in [44, Section 1.9].
Example 7.6 Connections. An old puzzle related to Delesse’s principle goes as fol-
lows: a region in the plane is such that every line through the origin intersects it in
a line segment of length 2. Is the region a unit radius circle centered at the origin?
The answer is no, and the cardioid r = 1 − cos θ is a nice counter-example. If,
however, the region is centrally symmetric, the answer is yes. Extended to three
dimensions, the problem is: an object in R3 is such that the area of its intersection
with every plane through the origin is π . Is the object the unit sphere centered at the
200 7 Distributed Sensing
origin? The answer [32] in this case is the same as in the plane, but the solution is
also much deeper mathematically. Counter-examples are found by using spherical
harmonics. This problem is a special case of the Funk-Hecke theorem [29, 109] for
finding the spherical harmonic expansion of a function knowing only its integrals
on all the great circles. While seeking the analog of this theorem in the plane (given
integrals of a function on all straight lines), Radon found in 1917 what is now called
the Radon transform, which is recognized as the mathematical basis of tomography.
Part III
Beyond the Poisson Point Process
Chapter 8
A Profusion of Point Processes
This chapter discusses only a few of the bewilderingly large variety of point pro-
cesses that are not PPPs and are also useful in applications. The MCMC revolution,
as it has been called, will undoubtedly increase the variety and number of non-PPPs
that find successful application in real world problems.
Many useful point processes are built upon a PPP foundation, which assists in
their simulation and aids in their theoretical development. Marked PPPs are dis-
cussed first. Despite appearances, marked PPPs turn out in the end to be equivalent
to PPPs.
Hard core processes are discussed next. They have less spatial variability than
PPPs because points are separated by a specified minimum distance. Loosely
speaking, the points are “mutually repelled” from each other. The Matérn hard core
processes use dependent thinning to enforce point separation.
Cluster processes are presented next. They have greater spatial variability than
PPPs because, as the name suggests, points tend to gather closely together—points
are in a loose sense “mutually attractive.” Poisson and Neyman-Scott cluster pro-
cesses are discussed. A special case of the latter is the Matérn cluster process, which
uses a resampling procedure to encourage point proximity.
The Cox, or doubly stochastic, processes are discussed next. These are complex
processes that push the concept of the ensemble to include the intensity function. In
other words, the intensity function is itself a random variable. It is shown that a Cox
process whose intensity function is a random sum is a Neyman-Scott process.
Many useful point processes are not directly related to PPPs. One of these—the
Gibbs, or Markov, point process—is discussed briefly here.
ξ = (m, {x1 , . . . , xm })
of a PPP with intensity λ(s) on the space S. Given m, the points x j are i.i.d.
samples of the random variable X whose pdf is p X (x) = λ(x)/ S λ(s) ds .
• The mark is a random variable U on a mark space U with conditional pdf
pU |X (u | x). Given the PPP realization ξ , generate the marks
{u 1 , . . . , u m } ⊂ U
as independent realizations of pU |X ( · | x j ) , j = 1, . . . , m .
• Pair the points with their corresponding marks:
The mark space U can be very general, but it is typically either a discrete set or
a subset of the Euclidean space Rκ . An example is the MMIF of Section 6.2.2 and
Appendix E. In this application, the targets are the points of a PPP, the measurements
are the marks, and the measurement likelihood function is the conditional mark pdf.
In the simplest case, the marks are independent of ξ , so that
pU | X (u | x j ) ≡ pU (u) .
The marks in this case are independent of the locations of the points in ξ as well as
the number m. This kind of marked PPP is called a compound PPP by Snyder, and
its theory is well developed in [118, Chapter 3] and [119, Chapter 4].
Several compound PPPs are used earlier in the book, but without comment. The
“Coloring Theorem” of Example 2.7 in Chapter 2 is an example of a marked PPP.
The marks are the colors assigned to the points. In Chapter 3, marks are intro-
duced as part of the complete data; that is, the marks are the missing data of the
EM method. A specific example is the complete data (3.12) for estimating super-
posed PPPs, which is a realization of a compound PPP in which the mark space is
U = {1, . . . , L}.
Yet another example, one that could have been discussed as a marked PPP but
was not, is the intensity filter of Chapter 6 when the target PPP is split during the
prediction step into the detected and undetected target PPPs. In this case, the detec-
tion process is equivalent to a marking procedure with marks U = {0, 1}, where
zero/one denotes target non-detection/detection.
m
= f (x j , u j )
j=1
% (
−F − f (x, u)
E e = exp e − 1 pU |X (u | x) λ(x) dxdu
%S U (
− f (x, u)
= exp e − 1 μ(x, u) dxdu . (8.3)
S ×U
Applying Campbell’s Theorem gives the expectation of the expression (8.4) with
respect to the PPP ξ with intensity function λ(x):
M
E e−F = E Ξ e− j=1 g(X j )
% (
= exp e−g(x) − 1 λ(x) dx .
S
% (
E e−F = exp e− f (x, u) pU |X (u | x) du − 1 λ(x) dx
%S U (
= exp e− f (x, u) − 1 pU |X (u | x) du λ(x) dx .
S U
A filtered Poisson process is the output of a function that is similar to (2.30). Let
h(x, y ; u) be a real valued function defined for all x, y ∈ S and u ∈ U. Given
the realization ξ
≡ (m, {(x1 , u 1 ), . . . , (xm , u m )}) of the marked PPP, define the
random sum
%
m 0, if m = 0,
F(y) = (8.6)
j=1 h(y, x j ; u j ), if m ≥ 1,
pU (u) ≡ pU (u ; θ ) ,
208 8 A Profusion of Point Processes
if the sphere of radius h centered at x j contains no points with marks smaller than
u j . Points are deleted from the realization only after all points are determined to
be either thinned or retained. This kind of thinning is not Bernoulli independent
thinning.
The intensity function of the resulting Matérn hard core process is
1
1 − e−λ0 cm h ,
m
λMatérn(II) = m
(8.9)
cm h
where cm is the volume of the unit radius sphere in Rm (see (7.9)). To see this, follow
the method of [123]: note that the process that deletes points x with mark u < t
is a thinned PPP whose intensity is λ0 t. Hence, r (t) = exp (−λ0 cm h m t) is the
probability that the thinned PPP has no points in the sphere of radius t. Equivalently,
r (t) is the probability that a point at x with mark t is retained. Hence,
1 − e−λ0 cm h
1 m
p = r (t) dt =
0 λ0 cm h m
is the probability that the point x is retained. The intensity λMatérn(II) is the product
of p and the PPP intensity λ0 .
The intensity of the Matérn process increases with increasing initial intensity λ0 .
From (8.9), the limiting intensity is
1
λMaxMatérn(II) = lim λMatérn = .
λ0 →∞ cm h m
The limiting intensity is one point per unit sphere. For m = 2 and h = 1,
1
λMaxMatérn(II) = ≈ 0.318 .
π
For comparison, regular hexagonal packing gives the maximum possible√ intensity
of one point per hexagon (inscribed in the unit circle), or λHex = 2 3/9 ≈ 0.385
with h = 1.
Example 8.2 Matérn Method I. This method starts with the same realization ξ as in
the previous example, but more aggressively thins the points of the PPP. Here, pairs
of points in the realization ξ that are separated by a distance 2h or less from each
other are removed from ξ . Pairs of points are removed only after first identifying all
pairs to be removed. The intensity function of the resulting hard core process is
λMatérn(I) = λ0 e−λ0 cm h .
m
(8.10)
To see this it is only necessary to see that the probability that a given point is retained
is e−λ0 cm h , and to multiply this probability by λ0 .
m
210 8 A Profusion of Point Processes
Variable size hard core models are defined by specifying a mark that denotes, say,
the radius of the hard core. Soft core models are defined in [123, p. 163]. Hard core
processes are very difficult to analyze theoretically. In most applications, numerical
modeling and simulation is often necessary in practice. As pointed out by Diaconis
[24], MCMC methods play an important role in the modern developments of these
problems.
ξC = ∪nj=1 ξ( j)
is a realization of the cluster process ΞC . The parent points x j are not in the realiza-
tion ξC . (In some applications, parent points are retained [103].) Cluster processes
are very general, and it is desirable to specialize them further.
Poisson cluster processes are cluster processes in which the parent process is a non-
homogeneous PPP. The family of daughter processes remain general finite point
processes.
The Neyman-Scott process is a Poisson cluster process in which the daughter
processes take a special form; realizations are generated via the following proce-
dure. The first step yields a realization ξ = (n, {x1 , . . . , xn }) of the parent PPP
Ξ whose intensity function is λ(x), x ∈ S. The second step draws i.i.d. samples
k j , j = 1, . . . , n, from a discrete random variable K on the nonnegative integers
with specified probabilities
is equivalent to that of a marked PPP. Let h(x), x ∈ S, be a specified pdf. The final
step draws k j i.i.d. samples from the shifted pdf h(x − x j ). Denote these samples by
xi j , i = 1, . . . , k j . The i-th child of x j is xi j . Let k = nj=1 k j . The realization
of the Neyman-Scott cluster process is
ξC = k, {x j + xi j : i = 1, . . . , k j and j = 1, . . . , n} ∈ E(S) . (8.12)
The defining parameters of the Neyman-Scott cluster process are the PPP parent
intensity function λ(x), the distribution of the number K , and the daughter pdf h(x).
The probability generating functional of the general Neyman-Scott process is given
in the next section.
A special case of the Neyman-Scott process is the Matérn cluster process. In this
case, the clusters are homogeneous PPPs with intensity function λ0 on the sphere of
radius R. The discrete pdf K in the simulation is in this case the Poisson distribution
with parameter cκ λ0 , where cκ is the volume of the unit sphere in Rκ . The points
xi j of the simulation are i.i.d. and uniformly distributed in the sphere of radius R.
The defining parameters of the Matérn cluster process are the PPP intensity function
λ(x), the homogeneous PPP intensity λ0 , and the radius R.
Ξparent = (N , X |N ) ,
X |N = {X 1 , . . . , X N }
are the conditionally i.i.d. points of Ξparent . Denote the number of daughter points
generated by the parent point X j by the discrete random variable K j . By definition
of the Neyman-Scott process, the variables K 1 , . . . , K N are i.i.d. with pdf given by
(8.11). Let η(t) denote the probability generating function of the discrete random
variable K :
∞
The daughter points of the parent point X j are the random variables
212 8 A Profusion of Point Processes
X1 j , . . . , X K j , j .
By definition of the Neyman-Scott process, they are i.i.d. with conditional pdf
)
p X i j |X j x ) X j = x j = h(x − x j ) , i = 1, . . . , K j , (8.14)
where the cluster pdf h(x) is given. The characteristic functional of Ξ evaluated for
a given function f is, by definition,
⎡ ⎤
Kj
N
G Ξ ( f ) = EΞ ⎣ f (X i j )⎦ .
j=1 i=1
The random variables X j of the parent points do not appear in the product because
they are not points of the output process. The expectation is evaluated in the nested
form:
⎡ ⎡ ⎡ ⎤⎤ ⎤
Kj
N
G Ξ ( f ) = E Ξparent ⎣ E K j |X j ⎣ E X 1 j X 2 j ···X K j , j | K j X j ⎣ f (X i j )⎦ ⎦ ⎦ .
j=1 i=1
(8.15)
Because the daughter points are conditionally i.i.d., the innermost expectation fac-
tors into the product of expectations:
⎡ ⎤
Kj Kj
E X 1 j X 2 j ···X K j , j | K j X j ⎣ f (X i j )⎦ = E X i j | X j f (X i j )
i=1 i=1
Kj Kj
= f (x) h(x − X j ) dx = f (x + X j ) h(x) dx
i=1 S i=1 S
K j
= f (x + X j ) h(x) dx . (8.16)
S
∞
k j
E K j |X j [ · ] = f (x + X j ) h(x) dx pk j
k j =0 S
= η f (x + X j ) h(x) dx . (8.17)
S
8.4 Cox (Doubly Stochastic) Processes 213
Using the generating functional of the parent PPP (see Eqn. (2.53)) gives the char-
acteristic functional of the Neyman-Scott cluster process as
% (
G Ξ ( f ) = exp η f (x + s) h(x) dx − 1 λ(s) ds . (8.18)
S S
A Cox process is a PPP in which the intensity function is randomly selected from a
well defined space of possible intensities, say Λ. Thus, realizations of a Cox process
are the result of sampling first from the space Λ to obtain the intensity function λ(x),
and then finding a realization of the PPP with intensity λ(x) via the two step pro-
cedure of Section 2.3. The idea is that the intensity space Λ characterizes possible
environments in which a PPP might be a good model—provided the right intensity
is used.
To set ideas, consider a homogeneous PPP whose intensity λ is selected so that
the mean number of points μ = λR in the window R is selected from an exponen-
tial pdf:
1 μ
p M (μ) = exp − , (8.19)
μ0 μ0
The one dimensional version of this example is used in [83] to model neural spike
trains. Carrying the general notion over to PPPs gives the Cox process.
214 8 A Profusion of Point Processes
N
Λ(x ; N , X | N ) = μ h x − Xj , (8.21)
j=1
where μ > 0 is a known scale constant, h(x) is a specified pdf on S, and where
the points X = {X 1 , . . . , X N } are the random points of a realization of a PPP
with intensity function λ(x) on S. Bartlett (1964) showed that this Cox process is a
Neyman-Scott cluster process with a Poisson distributed number of daughter points,
i.e., the random variable K in (8.11) is Poisson distributed.
To see Bartlett’s result, evaluate the characteristic functional of the Cox process
Ξ . Let {U1 , . . . , U N } denote N points of a realization of Ξ . The required expecta-
tion is equivalent to the nested expectations
⎡ ⎡ ⎤⎤
N
G Ξ ( f ) = E Λ ⎣ E Ξ |Λ ⎣ f (U j )⎦ ⎦ .
j=1
The inner expectation is the characteristic functional of the PPP with intensity
Λ(x ; N , X | N ); explicitly,
% (
E Ξ |Λ [ · ] = exp ( f (u) − 1) Λ(x ; N , X | N ) du
⎧S ⎫
⎨
N
⎬
= exp ( f (u) − 1) μ h u − X j du
⎩ S ⎭
j=1
N
= Ψ (X j ) ,
j=1
where
% (
Ψ (X j ) = exp μ f (u + X j ) h(u) du − 1 .
S
8.4 Cox (Doubly Stochastic) Processes 215
Since the points X j are the points of a PPP with intensity λ(x), the characteristic
functional of Ξ is
⎡ ⎤
N
G Ξ ( f ) = EΛ ⎣ Ψ (X j )⎦
j=1
% (
= exp [Ψ (x) − 1] λ(x) dx . (8.22)
S
η (s) = eμ (s − 1) , (8.23)
where
s = f (u + x) h(u) du . (8.24)
S
The right hand side of (8.23) is the characteristic function of the discrete Poisson
distribution with mean μ. Therefore, the Cox process has a Poisson distributed
number of points that are distributed around parent points with pdf h(u), u ∈ S.
Since probability generating functionals characterize finite orderly point processes
([16, p. 625]), the Cox process with random intensity function given by (8.21) is a
Neyman-Scott process with a Poisson distributed number of points. (General condi-
tions under which a Poisson cluster process is a Cox process are not known. Further
details are given in [16, pp. 663–664].)
Cox processes are useful in applications in which the parameter vector of the inten-
sity function is the solution of a stochastic differential equation (SDE). An example
is the intensity function (4.38) when the mean μt of the Gaussian component is time
dependent and satisfies the Ito diffusion equation [90, p. 104]
closely related to renewal processes (see [119, Chapters 6 and 7]). These processes
are well discussed elsewhere.
and the initial state probability mass function by π( · ) on C. The initial time is
t0 ≡ t (0). Homogeneous PPPs are defined for each state c( j) with intensity λc( j) .
(Nonhomogeneous PPPs can also be used.)
A realization of a MMPP on the interval [t (0), T ] is obtained from a two stage
procedure. The first stage generates a realization of A on the time interval [t (0), T ]
(see [100]). The initial state c(t (0)) at time t (0) is a realization of π( · ). Including
t (0), the switching times and states of the Markov chain are {t (0), t (1), . . . , t ()}
and {c(t (0)), c(t (1)), c(t (2)), . . . , c(t ())}, respectively. Let t ( + 1) = T . The
second stage generates a realization of the PPP with intensity λc(t ( j)) on the time
interval [t ( j), t ( j + 1)), j = 0, 1, . . . , . The concatenation of the realizations
of these Markov switched PPPs is a realization of the MMPP.
MMPPs are stochastic processes often used to model bursty phenomena in vari-
ous applications, especially in telecommunications. Other applications include load
modeling that changes abruptly depending on the outcome of certain events, say, an
alternating hypothesis SPRT (sequential probability ratio test) that controls arrival
rates in a queue. The superposition of MMPPs is an MMPP, a fact that facilitates
applications. The transition function of the superposed MMPP in terms of the com-
ponent MMPPs can be found in [95].
1
p(x1 , . . . , xn | n, λ) = exp (−E(x1 , . . . , xn )) , (8.27)
Z n n!
8.5 Gibbs Point Processes 217
Abstract Several topics of interest not discussed elsewhere in this book are
mentioned here.
The main properties of PPPs that seem (to the author) insightful and useful in appli-
cations are reviewed in this book. They do not appear to be gathered into one place
elsewhere. The applications to medical imaging and tomography, multiple target
tracking, and distributed sensor detection are intended to provide insight into meth-
ods and models of PPPs and related point processes, and to serve as an entry point
to exciting new applications of PPPs that are of active research interest.
Many interesting topics are naturally omitted from the book. For reasons already
mentioned in the introductory Chapter 1, nearly all discussion of one dimensional
point processes is omitted. Other omissions are detailed below in a chapter by chap-
ter review. A brief section on possible directions for further work follows.
Chapter 2 on the basics of PPPs omits many topics simply because they are
not used in the applications presented later in the book. Many of these topics have
diverse applications in one dimensional processes, and they are also interesting
in themselves. The connections with Poisson random measures is not mentioned.
More generally, the connections between stochastic processes and point processes
are only briefly mentioned. Markov point processes are omitted. This is especially
unfortunate because it means that physically important point processes with spatial
point to point correlations or interactions are treated only via example. The Matérn
hard core processes of Section 8.2 are excellent examples of Markov point pro-
cesses.
Chapter 3 on estimation is heavily weighted toward superposition because of
the needs of applications. Superposition leads rather naturally to an emphasis on
the method of EM. The dependence of the convergence rate of the EM method on
the OIM is not discussed. Since the convergence rate of EM based algorithms is
ultimately only linear, other algorithms merit consideration. Some methods use the
OIM to accelerate EM algorithm convergence; others use hybrid techniques that
switch from EM to other more rapidly convergent algorithms. MCMC methods are
also omitted from the discussion.
Chapter 4 discussion of the CRB is fairly complete for the purposes of PPPs. The
Bayesian CRB, or posterior CRB (PCRB), is not treated even though it is useful
in some applications, e.g., tracking. The notion of a lower bound for parameters
that are inherently discrete is available, but not discussed here, even though such
bounds are potentially useful in applications. The Hammersley-Chapman-Robbins
(HCR) bound holds for discrete parameters [14, 45]. It is a special case of the earlier
Barankin bound [5, 134]. When the parameter is continuously differentiable, the
HCR bound becomes the CRB.
Chapter 5 on PET, SPECT, and transmission tomography only scratches the sur-
face. Absent for the chapter are details about specific technical issues involved in
practical applications of the Shepp-Vardi and related algorithms. These details are
important to specialists and not without interest to others. Overcoming these issues
often involves exploiting the enormous flexibility of the algorithm. Examples that
illustrate to current state of the art would entice readers to learn more about these
topics, without delving too deeply into the intricacies of the methods. Happily, the
CRB is already available for PET and methods are available for computing the CRB
for subsets of the full PET image [49].
Chapter 6 on multitarget tracking applications of PPPs is an active area of
research. It is reasonable to ask about the CRB of the intensity filter, since the
intensity is the defining parameter. In this case, however, the target motion model is a
Bayesian prior and the CRBs of Chapter 4 must be extended to include this case. The
PCRB for the intensity filter is not available, but it is important for understanding
the quality of the intensity filter estimate. Computing it will surely be a complicated
affair because the intensity function is the small cell limit of a sequence of step
functions in the single target state space. Presumably, after first finding the PCRB
for a finite number of cells, taking the small cell limit of a suitably normalized
version yields a function on the product space S × S.
9.2 Possible Trends 221
Some argue that theoretical analysis of point processes is of little value in real world
applications because only extensive high fidelity simulations can give quantitative
understanding of the many variables of practical interest. This argument is easily
refuted in applications such as PET imaging, but more difficult to refute fully in
222 9 The Cutting Room Floor
applications such as distributed sensing that use point processes to gain insight
rather than as an exact model.
The debate between theory and simulation will undoubtedly continue for years,
but it is a healthy debate reminiscent of the debate between theoretical and exper-
imental physicists. Practical problems will undoubtedly challenge theory, and the
explanatory power of theoretical methods will grow. Because of its extraordi-
nary flexibility, MCMC methods will become a progressively more important tool
as demands on model sophistication and fidelity increase. The idea of perfect
simulation via the “coupling from the past” (CFTP) technique of Propp and Wilson
[96] will undoubtedly enrich applications. These methods will blur the distinction
between theory and simulation.
Geometrically distributed sensors and social networks (e.g., the Internet) are
both graphs whose vertices (points, sensors, agents, etc.) are connected by edges.
Edges represent connectivity that arises from geometric proximity, or from some
non-physical social communication link. The empirical evidence shows that the
vertex connectivity of geometric random graphs is very different from that of social
network graphs—complex social networks typically have a power law distribution
on connectivity, whereas geometric random graphs do not. Detecting subgraphs that
are proximate both geometrically and socially is an important problem. Adding
the element of time makes the problems dynamic and even more realistic. Inte-
gral geometry and its methods, especially Boolean models, may eventually be seen
as a key component of a mathematical foundation of detection in these kinds of
problems. It will be interesting to see what manner of contribution MCMC methods
make to the subject.
Appendix A
Expectation-Maximization (EM) Method
A.1 Formulation
The observed data z are called the incomplete data. As is seen shortly, the name
makes sense in the context of the EM method. The measurements z are a realization
of the random variable Z , and Z takes values in the space Z. The pdf of the data
z is specified parametrically by p Z (z ; θ ), where θ ∈ Θ is an unknown parameter
vector. Here, Θ is the set of all valid parameter values.
The likelihood function of θ is the pdf p Z (z ; θ ) thought of as a function of θ for
a given z. It is assumed that the likelihood of the data for any θ ∈ Θ is finite. It is
also assumed that the likelihood function of θ is uniformly bounded above, that is,
p Z (z ; θ ) ≤ B < ∞ for all θ ∈ Θ. The latter assumption is important.
θ̂ M L = arg max p Z (z ; θ ) .
θ∈Θ
For pdfs differentiable with respect to θ , the natural way to compute θ̂ M L is solve
the so-called necessary conditions
∇θ p Z (z ; θ ) = 0
p Z K (z, k ; θ ) p Z K (z, k ; θ )
p K |X (k | x ; θ ) = = . (A.2)
p Z (z ; θ ) K p Z K (z, k ; θ ) dk
Let n ≥ 0 denote the EM iteration index. Let θ (0) ∈ Θ be an initial (valid) value of
the parameter.
A.1.1 E-step
A.1.2 M-step
If the update θ (n) is chosen in this way, the method is called the generalized EM
(GEM) method. The GEM method is very useful in many problems. It is the starting
point of other closely related EM-based methods. A prominent example is the SAGE
(Space Alternating Generalized EM) algorithm [35].
A.1.3 Convergence
Under mild conditions, convergence is guaranteed to a critical point of the likeli-
hood function, that is, θ (n) → θ ∗ such that
∇θ p Z (z | θ ) θ = θ ∗ = 0 .
Experience shows that in practice, θ ∗ is almost certainly a local maximum and not
a saddle point, so that θ ∗ = θ̂ M L .
One of the mild conditions that cannot be overlooked is the assumption that
the likelihood function of θ is uniformly bounded above. If it is not, then the EM
iteration will converge only if the initial value of θ is by chance in the domain of
attraction of a point of local maximum likelihood. Otherwise, it will diverge, mean-
ing that the likelihood function of the iterates will grow unbounded. Unboundedness
would not be an issue were it not for the fact that the only valid values of θ have
finite likelihoods, so the iterates—if they converge to a point—converge to an invalid
parameter value, that is, to a point not in Θ. Estimation of heteroscedastic Gaussian
226 Appendix A: Expectation-Maximization (EM) Method
sums is notorious for this behavior (due to covariance matrix collapse). See Section
3.4 for further discussion.
The details of the EM convergence proof are given in too many places to repeat
all the details here. The book [80] gives a full and careful treatment of convergence,
as well as examples where convergence fails in various ways.
For present purposes, it is enough to establish two facts. One is that if the update
θ (n) satisfies (A.5), then it also increases the likelihood function. The other is that
a critical point of the auxiliary function is also a critical point of the likelihood
function, and conversely.
To see that each EM step monotonically increases the likelihood function, recall
that log x ≤ x − 1 for all x > 0 with equality if and only if x = 1. Then, using
only the above definitions,
0 < Q θ ; θ (n−1) − Q θ (n−1) ; θ (n−1)
)
= (log p Z K (z, k ; θ )) p K |Z k ) z ; θ (n−1) dk
K
)
− log p Z K z, k ; θ (n−1) p K |Z k ) z ; θ (n−1) dk
K
& , 2'
)
p Z K (z, k ; θ ) ) z ; θ (n−1) dk
= log p K |Z k
K p Z K z, k ; θ (n−1)
, 2
p Z K (z, k ; θ ) p Z K z, k ; θ (n−1)
≤ −1 dk
K p Z K z, k ; θ (n−1) p Z (z ; θ (n−1) )
* +
1 (n−1)
= p Z K (z, k ; θ ) − p Z K z, k ; θ dk
p Z z ; θ (n−1) K
1 * +
= p Z (z ; θ ) − p Z z ; θ (n−1) .
p Z z ; θ (n−1)
0 = ∇θ0 p Z (z ; θ0 )
= ∇θ0 p Z K (z, k ; θ0 ) dk
K
= p Z K (z, k ; θ0 ) ∇θ0 log p Z K (z, k ; θ0 ) dk
K
p Z K (z, k ; θ0 )
= p Z (z ; θ0 ) ∇θ0 log p Z K (z, k ; θ0 ) dk
K p Z (z ; θ0 )
A.2 Iterative Majorization 227
= p Z (z ; θ0 ) p K |Z (k | z ; θ0 ) ∇θ log p Z K (z, k | θ ) θ = θ dk
0
K
= p Z (z ; θ0 ) [∇θ Q (θ ; θ0 )]θ = θ0 .
The widely referenced paper [23] is the synthesis of several earlier discoveries of
EM in a statistical setting. However, EM is not essentially statistical in character, but
is rather only one member of a larger class of strictly numerical methods called iter-
ative majorization [19–21]. This connection is not often mentioned in the literature,
but the insight was noticed almost immediately after the publication of [23].
The observed data likelihood function p Z (z ; θ ) is the envelop of a two param-
eter family of functions. This family is defined in the EM method by the auxiliary
function, Q(θ ; φ). As seen from the results in [23], the data likelihood function
majorizes every function in this family. For each specified parameter φ, the func-
tion Q(θ ; φ) is tangent to the p Z (z ; θ ) at the point φ. This situation is depicted
in Fig. A.1 for the sequence φ = θ0 , θ1 , θ2 , . . . . It is now intuitively clear that
EM based algorithms monotonically increase the data likelihood function, and that
they converge with high probability to a local maximum of the likelihood function
p Z (z ; θ ) that depends on the starting point.
It is very often observed in practice that EM based algorithms make large strides
toward the solution in the early iterations, but that progress toward the solution
Fig. A.1 Iterative majorization interpretation of the observed (incomplete) data likelihood function
as the envelop of the EM auxiliary function, Q(θ, φ)
228 Appendix A: Expectation-Maximization (EM) Method
where
m
x̄ ≡ x j ∈ Rn x .
m
j=1
The j-th component of x̄ is denoted by x̄ j , which should not be confused with the
data point x j ∈ Rn x . The solution to (B.1) is unique and straightforward to compute
numerically for rectangular multidimensional regions
R = [a1 , b1 ] × · · · × [an x , bn x ]
and diagonal covariance matrices Σ = Diag σ12 , . . . , σn2x .
The conditional mean of a univariate Gaussian distributed random variable con-
ditioned on realizations in the interval [−1, 1] is defined by
1
s N (s ; μ, σ 2 ) ds
M μ, σ 2
≡ −11 , (B.2)
−1 N (s ; μ, σ 2 ) ds
229
230 Appendix B: Solving Conditional Mean Equations
and
lim M μ, σ 2 = −1.
μ→−∞
The most important fact about the function M is that it is strictly monotone increas-
ing as a function of μ. Consequently, for any number c such that −1 < c < 1, and
variance σ 2 , the solution μ of the general equation
M μ, σ 2 = c (B.3)
exists
and is unique. To
see this,
it is only necessary to verify that the derivative
∂
M
μ, σ 2 ≡ ∂μ M μ, σ 2 > 0 for all μ. The inequality is intuitively obvious
from its definiton as a conditional mean. The function M μ, σ 2 is plotted for sev-
eral values of σ in Fig. B.1. Evaluating the inverse function M −1 μ, σ 2 efficiently
is left as an exercise.
Fig. B.1 Plots of M[μ, σ 2 ] from σ = 0.1 to σ = 1 in steps of 0.1. The monotonicity of
M[μ, σ 2 ] is self evident. The steepness of the transition from −1 to +1 increases steadily with
decreasing σ
b1 bn x 1n x
a1 ··· an x s j i=1 N si ; μi , σi2 ds1 · · · dsn x
b1 bn x 1n x = x̄ j . (B.4)
a1 · · · an x i=1 N si ; μi , σi2 ds1 · · · dsn x
bj
aj sjN s j ; μ j , σ j2 ds j
bj = x̄ j . (B.5)
aj N s j ; μ j , σ j2 ds j
Substituting
bj − aj aj + bj
sj = x + ,
2 2
Solving
. 2 /
2σ aj + bj
M μ̃ j , = x̄ j − (B.7)
bj − aj 2
The expression (B.4) holds with obvious modifications for more general regions
R. However, the multidimensional integral over all the variables except x j is a func-
tion of x j in general and does not cancel from the ratio as it does in (B.5).
Appendix C
Bayesian Filtering
A brief review of general Bayesian filtering is given in this appendix. The discussion
sets the conceptual and notational foundation for Bayesian filtering on PPP event
spaces, all without mentioning PPPs until the very end. Gentler presentations that
readers may find helpful are widely available (e.g., [4, 54, 104, 122]).
The notation used in this appendix is used in Section 6.1 and also in the alterna-
tive derivation of Appendix D.
233
234 Appendix C: Bayesian Filtering
The denominator is the pdf of the measurement z k given that it is generated by the
target with pdf pk|k−1 (xk ):
pk|k−1 (xk )
pk|k (xk ) = pΥk |Ξk (z k | xk ) . (C.5)
πk|k−1 (z k )
and
z j = H j (x j ) + w j , (C.7)
The joint pdf is found by substituting (C.8a)–(C.8c) into (C.1). The recursion
(C.3)–(C.5) gives the posterior pdf on S = Rn x .
The linear Gaussian Kalman filter assumes that
pk|k (xk ) = N xk | x̂k|k , Pk|k , (C.10)
where x̂ j|k is the point estimate of the target at time tk and Pk|k is the associated
error covariance matrix. Explicitly, for j = 0, . . . , k − 1,
236 Appendix C: Bayesian Filtering
P j+1| j = F j P j| j F jT + Qj (C.11a)
* +−1
W j+1 = P j+1| j H j+1T
H j+1 P j+1| j H jT + R j+1 (C.11b)
P j+1| j+1 = P j+1| j − W j+1 H j+1 P j+1| j (C.11c)
x̂ j+1| j+1 = F j x̂ j| j + W j+1 z j+1 − H j+1 F j x̂ j| j . (C.11d)
These equations are not necessarily good for numerical purposes, especially when
observability is an issue. In practice, the information form of the Kalman filter is
always preferable.
The Kalman filter is often written in terms of measurement innovations when the
measurement and target motion models are linear. The predicted target state at time
t j is
The innovation at time t j is the difference between the actual measurement and the
predicted measurement:
The reason for the name innovation is now self-evident. The information updated
target state is
The update (C.15) is perhaps more intuitive, but it is the same as before.
For completeness, the smoothing (or, lagged) Kalman filter is given here. The
posterior pdf is denoted by
p j|k (x j ) = N x j | x̂ j|k , Σ j|k , (C.16)
where x̂ j|k is the point estimate of the target at time t j given all the data {z 1 , . . . , z k }
up to and including time tk , and Σ j|k is the associated error covariance matrix. These
quantities are computed by the backward recursion: For j = k −1, . . . , 0, the point
estimates are
−1
x̂ j|k = x̂ j| j + P j| j F jT P j+1| j x̂ j+1 − F j x̂ j| j . (C.17)
The innovation form of the filter, if it is desirable to think of the smoothing filter
in such terms, is written in terms of the state innovations, x̂ j+1 − F j x̂ j| j . The
corresponding error covariance matrices are
C.2 Special Case: Kalman Filtering 237
−1 −1
Σ j|k = P j| j + P j| j F jT P j+1| j Σ j+1 − P j+1| j P j+1| j F j P j| j . (C.18)
The smoothing recursions (C.17)–(C.18) were first derived in 1965 by Rauch, Tung,
and Striebel [98].
The multitarget intensity filter is derived by Bayesian methods in this appendix. The
posterior point process is developed first, and then the posterior point process is
approximated by a PPP. Finally, the last section discusses the relationship between
this method and the “first moment” approximation of the posterior point process.
The steps of the intensity filter are outlined in Fig. 6.1. The PPP interpretations of
these steps are thinning, approximating the Bayes update with a PPP, and superpo-
sition. The PPP at time tk is first thinned by detection. The two branches of the thin-
ning are the detected and undetected target PPPs. Both branches are important. Their
information updates are different. The undetected target PPP is the lesser branch. Its
information update is a PPP. The detected target branch is the main branch, and
its information update comprises two key steps. Firstly, the Bayes update of the
posterior point process of Ξk on E(S + ) given data up to and including time tk is
obtained. The posterior is not a PPP, as is seen below from the form of its pdf in
(D.10). Secondly, the posterior point process is approximated by a PPP, and a low
computational complexity expression for the intensity of the approximating PPP is
obtained. The two branches of detection thinning are recombined by superposition
to obtain the intensity filter update.
The random variables Ξk−1|k−1 , Ξk|k−1 , and Υk|k−1 are defined as in Appendix C.
The state space of Ξk−1|k−1 and Ξk|k−1 is E(S + ), where E(S + ) is a union of sets
defined as in (2.1). Similarly, the event space of Υk|k−1 is E(T ), not T .
The process Ξk−1|k−1 is assumed to be a PPP, so it is parameterized by its inten-
sity f k−1|k−1 (s), s ∈ S + . A realization ξk ∈ E(S + ) of Ξk−1|k−1 is transitioned
to time tk via the single target transition function Ψk−1 (y | x). Its intensity is, using
(2.83),
f k|k−1 (x) = Ψk−1 (x | s) f k−1|k−1 (s) ds. (D.1)
S+
239
240 Appendix D: Bayesian Derivation of Intensity Filters
The point process Ξk|k is the sum of detected and undetected target processes,
denoted by Ξk|kD and Ξ U , respectively. They are obtained from the same realizations
k|k
of Ξk|k−1 , so they would seem to be highly correlated. However, the number of
points in the realization is Poisson distributed, so they are actually independent. See
Section 2.9.
The undetected target process Ξk|k U is the predicted target PPP Ξ
k|k−1 thinned by
1 − PkD (s), where PkD (s) is the probability of detecting a target at s. Thus Ξk|k
U is a
PPP, and
U
f k|k (x) = 1 − PkD (x) f k|k−1 (x) (D.2)
is its intensity.
The detected target process Ξk|k
D is the predicted target PPP Ξ
k|k−1 that is thinned
by Pk (s) and subsequently updated by Bayesian filtering. Thinning yields the pre-
D
D
f k|k−1 (x) = PkD (x) f k|k−1 (x) (D.3)
is its intensity.
The predicted measurement process Υk|k−1 is obtained from Ξk|k−1 D via the pdf
of a single point measurement z ∈ T conditioned on a target located at s ∈ S + . The
quantity pk (z | φ) is the likelihood of z if it is a false alarm. See Section 2.12. Thus,
Υk|k−1 is a PPP on T and
λk|k−1 (z) = pk (z | s) PkD (s) f k|k−1 (s) ds, (D.4)
S+
is its intensity.
The measurement set is υk = {m, {z 1 , . . . , z m }, where z j ∈ T . The conditional
pdf of υk is defined for arbitrary target realizations ξk = (n, {x1 , . . . , xn }) ∈
E(S + ). All the points x j of ξk , whether they are a true target (x j ∈ Rn x ) or are clutter
(x j = φ), generate a measurement so that only when m = n is the measurement
likelihood non-zero. The correct assignment of point measurements to targets in ξk
is unknown. All such assignments are equally probable, so the pdf averages over all
possible assignments of data to false alarms and targets. Because φ is a target state,
the measurement pdf is
% 1 1m
σ ∈Sym (m) pk (z σ ( j) | x j ), m = n
pΥk |Ξk (υk | ξk ) = m! j=1 (D.5)
0, m = n,
where Sym(m) is the set of all permutations on the integers {1, 2, . . . , m}.
The lower branch of (D.5) is a consequence of the “at most one measurement
per target” rule together with the augmented target state space S + . To elaborate, the
points in a realization ξ of the detected target PPP are targets, some of which have
D.2 PPP Approximation 241
pk|k−1 (ξk )
pk|k (ξk ) = pΥk |Ξk (υk | ξk ) . (D.6)
πk|k−1 (υk )
Substituting (D.7), (D.8), and (D.5) into (D.6) and using obvious properties of per-
mutations gives the posterior pdf of Ξk|k
D :
1
m
pk (z σ ( j) | x j ) PkD (x j ) f k|k−1 (x j )
pk|k (ξk ) = . (D.10)
m! λk|k−1 (z σ ( j) )
σ ∈Sym (m) j=1
If ξk does not contain exactly m points, then pk|k (ξk ) = 0. Conditioning Ξk|k
D on m
1
m
pk (z σ ( j) | x j ) PkD (x j ) f k|k−1 (x j )
pk|k (x1 , . . . , xm ) = .
m! λk|k−1 (z σ ( j) )
σ ∈Sym (m) j=1
(D.11)
a problem for the recursion. One way around it is to approximate Ξk|k D by a PPP and
The pdf pk|k (x1 , . . . , xm ) = pk|k (xσ (1) , . . . , xσ (m) ) for all σ ∈ Sym(m); there-
fore, integrating it over all of arguments except, say, the th argument gives the same
result regardless of the choice of . The form of the “single target marginal” is, using
(D.4),
m
pk|k (x ) ≡ ··· pk|k (x1 , . . . , xm ) dxi
S+ S+ i=1
i=
pk (z σ ( j) |x j ) PkD (x j ) f k|k−1 (x j )
m m
1
= dxi
m! + m−1 λk|k−1 (z σ ( j) )
σ ∈Sym (m) (S ) j=1 i=1
i=
m
pk (z σ () | x ) PkD (x ) f k|k−1 (x )
=
m! λk|k−1 (z σ ( j) )
r =1 σ ∈Sym (m)
and σ ()=r
1
pk (zr | x )
m
PkD (x ) f k|k−1 (x )
= . (D.12)
m λk|k−1 (zr )
r =1
m
pk|k (x1 , . . . , xm ) ≈ pk|k (x j ). (D.13)
j=1
1 − + c pk|k (s) ds
m
L (c | ξk ) = e S c pk|k (x j ) ∝ e−c cm .
m!
j=1
m
pk (zr | x) PkD (x) f k|k−1 (x)
D
f k|k (x) = (D.14)
λk|k−1 (zr )
r =1
where
realizations contain exactly m points. The theory of general finite point processes
dates to the 1950s. (An excellent reference is [17, Chapter 5].) This theory is now
applied to Ξk|k
D .
244 Appendix D: Bayesian Derivation of Intensity Filters
and X |N is the point set. From [17, Section 5.3], the Janossy probability density of
a finite point process is defined by
Janossy densities were encountered (but left unnamed) early in Chapter 2, (2.10).
Using the ordered argument list as in (2.13) gives
where pk|k (x1 , . . . , xm ) is the posterior pdf given by (D.11). The first moment
intensity is denoted in [17] by m 1 (x). From [17, Lemma 5.4.III], it is given in terms
of the Janossy density functions by
∞
1
m 1 (x) = ··· jn+1 (x, x1 , . . . , xn ) dx1 · · · dxn . (D.24)
n! S + S+
n=0
The integral (D.25) is exactly m times the integral in (D.12), so the first moment
approximation to Ξk|k
D is identical to the intensity (D.14).
Appendix E
MMIF: Marked Multitarget Intensity Filter
This appendix derives the marked multitarget intensity filter (MMIF) recursion for
linear Gaussian target and measurement models via the EM method. Targets are
modeled as PPPs that are “marked” with measurements.
As seen from Section 8.1, measurement-marked target PPPs are equivalent to
ordinary PPPs on the Cartesian product of the measurement and target spaces. These
joint PPPs are superposed, and the target states estimated via the EM method. As
mentioned in Section 6.2.2, the MMIF satisfies the “at most one measurement per
target rule” in the mean.
Ψk−1 (xk () | xk−1 ()) = N (xk () ; Fk−1 () xk−1 (), Q k−1 ()) , (E.2)
where the system matrix Fk−1 () ∈ Rn x ×n x and the process noise covariance
matrix Q k−1 () ∈ Rn x ×n x are specified. The target motion model is equivalent
to xk () = Fk−1 () xk−1 () + u k−1 () , where the process noise u k−1 () ∈ Rn x is
zero mean Gaussian distributed with covariance matrix Q k−1 (). The process noises
are assumed independent from target to target.
Each target is modeled as a PPP. It is assumed, recursively, that the intensity
function of target at time tk−1 is
245
246 Appendix E: MMIF: Marked Multitarget Intensity Filter
f k−1|k−1 (x) = 0
Ik−1|k−1 () N x ; 0
xk−1|k−1 (), Pk−1|k−1 () , (E.3)
where the MAP estimate x̂k−1|k−1 (), its covariance matrix Pk−1|k−1 (), and inten-
sity 0
Ik−1|k−1 () are known. Under the target motion model (E.2), the predicted
detected target intensity function at time tk is
f k|k−1 (x) = PkD () Ik () N x ; 0
xk|k−1 (), Pk|k−1 () , (E.4)
where PkD () is the probability of detecting target at time tk and is assumed inde-
pendent of target state x. Also, the predicted state and covariance matrix of target
are
0
xk|k−1 () = Fk−1 () 0
xk−1|k−1 () (E.5)
Pk|k−1 () = Fk−1 () Pk−1|k−1 () T
Fk−1 () + Q k−1 () . (E.6)
The coefficient Ik () is estimated from data at time tk as part of the MMIF recursion.
where the measurement matrix Hk () ∈ Rn z ×n x and the measurement noise covari-
ance matrix Rk () ∈ Rn z ×n z are both specified. The measurement model is equiv-
alent to z = Hk () x + vk () , where the measurement noise vk () ∈ Rn z is zero
mean Gaussian distributed with covariance matrix Rk (). The measurement and
target process noises are assumed independent.
Measurements are modeled as marks that are associated with targets that are
realizations of a target PPP. Marked processes are described in a general setting in
Chapter 8. As seen from the Marking Theorem of Section 8.1, a measurement-
marked target PPP is equivalent to a PPP on the Cartesian product of the mea-
surement and target spaces, that is, on Rn z × Rn x . The measurement process is not
assumed to be a PPP.
The intensity function of the joint measurement-target PPP of target in state
x ∈ Rn x at time tk is, from the expression (8.2),
From the basic property of PPPs, the expected number of marked detected targets,
that is, the number of targets with a measurement, is the multiple integral over Rn z ×
Rn x :
E.2 Joint Measurement-Target Intensity Function 247
λk|k (z, x) dz dx =
f k|k−1 (x) dx = PkD () Ik () . (E.9)
R n z ×R n x Rn x
This statement is equivalent to the “at most one measurement per target rule,” but
only in the mean.
Substituting (E.4) and (E.7) gives the joint measurement-target intensity function
λk|k (z, x) = PkD () Ik () N x ; 0
xk|k−1 (), Pk|k−1 () N (z ; Hk () xk (), Rk ())
= PkD () Ik () N x ; 0
xk|k (z ; ), Pk|k () N z ; 0
z k|k−1 (), Sk|k () ,
(E.10)
where, using 0
xk|k−1 () and Pk|k−1 () above, the usual Kalman filter equations give
0
z k|k−1 () = Hk ()0
xk|k−1 ()
* +−1
Wk () = Pk|k−1 () HkT () Hk () Pk|k−1 () Hk−1
T
() + Rk ()
Pk|k () = Pk|k−1 () − Wk () Hk () Pk|k−1 ()
0
xk|k (z ; ) = Fk−1 () 0
xk−1|k−1 () + Wk () z − 0 z k|k−1 () . (E.11)
L
λk|k (z, x) = λ0k|k (z) + λk|k (z, x)
=1
= Ik (0) qk (z)
L
+ PkD () Ik () N x ; 0
xk|k (z ; ), Pk|k () N z ; 0
z k|k−1 (), Sk|k () .
=1
(E.13)
This sum parameterizes the likelihood function of the MMIF filter. The EM method
uses it in the next section to derive a recursion for estimating target states and the
intensity coefficients.
248 Appendix E: MMIF: Marked Multitarget Intensity Filter
z k (1 : m k ) = {z k (1), . . . , z k (m k )} ,
xk (1 : m k ) = {xk (1), . . . , xk (m k )} ,
Because the target model is a PPP, the data Zk are a realization of the measurement-
target PPP with intensity (E.13). Its likelihood function is
mk
p (Zk ) = e− Rn z ×Rn x λk|k (z, x) dz dx
λk|k (z k ( j), xk ( j))
j=1
, 2
L
mk
L
= e−Ik (0) − =1 Pk () Ik () λk|k (z k ( j), xk ( j)) ,
D
λ0k|k (z k ( j)) +
j=1 =1
(E.14)
xk ( j) = χk (σ j ), j = 1, . . . , m k . (E.16)
E.2 Joint Measurement-Target Intensity Function 249
In other words, measurements that arise from the same mode have exactly the same
target state. The constraints (E.16) violates the exact form of the “at most one mea-
surement per target rule”, but it is not violated in the mean. The target states to be
estimated are χk (1 : L).
The superposition in (E.14) is a clear indication of the utility of the EM method
for computing MAP estimates. In EM parlance, (E.14) is the incomplete data pdf.
It is natural (indeed, other choices seem contrived here) to let the indices σ ≡
{σ1 , . . . , σm k } denote the missing data. The complete data pdf is defined by
L
mk
σ
p (Zk , σ ) = e−Ik (0) − =1 PkD () Ik ()
λk|kj z k ( j), χk (σ j ) . (E.17)
j=1
Let Ik (0 : L) ≡ (Ik (0), Ik (1), . . . , Ik (L)). The posterior pdf of σ is, by the defi-
nition of conditioning,
) p (Zk , σ )
p σ ) χk (1 : L), Ik (0 : L) =
p (Zk )
m k
= wσ j (z k ( j) ; χk (1 : L), Ik (0 : L) , (E.18)
j=1
w (z ; χk (1 : L), Ik (0 : L))
PkD () Ik () N χk () ; 0 xk|k (z ; ), Pk|k () N z ; 0 z k|k−1 (), Sk|k ()
= L .
Ik (0)qk (z) + =1 PkD ()Ik ()N χk (); 0 xk|k (z; ), Pk|k () N z;0 z k|k−1 (), Sk|k ()
(E.19)
w0 (z ; χk (1 : L), Ik (0 : L))
Ik (0) qk (z)
= L .
Ik (0)qk (z) + =1 PkD ()Ik ()N χk (); 0
xk|k (z; ), Pk|k () N z;0
z k|k−1 (), Sk|k ()
(E.20)
L
The coefficient e−Ik (0) − =0 Pk () Ik () cancels out in the weight calculation. The
D
weights are ratios of intensities. They are the probabilities that the measurement z
is generated by target , or by clutter if = 0.
Let r = 0, 1, . . . be the EM iteration index, and let χk(0) (1 : L) and Ik(0) (0 : L)
be specified initial values of the target states and their intensity coefficients. The EM
auxiliary function is the conditional expectation
250 Appendix E: MMIF: Marked Multitarget Intensity Filter
) (r )
(r )
Q χk (1 : L), Ik (0 : L) ) χk (1 : L), Ik (0 : L)
mk
= {log p(Zk , σ )} w z ; χk(r ) (1 : L), Ik(r ) (0 : L) . (E.21)
j=1
) (r )
L
(r )
Q χk (1 : L), Ik (0 : L) ) χk (1 : L), Ik (0 : L) = −Ik (0) − PkD () Ik ()
=1
mk
(r ) (r )
+ w z ; χk (1 : L), Ik (0 : L) log λk|k (z k ( j), χk ()) .
=0 j=1
(E.22)
Using (3.38) gives the EM update for the intensity coefficient of the -th target as
mk
(r +1) (r ) (r )
Ik () = D w z k ( j) ; χk (1 : L), Ik (0 : L) . (E.23)
Pk () j=1
The factor PkD () cancels the same factor in the weights (E.19). For clutter, = 0,
the updated intensity coefficient is
mk
(r +1) (r ) (r )
Ik (0) = w0 z k ( j) ; χk (1 : L), Ik (0 : L) . (E.24)
j=1
m k
(r ) (r )
j=1 w z k () ; χk (1 : L), Ik (0 : L) 0 xk|k (z k ( j) ; )
(r +1)
χk () = m k .
(r ) (r )
j=1 w z k () ; χk (1 : L), Ik (0 : L)
(E.25)
E.3 MMIF Recursion 251
A more intuitive way to write the result is to substitute for 0 xk|k (z k ( j) ; ) using
(E.11). By linearity, the updated state is given by the Kalman filter
* +
(r +1) (r +1)
χk () = Fk−1 () 0
xk−1|k−1 () + Wk () 3z k|k () − 0
z k|k−1 () , (E.26)
0 (r )
Ik|k () = Ik|klast () 0≤≤ L, (E.28)
(r )
xk|k () =
0 χk|klast () 1≤≤ L. (E.29)
The output of a linear filter with stationary input signal is exponentially distributed
when the passband of the filter is a subset of the input signal band. This appendix
derives this result as the limit of a PPP model. The points of the PPP realizations are
in the spectral (frequency) domain.
The frequency domain output of a linear filter with a classical stationary Gaussian
input signal is equivalent to a compound PPP, assuming the input signal bandwidth
is larger than the filter bandwidth. In this case the pdf of the instantaneous filter
output power, a 2 , in the filter passband is
2
1 a
p(a 2 ; S 2 ) = 2
exp − 2 . (F.1)
S S
The parameter of this exponential distribution is equal to the signal power, S 2 < ∞,
in the filter passband.
The input signal is modeled in the frequency domain, not the time domain. (This
contrasts sharply with the time domain model mentioned in Section 8.1.2.) The
frequency domain PPP “emits” points, or shots, across the entire signal band. The
only shots that pass through the filter are those in the filter passband. The shots that
pass through the filter comprise a noncausal PPP process in a frequency domain
equal to that of the filter passband. In essence, the filter acts as a thinning process
on the input PPP. The filtered, or thinned, points are not observed directly in the
filter output. Instead, each point carries a complex-valued mark, or phase, and the
measurement is the coherent sum u of the marks of the points that pass through the
filter. Thus, the measurement u is known, but the number n of shots in the filter
output and the individual marks comprising the coherent sum are all unknown.
Let λ(ω) be the filtered PPP intensity—it is defined in the frequency domain
over the filter passband, B. It is shown in this appendix that the function λ(ω) is the
signal power spectrum.
253
254 Appendix F: Linear Filter Model
Let ν ≡ B λ(x) dx be the mean number of points in B. Now, suppose there
are n points with marks u k = (ck , sk )T , where ck and sk are the in-phase and
quadrature components of the k-th mark, respectively. The marks are assumed i.i.d.
with pdf
& '
ck 0 h̄ 2 1 0
p(u k ) ≡ N ; ,
sk 0 2 01
& '
1 ck2 + sk2
= exp − . (F.2)
π h̄ 2 h̄ 2
The mark power is the variance of its squared magnitude, or h̄ 2 . (The choice of the
the limit h̄ → 0 will
symbol h̄ 2 to represent mark power is intended to suggest that
n
eventually be taken.) The pdf of the coherent mark sum u = k=1 u k ≡ [c, s]
T
is
1 uT u
p(u | n) = exp − . (F.3)
π n h̄ 2 n h̄ 2
The pdf of the joint event (u, n) is, from (2.4) and (F.3),
p(u) = p(u, n)
n=0
∞
= p(n) p(u | n)
n=0
, ∞ 2
−ν 1
νn uT u
= e 1+ exp − . (F.4)
π h̄ 2 n! n
n=1
n h̄ 2
ν h̄ 2 = S 2 ⇔ h̄ 2 = ν −1 S 2 (F.5)
The interchange of summation and double integral in the first step is justified
because the series is absolutely convergent. Substituting (F.5) and taking the limit
gives
The limit (F.7) follows immediately from the Fourier inversion formula.
256 Appendix F: Linear Filter Model
F.2.1 Utility
Frequency domain PPPs that model the outputs of filters with nonoverlapped pass-
bands are independent. This follows from the interpretation of the PPPs as thinned
versions of the same PPP. (See the end of Section 2.8.) In contrast, for the usual
time domain model, the complex-valued outputs of the cells of a discrete Fourier
transform are asymptotically independent if the windows are not overlapped, and if
the input is a wideband stationary Gaussian signal.
The marked PPP model of signal power spectrum supports the development of
algorithms based on the method of EM.
Glossary
Affine Sum Used in parameter estimation problems in which the intensity function
is of the form f 0 (x) + Σi f i (x ; θi ), where the estimated parameters are {θi }. The
word affine refers to the fact that no parameters are estimated for the term f 0 (x).
Augmented Target State Space Typically, a target state space S + = S ∪ φ com-
prising a continuous component S, such as S ⊂ Rn , and a discrete component φ
not in S. The points in S + represent mutually exclusive and exhaustive statistical
hypotheses about the target. The state φ is interpreted as the hypothesis that a target
generates measurements statistically indistinguishable from clutter, i.e., the target is
a “clutter target”. More generally, a finite or countable number of discrete hypothe-
ses can be used.
Bayes-Markov Filter A sequential estimation method that recursively determines
the posterior density, or pdf, of the available data. Point estimates and some measure
of the area of uncertainty (AOU) are extracted from the posterior density in differ-
ent ways, depending on the application; however, point estimates and their AOUs
characterize the posterior density only for the linear-Gaussian Kalman filter.
Binomial Point Process An i.i.d. point process in which the number of points in a
specified set R ⊂ S is binomially distributed with parameter equal to the integral
over R of a specified pdf on S. It provides an interesting contrast to the Poisson
point process.
Campbell’s Theorem A classic theorem (1909) that gives an explicit form for the
expected value of the random sum Σi f (xi ), where {xi } are the points of a realiza-
tion of a PPP. It is a special case of Slivnyak’s Theorem.
Clutter Point measurements that do not originate from any physical target under
track. Clutter can be persistent, as in ground clutter, and it can be statistical in nature,
that is, arise from the locations of threshold crossings of fluctuations in some ambi-
ent noise background.
Cramér-Rao Bound (CRB) A lower bound on the variance of any unbiased
parameter estimate. It is derived directly from the likelihood function of the data.
257
258 Glossary
Data to Target Assignment Problem A problem that arises with tracking single
targets in clutter, and multiple targets either with or without clutter.
Dirac Delta Function Not really a function at all, but an operator that performs a
point evaluation of the integrand of an integral. Often defined as a limit of a sequence
of test functions.
Expectation-Maximization A method used to obtain ML and MAP parameter
estimation algorithms. It is especially well suited to likelihood functions of pro-
cesses that are sums or superpositions of other, simpler processes. Under broad con-
ditions, it is guaranteed convergent to a local maximum of the likelihood function.
Finite Point Process A geometrical distribution of random occurrences of finitely
many points, the number and locations of which are typically random. In general,
finite point processes are not Poisson point processes.
Fisher Information Matrix (FIM) The inverse of the Cramér-Rao Bound matrix.
Gaussian Mixture A Gaussian sum whose integral is one. Equivalently, a Gaussian
sum whose weights sum to one. Every Gaussian mixture is a pdf.
Gaussian Sum A weighted sum of multivariate Gaussian probability density func-
tions, where the weights are non-negative. Gaussian sums are not, in general, pdfs.
Generalized Functions These are not truly functions at all, but operators. The
concept of point masses leads to the need to define integrals over discrete sets
in a nontrivial way. One way to do this is measure theory, another is generalized
functions. The classic example is the Dirac delta function δ(x); see Test Functions
below. The classic book [71] by Lighthill is charming, short, and very readable.
Homogeneous Poisson Point Process A PPP whose intensity is a (non-negative)
constant.
Independent Increments A concept defined for stochastic processes X (t) which
states that the random variables X (b) − X (a) and X (d) − X (c) are independent
if the intervals (a, b) and (c, d) are disjoint.
Independent Scattering A concept defined for point processes which states that
the numbers and locations of points of a point process in two different sets are inde-
pendent if the sets are disjoint. (Sometimes called independent increments, despite
the confusion in terminology with stochastic processes.)
Intensity The defining parameter of a PPP. In general, intensity is the sum of a
nonnegative ordinary function and at most countably many Dirac delta functions.
Intensity Filter A multi-target tracking filter that recursively updates the intensity
function of a PPP approximation to the target state point process.
Intensity Function The intensity for orderly PPPs is an ordinary function, typically
denoted by λ(x). Intuitively, λ(x) dx specifies the expected number of points of the
Glossary 259
processes need to be vigilant when reading to avoid confusing the Poisson distribu-
tion with the Poisson point process.)
Poisson’s Gambit A term that describes the act of modeling the number of trials
in a sequence of Bernoulli trials as Poisson distributed. The name is pertains to
situations wherein the Poisson assumption is imposed by the modeler rather than
arising naturally in the application. The advantage of the assumption is, e.g., that the
numbers of occurrences of heads and tails in a series of coin flips are independent
under Poisson’s gambit.
Poisson Point Process (PPP) A special kind of point process characterized (param-
eterized) by an intensity function λ(x) that specifies both the number and probability
density of the locations of points. The number of points in a bounded set R is Pois-
son distributed with parameter μ = R λ(x) dx. Conditioned on n, the points are
i.i.d. in R with pdf λ(x)/μ . Compare to Binomial point processes (BPP).
Positron Emission Tomography (PET) A widely used medical imaging method-
ology that is based estimating the spatial intensity of positron decay. The well known
Shepp-Vardi algorithm (1982) is the basis of most, if not all, intensity estimation
algorithms.
Posterior Cramér-Rao Bound (PCRB) A lower bound on the variance of an unbi-
ased nonlinear tracking filter.
Probability Density Function (pdf) A function that represents the probability that
a realization x of a random variable X falls in the infinitesimal region dx.
Probability Hypothesis Density (PHD) An additive theory of Bayesian evidence
accrual first proposed by Stein and Winter [121].
PHD Filter A multitarget sequential tracking filter which avoids the data to tar-
get assignment problem by using a Poisson point process (PPP) to approximate
the multitarget state. The correspondence of data to individual target tracks is not
maintained. Target birth and clutter processes are assumed known a priori.
Radon-Nikodym Derivative A measure theoretic term that, with appropriate qual-
ifications and restricted to Rn , is another name for the likelihood ratio of the data
under two different hypotheses.
Random Sum See Campbell’s Theorem and Slivnyak’s Theorem.
Sequential Monte Carlo A generic name for particle methods.
Slivnyak’s Theorem An important theorem (1962) about the expected value of
a random sum that depends on how a point in a PPP realization relates to other
points in the same realization. The random sum is Σi f (xi , {x1 , . . . , xn } \ xi ).
Campbell’s Theorem is the special case f (xi , · ) ≡ f (xi ).
Single Photon Emission Computed Tomography (SPECT) A widely used
medical imaging technique for estimating the spatial intensity of radioisotope decay
Glossary 261
based on multiple gamma (Anger) camera snapshots. The physics differs from PET
in that only one gamma photon arises from each decay.
Target An entity characterized by a point in a target state space. Typically, the
target state evolves sequentially, that is, its state changes over time (or other time-
like variable).
Target State Space A set whose elements completely characterizes the properties
of a target that are of interest in an application. Typically, this set is the vector
space Rn , or some subset thereof, that represents target kinematic properties or other
properties, e.g., radar cross section. This space is sometimes augmented by discrete
attributes that represent specific categorical properties, e.g., target identity.
Test Functions A sequence of infinitely differentiable functions used in proofs
δ(x), the
involving generalized functions. For example, for the Dirac delta function
test sequence is often taken to be the sequence of Gaussian pdfs
N x ; 0, σn2 ,
where σn → 0, so that for any continuous function f (x), R f (x) δ(x) dx ≡
limσn →0 R f (x) N x ; 0, σn2 dx = f (0) .
Test Sets A collection of “simple” sets, e.g., intervals, that generate the Borel sets,
which in turn are used to define measurable sets. A common style of proof is to
demonstrate a result on a sufficiently rich class of test sets, and then invoke appro-
priate general limit theorems to extend the result to measurable sets.
Transmission Tomography A method of imaging the variation in spatial density
of an object known only from measurements of the line integrals of the density
function for a collection of straight lines.
List of Acronyms
263
264 List of Acronyms
265
266 References
20. J. de Leeuw and W. J. Heiser. Convergence of correction matrix algorithms for multidimen-
sional scaling. In J. C. Lingoes, editor, Geometric Representations of Relational Data, pages
735–752. Mathesis Press, Ann Arbor, MI, 1977.
21. J. de Leeuw and W. J. Heiser. Multidimensional scaling with restrictions on the configuration.
In P. R. Krishnaiah, editor, Multivariate Analysis, volume 5, pages 501–522. North-Holland,
Amsterdam, 1980.
22. A. H. Delaney and Y. Bresler. A fast and accurate iterative reconstruction algorithm for
parallel-beam tomography. IEEE Transactions on Image Processing, IP-5(5):740–753, 1996.
23. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society, B, 39:1–39, 1977.
24. P. Diaconis. The Markov chain Monte Carlo revolution. Bulletin (New Series) of the
American Mathematical Society, 46:179–205, 2009.
25. S. W. Dufour. Intersections of Random Convex Regions. PhD thesis, Department of Statistics,
Stanford University, Stanford, CA, 1972.
26. B. Efron and D. V. Hinkley. Assessing the accuracy of the maximum likelihood estimator:
Observed versus expected fisher information (with discussion). Biometrika, 65(3):457–487,
1978.
27. A. Einstein. On the method of theoretical physics. Philosophy of Science, 1(2):163–169,
1934.
28. C. L. Epstein. Introduction to the Mathematics of Medical Imaging. SIAM Press, Philadel-
phia, PA, second edition, 2007.
29. A. Erdélyi, editor. Higher Transcendental Functions, volume 2. Bateman Manuscript Project,
New York, 1953.
30. O. Erdinc, P. Willett, and Y. Bar-Shalom. A physical space approach for the probability
hypothesis density and cardinalized probability hypothesis density filters. In Proceedings
of the SPIE Conference on Signal Processing of Small Targets, Orlando, FL, volume 6236,
April 2006.
31. O. Erdinc, P. Willett, and Y. Bar-Shalom. The bin-occupancy filter and its connection to the
PHD filters. IEEE Transactions on Signal Processing, 57: 4232–4246, 2009.
32. K. J. Falconer. Applications of a result on spherical integration to the theory of convex sets.
The American Mathematical Monthly, 90(10):690–693, 1983.
33. P. Faure. Theoretical model of reverberation noise. Journal of the Acoustical Society of
America, 36(2):259–266, 1964.
34. J. A. Fessler. Statistical image reconstruction methods for transmission tomography. In
M. Sonka and J. M. Fitzpatrick, editors, Handbook of Medical Imaging, SPIE, Bellingham,
Washington, volume 2, pages 1–70, 2000.
35. J. A. Fessler and A. O. Hero. Space-alternating generalized expectation- maximization algo-
rithm. IEEE Transactions on Signal Processing, SP-42(10):2664–2677, 1994.
36. P. M. Fishman and D. L. Snyder. The statistical analysis of space-time point processes. IEEE
Transactions on Information Theory, IT-22:257–274, 1976.
37. D. Fränken, M. Schmidt, and M. Ulmke. “Spooky action at a distance” in the cardinalized
probability hypothesis density filter. IEEE Transactions on Aerospace and Electronic Sys-
tems, AES-45(4):1657–1664, October 2009.
38. K. Fukunaga and L. D. Hostetler. The estimation of the gradient of a density function,
with application in pattern recognition. IEEE Transactions on Information Theory, IT-21(1):
32–40, January 1975.
39. I. I. Gikhman and A. V. Skorokhod. Introduction to the Theory of Random Processes. Dover,
Mineola, NY, Unabridged republication of 1969 edition, 1994.
40. J. R. Goldman. Stochastic point processes: Limit theorems. The Annals of Mathematical
Statistics, 38(3):771–779, 1967.
41. I. R. Goodman, R. P. S. Mahler, and H. T. Nguyen. Mathematics of Data Fusion. Kluwer,
Dordrecht, 1997.
42. G. R. Grimmett and D. D. Stirzaker. Probability and Random Processes. Oxford University
Press, Oxford, Third edition, 2001.
References 267
68. K. Lewin. The research center for group dynamics at Massachusetts Institute of Technology.
Sociometry, 8: 126–136, 1945.
69. T. A. Lewis. Finding the observed information matrix when using the EM algorithm. Journal
of the Royal Statistical Society, Series B (Methodological), 44(2):226–233, 1982.
70. R. M. Lewitt and S. Matej. Overview of methods for image reconstruction from projections
in emission computed tomography. Proceedings of the IEEE, 91(10):1588–1611, 2003.
71. M. J. Lighthill. Introduction to Fourier Analysis and Generalized Functions. Cambridge
University Press, London, 1958.
72. L. B. Lucy. An iterative technique for the rectification of observed distributions. The Astro-
nomical Journal, 79:745–754, 1974.
73. T. E. Luginbuhl. Estimation of General, Discrete-Time FM Signals. PhD thesis, Department
of Electrical Engineering, University of Connecticut, Storrs, CT, 1999.
74. R. P. S. Mahler. Multitarget Bayes filtering via first-order multitarget moments. IEEE Trans-
actions on Aerospace and Electronic Systems, AES-39:1152–1178, 2003.
75. R. P. S. Mahler. PHD filters of higher order in target number. IEEE Transactions on
Aerospace and Electronic Systems, AES-43:1523–1543, 2007.
76. R. P. S. Mahler. Statistical Multisource-Multitarget Information Fusion. Artech House,
Boston, MA, 2007.
77. B. Matérn. Spatial variation. Meddelanden fran Statens Skogsforskningsinstitut (Communi-
cations of the State Forest Research Institute), 49(5):163–169, 1960.
78. B. Matérn. Spatial Variation. Number 36 in Lecture Notes in Statistics. Springer, New York,
second edition, 1986.
79. G. Matheron. Random Sets and Integral Geometry. John Wiley & Sons, New York, 1975.
80. G. J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley, New York,
1997.
81. G. J. McLachlan and D. Peel. Finite Mixture Models. Wiley, New York, 2000.
82. M. I. Miller, D. L. Snyder, and T. R. Miller. Maximum-likelihood reconstruction for
single-photon emission computed-tomography. IEEE Transactions on Nuclear Science,
NS-32(1):769–778, 1985.
83. P. Mitra. Spectral analysis: Point processes, August 16, 2006. http://wiki.neufo.org/
neufo/jsp/Wiki?ParthaMitra.
84. J. Møller and R. P. Waagepetersen. Statistical Inference and Simulation for Spatial Point
Processes. Chapman & Hall/CRC, Boca Raton, FL, 2004.
85. M. R. Morelande, C. M. Kreucher, and K. Kastella. A bayesian approach to multiple target
detection and tracking. IEEE Transactions on Signal Processing, SP-55:1589–1604, 2007.
86. N. Nandakumaran, T. Kirubarajan, T. Lang, and M. McDonald. Gaussian mixture probability
hypothesis density smoothing with multiple sensors. IEEE Transactions on Aerospace and
Electronic Systems. Accepted for publication, 2010.
87. N. Nandakumaran, T. Kirubarajan, T. Lang, M. McDonald, and K. Punithakumar. Multitarget
tracking using probability hypothesis density smoothing. IEEE Transactions on Aerospace
and Electronic Systems. submitted, March 2008.
88. J. K. Nelson, E. G. Rowe, and G. C. Carter. Detection capabilities of randomly-deployed
sensor fields. International Journal of Distributed Sensor Networks, 5(6):708 – 728, 2009.
89. R. Niu, P. Willett, and Y. Bar-Shalom. Matrix CRLB scaling due to measurements of uncer-
tain origin. IEEE Transactions on Signal Processing, SP-49:1325–1335, 2001.
90. B. Øksendal. Stochastic Differential Equations, An Introduction with Applications. Springer,
Berlin, Fourth edition, 1995.
91. J. A. O’Sullivan and J. Benac. Alternating minimization algorithms for transmission tomog-
raphy. IEEE Transactions on Medical Imaging, MI-26:283–297, 2007.
92. C. Palm. Intensitätsschwankungen im fernsprechverkehr. Ericsson Techniks, 44:1–189, 1943.
93. A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New
York, 1965.
94. M. D. Penrose. On k-connectivity for a geometric random graph. Random Structures and
Algorithms, 15(2):145–164, 1999.
References 269
121. M. C. Stein and C. L. Winter. An additive theory of probabilistic evidence accrual, 1993.
Report LA-UR-93-3336, Los Alamos National Laboratories.
122. L. D. Stone, T. L. Corwin, and C. A. Barlow. Bayesian Multiple Target Tracking. Artech
House, Inc., Norwood, MA, 1999.
123. D. Stoyan, W. S. Kendall, and Joseph Mecke. Stochastic Geometry and its Applications.
Wiley, Chichester, second edition, 1995.
124. R. L. Streit. Multisensor multitarget intensity filter. In Proceedings of the International
Conference on Information Fusion, Cologne, Germany. ISIF, pp 1694–1701 30 June–3 July
2008.
125. R. L. Streit. PHD intensity filtering is one step of a MAP estimation algorithm for positron
emission tomography. In Proceedings of the International Conference on Information
Fusion, Seattle. ISIF, pp 308–315 6 July – 9 July 2009.
126. R. L. Streit and T. E. Luginbuhl. Maximum likelihood method for probabilistic multi-
hypothesis tracking. In Proceedings of the SPIE Conference on Signal and Data Processing
of Small Targets, volume 2235, pages 394–405, Orlando, FL, 1991.
127. R. L. Streit and T. E. Luginbuhl. A probabilistic multi-hypothesis tracking algorithm without
enumeration and pruning. In Proceedings of the Sixth Joint Service Data Fusion Symposium,
pages 1015–1024, Laurel, Maryland, 1993.
128. R. L. Streit and T. E. Luginbuhl. Probabilistic multi-hypothesis tracking, 1995. Technical
Report 10,428 , Naval Undersea Warfare Center, Newport, RI.
129. R. L. Streit and Tod E. Luginbuhl. Estimation of Gaussian mixtures with rotationally invari-
ant covariance matrices. Communications in Statistics: Theory and Methods, 26:2927–2944,
1997.
130. R. L. Streit and L. D. Stone. Bayes derivation of multitarget intensity filters. In Proceedings
of the International Conference on Information Fusion, Cologne, Germany. ISIF, pp.1686–
1693 30 June–3 July 2008.
131. M. Ter-Pogossian. The Physical Aspects of Diagnostic Radiology. Hoeber Medical Division,
Harper and Rowe, New York, 1967.
132. H. R. Thompson. Distribution of distance to Nth neighbour in a population of randomly
distributed individuals. Ecology, 37:391–394, 1956.
133. P. Tichavský, C. H. Muravchic, and A. Nehorai. Posterior Cramér-Rao bounds for discrete-
time nonlinear filtering. IEEE Transactions on Signal Processing, SP-46:1386–1396, 1998.
134. H. L. Van Trees. Detection, Estimation, and Modulation Theory—Part I. Wiley, New York,
1968.
135. H. L. Van Trees and Kristine L. Bell, editors. Bayesian Bounds for Parameter Estimation
and Nonlinear Filtering and Tracking. Wiley, 2007.
136. M. N. M. van Lieshout. Markov Point Processes and Their Applications. Imperial College
Press, London, 2000.
137. B.-N. Vo, S. Singh, and A. Doucet. Sequential Monte Carlo methods for multi-target filtering
with random finite sets. IEEE Transactions on Aerospace and Electronic Systems, AES-
41:1224–1245, 2005.
138. B.-T. Vo, B.-N. Vo, and A. Cantoni. The cardinalized probability hypothesis density filter
for linear Gaussian multi-target models. In Proceedings of the 40th Annual Conference on
Information Sciences and Systems, Princeton, NJ, pp.681–686 March 22–24 2006.
139. B.-T. Vo, B.-N. Vo, and A. Cantoni. Analytic implementations of the cardinalized probability
hypothesis density filter. IEEE Transactions on Signal Processing, SP-55:3553–3567, 2007.
140. W. G. Warren. The center-satellite concept as a basis for ecological sampling (with discus-
sion). In G. P. Patil, E. C. Pielou, and W. E. Waters, editors, Statistical Ecology, volume 2,
pages 87–118. Pennsylvania State University Press, University Park, IL, 1971. (ISBN 0-271-
00112-7).
141. Y. Watanabe. Derivation of linear attenuation coefficients from CT numbers for low-energy
photons. Physics in Medicine Biology, 44:2201–2211, 1999.
142. T. A. Wettergren and M. J. Walsh. Localization accuracy of track-before-detect search strate-
gies for distributed sensor networks. EURASIP Journal on Advances in Signal Processing,
Article ID 264638:15, 2008.
Index
A D
Acceptance-rejection procedure, 14, 31 Delesse’s principle, 198
Affine Gaussian sums, 72, 78, 103 Dirac delta function, 13, 45, 94, 258
Ambient noise, 221 Dirichlet density, 80
Anisotropy, 195 Discrete spaces, 50
Attenuation, 135 Discrete-continuous integral, definition, 53
Augmented space, 50, 53 Discrete-continuous spaces, 50
Distance distributions, 180
B
Barrier problems, 196 E
Bayes-Markov filter, 235 Energy function, 217
Bayesian data splitting, 70 Ensemble average, 19, 149
Erdös-Rényi, 187
Bayesian filtering, 233
Expectation, 18
Bayesian method, 80
Expectation of a random sum, 21
Bernoulli thinning, 30
Expectation of outer product of random sums,
Bernoulli trial, 36
23
Bertrand’s paradox, 191
Expectation-Maximization, 3, 63, 112, 164,
Binomial point process, 7, 20
223
Boolean process, 6
Extreme value distributions, 184
C F
Calculus of variations, 126 Field of view, 198
Campbell’s Theorem, 23 Filtered process, 207
Cauchy-Schwarz, 84 Finite point processes, 7
Central Limit Theorem, 29 First moment intensity, 148, 244
Characteristic function, 24 Fisher information matrix, 142, 207
Cluster process, 210 Fourier reconstruction, 112
Coloring Theorem, 38, 205 Fourier transform, 24
Complete graph, 190 Funk-Hecke Theorem, 112
Compton effect, 124
Conditional mean equation, 229 G
Convex combination, 165 Gamma (Anger) camera, 124
Count record data, 3 Gating, 82
Coupling from the past, 222 Gaussian mixtures, 3, 80
Coverage, 6, 190 Gaussian sum, 3, 69, 168
Cox process, 213 Generalized EM, 225
Cramér-Rao bound, 3, 81, 142, 220 Generalized functions, 13, 44, 258
Crofton’s Theorem, 196 Geometric random graph, 4, 187
271
272 Index
Geometry, stochastic, 6 M
Germ-grain model, 6 Machine learning, 234
Gibbs phenomenon, 143 Marked processes, 204
Gibbs process, 216 Marking Theorem, 205
Grand canonical ensemble, 216 Markov chain, 216
Grenander’s Method of Sieves, 143, 144, Markov Chain Monte Carlo, 203, 221
169 Markov modulated Poisson process, 216
Markov point process, 216
H Markov point processes, 220
Hard core process, 208 Markov transition function, 46
Harmonic spectrum, 79 Matérn hard core process, 209
Hausdorff space, 50 Maximum likelihood, 3, 57
Heteroscedastic sums, 78 Maximum a posteriori, 3
Histogram data, 18, 35 Mean shift algorithm, 164
Histograms, 52 Measurement process, 47
Homogeneous PPP, 13, 189 Microstates, 150
Homoscedastic sums, 78 Microtargets, 150, 175
Homothetic sum, 79 Moments of a random sum, 24
Hyperparameters, 80 Moore’s Law, 163
Multi-sensor intensity filter, 172
I Multinomial thinning (coloring), 38
Importance function, 15 Multinomial trials, 38
Independent increments, 8, 41 Multisensor intensity filter, 134, 173
Independent scattering, 8, 33 Multisets, 8, 12
Inevitability of Poisson distribution, 38 Multitarget tracking, 237
Innovations, 236 Multivariate Gaussian density, expression for,
Intensity, 12, 209 15
Inverse probability mapping on the
N
real line, 43
Nearest neighbor graph, 187
Isopleth, 161
Nearest neighbor tracking, 160
Iterative majorization, 5, 227
Negative binomial distribution, 221
Ito differential equation, 215
Neural spike trains, 213
Neyman-Scott cluster process, 210
J
Nonhomogeneous PPP, definition, 13
Janossy density, 154
Nonlinear transformation, 42
Janossy density function, 244
Joint detection and tracking, 50 O
Observed information matrix, 87, 100, 228
K Olber’s paradox, 45
k-connectivity, 189
k-coverage, 194 P
K-distribution, 221 Parameter tying, 78
Kalman filter, 235 Pareto optimization, 190
Kernel estimator, 164 Particle method, 161
Partition function, 234
L Percolation, 221
Laplace functional, 27 Photo-multiplier tube, 124
Lattices, 50 Photoelectric effect, 60, 124
Level curves, 184 Photoemission, 60
Likelihood function, histogram data, 35 Poisson approximation to binomial
Likelihood function, ordered data, 18 distribution, 195
Likelihood function, unordered data, 17 Poisson cluster process, 210
Luginbuhl’s harmonic spectrum, 79 Poisson gambit, 37
Index 273