An Introduction To Sparse Stochastic Processes (PDFDrive)
An Introduction To Sparse Stochastic Processes (PDFDrive)
Providing a novel approach to sparsity, this comprehensive book presents the theory of
stochastic processes that are ruled by linear stochastic differential equations and that
admit a parsimonious representation in a matched wavelet-like basis.
Two key themes are the statistical property of infinite divisibility, which leads to two
distinct types of behavior – Gaussian and sparse – and the structural link between linear
stochastic processes and spline functions, which is exploited to simplify the mathemati-
cal analysis. The core of the book is devoted to investigating sparse processes, including
a complete description of their transform-domain statistics. The final part develops prac-
tical signal-processing algorithms that are based on these models, with special emphasis
on biomedical image reconstruction.
This is an ideal reference for graduate students and researchers with an interest in
signal/image processing, compressed sensing, approximation theory, machine learning,
or statistics.
MICHAEL UNSER is Professor and Director of the Biomedical Imaging Group at the École
Polytechnique Fédérale de Lausanne (EPFL), Switzerland. He is a member of the Swiss
Academy of Engineering Sciences, a Fellow of EURASIP, and a Fellow of the IEEE.
POUYA D. TAFTI is a data scientist currently residing in Germany, and a former member
of the Biomedical Imaging Group at EPFL, where he conducted research on the theory
and applications of probabilistic models for data.
“Over the last twenty years, sparse representation of images and signals became a very
important topic in many applications, ranging from data compression, to biological
vision, to medical imaging. The book Sparse Stochastic Processes by Unser and Tafti is
the first work to systematically build a coherent framework for non-Gaussian processes
with sparse representations by wavelets. Traditional concepts such as Karhunen-Loéve
analysis of Gaussian processes are nicely complemented by the wavelet analysis of Levy
Processes which is constructed here. The framework presented here has a classical feel
while accommodating the innovative impulses driving research in sparsity. The book
is extremely systematic and at the same time clear and accessible, and can be recom-
mended both to engineers interested in foundations and to mathematicians interested in
applications.”
David Donoho, Stanford University
“This is a fascinating book that connects the classical theory of generalised functions
(distributions) to the modern sparsity-based view on signal processing, as well as sto-
chastic processes. Some of the early motivations given by I. Gelfand on the importance
of generalised functions came from physics and, indeed, signal processing and sam-
pling. However, this is probably the first book that successfully links the more abstract
theory with modern signal processing. A great strength of the monograph is that it consi-
ders both the continuous and the discrete model. It will be of interest to mathematicians
and engineers having appreciations of mathematical and stochastic views of signal pro-
cessing.”
Anders Hansen, University of Cambridge
An Introduction to Sparse
Stochastic Processes
MICHAEL UNSER and POUYA D. TAFTI
École Polytechnique Fédérale de Lausanne
University Printing House, Cambridge CB2 8BS, United Kingdom
www.cambridge.org
Information on this title: www.cambridge.org/9781107058545
© Cambridge University Press 2014
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2014
Printed in the United Kingdom by Clays, St Ives plc
A catalog record for this publication is available from the British Library
Library of Congress Cataloging in Publication Data
Unser, Michael A., author.
An introduction to sparse stochastic processes / Michael Unser and Pouya Tafti, École polytechnique fédérale,
Lausanne.
pages cm
Includes bibliographical references and index.
ISBN 978-1-107-05854-5 (Hardback)
1. Stochastic differential equations. 2. Random fields. 3. Gaussian processes. I. Tafti, Pouya, author.
II. Title.
QA274.23.U57 2014
519.2 3–dc23 2014003923
ISBN 978-1-107-05854-5 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication,
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
To Gisela, Lucia, Klaus, and Murray
Contents
1 Introduction 1
1.1 Sparsity: Occam’s razor of modern signal processing? 1
1.2 Sparse stochastic models: the step beyond Gaussianity 2
1.3 From splines to stochastic processes, or when Schoenberg meets Lévy 5
1.3.1 Splines and Legos revisited 5
1.3.2 Higher-degree polynomial splines 8
1.3.3 Random splines, innovations, and Lévy processes 9
1.3.4 Wavelet analysis of Lévy processes and M-term approximations 12
1.3.5 Lévy’s wavelet-based synthesis of Brownian motion 15
1.4 Historical notes: Paul Lévy and his legacy 16
12 Conclusion 326
References 347
Index 363
Contents
1 Introduction 1
1.1 Sparsity: Occam’s razor of modern signal processing? 1
1.2 Sparse stochastic models: the step beyond Gaussianity 2
1.3 From splines to stochastic processes, or when Schoenberg meets Lévy 5
1.3.1 Splines and Legos revisited 5
1.3.2 Higher-degree polynomial splines 8
1.3.3 Random splines, innovations, and Lévy processes 9
1.3.4 Wavelet analysis of Lévy processes and M-term approximations 12
1.3.5 Lévy’s wavelet-based synthesis of Brownian motion 15
1.4 Historical notes: Paul Lévy and his legacy 16
12 Conclusion 326
References 347
Index 363
Preface
In the years since 2000, there has been a significant shift in paradigm in signal proces-
sing, statistics, and applied mathematics that revolves around the concept of sparsity
and the search for “sparse” representations of signals. Early signs of this (r)evolution
go back to the discovery of wavelets, which have now superseded classical Fourier
techniques in a number of applications. The other manifestation of this trend is the
emergence of data-processing schemes that minimize an 1 norm as opposed to the
squared 2 norm associated with the traditional linear methods. A highly popular
research topic that capitalizes on those ideas is compressed sensing. It is the quest for
a statistical framework that would support this change of paradigm that led us to the
writing of this book.
The cornerstone of our formulation is the classical innovation model, which is equi-
valent to the specification of stochastic processes as solutions of linear stochastic diffe-
rential equations (SDE). The non-standard twist here is that we allow for non-Gaussian
driving terms (white Lévy noise) which, as we shall see, has a dramatic effect on the
type of signal being generated. A fundamental property, hinted in the title of the book,
is that the non-Gaussian solutions of such SDEs admit a sparse representation in an
adapted wavelet-like basis. While a sizable part of the present material is an outgrowth
of our own research, it is founded on the work of Lévy (1930) and Gelfand (arguably,
the second most famous Soviet mathematician after Kolmogorov), who derived general
functional tools and results that are hardly known by practitioners but, as we argue in
the book, are extremely relevant to the issue of sparsity. The other important source
of inspiration is spline theory and the observation that splines and stochastic processes
are ruled by the same differential equations. This is the reason why we opted for the
innovation approach which facilitates the transposition of analytical techniques from
one field to the other. While the formulation requires advanced mathematics that are
carefully explained in the book, the underlying model has a strong engineering appeal
since it constitutes the natural extension of the traditional filtered-white-noise interpre-
tation of a Gaussian stationary process.
The book assumes that the reader has a good understanding of linear systems
(ordinary differential equations, convolution), Hilbert spaces, generalized functions
(i.e., inner products, Dirac impulses, linear operators), the Fourier transform, basic
statistical signal processing, and (multivariate) statistics (probability density and cha-
racteristic functions). By contrast, there is no requirement for prior knowledge of
splines, stochastic differential equations, or advanced functional analysis (function
xiv Preface
spaces, Bochner’s theorem, operator theory, singular integrals) since these topics are
treated in a self-contained fashion.
Several people have had a crucial role in the genesis of this book. The idea of defi-
ning sparse stochastic processes originated during the preparation of a talk for Martin
Vetterli’s 50th birthday (which coincided with the anniversary of the launching of Sput-
nik) in an attempt to build a bridge between his signals with a finite rate of innovation
and splines. We thank him for his long-time friendship and for convincing us to under-
take this writing project. We are grateful to our former collaborator, Thierry Blu, for
his precious help in the elucidation of the functional link between splines and stochastic
processes. We are extremely thankful to Arash Amini, Julien Fageot, Pedram Pad, Qiyu
Sun, and John-Paul Ward for many helpful discussions and their contributions to mathe-
matical results. We are indebted to Emrah Bostan, Ulugbek Kamilov, Hagai Kirshner,
Masih Nilchian, and Cédric Vonesch for turning the theory into practice and for running
the signal- and image-processing experiments described in Chapters 10 and 11. We are
most grateful to Philippe Thévenaz for his intelligent editorial advice and his spotting of
multiple errors and inconsistencies, while we take full responsibility for the remaining
ones. We also thank Phil Meyler, Sarah Marsh and Gaja Poggiogalli from Cambridge
University Press, as well as John King for his careful copy-editing.
The authors also acknowledge very helpful and stimulating discussions with Ben
Adcock, Emmanuel Candès, Volkan Cevher, Robert Dalang, Mike Davies, Christine
De Mol, David Donoho, Pier-Luigi Dragotti, Michael Elad, Yonina Eldar, Jalal Fadili,
Mario Figueiredo, Vivek Goyal, Rémy Gribonval, Anders Hansen, Nick Kingsbury,
Gitta Kutyniok, Stamatis Lefkimmiatis, Gabriel Peyré, Robert Novak, Jean-Luc Stark,
and Dimitri Van De Ville, as well as a number of other researchers involved in the field.
The European Research Commission (ERC) and the Swiss National Science Founda-
tion provided partial support throughout the writing of the book.
Notation
Abbreviations
ADMM Alternating-direction method of multipliers
AL Augmented Lagrangian
AR Autoregressive
ARMA Autoregressive moving average
AWGN Additive white Gaussian noise
BIBO Bounded input, bounded output
CAR Continuous-time autoregressive
CARMA Continuous-time autoregressive moving average
CCS Consistent cycle spinning
DCT Discrete cosine transform
fBm Fractional Brownian motion
FBP Filtered backprojection
FFT Fast Fourier transform
FIR Finite impulse response
FISTA Fast iterative shrinkage/thresholding algorithm
ICA Independent-component analysis
id Infinitely divisible
i.i.d. Independent identically distributed
IIR Infinite impulse response
ISTA Iterative shrinkage/thresholding algorithm
JPEG Joint Photographic Experts Group
KLT Karhunen–Loève transform
LMMSE Linear minimum-mean-square error
LPC Linear predictive coding
LSI Linear shift-invariant
MAP Maximum a posteriori
MMSE Minimum-mean-square error
MRI Magnetice resonance imaging
PCA Principal-component analysis
pdf Probability density function
PSF Point-spread function
ROI Region of interest
xvi Notation
SαS Symmetric-alpha-stable
SDE Stochastic differential equation
SNR Signal-to-noise ratio
WSS Wide-sense stationary
Sets
N, Z+ Non-negative integers, including 0
Z Integers
R Real numbers
R+ Non-negative real numbers
C Complex numbers
Rd d-dimensional Euclidean space
Zd d-dimensional integers
Various notation
j Imaginary unit such that j2 = −1
x Ceiling: smallest integer at least as large as x
x Floor: largest integer not exceeding x
(x1 : xn ) n-tuple (x1 , x2 , . . . , xn )
f Norm of the function f (see Section 3.1.2)
fLp Lp -norm of the function f (in the sense of Lebesgue)
ap p -norm of the sequence a
ϕ, s Scalar (or duality) product
f, g L2 L2 inner product
f∨ Reversed signal: f∨ (r) = f(−r)
(f ∗ g)(r) Continuous-domain convolution
(a ∗ b)[n] Discrete-domain convolution
ϕ (ω) Fourier transform of ϕ: Rd ϕ(r)e−jω,r dr
f = F {f} Fourier transform of f (classical or generalized)
f = F −1 {f} Inverse Fourier transform of f
F {f}(ω) = F {f}(−ω) Conjugate Fourier transform of f
Spaces
X, Y Generic vector spaces (normed or nuclear)
L2 (Rd ) Finite-energy functions
R d |f(r)| 2 dr < ∞
Operators
Id Identity
D = dtd Derivative
Dd Finite difference (discrete derivative)
DN Nth-order derivative
∂n Partial derivative of order n = (n1 , . . . , nd )
L Whitening operator (LSI)
L(ω) Frequency response of L (Fourier multiplier)
ρL Green’s function of L
L∗ Adjoint of L such that ϕ1 , Lϕ2 = L∗ ϕ1 , ϕ2
L−1 Right inverse of L such that LL−1 = Id
h(r1 , r2 ) Generalized impulse response of L−1
L−1∗ Left inverse of L∗ such that (L−1∗ )L∗ = Id
Ld Discrete counterpart of L
NL Null space of L
Pα First-order differential operator: D − αId, α ∈ C
P(α1 :αN ) Differential operator of order N: Pα1 ◦ · · · ◦ PαN
α First-order weighted difference
(α1 :αN ) Nth-order weighted differences: α1 ◦ · · · ◦ αN
γ
∂τ Fractional derivative of order γ ∈ R+ and phase τ
γ
(− ) 2 Fractional Laplacian of order γ ∈ R+
γ∗ γ
Ip Lp -stable left inverse of (− ) 2
xviii Notation
Probability
X, Y Generic scalar random variables
PX Probability measure on R of X
pX (x) Probability density function (univariate)
X (x) Potential function: − log pX (x)
proxX (x, λ) Proximal operator
pid (x) Infinitely divisible probability law
E{·} Expected value operator
mn nth-order moment: E{Xn }
κn nth-order cumulant
pX (ω) Characteristic function of X: E{ejωX }
f(ω) Lévy exponent: log pid (ω)
v(a) Lévy density
p(X1 :XN ) (x) Multivariate probability density function
p(X1 :XN ) (ω) Multivariate characteristic function
mn Moment with multi-index n = (n1 , . . . , nN )
κn Cumulant with multi-index n
H(X1 :XN ) Differential entropy
I(X1 , . . . , XN ) Mutual information
D(pq) Kullback–Leibler divergence
The hypotheses of Gaussianity and stationarity play a central role in the standard statis-
tical formulation of signal processing. They fully justify the use of the Fourier transform
as the optimal signal representation and naturally lead to the derivation of optimal linear
filtering algorithms for a large variety of statistical estimation tasks. This classical view
of signal processing is elegant and reassuring, but it is not at the forefront of research
anymore.
Starting with the discovery of the wavelet transform in the late 1980s [Dau88,Mal89],
researchers in signal processing have progressively moved away from the Fourier
transform and have uncovered powerful alternatives. Consequently, they have ceased
modeling signals as Gaussian stationary processes and have adopted a more determi-
nistic, approximation-theoretic point of view. The key developments that are presently
reshaping the field, and which are central to the theory presented in this book, are
summarized below.
• Novel transforms and dictionaries for the representation of signals. New redundant
and non-redundant representations of signals (wavelets, local cosine, curvelets) have
emerged since the mid 1990s and have led to better algorithms for data compression,
data processing, and feature extraction. The most prominent example is the wavelet
based JPEG-2000 standard for image compression [CSE00], which outperforms the
widely-used JPEG method based on the DCT (discrete cosine transform). Another
illustration is wavelet-domain image denoising, which provides a good alternative to
more traditional linear filtering [Don95]. The various dictionaries of basis functions
that have been proposed so far are tailored to specific types of signals; there does not
appear to be one that fits all.
• Sparsity as a new paradigm for signal processing. At the origin of this new trend
is the key observation that many naturally occurring signals and images – in par-
ticular, the ones that are piecewise-smooth – can be accurately reconstructed from
a “sparse” wavelet expansion that involves many fewer terms than the original
number of samples [Mal98]. The concept of sparsity has been systematized and
extended to other transforms, including redundant representations (a.k.a. frames); it
is at the heart of recent developments in signal processing. Sparse signals are easy
to compress and to denoise by simple pointwise processing (e.g., shrinkage) in the
transformed domain. Sparsity provides an equally powerful framework for dealing
2 Introduction
While the recent developments listed above are truly remarkable and have resulted in
significant algorithmic advances, the overall picture and understanding is still far from
being complete. One limiting factor is that the current formulations of compressed sen-
sing and sparse-signal recovery are fundamentally deterministic. By drawing on the
analogy with the classical linear theory of signal processing, where there is an equiva-
lence between quadratic energy-minimization techniques and minimum-mean-square-
error (MMSE) estimation under the Gaussian hypothesis, there are good chances that
further progress is achievable by adopting a complementary statistical-modeling point
of view. 1 The crucial ingredient that is required to guide such an investigation is a
sparse counterpart to the classical family of Gaussian stationary processes (GSP). This
1 It is instructive to recall the fundamental role of statistical modeling in the development of traditional
signal processing. The standard tools of the trade are the Fourier transform, Shannon-type sampling, linear
filtering, and quadratic energy-minimization techniques. These methods are widely used in practice: they
are powerful, easy to deploy, and mathematically convenient. The important conceptual point is that they
1.2 Sparse stochastic models: beyond Gaussianity 3
book focuses on the formulation of such a statistical framework, which may be aptly
qualified as the next step after Gaussianity under the functional constraint of linearity.
In light of the elements presented in the introduction, the basic requirements for a
comprehensive theory of sparse stochastic processes are as follows:
• Backward compatibility. There is a large body of literature and methods based on the
modeling of signals as realizations of GSP. We would like the corresponding iden-
tification, linear filtering, and reconstruction algorithms to remain applicable, even
though they obviously become suboptimal when the Gaussian hypothesis is violated.
This calls for an extended formulation that provides the same control of the correla-
tion structure of the signals (second-order moments, Fourier spectrum) as the classical
theory does.
• Continuous-domain formulation. The proper interpretation of qualifying terms
such as “piecewise-smooth,” “translation-invariant,” “scale-invariant,” “rotation-
invariant” calls for continuous-domain models of signals that are compatible with
the conventional (finite-dimensional) notion of sparsity. Likewise, if we intend to
optimize or possibly redesign the signal-acquisition system as in generalized sam-
pling and compressed sensing, the very least is to have a model that characterizes the
information content prior to sampling.
• Predictive power. Among other things, the theory should be able to explain why
wavelet representations can outperform the older Fourier-related types of decompo-
sitions, including the KLT, which is optimal from the classical perspective of variance
concentration.
• Ease of use. To have practical usefulness, the framework should allow for the de-
rivation of the (joint) probability distributions of the signal in any transformed
domain. This calls for a linear formulation with the caveat that it needs to accommo-
date non-Gaussian distributions. In that respect, the best thing beyond Gaussianity is
are justifiable based on the theory of Gaussian stationary processes (GSP). Specifically, one can invoke the
following optimality results:
• The Fourier transform as well as several of its real-valued variants (e.g., DCT) are asymptotically equi-
valent to the Karhunen–Loève transform (KLT) for the whole class of GSP. This supports the use of
sinusoidal transforms for data compression, data processing, and feature extraction. The underlying
notion of optimality here is energy compaction, which implies decorrelation. Note that the decorrela-
tion is equivalent to independence in the Gaussian case only.
• Optimal filters. Given a series of linear measurements of a signal corrupted by noise, one can readily
specify its optimal reconstruction (LMMSE estimator) under the general Gaussian hypothesis. The cor-
responding algorithm (Wiener filter) is linear and entirely determined by the covariance structure of the
signal and noise. There is also a direct connection with variational reconstruction techniques since the
Wiener solution can also be formulated as a quadratic energy-minimization problem (Gaussian MAP
estimator) (see Section 10.2.2).
• Optimal sampling/interpolation strategies. While this part of the story is less known, one can also
invoke estimation-theoretic arguments to justify a Shannon-type, constant-rate sampling, which ensures
a minimum loss of information for a large class of predominantly lowpass GSP [PM62, Uns93]. This
is not totally surprising since the basis functions of the KLT are inherently bandlimited. One can also
derive minimum-mean-square-error interpolators for GSP in general. The optimal signal-reconstruction
algorithm takes the form of a hybrid Wiener filter whose input is discrete (signal samples) and whose
output is a continuously defined signal that can be represented in terms of generalized B-spline basis
functions [UB05b].
4 Introduction
provide a very direct link with the theory of linear systems, which allows for the use of
standard engineering notions such as the impulse and frequency responses of a system.
We shall start our journey by making an interesting connection between splines, which
are deterministic objects with some inherent sparsity, and Lévy processes with a spe-
cial focus on the compound-Poisson process, which constitutes the archetype of a
sparse stochastic process. The key observation is that both categories of signals –
namely, deterministic and random – are ruled by the same differential equation.
They can be generated via the proper integration of an “innovation” signal that carries
all the necessary information. The fun is that the underlying differential system is only
marginally stable, which requires the design of a special antiderivative operator. We
then use the close relationship between splines and wavelets to gain insight into the
ability of wavelets to provide sparse representations of such signals. Specifically, we
shall see that most non-Gaussian Lévy processes admit a better M-term representa-
tion in the Haar wavelet basis than in the classical Karhunen–Loève transform (KLT),
which is usually believed to be optimal for data compression. The explanation for this
counter-intuitive result is that we are breaking some of the assumptions that are implicit
in the proof of optimality of the KLT.
0 2 4 6 8 10
(a)
0 2 4 6 8 10
(b)
0 2 4 6 8 10
(c)
Figure 1.1 Examples of spline signals. (a) Cardinal spline interpolant of degree zero
(piecewise-constant). (b) Cardinal spline interpolant of degree one (piecewise-linear).
(c) Non-uniform D-spline or compound-Poisson process, depending on the interpretation
(deterministic vs. stochastic).
Observe that the basis functions {β+0 (t − k)}k∈Z are non-overlapping and orthonormal,
and that their linear span defines the space of cardinal polynomial splines of degree zero.
Moreover, since β+0 (t) takes the value one at the origin and vanishes at all other inte-
gers, the expansion coefficients in (1.3) coincide with the original samples of the signal.
Equation (1.3) is nothing but a mathematical representation of the sample-and-hold me-
thod of interpolation which yields the type of “Lego-like” signal shown in Figure 1.1a.
A defining property of piecewise-constant signals is that they exhibit “sparse” first-
order derivatives that are zero almost everywhere, except at the points of transition
where differentiation is only meaningful in the sense of distributions. In the case of the
cardinal spline specified by (1.3), we have that
Df1 (t) = a1 [k]δ(t − k), (1.5)
k∈Z
where the weights of the integer-shifted Dirac impulses δ(· − k) are given by the cor-
responding jump size of the function: a1 [k] = f [k] − f [k − 1]. The main point is that
the application of the operator D = dtd uncovers the spline discontinuities (a.k.a. knots)
which are located on the integer grid: its effect is that of a mathematical A-to-D conver-
sion since the right-hand side of (1.5) corresponds to the continuous-domain represen-
tation of a discrete signal commonly used in the theory of linear systems. In the no-
menclature of splines, we say that f1 (t) is a cardinal D-spline, 3 which is a special case
3 Other brands of splines are defined in the same fashion by replacing the derivative D by some other dif-
ferential operator generically denoted by L.
1.3 From splines to stochastic processes 7
0
b+ (t) = + (t) − + (t − 1)
1 1
1 2 3 4 5 1 2 3 4 5
(a) (b)
Figure 1.2 Causal polynomial B-splines. (a) Construction of the B-spline of degree zero starting
from the causal Green’s function of D. (b) B-splines of degree n = 0, . . . , 4 (light to dark), which
become more bell-shaped (and beautiful) as n increases.
of a general non-uniform D-spline where the knots can be located arbitrarily (see Fi-
gure 1.1c).
The next fundamental observation is that the expansion coefficients in (1.5) are
obtained via a finite-difference scheme which is the discrete counterpart of differentia-
tion. To get some further insight, we define the finite-difference operator
where the smoothing kernel is precisely the B-spline generator for the expansion (1.3).
An equivalent manifestation of this property can be found in the relation
where the unit step 1+ (t) = 1[0,+∞) (t) (a.k.a. the Heaviside function) is the causal
Green’s function 4 of the derivative operator. This formula is illustrated in Figure 1.2a.
Its Fourier-domain counterpart is
where
⎧
⎪
⎪ t, for 0 ≤ t < 1
⎨
β+1 (t) = β+0 ∗ β+0 (t) = 2 − t, for 1 ≤ t < 2 (1.9)
⎪
⎪
⎩
0, otherwise
is the causal B-spline of degree one, a triangular function centered at t = 1. Note that
the use of a causal generator is compensated by the unit shifting of the coefficients in
(1.8), which is equivalent to recentering the basis functions on the sampling locations.
The main advantage of f2 in (1.8) over f1 in (1.3) is that the underlying function is now
continuous, as illustrated in Figure 1b.
In an analogous manner, one can construct higher-degree spline interpolants that are
piecewise polynomials of degree n by considering B-spline atoms of degree n obtained
from the (n + 1)-fold convolution of β+0 (t) (see Figure 1.2b). The generic version of such
a higher-order spline model is
with
β+n (t) = β+0 ∗ β+0 ∗ · · · ∗ β+0 (t).
n+1
The catch, though, is that, for n > 1, the expansion coefficients c[k] in (1.10) are not
identical to the sample values f [k] anymore. Yet, they are in a one-to-one correspon-
dence with them and can be determined efficiently by solving a linear system of equa-
tions that has a convenient band-diagonal Toeplitz structure [Uns99].
The higher-order counterparts of relations (1.7) and (1.6) are
n+1
1 − e −jω
+ (ω) =
β n
jω
and
−(n+1)
β+n (t) = Dn+1
d D δ(t)
Dn+1 n
d (t)+
= (1.11)
n!
n+1
n + 1 (t − k)n+
k
= (−1)
k n!
k=0
1.3 From splines to stochastic processes 9
with (t)+ = max(0, t). The latter explicit time-domain formula follows from the fact
that the impulse response of the (n + 1)-fold integrator (or, equivalently, the causal
tn
Green’s function of Dn+1 ) is the one-sided power function D−(n+1) δ(t) = n!+ . This elegant
formula is due to Schoenberg, the father of splines [Sch46]. He also proved that the
polynomial B-spline of degree n is the shortest cardinal Dn+1 -spline and that its integer
translates form a Riesz basis of such polynomial splines. In particular, he showed that
the B-spline representation (1.10) is unique and stable, in the sense that
fn 2L2 = |fn (t)|2 dt ≤ c22 = c[k]2 .
R k∈Z
Note that the inequality above becomes an equality for n = 0 since the squared
L2 -norm of the corresponding piecewise-constant function is easily converted into a
sum. This also follows from Parseval’s identity because the B-spline basis {β+0 (·−k)}k∈Z
is orthonormal.
One last feature is that polynomial splines of degree n are inherently smooth, in the
sense that they are n-times differentiable everywhere with bounded derivatives – that is,
Hölder continuous of order n. In the cardinal setting, this follows from the property that
−(n+1)
Dn β+n (t) = Dn Dn+1
d D δ(t)
where the locations tn of the Dirac impulses are uniformly distributed over the real
line (Poisson distribution with rate parameter λ) and the weights An are independent
identically distributed (i.i.d.) with amplitude distribution pA (a). For simplicity, we are
also assuming that pA is symmetric with finite variance σA2 = R a2 pA (a)da. We shall
refer to w as the innovation of the signal s since it contains all the parameters that are
necessary for its description. Clearly, s is a signal with a finite rate of innovation, a term
that was coined by Vetterli et al. [VMB02].
The idea now is to reconstruct s from its innovation w by integrating (1.12). This
requires the specification of some boundary condition to fix the integration constant.
Since the constraint in the definition of Lévy processes is s(0) = 0 (with probability
one), we first need to find a suitable antiderivative operator, which we shall denote by
10 Introduction
D−1
0 . In the event when the input function is Lebesgue integrable, the relevant operator
is readily specified as
⎧ t
⎪
⎪
t 0 ⎨ ϕ(τ ) dτ , for t ≥ 0
D−1 ϕ(t) = ϕ(τ ) dτ − ϕ(τ ) dτ = 0
0
0
−∞ −∞ ⎪
⎪
⎩ − ϕ(τ ) dτ , for t < 0.
t
ejωt − 1 dω
D−1
0 ϕ(t) =
ϕ (ω) ,
R jω 2π
which can be extended, by duality, to a much larger class of generalized functions (see
Chapter 5). This is feasible because the latter expression is a regularized version of an
integral that would otherwise be singular, since the division by jω is tempered by a prop-
er correction in the numerator: ejωt − 1 = jωt + O(ω2 ). It is important to note that D−1
0 is
scale-invariant (in the sense that it commutes with scaling), but not shift-invariant, un-
like D−1 . Our reason for selecting D−10 over D
−1 is actually more fundamental than just
Having the proper inverse operator at our disposal, we can apply it to formally solve
the stochastic differential equation (1.12). This yields the explicit representation of the
sparse stochastic process:
s(t) = D−1
0 w(t) = An D−1
0 {δ(· − tn )}(t)
n
= An 1+ (t − tn ) − 1+ (−tn ) , (1.14)
n
where the second term 1+ (−tn ) in the last parenthesis ensures that s(0) = 0. Clearly,
the signal defined by (1.14) is piecewise-constant (random spline of degree zero) and
its construction is compatible with the classical definition of a compound-Poisson pro-
cess, which is a special type of Lévy process. A representative example is shown in
Figure 1.1c.
It can be shown that the innovation w specified by (1.12), made of random impulses,
is a special type of continuous-domain white noise with the property that
0 Brownian motion 0
Gaussian
0.0 0.2 0.4 0.6 0.8 1.0
Integrator
Sa S (Cauchy)
0 0
0.0 0.2 0.4 0.6 0.8 1.0
where σw2 = λσA2 is the product of the Poisson rate parameter λ and the variance σA2 of
the amplitude distribution. More generally, we can determine the correlation functional
of the innovation, which is given by
This implies, among other things, that u is stationary, while the original Lévy process
s is not (since D−1
0 is not shift-invariant). It also suggests that the samples of the incre-
ment process u are independent if they are taken at a distance of 1 or more apart, the
limit corresponding to the support of the rectangular convolution kernel β+0 . When the
12 Introduction
autocorrelation function Rw (τ ) of the driving noise is well defined and given by (1.15),
we can easily determine the autocorrelation of u as
Ru (τ ) = E{u(t + τ )u(t)} = β+0 ∗ (β+0 )∨ ∗ Rw (τ ) = σw2 β+1 (τ − 1), (1.18)
where (β+0 )∨ (t) = β+0 (−t). It is proportional to the autocorrelation of a rectangle, which
is a triangular function (centered B-spline of degree one).
Of special interest to us are the samples of u on the integer grid, which are characte-
rized for k ∈ Z as
The relation on the right-hand side can be used to show that the u[k] are i.i.d. because
w is white, stationary, and the supports of the analysis functions β+0 (k − t) are non-
overlapping. We shall refer to {u[k]}k∈Z as the discrete innovation of s. Its determi-
nation involves the sampling of s at the integers and a discrete differentiation (finite
differences), in direct analogy with the generation of the continuous-domain innovation
w(t) = Ds(t).
The discrete innovation sequence u[·] will play a fundamental role in signal proces-
sing because it constitutes a convenient tool for extracting the statistics and charac-
terizing the samples of a stochastic process. It is probably the best practical way of
presenting the information because
• we never have access to the full signal s(t), which is a continuously defined entity,
and
• we cannot implement the whitening operator (derivative) exactly, not to
mention that the continuous-domain innovation w(t) does not admit an interpretation
as an ordinary function of t. For instance, Brownian motion is not differentiable
anywhere in the classical sense.
This points to the fact that the continuous-domain innovation model is a theoretical
construct. Its primary purpose is to facilitate the determination of the joint probability
distributions of any series of linear measurements of a wide class of sparse stochastic
processes, including the discrete version of the innovation which has the property of
being maximally decoupled.
Figure 1.4 Dual pair of multiresolution bases where the first kind of functions (wavelets) are the
derivatives of the second (hierarchical basis functions). (a) (Unnormalized) Haar wavelet basis.
(b) Faber–Schauder basis (a.k.a. Franklin system).
where i and k are the scale (dilation of ψHaar by 2i ) and location (translation of ψi,0
by 2i k) indices, respectively. A closely related system is the Faber–Schauder basis
{φi,k (·)}i∈Z,k∈Z , which is made up of B-splines of degree one in a wavelet-like confi-
guration (see Figure 1.4).
Specifically, the hierarchical triangle basis functions are given by
t − 2i k
φi,k (t) = β+1 . (1.21)
2i−1
While these functions are orthogonal within any given scale (because they are non-
overlapping), they fail to be so across scales. Yet, they form a Schauder basis, which is
a somewhat weaker property than being a Riesz basis of L2 (R).
The fundamental observation for our purpose is that the Haar system can be obtained
by differentiating the Faber–Schauder one, up to some amplitude factor. Specifically,
we have the relations
D−1
0 ψi,k = 2
i/2−1
φi,k . (1.23)
Let us now apply (1.22) to the formal determination of the wavelet coefficients of
the Lévy process s = D−1 0 w. The crucial manipulation, which will be justified rigo-
rously within the framework of generalized stochastic processes (see Chapter 3), is
s, Dφi,k = D∗ s, φi,k = −w, φi,k , where we have used the adjoint relation D∗ = −D
14 Introduction
Figure 1.5 Haar wavelets vs. KLT: M-term approximation errors for different brands of Lévy
processes. (a) Gaussian (Brownian motion). (b) Compound-Poisson with Gaussian jump
distribution and e−λ = 0.9. (c) Alpha-stable (symmetric Cauchy). The results are averages
over 1000 realizations.
of the KLT. The other point is that the KLT solution is not defined for the third type of
SαS process, whose theoretical covariances are unbounded – this does not prevent us
from applying the Gaussian solution/DCT to a finite-length realization whose 2 -norm
is finite (almost surely). This simple experiment with various stochastic models corro-
borates the results obtained with image compression where the superiority of wavelets
over the DCT (e.g., JPEG2000 vs. JPEG) is well established.
This is acceptable 5 under the finite-variance hypothesis on w. Since the Haar basis is
orthogonal, the coefficients Zi,k in the above expansion are fully decorrelated, but not
necessarily independent, unless the white noise is Gaussian or the corresponding basis
functions do not overlap. We then construct the Lévy process s = D−10 w by integrating
the wavelet expansion of the innovation, which yields
The representation (1.24) is of special interest when the noise is Gaussian, in which
case the coefficients Zi,k are i.i.d. and follow a standardized Gaussian distribution. The
formula then maps into Lévy’s recursive mid-point method of synthesizing Brownian
motion, which Yves Meyer singles out as the first use of wavelets to be found in the
literature (see [JMR01, pp. 20–24]). The Faber–Schauder expansion (1.24) stands out as
a localized, practical alternative to Wiener’s original construction of Brownian motion,
which involves a sum of harmonic cosines (KLT-type expansion).
Paul Lévy is a highly original thinker who ended up being one of the most influen-
tial figures of modern probability theory [Tay75, Loè73]. Among his many contribu-
tions to the field are the introduction of the characteristic function as an analytical
tool, the characterization of the limit of sums of independent random variables with
unbounded variance (stable distributions), and the investigation of infinitely divisible
laws which ultimately led to the specification of the complete family of additive – or
Lévy – processes. In this latter respect, Michel Loève singles out his 1934 report on
the integration/summation of independent random components [Lév34] as one of the
most important probability papers ever published. There, Lévy is bold enough to make
the transition from a discrete to a continuous indexing in a running sum. This results
in the construction of a random function that is a generalization of Brownian motion
and one of the earliest instances of a non-Gaussian stochastic process. If one leaves the
5 The convergence in the sense of distributions is ensured since the wavelet coefficients of a rapidly decaying
test function ϕ are rapidly decaying as well.
1.4 Historical notes: Paul Lévy and his legacy 17
mathematical technicalities aside, this is very much in the spirit of (1.2), except for the
presence of the more general weighting kernel h.
During his tenure as professor at the prestigious École Polytechnique in Paris from
1920 to 1959, Paul Lévy only supervised four Ph.D. students. 6 Every one of them tur-
ned out to be a brilliant scientist whose work is intertwined with the material presented
in this book.
The first, Wolfgang Döblin (Ph.D. 1938 at age 23; co-advised by Maurice Fréchet),
was a German jew who acquired French citizenship in 1936. Döblin was an extraor-
dinarily gifted mathematician. His short career ended tragically on the front of World
War II when he took his own life to avoid being captured by the German troops entering
France. Yet, during the year he served as a French soldier, he was able to make funda-
mental contributions to the theory of Markov processes and stochastic integration that
predate the work of Itô; these were discovered in 2000 in a sealed envelope deposited
at the French Academy of Sciences – see [BY02] as well as [Pet05] for a romanced
account of Döblin’s life.
Lévy’s second student, Michel Loève (Ph.D. 1941; co-advised by Maurice Fréchet),
is a prominent name in modern probability theory. He was a famous professor at Berke-
ley who is best known for the development of the spectral representation of second-
order stationary processes (the Karhunen–Loève transform).
The third student was Benoit B. Mandelbrot (Ph.D. 1952), who is universally reco-
gnized as the inventor of fractals. In his early work, Mandelbrot introduced the use of
non-Gaussian random walks – that is, Lévy processes – into financial statistics. In parti-
cular, he showed that the rate of change of prices in markets was much better character-
ized by alpha-stable distributions (which are heavy-tailed) than by Gaussians [Man63].
Interestingly, it is also statistics, albeit Gaussian statistics, that led him to the discovery
of fractals when he characterized the class of self-similar processes known as fraction-
al Brownian motion (fBm). While fBm corresponds to a fractional-order integration of
white Gaussian noise, the construction is somewhat technical for it involves the res-
olution of a singular integral. 7 The relevance to the present study is that an important
subclass of sparse processes is made up by the non-Gaussian cousins of fBms and their
multidimensional extension (see Section 7.5).
Lévy’s fourth and last student, Georges Matheron (Ph.D. 1958), was the founding
father of the field of geostatistics. Being interested in the prediction of ore concen-
tration, he developed a statistical method for the interpolation of random fields from
non-uniform samples [Mat63]. His method, called kriging, uses the prior knowledge
that the field is a Brownian motion and determines the interpolant that minimizes the
mean-square estimation error. Interestingly, the solution, which is specified by a space-
dependent regression equation, happens to be a spline function whose type is determined
by the correlation structure (variogram) of the underlying field. There is also an intimate
link between kriging and data-approximation methods based on radial basis functions
and/or reproducing-kernel Hilbert spaces [Mye92] – in particular, thin-plate splines that
are associated with the Laplace operator [Wah90]. While the Gaussian hypothesis is
implicit to Matheron’s work, it is arguably the earliest link established between splines
and stochastic processes.
2 Roadmap to the book
The writing of this book was motivated by our desire to formalize and extend the ideas
presented in Section 1.3 to a class of differential operators much broader than the de-
rivative D. Concretely, this translates into the investigation of the family of stochastic
processes specified by the general innovation model that is summarized in Figure 2.1.
The corresponding generator of random signals (upper part of the diagram) has two
fundamental components: (1) a continuous-domain noise excitation w, which may be
thought of as being composed of a continuum of independent identically distributed
(i.i.d.) random atoms (innovations), and (2) a deterministic mixing procedure (formally
described by L−1 ) which couples the random contributions and imposes the correlation
structure of the output. The concise description of the model is Ls = w, where L is the
whitening operator. The term “innovation” refers to the fact that w represents the un-
predictable part of the process. When the inverse operator L−1 is linear shift-invariant
(LSI), the signal generator reduces to a simple convolutional system which is charac-
terized by its impulse response (or, equivalently, its frequency response). Innovation
modeling has a long tradition in statistical communication theory and signal proces-
sing; it is the basis for the interpretation of a Gaussian stationary process as a filtered
version of a white Gaussian noise [Kai70, Pap91].
In the present context, the underlying objects are continuously defined. The inno-
vation model then results from defining a stochastic process (or random field when
the index variable r is a vector in Rd ) as the solution of a stochastic differential
equation (SDE) driven by a particular brand of noise. The non-standard aspect here
is that we are considering the innovation model in its greatest generality, allowing
for non-Gaussian inputs and differential systems that are not necessarily stable. We
shall argue that these extensions are essential for making this type of modeling com-
patible with the latest developments in signal processing pertaining to the use of
wavelets and sparsity-promoting reconstruction algorithms. Specifically, we shall
see that it is possible to generate a wide variety of sparse processes by replacing
the traditional Gaussian input by some more general brand of (Lévy) noise, within
the limits of mathematical admissibility. We shall also demonstrate that such pro-
cesses admit a sparse representation in a wavelet basis under the assumption that L is
scale-invariant. The difficulty there is that scale-invariant SDEs are inherently unstable
(due to the presence of poles at the origin); yet, we shall see that they can still result in
a proper specification of fractal-type processes, albeit not within the usual framework
of stationary processes. The non-trivial aspect of these generalizations is that they
20 Roadmap to the book
L−1 Generalized
(appropriate boundary conditions)
stochastic process
w(r) s(r)
Whitening operator
Figure 2.1 Innovation model of a generalized stochastic process. The process is generated by
application of the (linear) inverse operator L−1 to a continuous-domain white-noise process w.
The generation mechanism is general in the sense that it applies to the complete family of Lévy
noises, including Gaussian noise as the most basic (non-sparse) excitation. The output process s
is stationary if and only if L−1 is shift-invariant.
To motivate our approach, we start with an informal discussion, leaving the technicali-
ties aside. The stochastic process s in Figure 2.1 is constructed by applying the (integral)
operator L−1 to some continuous-domain white noise w. In most cases of interest, L−1
has an infinitely supported impulse response which introduces long-range dependen-
cies. If we are aiming at a concise statistical characterization of s, it is essential that we
somehow invert this integration process, the ideal being to apply the operator L which
would give back the innovation signal w that is fully decoupled. Unfortunately, this is
not feasible in practice because we do not have access to the signal s(r) over the entire
domain r ∈ Rd , but only to its sampled values on a lattice or, more generally, to a series
of coefficients in some appropriate basis. Our analysis options are essentially two-fold,
as described in Sections 2.1.1 and 2.1.2.
where
βL (r) = Ld L−1 δ(r). (2.2)
This suggests that the decoupling effect will be the strongest when the convolution ker-
nel βL is the most localized (minimum support) and closest to an impulse. 1 We call βL
the generalized B-spline associated with the operator L. For a given operator L, the chal-
lenge will be to design the most localized kernel βL , which is the way of approaching
the discretization problem that best matches our statistical objectives. The good news is
that this is a standard problem in spline theory, meaning that we can take advantage of
the large body of techniques available in this area, even though they have hardly been
applied to the stochastic setting so far.
Here, L∗ is the adjoint operator of L and φi is some smoothing kernel with good
localization properties. The interpretation is that the wavelet transform provides some
kind of multiresolution version of the operator L with the effective width of the kernels
φi increasing in direct proportion to the scale; typically, φi (r) ∝ φ0 (r/2i ). Then, the
wavelet analysis of the stochastic process s reduces to
s, ψi (· − r0 ) = s, L∗ φi (· − r0 )
= Ls, φi (· − r0 )
= w, φi (· − r0 ) = (φi∨ ∗ w)(r0 ), (2.4)
where φi∨ (r) = φi (−r) is the reversed version of φi . The remarkable aspect is that
the effect is essentially the same as in (2.1), so that it makes good sense to develop a
common framework to analyze white noise.
1 One may be tempted to pretend that β is a Dirac impulse, which amounts to neglecting all discretization
L
effects. Unfortunately, this is incorrect and most likely to result in false statistical conclusions. In fact, we
shall see that the localization deteriorates as the order of the operator increases, inducing higher (Markov)
orders of dependencies.
22 Roadmap to the book
This is all nice in principle as long as one can construct “L-compatible” wavelet
bases. For instance, if L is a pure nth-order derivative operator – or by extension, a
scale-invariant differential operator – then the above reasoning is directly applicable to
conventional wavelet bases. Indeed, these are known to behave like multiscale versions
of derivatives due to their vanishing-moment property [Mey90, Dau92, Mal09]. In prior
work, we have linked this behavior, as well as a number of other fundamental wavelet
properties, to the polynomial B-spline convolutional factor that is necessarily present
in every wavelet that generates a multiresolution basis of L2 (R) [UB03]. What is not
so widely known is that the spline connection extends to a much broader variety of
operators – not necessarily scale-invariant – and that it also provides a general recipe
for constructing wavelet-like basis functions that are matched to some given operator L.
This has been demonstrated in one dimension (1-D) for the entire family of ordinary dif-
ferential operators [KU06]. The only significant difference with the conventional theory
of wavelets is that the smoothing kernels φi are not necessarily rescaled versions of each
other.
Note that the “L-compatible” property is relatively robust. For instance, if L = L L0 ,
then an “L-compatible” wavelet is also L -compatible with φi = L0 φi . The design chal-
lenge in the context of stochastic modeling is thus to come up with a suitable wavelet
basis such that φi in (2.3) is most localized – possibly, of compact support.
The reasoning of Section 2.1 is appealing because of its conceptual simplicity and
generality. Yet, the precise formulation of the theory requires some special care because
the underlying stochastic objects are infinite-dimensional and possibly highly singular.
For instance, we are faced with a major difficulty at the outset because the continuous-
domain input of our model (the innovation w) does not admit a conventional interpre-
tation as a function of the domain variable r. This entity can only be probed indirectly
by forming scalar products with test functions in accordance with Laurent Schwartz’
theory of distributions, so that the use of advanced mathematics is unavoidable.
For the benefit of readers who would be unfamiliar with concepts used in this book,
we provide the relevant mathematical background in Chapter 3, which also serves the
purpose of introducing the notation. The first part is devoted to the definition of the
relevant function spaces, with special emphasis on generalized functions (a.k.a. tem-
pered distributions), which play a central role in our formulation. The second part
reviews the classical, finite-dimensional tools of probability theory and shows how
some concepts (e.g., characteristic function, Bochner’s theorem) are extendable to the
infinite-dimensional setting within the framework of Gelfand’s theory of generalized
stochastic processes [GV64].
Chapter 4 is devoted to the mathematical specification of the innovation model. Since
the theory gravitates around the notion of Lévy exponents , we start with a systema-
tic investigation of such functions, denoted by f(ω), which are fundamental to the
(classical) study of infinitely divisible probability laws. In particular, we discuss their
2.2 Organization of the book 23
the transformed domain. Apart from a shaping effect that can be quantified, the resulting
probability density function remains within the same family of infinitely divisible laws.
In the final part of the book, we illustrate the use of these stochastic models (and
the corresponding analytical tools) with the formulation of algorithms for the recov-
ery of signals and images from incomplete, noisy measurements. In Chapter 10, we
develop a general framework for the discretization of linear inverse problems in a
B-spline basis, which is analogous to the finite-element method for solving PDEs.
The central element is the “projection” of the continuous-domain stochastic model
onto the (finite-dimensional) reconstruction space in order to specify the prior sta-
tistical distribution of the signal. This naturally yields the maximum a posteriori
solution to the signal-reconstruction problem. The framework is illustrated with the
derivation of practical algorithms for magnetic resonance imaging, deconvolution
microscopy, and tomographic reconstruction. Remarkably, the resulting MAP es-
timators are compatible with the non-quadratic regularization schemes (e.g., 1 -
minimization, LASSO, and/or non-convex p relaxation) that are currently in favor
in imaging. To get a handle on the quality of the reconstruction, we then rely on the
innovation model to investigate the extent to which one is able to “optimally” denoise
sparse signals. In particular, we demonstrate the feasibility of MMSE reconstruction
when the signal belongs to the class of Lévy processes, which provides us with a gold
standard against which to compare other algorithms.
In Chapter 11, we present alternative wavelet-based reconstruction methods that are
typically faster than the fixed-scale techniques of Chapter 10. These methods capitalize
on the orthonormality of the wavelet basis, which provides direct control of the norm
of the signal. We show that the underlying optimization task is amenable to iterative
thresholding algorithms (ISTA or FISTA) which are simple to deploy and well suited
for large-scale problems. We also investigate the effect of cycle spinning, which is a
fundamental ingredient for making wavelets competitive in terms of image quality. Our
closing topic is the use of statistical modeling for the improvement of standard wavelet-
based denoising – in particular, the optimization of the wavelet-domain thresholding
functions and the search of a consensus solution across multiple wavelet expansions in
order to minimize the global estimation error.
3 Mathematical context
and background
In this chapter we summarize some of the mathematical preliminaries for the remai-
ning chapters. These concern the function spaces used in the book, duality, generalized
functions, probability theory, and generalized random processes. Each of these topics is
discussed in a separate section.
For the most part, the theory of function spaces and generalized functions can be
seen as an infinite-dimensional generalization of linear algebra (function spaces gener-
alize Rn , and continuous linear operators generalize matrices). Similarly, the theory of
generalized random processes involves the generalization of the idea of a finite random
vector in Rn to an element of an infinite-dimensional space of generalized functions.
To give a taste of what is to come, we briefly compare finite- and infinite-dimensional
theories in Tables 3.1 and 3.2. The idea, in a nutshell, is to replace vectors by (genera-
lized) functions. Formally, this extension amounts to replacing some finite sums (in the
finite-dimensional formulation) by integrals. Yet, in order for this to be mathematically
sound, one needs to properly define the underlying objects as elements of some infinite-
dimensional vector space, to specify the underlying notion(s) of convergence (which is
not an issue in Rn ) while ensuring that some basic continuity conditions are met.
Fundamental to our formulation is the material on infinite-dimensional probability
theory from Section 3.4.4 to the end of the chapter. The mastery of those notions requires
a good understanding of function spaces and generalized functions, which are covered
in the first part of the chapter. The impatient reader who is not directly concerned with
the full mathematical details may skip what follows the tables at first reading and consult
the relevant sections later as needed.
By the term “function” we shall mean elements of various function spaces. At a mini-
mum, a function space is a set X along with some criteria for determining, first, whether
or not a given “function” ϕ = ϕ(r) belongs to X (in mathematical notation, ϕ ∈ X )
and, second, given ϕ, ϕ0 ∈ X , whether or not ϕ and ϕ0 describe the same object in X
(in mathematical notation, ϕ = ϕ0 ). Most often, in addition to these, the space X has
additional structure (see below).
In this book we shall deal largely with two types of function spaces: complete normed
spaces such as Lebesgue Lp spaces, and nuclear spaces such as the Schwartz space S
26 Mathematical context and background
Table 3.1 Comparison of notions of linear algebra with those of functional analysis and the theory of
distributions (generalized functions). See Sections 3.1–3.3 for an explanation.
Table 3.2 Comparison of notions of finite-dimensional statistical calculus with the theory of generalized
stochastic processes. See Section 3.4 for an explanation.
Finite-dimensional Infinite-dimensional
and the space D of compactly supported test functions, as well as their duals S and D ,
which are spaces of generalized functions. These two categories of spaces (complete-
normed and nuclear) cannot overlap, except in finite dimensions. Since the function
spaces that are of interest to us are infinite-dimensional (they do not have a finite vector-
space basis), the two categories are mutually exclusive.
The structure of each of the aforementioned spaces has two aspects. First, as a vector
space over the real numbers or its complexification, the space has an algebraic structure.
Second, with regard to the notions of convergence and taking of limits, the space has
a topological structure. The algebraic structure lends meaning to the idea of a linear
operator on the space, while the topological structure gives rise to the concept of a
continuous operator or map, as we shall see shortly.
All the spaces considered here have a similar algebraic structure. They are either vec-
tor spaces over R, meaning that, for any ϕ, ϕ0 in the space and any a ∈ R, the operations
of addition ϕ → ϕ + ϕ0 and multiplication by scalars ϕ → aϕ are defined and map the
space (denoted henceforth by X ) into itself. Or, we may take the complexification of a
real vector space X , composed of elements of the form ϕ = ϕr + jϕi , with ϕr , ϕi ∈ X
and j denoting the imaginary unit. The complexification is then a vector space over C. In
the remainder of the book, we shall denote a real vector space and its complexification
by the same symbol. The distinction, when important, will be clear from the context.
For the spaces with which we are concerned in this book, the topological structure
is completely specified by providing a criterion for the convergence of sequences. 1 By
this we mean that, for any given sequence (ϕi ) in X and any ϕ ∈ X , we are equipped
with the knowledge of whether or not ϕ is the limit of (ϕi ). A topological space is a set
X with topological structure. For normed spaces, the said criterion is given in terms of
a norm, while in nuclear spaces it is given in terms of a family of seminorms, as we shall
discuss below. But before that, let us first define linear and continuous operators.
An operator is a mapping from one vector space to another; that is, a rule that asso-
ciates an output function A{ϕ} ∈ Y (also written as Aϕ) with each input ϕ ∈ X .
The above definition of continuity coincides with the stricter topological definition for
spaces we are interested in.
1 This is in contrast with those topological spaces where one needs to consider generalizations of the notion
of a sequence involving partially ordered sets (the so-called nets or filters). Spaces in which a knowledge
of sequences suffices are called sequential.
28 Mathematical context and background
We shall assume that the topological structure of our vector spaces is such that the
operations of addition and multiplication by scalars in R (or C) are continuous. With
this compatibility condition, our object is called a topological vector space.
Having defined the two types of structure (algebraic and topological) and their rela-
tion with operators in abstract terms, let us now show concretely how the topological
structure is defined for some important classes of spaces.
if and only if
lim ϕ − ϕi = 0.
i
3.1 Some classes of function spaces 29
Let (ϕi ) be a sequence in X such that for any > 0 there exists an N ∈ N with
ϕi − ϕj < for all i, j ≥ N.
Such a sequence is called a Cauchy sequence. A normed space X is complete if it does
not have any holes, in the sense that, for every Cauchy sequence in X , there exists a
ϕ ∈ X such that limi ϕi = ϕ (in other words if every Cauchy sequence has a limit in
X ). A normed space that is not complete can be completed by introducing new points
corresponding to the limits of equivalent Cauchy sequences. For example, the real line
is the completion of the set of rational numbers with respect to the absolute-value norm.
Examples
Important examples of complete normed spaces are the Lebesgue spaces. The Lebesgue
spaces Lp (Rd ), 1 ≤ p ≤ ∞, are composed of functions whose Lp (Rd ) norm, denoted as
· p , is finite, where
1
d |ϕ(r)| dr for 1 ≤ p < ∞
p p
ϕ(r)Lp = R
ess supr∈Rd |ϕ(r)| for p = ∞
and where two functions that are equal almost everywhere are considered to be
equivalent.
We may also define weighted Lp spaces by replacing the shift-invariant Lebesgue
measure (dr) by a weighted measure w(r)dr in the above definitions. In that case, w(r)
is assumed to be a measurable function that is (strictly) positive almost everywhere. In
particular, for w(r) = 1 + |r|α with α ≥ 0 (or, equivalently, w(r) = (1 + |r|)α ), we denote
the associated norms as · p,α , and the corresponding normed spaces as Lp,α (Rd ).
The latter spaces are useful when characterizing the decay of functions at infinity. For
example, L∞,α (Rd ) is the space of functions that are bounded by a constant multiple of
1
1+|r|α almost everywhere (a.e.).
Some remarkable inclusion properties of Lp,α (Rd ), 1 ≤ p ≤ ∞, α > 0, are
• α > α0 implies Lp,α (Rd ) ⊂ Lp,α0 (Rd ).
• L∞, d + (Rd ) ⊂ Lp (Rd ) for any > 0.
p
Finally, we define the space of rapidly decaying functions, R (Rd ), as the intersection
of all L∞,α (Rd ) spaces, α > 0, or, equivalently, as the intersection of all L∞,α (Rd ) with
α ∈ N. In other words, R (Rd ) contains all bounded functions that essentially decay
faster than 1/|r|α at infinity for all α ∈ R+ . A sequence (fi ) converges in (the topology
of) R (Rd ) if and only if it converges in all L∞,α (Rd ) spaces.
The causal exponential ρα (r) = 1[0,∞) (r)eαr with Re(α) < 0 that is central to linear
systems theory is a prototypical example of a function included in R (R).
that certain function spaces are nuclear, in order to use certain results that are true for
nuclear spaces (specifically, the Minlos–Bochner theorem; see below). For the sake of
completeness, a general definition of nuclear spaces is given at the end of this section,
but this definition may safely be skipped without compromising the presentation.
Specifically, it will be sufficient for us to know that the spaces D (Rd ) and S (Rd ),
which we shall shortly define, are nuclear, as are the Cartesian products and powers of
nuclear spaces, and their closed subspaces.
To define these spaces, we need to identify their members, as well as the criterion of
convergence for sequences in the space.
(1) there exists a compact (here, meaning closed and bounded) subset K of Rd such that
all ϕi are supported inside K; and
(2) the sequence (ϕi ) converges in all of the seminorms
In other words, S (Rd ) is populated by functions that, together with all of their deriva-
tives, decay faster than the inverse of any polynomial at infinity.
The topology of S (Rd ) is defined by positing that a sequence (ϕi ) converges in
S (Rd ) if and only if it converges in all of the above seminorms.
The Schwartz space has the remarkable property that its complexification is invari-
ant under the Fourier transform. In other words, the Fourier transform, defined by the
integral
ϕ (ω) = F {ϕ}(ω) =
ϕ(r) → e−jr,ω ϕ(r) dr
Rd
3.1 Some classes of function spaces 31
and inverted by
dω
ϕ (ω) → ϕ(r) = F −1 {
ϕ }(r) = ejr,ω
ϕ (ω) ,
Rd (2π)d
is a continuous map from S (Rd ) into itself. Our convention here is to use ω ∈ Rd as
the generic Fourier-domain index variable.
In addition, both S (Rd ) and D (Rd ) are closed and continuous under differentiation
of any order and multiplication by polynomials. Lastly, they are included in R (Rd ) and
hence in all the Lebesgue spaces, Lp (Rd ), which do not require any smoothness.
|ci |p
i∈N
ci ψi
i∈N
an operator
: ∞ → 1 : (ci ) → (λi ci )
where i |λi | < ∞, and a bounded sequence ψ = (ψi ) in Y , such that we can write
A = Mψ
A.
A : ϕ → λi ai (ϕ)ψi .
i∈N
Given a space X , a functional on X is a map f that takes X to the scalar field R (or C).
In other words, f takes a function ϕ ∈ X as argument and returns the number f (ϕ).
When X is a vector space, we may consider linear functionals on it, where linearity
has the same meaning as in Definition 3.1. When f is a linear functional, it is customary
to use the notation ϕ, f in place of f (ϕ).
The set of all linear functionals on X , denoted as X ∗ , can be given the structure of a
vector space in the obvious way by the identity
As a general rule, in this book we shall adopt some standard topologies and only work
with the corresponding continuous dual space, which we shall simply call the dual. Also,
henceforth, we shall assume the scalar product ·, · to be restricted to X × X . There,
the space X may vary but is necessarily paired with its continuous dual.
Following the restrictions of the previous paragraph, we sometimes say that the ad-
joint of A : X → Y exists, meaning that the algebraic adjoint A∗ : Y ∗ → X ∗ , when
restricted to Y , maps into X , so that we can write
Aϕ, f = ϕ, A∗ f ,
where the scalar products on the two sides are now restricted to Y × Y and X × X ,
respectively.
One can define different topologies on X by providing various criteria for conver-
gence. The only one we shall need to deal with is the weak-∗ topology, which indicates
(for a sequential space X ) that (fi ) converges to f in X if and only if
for ϕ ∈ Lp (Rd ) and f ∈ Lp (Rd ). In particular, L2 (Rd ), which is the only Hilbert space
of the family, is its own dual.
To see that linear functionals described by the above formula with f ∈ Lp are conti-
nuous on Lp , we can rely on Hölder’s inequality, which states that
for 1 ≤ p, p ≤ ∞ and 1/p + 1/p = 1. The special case of (3.4) for p = 2 yields the
Cauchy–Schwarz inequality.
function in the sense of Section 3.3). Ordinary locally integrable functions 2 (in parti-
cular, all Lp functions and all continuous functions) can be identified with elements of
D (Rd ) by using (3.3). By this we mean that any locally integrable function f defines a
continuous linear functional on D (Rd ) where, for ϕ ∈ D (Rd ), ϕ, f is given by (3.3).
However, not all elements of D (Rd ) can be characterized in this way. For instance,
the Dirac functional δ (a.k.a. Dirac impulse), which maps ϕ ∈ D (Rd ) to the value
ϕ, δ = ϕ(0), belongs in D (Rd ) but cannot be written as an integral à la (3.3). Even
in this and similar cases, we may sometimes write Rd ϕ(r)f (r) dr, keeping in mind
that the integral is no longer a true (i.e., Lebesgue) integral, but simply an alternative
notation for ϕ, f .
In similar fashion, the dual of S (Rd ), denoted as S (Rd ), is defined and called the
space of tempered (or Schwartz) distributions. Since D ⊂ S and any sequence that
converges in the topology of D also converges in S , it follows that S (Rd ) is (can
be identified with) a smaller space (i.e., a subspace) of D (Rd ). In particular, not every
locally integrable function belongs in S . For example, locally integrable functions
of exponential growth have no place in S as their scalar product with Schwartz test
functions via (3.3) is not in general finite (much less continuous). Once again, S (Rd )
contains objects that are not functions on Rd in the true sense of the word. For example,
δ also belongs in S (Rd ).
f (ω) = F {f}(ω) = f (r)e−jr,ω dr
Rd
for any f ∈ L1 (Rd ). This definition admits a unique extension, F : L2 (Rd ) → L2 (Rd ),
which is an isometry map (Plancherel’s theorem). The fact that the Fourier transform
preserves the L2 norm of a function (up to a normalization factor) is a direct consequence
of Parseval’s relation,
1
f, g L2 = f,
g L2 ,
(2π)d
g =
whose duality product equivalent is f, f, g .
2 A function on Rd is called locally integrable if its integral over any closed bounded set is finite.
3.3 Generalized functions 35
In addition, we may reasonably suppose that the phenomenon under observation has
some form of continuity, meaning that
limϕi , f = ϕ, f ,
i
where (ϕi ) is a sequence of sensors that tend to ϕ in a certain sense. We denote the set
of all sensors by X . In the light of the above notions of linear combinations and limits
defined in X , mathematically the space of sensors then has the structure of a topological
vector space.
Given the above properties and the definitions of the previous sections, we conclude
that f represents an element of the continuous dual X of X . Given that our sensors, as
previously noted, are assumed to be localized in Rd , we may model them as compactly
supported or rapidly decaying functions on Rd , denoted by the same symbols (ϕ, ψ, . . .)
and, in the case where f also corresponds to a function on Rd , relate the observation
ϕ, f to the functional form of ϕ and f by the identity
We exclude from consideration those functions f for which the above integral is unde-
fined or infinite for some ϕ ∈ X .
However, we are not limited to taking f to be a true function of r ∈ Rd . By requiring
our sensor or test functions to be smooth, we can permit f to become singular; that is,
to depend on the value of ϕ and/or of its derivatives at isolated points/curves inside Rd .
An example of a singular generalized function f, which we have already noted, is the
Dirac distribution (or impulse) δ that measures the value of ϕ at the single point r = 0
(i.e., ϕ, δ = ϕ(0)).
Mathematically, we define generalized functions as members of the continuous dual
X of a nuclear space X of functions, such as D (Rd ) or S (Rd ).
3 The connection with previous sections should already be apparent from this choice of notation.
36 Mathematical context and background
∂ n ϕ, f = ϕ, ∂ n∗ f .
Now, using integration by parts in (3.3), for ϕ, f in D (Rd ) or S (Rd ) we see that ∂ n∗ =
(−1)|n| ∂ n . In other words, we can write
The idea is then to use (3.5) as the defining formula in order to extend the action of the
derivative operator ∂ n for any f ∈ D (Rd ) or S (Rd ).
Formulas for scaling, shifting (translation), rotation, and other geometric transforma-
tions of distributions are obtained in a similar manner. For instance, the translation by
r0 of a generalized function f is defined via the identity
ϕ, f (· − r0 ) = ϕ(· + r0 ), f .
ϕ, Uf = U∗ ϕ, f
ϕ, U∗ f = Uϕ, f
for all f. A similar definition gives the extension of adjoint pairs D (Rd ) → D (Rd ) to
operators D (Rd ) → D (Rd ).
Examples of operators S (Rd ) → S (Rd ) that can be extended in the above fashion
include derivatives, rotations, scaling, translation, time-reversal, and multiplication by
smooth functions of slow growth in the space–time domain. The other fundamental
operation is the Fourier transform, which is treated in Section 3.3.3.
3.3 Generalized functions 37
ϕ,
f =
ϕ, f
ϕ (ω) = e−jr,ω ϕ(r) dr.
Rd
For example, since we have
ϕ(r) dr = ϕ, 1 =
ϕ (0) =
ϕ, δ ,
Rd
we conclude that the (generalized) Fourier transform of δ is the constant function 1.
The fundamental property of the generalized Fourier transform is that it maps S (Rd )
into itself and that it is invertible with F −1 = (2π
1
)d
F where F {f} = F { f ∨ }. This
quasi-self-reversibility – also expressed by the first row of Table 3.3 – implies that
any operation on generalized functions that is admissible in the space–time domain
has its counterpart in the Fourier domain, and vice versa. For instance, multiplication
with a smooth function in the Fourier domain corresponds to a convolution in the signal
f(r) = F {f}(r) (2π)d f (−ω)
f ∨ (r) = f (−r)
f (−ω) =
f ∨ (ω)
f (r)
f (−ω)
f (AT r) 1 −1
| det A| f(A ω)
f (r − r0 ) e−jr0 ,ω
f(ω)
ejr,ω0 f (r)
f(ω − ω0 )
∂ n f (r) (jω)n
f(ω)
rn f (r) j|n| ∂ n
f(ω)
(g ∗ f)(r) g(ω)
f(ω)
g(r)f (r) g ∗
(2π)−d ( f)(ω)
38 Mathematical context and background
domain. Consequently, the familiar functional identities concerning the classical Fourier
transform, such as the formulas for change of variables and differentiation, among
others, also hold true for this generalization. These are summarized in Table 3.3.
In addition, the reader can find in Appendix A a table of Fourier transforms of some
important singular generalized functions in one and several variables.
T H E O R E M 3.1 (Schwartz’ kernel theorem: first form) Every continuous linear op-
erator A : S (Rd ) → S (Rd ) can be written in the form
which corresponds to making the formal substitution ϕ = δ(· − s0 ) in (3.6). One can
therefore view a(·, s0 ) as the generalized impulse response of A.
An equivalent statement of Theorem 3.1 is as follows.
One may argue that the signal-domain notation that is used in both (3.6) and (3.8) is
somewhat abusive since A{ϕ} and a do not necessarily have an interpretation as classical
functions (see statement on the notation in Section 3.1.1). The purists therefore prefer
to denote (3.8) as
l(ϕ1 , ϕ2 ) = ϕ1 ⊗ ϕ2 , a (3.9)
l(ϕ1 , ϕ) = ϕ1 , Aϕ ,
where Aϕ ∈ S (Rd ) is the generalized function specified by (3.6) or, equivalently, by
the inner “integral” (duality product) with respect to s in (3.8).
LSI operators. The convolution of two distributions will then correspond to the com-
position of two LSI operators. To fix ideas, let us take two distributions f and h, with
corresponding operators Af and Ah . We then wish to identify f ∗ h with the composi-
tion Af Ah . However, note that, by the kernel theorem, Af and Ah are initially defined
S → S . Since the codomain of Ah (the space S ) does not match the domain of Af
(the space S ), this composition is a priori undefined.
There are two principal situations where we can get around the above limitation. The
first is when the range of Ah is limited to S ⊂ S (i.e., Ah maps S to itself instead
of the much larger S ). This is the case for the distributions with a smooth Fourier
transform that we discussed previously.
The second situation where we may define the convolution of f and h is when the
range of Ah can be restricted to some space X (i.e., Ag : S → X ) and, furthermore,
Af has a continuous extension to X ; that is, we can extend it as Af : X → S .
An important example of the second situation is when the distributions in question
belong to the spaces Lp (Rd ) and Lq (Rd ) with 1 ≤ p, q ≤ ∞ and 1/p + 1/q ≤ 1. In this
case, their convolution is well defined and can be identified with a function in Lr (Rd ),
1 ≤ r ≤ ∞, with
1 1 1
1+ = + .
r p q
Moreover, for f ∈ Lp (Rd ) and h ∈ Lq (Rd ), we have
This result is Young’s inequality for convolutions. An important special case of this
identity, most useful in derivations, is obtained for q = 1 and p = r:
The latter formula indicates that Lp (Rd ) spaces are “stable” under convolution with
elements of L1 (Rd ) (stable filters).
The first observation is that the definition guarantees linearity and shift-invariance.
Moreover, since S (Rd ) ⊂ Lp (Rd ) ⊂ S (Rd ), the multiplier operator can be written as
3.3 Generalized functions 41
Tf Lp
TLp = sup .
f∈Lp (Rd )\{0} f Lp
In practice, it is often sufficient to work out bounds for the extreme cases (e.g.,
p = 1, +∞) and to then invoke the Riesz–Thorin interpolation theorem to extend the
results to the p values in between.
THEOREM 3.4 (Riesz–Thorin) Let T be a linear operator that is bounded on Lp1 (Rd )
as well as on Lp2 (Rd ), with 1 ≤ p1 ≤ p2 . Then, T is also bounded for any p ∈
[p1 , p2 ] in the sense that there exist constants Cp = TLp with min(Cp1 , Cp2 ) ≤ Cp ≤
max(Cp1 , Cp2 ) such that
The next theorem summarizes the main results that are available on the characteriza-
tion of convolution operators on Lp (Rd ).
is met with
TLp ≤ μh TV = sup ϕ, h . (3.12)
ϕL∞ ≤1
We note that the above theorem is an extension of (3.11) since being a finite
Borel measure is less restrictive a condition than h ∈ L1 (Rd ). To see this, we invoke
Lebesgue’s decomposition theorem, stating that a finite measure μh admits a unique
decomposition as
μh = μac + μsing ,
where μac is an absolutely continuous measure and μsing a singular measure whose mass
is concentrated on a set whose Lebesgue measure is zero. If μsing = 0, then there exists
a unique function h ∈ L1 (Rd ) – the Radon–Nikodym derivative of μh with respect to
the Lebesgue measure – such that
Mikhlin’s condition, which can absorb some degree of discontinuity at the origin, is
easy to check in practice. It is stronger than the minimal boundedness requirement for
p = 2.
PX (A) = pX (x) dx
A
5 Probability distributions should not be confused with the distributions in the sense of Schwartz (i.e., ge-
neralized functions) that were introduced in Section 3.3. It is important to distinguish the two usages, in
part because, as we describe here, in finite dimensions a connection can be made between probability
distributions and positive generalized functions.
6 In classical probability theory, a pdf is defined as the Radon–Nikodym derivative of a probability measure
with respect to some other measure, typically the Lebesgue measure (as we shall assume). This requires
the probability measure to be absolutely continuous with respect to the latter measure. The definition
44 Mathematical context and background
of the generalized pdf given here is more permissive, and also includes measures that are singular with
respect to the Lebesgue measure (for instance the Dirac measure of a point, for which the generalized pdf
is a Dirac distribution). This generalization relies on identifying measures on the Euclidean space with
positive linear functionals.
3.4 Probability theory 45
The functions f, g above are assumed to be fixed, and the joint event that A occurs for
X and B for Y is given by
f−1 (A) ∩ g−1 (B).
If the outcome of X has no bearing on the outcome of Y and vice versa, then X and Y
are said to be independent. In terms of probabilities, this translates into the probability
factorization rule
PX,Y (f−1 (A) ∩ g−1 (B)) = PX (A) · PY (B) = PX,Y (f−1 (A)) · PX,Y (g−1 (B)).
pX (ω) = E{ejω,x } = ejω,x pX (x) dx = F {pX }(ω), (3.14)
Rn
pX (ω) = ejω,x PX (dx) = E{ejω,x }.
Rn
Conversely, the function specified by (3.14) with pX (r) ≥ 0 and Rn pX (r) dr = 1 is
pX (ω)| ≤
positive definite, uniformly continuous, and such that | pX (0) = 1.
The interesting twist (which is due to Lévy) is that the positive definiteness of
pX and
its continuity at 0 implies continuity everywhere (as well as boundedness).
Since, by the above theorem, pX uniquely identifies PX , it is called the characteristic
function of probability measure PX (recall that the probability measure PX is related
to the density pX by PX (E) = E pX (x) dx for sets E in the σ -algebra over Rn ).
The next theorem characterizes weak convergence of measures on Rn in terms of
their characteristic functions.
46 Mathematical context and background
X (ϕ) =
P ejϕ,x PX (dx) = E{ejϕ,X }.
X
3.5 Generalized random processes and fields 47
truly powerful aspect is that this rule condenses all the information about the statistical
distribution of some underlying infinite-dimensional random object X. When working
with characteristic functionals, we shall see that computing probabilities and deriving
various properties of the said processes are all reduced to analytical derivations.
7 We shall use the terms random/stochastic “process” and “field” almost interchangeably. The distinction, in
general, lies in the fact that, for a random process, the parameter is typically interpreted as time, while, for
a field, the parameter is typically multidimensional and interpreted as a spatial or spatio-temporal location.
48 Mathematical context and background
Table 3.4 Comparison of innovation models in finite- and infinite-dimensional settings. See
Sections 4.3–4.5 for a detailed explanation.
Finite-dimensional Infinite-dimensional
Standard Gaussian i.i.d. vector Standard Gaussian white noise w
W = (W1 , . . . , WN ) w (ϕ) = e− 12 ϕ22 , ϕ ∈ S
P
1
pW (ω) = e− 2 |ω| , ω ∈ RN
2
Multivariate Gaussian vector X Gaussian generalized process s
X = AW s = Aw (for continuous A : S → S )
pX (ω) =
1 T 2
e− 2 |A ω| Ps (ϕ) = e− 12 A∗ ϕ2
General i.i.d. vector W = (W1 , . . . , WN ) General white noise w with Lévy exponent f
with exponent f w (ϕ) = e Rd f ϕ(r) dr
P
N
pW (ω) = e n=1 f(ωn )
entity belonging to the dual X of X , since we have not yet defined a proper probability
measure on X . 8 Doing so involves some additional steps.
characteristic functional P s (ϕ) = E{ejϕ,s } that is continuous over the function space
X . Then,
N
s,ϕ :ϕ (ω) = P
p(Y :Y ) (ω) = P s ωn ϕn
1 N 1 N
n=1
where the observation functions ϕn ∈ X are fixed and ω = (ω1 , . . . , ωN ) plays the role
of the N-dimensional Fourier variable.
Proof The continuity assumption over the function space X (which need not be
nuclear) ensures that the manipulation is legitimate. Starting from the definition of the
characteristic function of y = (Y1 , . . . , YN ), we have
p(Y1 :YN ) (ω) = E exp jω, y
N
= E exp j ωn ϕn , s
n=1
N
= E exp j ωn ϕn , s (by linearity of duality product)
n=1
N
s
=P ωn ϕn . s (ϕ))
(by definition of P
n=1
The density p(Y1 :YN ) is then obtained by inverse (conjugate) Fourier transformation.
Similarly, the formalism allows one to retrieve all first- and second-order moments of
the generalized stochastic process s. To that end, one considers the mean and correlation
functionals defined and computed as
d
Ms (ϕ) = E{ϕ, s } = (−j) Ps,ϕ (ω)ω=0
dω
d
= (−j) P
s (ωϕ) ω=0
dω
∂2
Bs (ϕ1 , ϕ2 ) = E{ϕ1 , s ϕ2 , s } = (−j)2 Ps,ϕ ,ϕ (ω1 , ω2 )
1 2 ω1 ,ω2 =0
∂ω1 ∂ω2
∂ 2
= (−j)2 Ps (ω1 ϕ1 + ω2 ϕ2 ) .
∂ω1 ∂ω2 ω1 ,ω2 =0
When the space of test functions is nuclear (X = S (Rd ) or D (Rd )) and the above
quantities are well defined, we can find generalized functions ms (the generalized mean)
3.5 Generalized random processes and fields 51
that satisfies the three conditions of Theorem 3.9 (continuity, positive definiteness, and
normalization), we obtain a new functional
s : X → C
P
w and U as per
fulfilling the same properties by composing P
P w (Uϕ)
s (ϕ) = P for all ϕ ∈ X . (3.18)
Writing
s (ωϕ) = E{ejωϕ,s } =
P pϕ,s (ω)
and
w (ωUϕ) = E{ejωUϕ,w } =
P pUϕ,w (ω)
for generalized processes s and w, we deduce that the random variables ϕ, s and
Uϕ, w have the same characteristic functions and therefore follow
The manipulation that led to Proposition 3.10 shows that a similar relation exists, more
generally, for any finite collection of observations ϕn , s and Uϕn , w , 1 ≤ n ≤ N,
N ∈ N.
52 Mathematical context and background
This seems to indicate that, in a sense, the random model s, which we have defined
using (3.18), can be interpreted as the application of U∗ to the original random model w.
However, things are complicated by the fact that, unless X and Y are nuclear spaces,
we may not be able to interpret w and s as random elements of Y and X , respectively.
Therefore the application of U∗ : Y → X to s should be understood to be merely a
formal construction.
On the other hand, by requiring X to be nuclear and Y to be either nuclear or com-
pletely normed, we see immediately that P s : X → C fulfills the requirements of
the Minlos–Bochner theorem, and thereby defines a generalized random process with
realizations in X .
The previous discussion suggests the following approach to defining generalized ran-
dom processes: take a continuous, positive definite functional P w : Y → C on some
(nuclear or completely normed) space Y . Then, for any continuous operator U defined
from a nuclear space X into Y , the composition
s = P
P w (U·)
at every point, we therefore require that the random variables ϕ1 , w and ϕ2 , w be
independent whenever the test functions ϕ1 and ϕ2 have disjoint supports.
Since the joint characteristic function of independent random variables factorizes (is
separable), we can formulate the above property in terms of the characteristic functional
Pw of w as
Pw (ϕ1 + ϕ2 ) = Pw (ϕ1 )P
w (ϕ2 ).
An important class of characteristic functionals fulfilling this requirement are those that
can be written in the form
w (ϕ) = e
P Rd
f (ϕ(r)) dr
. (3.19)
Note that this functional is a special instance of (3.19) with f (ω) = − 12 ω2 . The Gaussian
appellation is justified by observing that, for any N test functions ϕ1 , . . . , ϕN , the random
54 Mathematical context and background
variables ϕ1 , w , . . . , ϕN , w are jointly Gaussian. Indeed, we can apply Proposition
3.10 to obtain the joint characteristic function
⎛ N 2 ⎞
Pϕ :ϕ (ω) = exp ⎝− 1
ω n ϕn
⎠
.
1 N
2
n=1 2
By taking the inverse Fourier transform of the above expression, we find that the random
variables ϕn , w , n = 1, . . . , N, have a multivariate Gaussian distribution with mean 0
and covariance matrix with entries
Cmn = ϕm , ϕn .
The independence of ϕ1 , w and ϕ2 , w is obvious whenever ϕ1 and ϕ2 have disjoint
support. This justifies calling the process white. 9 In this special case, even mere ortho-
gonality of ϕ1 and ϕ2 is enough for independence, since for ϕ1 ⊥ ϕ2 we have Cmn = 0.
From (3.16) and (3.17), we also find that w has 0 mean and “correlation function”
Rw (r, s) = δ(r − s), which should also be familiar. In fact, this last expression is some-
times used to formally “define” white Gaussian noise.
A filtered white Gaussian noise is obtained by applying a continuous convolution
(i.e., LSI) operator U∗ : S → S to the Gaussian innovation, in the sense described
in Section 3.5.4.
Let us denote the convolution kernel of the operator U : S → S (the adjoint of U∗ )
by h. 10 The convolution kernel of U∗ : S → S is then h∨ . Following Section 3.5.4,
we find the following characteristic functional for the filtered process U∗ w = h∨ ∗ w:
U∗ w (ϕ) = e− 12 h∗ϕ22 .
P
In turn, it yields the following mean and correlation functions
mU∗ w (r) = 0
RU∗ w (r, s) = h ∗ h∨ (r − s),
as expected.
Section 3.3
For a comprehensive treatment of generalized functions, we recommend the books of
Gelfand and Shilov [GS64] and Schwartz [Sch66] (the former being more accessible
while maintaining rigor). The results on Fourier multipliers are covered by Hörmander
[Hör80] and Mikhlin et al. [MP86].
A historical precursor to the theory of generalized functions is the “operational
method” of Heaviside, appearing in his collected works in the last decade of the nine-
teenth century [Hea71]. The introduction of the Lebesgue integral was a major step
that gave a precise meaning to the concept of the almost-everywhere equivalence of
functions. Dirac introduced his eponymous distribution as a convenient notation in the
1920s. Sobolev [Sob36] developed a theory of generalized functions in order to define
weak solutions of partial differential equations. But it was Laurent Schwartz [Sch66]
who put forth the formal and comprehensive theory of generalized functions (distri-
butions) as we use it today (first edition published in 1950). His work was further
developed and exposed by the Russian school of Gelfand et al.
Section 3.4
Kolmogorov is the founding father of the modern axiomatic theory of probability which
is based on measure theory. We still recommend his original book [Kol56] as the main
reference for the material presented here. Newer and more advanced results can be
found in the encyclopedic works of Bogachev [Bog07] and Fremlin [Fre00, Fre01,
Fre02, Fre03, Fre08] on measure theory.
Paul Lévy defined the characteristic function in the early 1920s and is responsible for
turning the Fourier–Stieltjes apparatus into one of the most useful tools of probability
theory [Lév25, Tay75]. The foundation of the finite-dimensional Fourier approach is
Bochner’s theorem, which appeared in 1932 [Boc32].
Interestingly, it was Kolmogorov himself who introduced the characteristic functional
in 1935 as an equivalent (infinite-dimensional) Fourier-based description of a measure
on a Banach space [Kol35]. This tool then lay dormant for many years. The theoretical
breakthrough came when Minlos proved the equivalence between this functional and
the characterization of probability measures on duals of nuclear spaces (Theorem 3.9) –
as hypothesized by Gelfand [Min63, Kol59]. This powerful framework now constitutes
the infinite-dimensional counterpart of the traditional Fourier approach to probability
theory.
What is lesser known is that Laurent Schwartz, who also happened to be Paul
Lévy’s son-in-law, revisited the theory of probability measures on infinite-dimensional
topological vector spaces, including developments from the French school, in the final
years of his career [Sch73b, Sch81b]. These later works are highly abstract, as one may
expect from their author. This makes for an interesting contrast with Paul Lévy, who
had a limited interest in axioms and whose research was primarily guided by an extra-
ordinary intuition.
Section 3.5
The concept of generalized stochastic processes, including the characterization of
continuously defined white noises, was introduced by Gelfand in 1955 [Gel55]. Itô
56 Mathematical context and background
The stochastic processes that we wish to characterize are those generated by linear trans-
formation of non-Gaussian white noise. If we were operating in the discrete domain and
restricting ourselves to a finite number of dimensions, we would be able to use any
sequence of i.i.d. random variables wn as system input and rely on conventional multi-
variate statistics to characterize the output. This strongly suggests that the specification
of the mixing matrix (L−1 ) and the probability density function (pdf) of the innovation
is sufficient to obtain a complete description of a linear stochastic process, at least in the
discrete setting.
But our goal is more ambitious since we place ourselves in the context of conti-
nuously defined processes. The situation is then not quite as straightforward because: (1)
we are dealing with infinite-dimensional objects, (2) it is much harder to properly define
the notion of continuous-domain white noise, and (3) there are theoretical restrictions
on the class of admissible innovations. While this calls for an advanced mathematical
machinery, the payoff is that the continuous-domain formalism lends itself better to
analytical computations, by virtue of the powerful tools of functional and harmonic
analysis. Another benefit is that the non-Gaussian members of the family are necessarily
sparse as a consequence of the theory which rests upon the powerful characterization
and existence theorems by Lévy–Khintchine, Minlos, Bochner, and Gelfand–Vilenkin.
As in the subsequent chapters, we start by providing some intuition in the first sec-
tion and then proceed with a more formal characterization. Section 4.2 is devoted to
an in-depth investigation of Lévy exponents, which are intimately tied to the family of
infinitely divisible distributions in the classical (scalar) theory of probability. What is
non-standard here and fundamental to our argument is the link that is made between
infinite divisibility and sparsity in Section 4.2.3. In Section 4.3, we apply those results
to the Fourier-domain characterization of a multivariate linear model driven by an infi-
nitely divisible noise vector, which primarily serves as preparation for the subsequent
infinite-dimensional generalization. In Section 4.4, we extend the formulation to the
continuous domain, which results in the proper specification of white Lévy noise w (or
non-Gaussian innovations) as a generalized stochastic process (in the sense of Gelfand
and Vilenkin) with independent “values” at every point. The fundamental result is that a
given brand of noise (or innovations) is uniquely specified by its Lévy exponent f (ω) via
its characteristic functional Pw (ϕ). Finally, in Section 4.5, we characterize the statisti-
cal effect of the mixing operator L−1 (general linear model) and provide mathematical
58 Continuous-domain innovation models
conditions on f and L that ensure that the resulting process s = L−1 w is well defined
mathematically.
Xk = rect(· − k), w .
The concept is illustrated in Figure 4.1. The main point that will be made clearer in
what follows is that there is a one-to-one correspondence between the pdf of Xk – the
so-called canonical pdf pid (x) = pX(rect) (x) – and the complete functional description
of w via its characteristic functional, which we shall investigate in Section 4.4. What is
more remarkable (and distinct from the discrete setting) is that this canonical pdf cannot
be arbitrary: the theory dictates that it must be part of the family of infinitely divisible
(id) laws (see Section 4.2).
The prime example of an id pdf is the Gaussian distribution illustrated in Figure 4.1a.
As already mentioned, it is the only non-sparse member of the family. All other distri-
butions exhibit either a mass density at the origin (like the compound-Poisson example
in Figure 4.1c with Prob(x = 0) = e−λ = 0.75 and Gaussian amplitude distribution) or
a slower rate of decay at infinity (heavy-tail behavior). The Laplace probability law of
Figure 4.1b results in the mildest possible form of sparsity – indeed, it can be proven
that there is a gap between the Gaussian and the other members of the family in the
sense that there is no id distribution with p(x) = e−O(|x| ) with 0 < < 1. In other
1+
words, a non-Gaussian pid (x) is constrained to decay like e−λ|x| or slower – typically,
like O(1/|x|r ) with r > 1 (inverse polynomial/algebraic decay). The sparsest example
in Figure 4.1 is provided by the Cauchy distribution pCauchy (x) = π x12 +1 , which is part
( )
of the symmetric-alpha-stable (SαS) family (here, α = 1). The SαS distributions with
α ∈ (0, 2) are notorious for their heavy-tail behavior and the fact that their moments
E{|x|p } are unbounded for p > α.
4.2 Lévy exponents and infinitely divisible distributions 59
5
(a) Gaussian
0.4
0.3
0.2 2000
0.1
–4 –2 2 4 5
5
(b) Laplace
0.5
0.4
0.3
2000
0.2
0.1
–4 –2 2 4 5
5
(c) Compound-Poisson
0.7
0.6
0.5
0.4 2000
0.3
0.2
0.1
–4 –2 0 2 4 5
5
(d) Cauchy (stable)
0.30
0.25
0.20
2000
0.15
0.10
0.05
–4 –2 2 4
5
Figure 4.1 Examples of canonical, infinitely divisible probability density functions and
corresponding observations of a continuous-domain white-noise process through an array of
non-overlapping rectangular integration windows. (a) Gaussian distribution (not sparse). (b)
Laplace distribution (moderately sparse). (c) Compound-Poisson distribution (finite rate of
innovation). (d) Cauchy distribution (ultra-sparse = heavy-tailed with unbounded variance).
The importance of Lévy exponents in mathematical statistics is that they are tightly
linked with the property of infinite divisibility.
DEFINITION 4.2 (Infinite divisibility) A random variable X with generic pdf pid is
infinitely divisible (id) if and only if, for any N ∈ Z+ , there exist i.i.d. random variables
X1 , . . . , XN such that X has the same distribution as X1 + · · · + XN .
The foundation of the theory of such random variables is that their characteristic func-
tions are in one-to-one correspondence with Lévy exponents. While the better-known
formulation of this equivalence is provided by the Lévy–Khintchine theorem (Theo-
rem 4.2), we prefer first to express it in functional terms, building upon the work of
three giants in harmonic analysis: Lévy, Bochner, and Schoenberg.
T H E O R E M 4.1 (Lévy–Schoenberg) Let pid (ω) = E{ejωX } = R ejωx pid (x) dx be the
characteristic function of an infinitely divisible (id) random variable X. Then,
f (ω) = log
pid (ω)
is a Lévy exponent in the sense of Definition 4.1. Conversely, if f is a valid Lévy exponent,
then the inverse Fourier integral
dω
pid (x) = e f (ω) e−jωx
R 2π
yields the pdf of an id random variable.
The proof is given in the supplementary material in Section 4.2.4. As for the lat-
implication, we observe that the condition f (0) = 0 ⇔
ter pid (0) = 1 implies that
R pid (x) dx = 1, while the positive definiteness ensures that pid (x) ≥ 0 so that it is a
valid pdf.
The corresponding density function v : R → R+ , which is such that μv (da) = v(a) da,
is called the Lévy density.
1 In most mathematical texts, the Lévy–Khintchine decomposition is formulated in terms of a Lévy measure
μv rather than a density. Even though Lévy measures need not always have a density in the sense of the
Radon–Nikodym derivative with respect to the Lebesgue measure (i.e., as an ordinary function), following
Bourbaki we may still identify them with positive linear functionals, which we represent notationally as
integrals against a “generalized” density: μv (E) = E v(a) da for any set in the Borel algebra on R\{0}.
4.2 Lévy exponents and infinitely divisible distributions 61
We observe that, as in the case of a pdf, the density v is not necessarily an ordinary
function, for it may include isolated Dirac impulses (discrete part of the measure) as
well as a singular component.
THEOREM 4.2 (Lévy–Khintchine) A probability distribution pid is infinitely divisible
(id) if and only if its characteristic function can be written as
pid (ω) = pid (x)ejωx dx = exp f (ω) , (4.2)
R
with
b2 ω2 jaω
f (ω) = jb1 ω − + e − 1 − jaω1|a|<1 (a) v(a) da, (4.3)
2 R\{0}
where b1 ∈ R and b2 ∈ R+ are some arbitrary constants, and where v is an admissible
Lévy density; 1|a|<1 (a) is the indicator function of the set = {a ∈ R : |a| < 1} (i.e.,
1 (a) = 1 if a ∈ and 1 (a) = 0 otherwise).
The admissibility condition (4.1) guarantees that the right-hand integral in (4.3) is
well defined; this follows from the bounds |ejaω − 1 − jaω| < a2 ω2 and |ejaω −
1| < min(|aω|, 2). An important aspect of the theory is that this allows for (non-
integrable) Lévy densities with a singular behavior around the origin; for instance,
v(a) = O(1/|a|2+ ) with ∈ [0, 1) as a → 0.
The connection with Theorem 4.1 is that the Lévy–Khintchine expansion (4.3) pro-
vides a complete characterization of the conditionally positive definite functions of or-
der one, as specified in Definition 4.1. This theme is further developed in Appendix B,
which contains the proof of the above statement and also makes interesting links with
theoretical results that are fundamental to machine learning and approximation theory.
In Section 4.4, we shall indicate how id distributions (or, equivalently, Lévy expo-
nents) can be used to specify an extended family of continuous-domain white-noise
processes. In that context, we shall typically require that pid has a well-defined first-
order absolute moment and/or that it is symmetric with respect to the origin, which
leads to the following simplifications of the canonical representation.
C O R O L L A RY 4.3 Let pid be an infinitely divisible pdf whose characteristic
function
is given by f (ω)
pid (ω) = e . Then, depending on the properties of pid or, equivalently,
on the Lévy density v , the Lévy exponent f admits the following Lévy–Khintchine-type
representations:
(1) pid id symmetric i.e., pid (x) = pid (−x) if and only if
b2 ω2
f (ω) = − + cos(aω) − 1 v(a) da (4.4)
2 R\{0}
(3) pid id with R\{0} |a|v(a) da < ∞ if and only if
b2 ω2 jaω
f (ω) = jb1 ω − + e − 1 v(a) da (4.6)
2 R\{0}
where b1 ∈ R, b2 ∈ R+ , b1 = b1 − R\{0} a v(a) da and v(a) ≥ 0 is an admissible
Lévy density.
These are obtained by direct manipulation of (4.3) with b1 = b1 + |a|≥1 a v(a) da.
Equation (4.4) is valid in all generality, provided that we interpret the integral as a Cau-
chy principal-value limit (see Appendix A) to handle potential (symmetric) singularities
around the origin. The Lévy–Khintchine formulas (4.4) and (4.5) are fundamental be-
cause they give an explicit, constructive characterization of the noise functionals that
are central to our formulation. From now on, we rely on Corollary 4.3 to specify admis-
sible Lévy exponents: the parameters (b1 , b2 , v) will be referred to as the Lévy triplet of
f (ω).
For completeness, we also mention the less classical (but equivalent) representation
of the Lévy exponent as
b2 ω2
f (ω) = jb1 ω − v(ω) −
+ v(0), (4.7)
2
where v(ω) = F {v}(ω) is the (conjugate) Fourier transform of v in the sense of general-
ized functions. The idea here is to rely on the powerful theory of generalized functions
to seamlessly absorb the (potential) singularity 2 of v at a = 0. The interested reader can
refer to Appendix A for complementary explanations.
Below is a summary of known criteria for identifying admissible Lévy exponents,
some being more operational than others [GV64, pp. 275–282]. These are all conse-
quences of Bochner’s theorem, which provides a Fourier-domain equivalence between
continuous, positive definite functions and probability density functions (or positive
Borel measures). See Appendix B for an overview and discussion of the functional
notion of positive definiteness and corresponding mathematical tools.
Interestingly, it was Schoenberg (the father of splines) who first established the equi-
valence between Statements (1) and (5) (see proof of the direct part in Section 4.2.4).
2 This corresponds to interpreting v as the generalized function associated with the finite part (p.f., i.e.,
Hadamard’s “partie finie”) of the classical Lévy measure: ϕ, v = p.f. R ϕ(a)μv (da).
4.2 Lévy exponents and infinitely divisible distributions 63
The equivalence between (4) and (5) follows from Bochner’s theorem (Theorem 3.7).
The fact that (2) implies (4) is a side product of the proof in Section 4.2.4, while
the converse implication is a direct consequence of (3). Indeed, if f (ω) has a Lévy–
Khintchine representation, then the same is true for τ f (ω), which also implies that the
whole family of pdfs {pXτ }τ ∈R+ is infinitely divisible. The latter is uniquely specified by
f and therefore
in one-to-one correspondence with the canonical
id
τ distribution
pid (x) =
pXτ (x)τ =1 . Another important observation is that
pXτ (ω) = ef (ω) = pid (ω))τ , so that
pXτ in Statement (4) may be interpreted as the τ -fold (possibly, fractional) convolution
of pid .
In our work, we sometimes need to limit ourselves to some particular subset of Lévy
exponents.
DEFINITION 4.4 (p-admissibility) A Lévy exponent f with derivative f is called
p-admissible if it satisfies the inequality
|f (ω)| + |ω| |f (ω)| ≤ C |ω|p (4.8)
for some constant C > 0 and 0 < p ≤ 2.
PROPOSITION 4.5 The generic Lévy exponents
(1) f1 (ω) = R\{0} ejaω − 1 v1 (a) da with R |a|v1 (a) da < ∞
(2) f2 (ω) = R\{0} cos(aω) − 1 v2 (a) da with R a2 v2 (a) da < ∞
(3) f3 (ω) = R\{0} ejaω − 1 − jaω v3 (a) da with R a2 v3 (a) da < ∞
are p-admissible with p1 = 1, p2 = 2, and p3 = 2, respectively.
jaω
Proof The first result follows
from the
bounds |ejaω − 1| ≤ |a| · |ω| and | dedω | < |a|.
The second is based on cos(aω) − 1 ≤ |aω| and | sin(aω)| ≤ |aω|. Specifically,
2
As for the third exponent, we also use the inequality |ejaω − 1 − jaω| ≤ |aω|2 , which
yields
|ω| |f 3 (ω)| = |ω| ja(ejaω − 1) v3 (a) da
R\{0}
Since the p-admissibility property is preserved through summation, this covers a large
portion of the Lévy exponents specified in Corollary 4.3.
Examples
The power law fα (ω) = −|ω|α with 0 < α ≤ 2 is Lévy α-admissible; it generates the
SαS id distributions [Fel71]. Note that fα (ω) fails to be conditionally positive definite
for α > 2, meaning that the inverse Fourier transform of e−|ω| exhibits negative values
α
and is therefore not a valid pdf. The upper acceptable limit is α = 2 and corresponds
to the Gaussian law. More generally, a Lévy exponent that is symmetric and twice-
differentiable at the origin (which is equivalent to the variance of the corresponding id
distribution being finite) is p-admissible with p = 2; this follows as a direct consequence
of Corollary 4.3 and Proposition 4.5.
Another fundamental instance, which generates the complete family of id
compound-Poisson distributions, is
jaω
fPoisson (ω) = λ e − 1 pA (a) da,
R\{0}
where λ > 0 is the Poisson rate and pA (a) ≥ 0 the amplitude pdf with R pA (a) da = 1.
In general, fPoisson (ω) is p-admissible with p = 1 provided that E{|A|} = R |a|pA (a) da <
∞ (see Proposition 4.5). If, in addition, pA is symmetric with a bounded variance, then
the Poisson range of admissibility extends to p ∈ [1, 2]. Further examples of symmetric
id distributions are documented in Table 4.1. Their Lévy exponent is simply obtained
by taking f (ω) = log pX (ω).
The relevance of id distributions for signal processing is that any linear combination
of independent id random variables is id as well. Indeed, let X1 and X2 be two inde-
pendent id random variables with Lévy exponents f1 and f2 , respectively; then, it is not
difficult to show that a1 X1 + a2 X2 , where a1 and a2 are arbitrary constants, is id with
Lévy exponent f (ω) = f1 (a1 ω) + f2 (a2 ω).
Table 4.1 Primary families of symmetric, infinitely divisible probability laws. The special functions
Kα (x ), (z), and B(x , y ) are defined in Appendix C.
that the admissibility of f (ω) implies that τ f (ω) is a valid Lévy exponent as well for
any τ ≥ 0.
To further our understanding of id distributions, it is instructive to characterize the
atoms of the Lévy–Khintchine representation. Focusing on the simplest form (4.6),
we identify three types of elementary constituents, with the third type being motivated
by the decomposition
of a (continuous) Lévy density into a weighted “sum” of Dirac
impulses: v(a) = R v(τ )δ(a − τ ) dτ ≈ n λn δ(a − τn ) with λn = v(τn )(τn − τn−1 ):
(1) Linear term f1 (ω) = jb1 ω. This corresponds to the (degenerate) pdf of a constant
X1 = b1 with pX1 (x) = δ(x − b1 ).
(2) Quadratic term f2 (ω) = −b22 ω . As already mentioned, this leads to the centered
2
More generally, when v(a) = λp(a) where p(a) ≥ 0 is some arbitrary pdf with
p(a) da = 1, we can make a compound-Poisson 4 interpretation with
R
fPoisson (ω) = λ (ejaω − 1)p(a) da = λ
p(ω) − 1 ,
R
where p(ω) = R ejaω p(a) dais the characteristic function of p = pA . Using the fact
that
p(ω) is bounded, we apply the same type of Taylor-series argument and express the
characteristic function as
∞ m
−λ p(ω)
λ
efPoisson (ω)
=e = pY (ω).
m!
m=0
3 The standard form of the discrete Poisson probability model is Prob(N = n) = e−λ λn with n ∈ N. It provides
n!
the probability of a given number of independent events (n) occurring in a fixed space–time interval when
the average rate of occurrence is λ. The Poisson parameter is equal to the expected value of N, but also to
its variance: λ = E{N} = Var{N}.
4 The compound-Poisson probability model has two components. The first is a random variable N that follows
a Poisson distribution with parameter λ, and the second a series of i.i.d. random variables A1 , A2 , A3 , . . .
N
with pdf pA which are drawn at each trial of N. Then, Y = n=1 An is a compound-Poisson random
variable with Poisson parameter λ and amplitude pdf pA . Its mean and variance are given by E{Y} = λE{A}
and Var{Y} = λVar(A), respectively.
4.2 Lévy exponents and infinitely divisible distributions 67
Proof Let pid be the characteristic function of some id distribution pid and consider an
arbitrary sequence τn ↓ 0. Then,
" #
pXn (ω) = exp τn−1 (pid (ω)τn − 1)
The id pdfs for which v ∈ / L1 (R) are generally smoother than the compound-Poisson
ones for they do not display a singularity (Dirac impulse) at the origin, unlike (4.9). Yet,
depending on the degree of concentration (or singularity) of v around the origin, they
will typically exhibit a peaky behavior around the mean. While this class of distributions
is responsible for the additional level of complication in the Lévy–Khintchine formula –
as compared to the simpler Poisson version (4.6) – we argue that it is highly relevant
for applications because of the many possibilities that it offers. Somewhat surprisingly,
there are many families of id distributions with singular Lévy density that are more
tractable mathematically than their compound-Poisson cousins found in Table 4.1, at
least when considering their pdf.
68 Continuous-domain innovation models
Finite-variance case
We first assume that the second
moment m2 = R a2 v(a) da of the Lévy density is finite,
which also implies that |a|>1 |a|v(a) da < ∞ because of the admissibility condition.
Hence, the corresponding Lévy–Khintchine representation is (4.5). An interesting non-
Poisson example of infinitely divisible probability laws that
falls into this category (with
−|a|
non-integrable v) is the Laplace density with Lévy triplet 0, 0, v(a) = e|a| and pid (x) =
1 −|x|
2e (see Figure 4.1b). This model is particularly relevant in the context of sparse
signal recovery because it provides a tight connection between Lévy processes and total-
variation regularization [UT11, Section VI].
Now, if the Lévy density is Lebesgue integrable, we can pull the linear correction
of the Lévy–Khintchine integral and represent f using (4.6) with v(a) = λpA (a) and
out
R pA (a) da = 1. This implies that we can decompose X into the sum of two independent
Gaussian and compound-Poisson random variables. The variances of the Gaussian and
Poisson components are σ 2 = b2 and λE{A2 }, respectively. The Poisson component is
sparse because its probability density function exhibits the mass distribution e−λ δ(x) at
the origin shown in Figure 4.1c, meaning that the chances, for a continuous amplitude
distribution, of getting zero are overwhelmingly higher than any other value, especially
for smaller values of λ > 0. It is therefore justifiable to use 0 ≤ e−λ < 1 as our Poisson
sparsity index.
Infinite-variance case
We now turn our attention to the case where the second moment of the Lévy density
is unbounded, which we like to label as “super-sparse.” To justify this terminology, we
invoke the Ramachandran–Wolfe theorem [Ram69, Wol71], which states that the pth
moment
E{|x|p } with p ∈ R+ of an infinitely divisible distribution is finite if and only
if
|a|>1 |a|p v(a) da < ∞ (see Section 9.5). For p ≥ 2, the latter is equivalent to
R |a| v(a) da < ∞ because of the Lévy admissibility condition. It follows that the cases
p
that are not covered by the previous scenario (including the Gaussian + Poisson model) ne-
cessarily give rise to distributions whose moments of order p are unbounded for p ≥ 2. The
prototypical representatives of such heavy-tail distributions are the alpha-stable ones (see
Figure 4.1d) or, by extension, the broad family of infinitely divisible probability laws
that are in their domain of attraction. It has been shown that these distributions precisely
fulfill the requirement for p compressibility [AUM11], which is a stronger form of spar-
sity than the presence of a mass probability density at the origin.
4.2 Lévy exponents and infinitely divisible distributions 69
Proof of Theorem
4.1 (Lévy–Schoenberg)
Let
pid (ω) = jωx
R e pid (x) dx be the characteristic function of an id random variable.
1/n
Then, by definition, pid (ω) is a valid characteristic function for any n ∈ Z+ .
Since the convolution
m/n of two pdfs is a pdf, we can also take integer powers, which
results in pid (ω) being a characteristic function. For any irrational number
τ > 0, we can specify a sequence
τ of rational numbers m/n that converges to τ so
m/n
that pid (ω) → pid (ω) with the limit function being continuous. This implies
τ
pXτ (ω) =
that pid (ω) is a characteristic function
for any τ ≥ 0 by Lévy’s continuity
s
theorem (Theorem 3.8). Moreover, pXτ (x) = pXτ/s (ω) must be non-zero for any finite
τ , owing to the fact that lims→∞ pXτ/s (ω) = 1. In particular, pid (ω) =
τ =1 = 0
pXτ (ω)
so that we can write it as f (ω)
pid (ω) = e , where f (ω) is continuous
jωx with Re f (ω) ≤ 0
τ
and f (0) = 0. Hence, pXτ (ω) = pid (ω) = e τ f (ω) = R e pXτ (x) dx, where pXτ (x)
is a valid pdf for any τ ∈ [0, ∞), which is Statement (4) in Proposition 4.4. By
Bochner’s theorem (Theorem 3.7), this is equivalent to eτ f (ω) being positive defi-
nite for any τ ≥ 0 with f (ω) continuous and f (0) = 0. The first part of Theo-
rem 4.1 then follows as a corollary of the next fundamental result, which is due to
Schoenberg.
Proof We only give the easier part (if statement) and refer to [Sch38, Joh66] for the
complete details. The property that eτ f (ω) is positive definite for every τ > 0 is ex-
pressed as
N N
ξm ξ n eτ f (ωm −ωn ) ≥ 0,
m=1 n=1
where
x pXτ (x)
a(τ ) = dx.
R 1 + x2 τ
The technical part of the work, which is quite tedious and is not included here, is to
show that the above integrals are bounded and that the two limits are well defined in the
sense that a(τ ) → a0 and Kτ → K (weakly) as τ ↓ 0 with K being a finite measure.
This ultimately yields Khintchine’s canonical representation,
2
jxω x +1
f (ω) = jωa0 + ejωx − 1 − 2
K(dx),
R 1+x x2
where a0 ∈ R and K is some bounded Borel measure. A potential advantage of Khin-
tchine’s representation is that the corresponding measure K is not singular. The connec-
tion with the standard Lévy–Khintchine formula is b2 = K(0+ ) − K(0− ) and v(x) dx =
x2 +1
x2
K(dx) for x = 0. It is also possible to work out a relation between a and b1 , which
depends upon the type of linear compensation in the canonical representation.
The above manipulation shows that the coefficients of the linear and quadratic terms
of the Lévy–Khintchine formula (4.3) are primarily due to the non-integrable part of
p (x) 2
g(x) = limτ ↓0 Xττ = x x+1 2 k(x) where k(x) dx = K(dx).
By convention, the classical Lévy density v is assumed to be zero at the origin so that
it differs from g by a point distribution that is concentrated at the origin. By invoking
4.3 Finite-dimensional innovation model 71
To set the stage for the infinite-dimensional extension to come in Section 4.4, it is ins-
tructive to investigate the structure of a purely discrete innovation model whose input is
the random vector u = (U1 , . . . , UN ) of i.i.d. infinitely divisible random variables. The
generic Nth-order pdf of the discrete innovation variable u is
$
N
p(U1 :UN ) (u1 , . . . , uN ) = pid (un ), (4.10)
n=1
−1
where pid (x) = F {ef (ω) }(x) and f is the Lévy exponent of the underlying id distri-
bution. Since p(U1 :UN ) (u1 , . . . , uN ) is separable due to the independence assumption, we
can write its characteristic function as the product of individual id factors:
$
N
p(U1 :UN ) (ω) = E(U1 :UN ) {ejω,u } = ef (ωn )
n=1
N
= exp f (ωn ) , (4.11)
n=1
72 Continuous-domain innovation models
Based on this equation, we can determine any marginal distribution by setting the
appropriate frequency variables to zero. For instance, we find that the first-order pdf
of Xn , the nth component of x, is given by
−1
% &
pXn (x) = F efn (ω) (x),
where
N
fn (ω) = f anm ω
m=1
with weighting coefficients anm = [A]nm = [L−1 ]nm . The key observation here is that
fn is an admissible Lévy exponent, which implies that the underlying distribution is
infinitely divisible (by Theorem 4.1), with the same being true for all the marginals
and, by extension, the distribution of any linear measurement(s) of x. While this pro-
vides a general mechanism for characterizing the probability law(s) of the discrete
signal x within the classical framework of multivariate statistics, it is a priori difficult
to perform the required computations (matrix inverse and inverse Fourier transforms)
analytically, except in the Gaussian case, where the exponent is quadratic. Indeed, in
1
p(X1 :XN ) (ω) = e− 2 A ω2 , which is the Fourier
T 2
this latter situation, (4.12) simplifies to
transform of a multivariate Gaussian distribution with zero mean and covariance matrix
E{xxT } = AAT .
As we shall see in the next two sections, these results are transposable to the infinite-
dimensional setting (see Table 3.4). While this may look like an unnecessary complica-
tion at first sight, the payoff is a theory that lends itself better to an analytical treatment
using the powerful tools of harmonic analysis. The essence of the generalization is to
replace the frequency variable ω by a generic test function ϕ ∈ S (Rd ), the sums in
(4.11) and (4.12) by Lebesgue integrals, and the matrix inverses by Green’s functions
which can often be specified explicitly. To make an analogy, it is conceptually and prac-
tically easier to formulate a comprehensive (deterministic) theory of linear systems by
using Fourier analysis and convolution operators than by relying on linear algebra, with
the same applying here. At the end of the exercise, it is still possible to come back to
4.4 White Lévy noises or innovations 73
Having gained a solid understanding of Lévy exponents, we can now move to the speci-
fication of a corresponding family of continuous-domain white-noise processes to drive
the innovation model in Figure 2.1. To that end, we rely on Gelfand’s theory of gen-
eralized stochastic processes [GV64], which was briefly summarized in Section 3.5.
This powerful formalism allows for the complete and remarkably concise description
of a generalized stochastic process by its characteristic functional. While the latter is
not widely used in the standard formulation of stochastic processes, it lends itself quite
naturally to the specification of generalized white-noise processes in terms of Lévy ex-
ponents, in direct analogy with what we have done before for id distributions.
D E F I N I T I O N 4.5 (White Lévy noise or innovation) A generalized stochastic process
w in D (Rd ) is called a white Lévy noise (or innovation) if its characteristic functional is
given by
Pw (ϕ) = E{e jϕ,w
} = exp f ϕ(r) dr , (4.13)
Rd
where f is a valid Lévy exponent and ϕ is a generic test function in D (Rd ) (the space of
infinitely differentiable functions of compact support).
Equation (4.13) is very similar to (4.2) and its multivariate extension (4.11). The key
differences are that the frequency variable is now replaced by the generic test func-
tion ϕ ∈ D (Rd ) (which is a more general infinite-dimensional entity) and that the sum
inside the exponential in (4.11) is replaced by an integral over the domain of ϕ. The
fundamental point is that Pw (ϕ) is a continuous, positive definite functional on D (Rd )
with the key property that P w (ϕ1 + ϕ2 ) = Pw (ϕ1 )Pw (ϕ2 ) whenever ϕ1 and ϕ2 have
non-overlapping support (i.e., ϕ1 (r)ϕ2 (r) = 0). The first part of the statement ensures
that these generalized processes are well defined (by the Minlos–Bochner theorem),
while the separability property implies that they take independent values at all points,
which partially justifies the “white noise” nomenclature. Remarkably, Gelfand and Vi-
lenkin have shown that there is also a converse implication [GV64, Theorem 6, p. 283]:
Equation (4.13) specifies a continuous, positive definite functional on D (Rd ) (and hence
defines an admissible white-noise process in D (Rd )) if and only if f is a Lévy exponent.
This ensures that the Lévy family constitutes the broadest possible class of acceptable
white-noise inputs for our innovation model.
by Gelfand and Vilenkin. This requires a minimal restriction on the class of admissible
Lévy densities in reference to Definition 4.3 to compensate for the lack of compact
support of the functions in S (Rd ).
Proof By the Minlos–Bochner theorem (Theorem 3.9), it suffices to show that P w (ϕ)
is a continuous, positive definite functional over S (R ) with P
d w (0) = 1, where the
latter follows trivially from f (0) = 0. The positive definiteness is a direct consequence
of the exponential nature of the characteristic
functional and the conditional positive
definiteness of the Lévy exponent see Section 4.4.3 and the paragraph following Equa-
tion (4.20) .
The only delicate part is to prove continuity in the topology of S (Rd ). To that end,
we consider a series of functions ϕn that converge to ϕ in S (Rd ). First, we recall that
S (Rd ) ⊂ Lp (Rd ) for all 0 < p ≤ +∞. Moreover, the convergence in S (Rd ) implies
the convergence in all the Lp spaces. Indeed, if we select k > 0 such that kp > 1 and
0 > 0, we have, for n sufficiently large,
0
|ϕn (r) − ϕ(r)| ≤ ,
(r + 1)k
2
Since ϕn − ϕLp → 0 for all p > 0 as ϕn converges to ϕ in S (Rd ), we conclude that
w (ϕ).
limn→∞ |G(ϕn ) − G(ϕ)| = 0, which proves the continuity of P
Note that (4.14), which will be referred to as Lévy–Schwartz admissibility, is a very
slight restriction on the classical condition ( = 0) for id laws (see (4.1) in Defini-
tion 4.3). The fact that can be chosen arbitrarily small reflects the property that the
functions in S have a faster-than-algebraic decay. Another equivalent formulation of
Lévy–Schwartz admissibility is
E{|ϕ, w | } < ∞ for some > 0,
(4.15)
which follows from (9.10) and Proposition 9.10 in Chapter 9. Along the same lines, it
can be shown that the finiteness of the th-order moment in (4.15) for any non-trivial
ϕ0 implies that the same holds true for all ϕ ∈ S (Rd ).
From now on, we implicitly assume that the Lévy–Schwartz admissibility condition
is met.
To exemplify the procedure, we select a quadratic exponent which is trivially admis-
sible (since v(a) = 0). This results in
b2
PwGauss (ϕ) = exp − ϕL2 , 2
2
which is the functional that completely characterizes the white Gaussian noise of the
classical theory of stationary processes.
76 Continuous-domain innovation models
where rk are random point locations in Rd and where the Ak are i.i.d. random variables
with pdf pA . The random events are indexed by k (using some arbitrary ordering); they
are mutually independent and follow a spatial Poisson distribution. Specifically, let
be any finite-measure subset of Rd ; then the probability of observing N() = n events
in is
e−λVol() (λVol())n
Prob (N() = n) = ,
n!
where Vol() is the measure (or spatial volume) of . This is to say that the Poisson
parameter λ represents the average number of random impulses per unit hyper-volume.
The link with the formal specification of Lévy noise in Definition 4.5 is as follows.
THEOREM 4.9 The characteristic functional of the impulsive Poisson noise specified
by (4.16) is
w
P Poisson (ϕ) = E{e } = exp
jϕ,w
fPoisson ϕ(r) dr (4.17)
Rd
with
By the order-statistics property of Poisson processes, the rn are independent and all
equivalent in distribution to a random variable r that is uniform on ϕ .
Using the law of total expectation, we expand the characteristic functional of w,
Pw (ϕ) = E ejϕ,w , as
% % &&
w (ϕ) = E E ejϕ,w Nw,ϕ
P
⎧ ⎧ ⎫⎫
⎨ ⎨N$ ⎬⎬
w,ϕ
=E E ejAn ϕ(rn ) Nw,ϕ
⎩ ⎩ ⎭⎭
n=1
⎧ ⎫
⎨N$w,ϕ %
&⎬
=E E ejA ϕ(r ) (by independence)
⎩ ⎭
n=1
⎧ ⎫
⎨N$w,ϕ % %
&& ⎬
=E E E ejAϕ(r ) A (total expectation)
⎩ ⎭
n=1
⎧ ⎧ ⎫⎫
⎨N$w,ϕ ⎨ e jAϕ(r ) dr ⎬⎬
ϕ
=E E (as r is uniform in ϕ )
⎩ ⎩ Vol(ϕ ) ⎭⎭
n=1
⎧ ⎫
⎨N$w,ϕ ejaϕ(r) dr p (a) da ⎬
A
R ϕ
=E . (4.19)
⎩ Vol(ϕ ) ⎭
n=1
The last expression has the inner expectation expanded in terms of the pdf pA of the
random variable A. Defining the auxiliary functional
we rewrite (4.19) as
⎧ ⎫
⎨N$
M(ϕ) ⎬
w,ϕ
M(ϕ) Nw,ϕ
E =E .
⎩ Vol(ϕ ) ⎭ Vol(ϕ )
n=1
Next, we use the fact that Nw,ϕ is a Poisson random variable to compute the above
expectation directly:
n
M(ϕ) Nw,ϕ M(ϕ) n e−λVol(ϕ ) λVol(ϕ )
E =
Vol(ϕ ) Vol(ϕ ) n!
n≥0
(λM(ϕ))n
= e−λVol(ϕ )
n!
n≥0
−λVol(ϕ ) λM(ϕ)
=e e (Taylor)
= exp λ M(ϕ) − Vol(ϕ ) .
78 Continuous-domain innovation models
We now replace M(ϕ) by its integral equivalent, noting also that Vol(ϕ ) = ϕ R 1 ×
pA (a) da dr, whereupon we obtain the expression
Pw (ϕ) = exp λ (ejaϕ(r) − 1)pA (a) da dr .
ϕ R
The interest of this result is twofold. First, it gives a concrete meaning to the
compound-Poisson scenario in Figure 4.1c, allowing for a description in terms of
conventional point processes. In the same vein, we can propose a physical analogy for
the elementary Poisson term f3 (ω) = λ(eja0 ω −1) in Section 4.2.2 with pA (a) = δ(a−a0 ):
the counting of photons impinging on the detectors of a CCD camera with the photon
density being constant over Rd and the integration time proportional to λ. The cor-
responding process is usually termed “photon noise” in optical imaging. Second, the
explicit noise model (4.16) suggests a practical mechanism for generating generalized
Poisson processes as a weighted sum of shifted Green’s functions of L, each Dirac
impulse being replaced by the response of the inverse operator in the innovation model
in Figure 2.1.
Note that the above description of generalized compound-Poisson processes is com-
patible with the usual definition of finite-rate-of-innovation signals. Yet, this is by far
not the whole story since the impulsive Poisson noise is the only member of the Lévy
family whose “innovation rate,” as measured by λ, is finite.
ξ1 , . . . , ξN ∈ C, and N ∈ Z .
+
where P w (ϕ) = exp
N Rd f ϕ(r) /N dr is the characteristic functional of a Lévy noise.
The justification is that f (ω)/N is a valid Lévy exponent for any N ≥ 1. In the impulsive
Poisson case, this simply translates into the Poisson density parameter λ being divided
by N.
Interestingly, there is also a converse to the statement in Proposition 4.14 [PR70]:
a generalized process s in S (Rd ) is infinitely divisible
if and only if its character-
istic functional can be written as P s (ϕ) = exp F(ϕ) , where F(ϕ) is a continuous,
4.4 White Lévy noises or innovations 81
conditionally positive definite functional over S (Rd ) (or generalized Lévy exponent)
as specified in Definition 4.6 [PR70, main theorem]. While this general characterization
is nice conceptually, it is hardly discriminative since the underlying notion of infinite
divisibility applies to all concrete families of generalized stochastic processes that are
known to us. In particular, it does not require the “whiteness” property that is funda-
mental for defining proper innovations.
We recall that the term “white” is used in reference to white light, whose electromag-
netic spectrum is distributed over the visible band in a way that stimulates all color
receptors of the eye equally. This is in opposition to “colored” noise, whose spectral
content is not equally distributed.
5 In the statistical literature, a second-order process usually designates a stochastic process whose second-
order moments are all well defined. In the case of generalized processes, the property refers to the existence
of the correlation functional.
82 Continuous-domain innovation models
Proof We have that Bw (ϕ1 , ϕ2 ) = E{X1 X2 }, where X1 = ϕ1 , w and X2 = ϕ2 , w are
real-valued
random variables with joint characteristic function pX1 ,X2 (ω) = exp F(ω1 ϕ1 +
ω2 ϕ2 ) with ω = (ω1 , ω2 ). We then invoke the moment-generating property of the Fou-
rier transform, which translates into
2 ∂ X1 ,X2 (ω)
2p
Bw (ϕ1 , ϕ2 ) = E{X1 X2 } = (−j) .
∂ω1 ∂ω2 ω1 =0,ω2 =0
is the cumulant generating function of pX1 ,X2 . The required first derivative with respect
to ωi , i = 1, 2 is given by
∂fX1 ,X2 (ω)
= f ω1 ϕ1 (r) + ω2 ϕ2 (r) ϕi (r) dr,
∂ωi Rd
Similarly, we get
By combining those results and using the property that fX1 ,X2 (0) = 0, we conclude that
2
E{X1 X2 } = −f (0)ϕ1 , ϕ2 − f (0) ϕ1 , 1 ϕ2 , 1 ,
which is equivalent to (4.21) under the hypothesis that f (0) = 0. It is also clear from
(4.22) that this latter condition is equivalent to the zero-mean property of the noise;
that is, E{ϕ, w } = 0 for all ϕ ∈ S (Rd ). Finally, we note that (4.21) is compatible
with the more general cumulant formula (9.22) if we set n = (1, 1), n = 2, and κ2 =
(−j)2 f (0).
The result is interesting because it gives us some insight into the nature of continuous
domain noise. The limit process involves random Dirac impulses that get denser as
n increases. When the variance of the noise σw2 = −f (0) is finite, the increase of the
average number of impulses per unit volume λn = O(n) is compensated by a decrease of
the variance of the amplitude distribution in inverse proportion: Var{An } = σw2 /n. While
any of the generalized noise processes in the sequence is as rough as a Dirac impulse,
this picture suggests that the degree of singularity of the limit object in the non-Poisson
scenario can be potentially reduced due to the accumulation of impulses and the fact that
the variance of their amplitude distribution converges to zero. The particular example
84 Continuous-domain innovation models
that we have in mind is Gaussian white noise, which can be obtained as a limit of
compound-Poisson processes with contracting Gaussian amplitude distributions.
As already mentioned, the class of generalized stochastic processes that are of interest
to us are those defined through the generic innovation model Ls = w (linear stochastic
differential equation) where the differential operator L is shift-invariant and where the
driving term w is a continuous-domain white Lévy noise. Having made sense of the
latter, we can now proceed with the specification of the class of admissible whitening
operators L. The key requirement here is that the model be invertible (in the sense
of generalized functions) which, by duality, translates into some continuity constraint
on the adjoint operator L−1∗ . For the time being, we shall limit ourselves to making
some general statements about L and its inverse that ensure existence, while deferring
to Chapter 5 for concrete examples of admissible operators.
where L−1 is an appropriate right inverse of L. The above manipulation obviously only
makes sense if the action of the adjoint operator L−1∗ is well defined over Schwartz’
class S (Rd ) of test functions – ideally, a continuous mapping from S (Rd ) into itself
or, possibly, Lp (Rd ) (or some variant) if one imposes suitable restrictions on the Lévy
exponent f to maintain continuity.
We like to refer to (4.23) as the analysis statement of the model, and to (4.24) – or
its shorthand, s = L−1 w – as the synthesis description. Of course, this will only work
properly if we have an exact equivalence, meaning that there is a proper and unique
definition of L−1 . The latter will need to be made explicit on a case-by-case basis with
the possible help of boundary conditions.
= ejϕ,s Ps (ds),
S (Rd )
where the latter expression involves an abstract infinite-dimensional integral over the
space of tempered distributions and provides the connection with the defining measure
Ps on S (Rd ). Ps is a functional S (Rd ) → C that associates the complex number
Ps (ϕ) with each test function ϕ ∈ S (Rd ) and which is endowed with three fundamen-
s (0) = 1). It
tal properties: positive definiteness, continuity, and normalization (i.e., P
can also be specified using the more concrete formula
s (ϕ) =
P ejy dPϕ,s (y), (4.25)
R
which involves a classical Stieltjes integral with respect to the probability law PY=ϕ,s =
Prob(Y < y), where Y = ϕ, s is a conventional scalar random variable, once ϕ is fixed.
For completeness, we also recall the meaning of the underlying terminology in the
context of a generic (normed or nuclear) space X of test functions.
The truly powerful aspect of the theorem is that it suffices to check that Ps satisfies the
d
three defining conditions – positive definiteness, continuity over S (R ), and normali-
zation – to prove that it is a valid characteristic functional, which then automatically
ensures the existence of the process since the corresponding measure over S (Rd ) is
well defined.
86 Continuous-domain innovation models
Then, s (ϕ) =
process s that is characterized by E{ejϕ,s } = P
the generalized
−1∗ stochastic
d
exp Rd f L ϕ(r) dr is well defined in S (R ) and satisfies the innovation model
Ls = w, where w is a Lévy innovation with exponent f.
When f is p-admissible (see Definition 4.4) with p ≥ 1, the second condition can be
replaced by the weaker requirement that U is a continuous linear map from S (Rd ) into
Lp (Rd ).
Proof First, we prove that s is a bona fide generalized stochastic process in S (Rd )
by showing that Ps (ϕ) is a continuous, positive definite functional on S (Rd ) such that
Ps (0) = 1 (by the Minlos–Bochner theorem).
The Lévy noise functional P w (ϕ) = exp d
Rd f ϕ(r) dr is continuous over S (R )
by construction (see Theorem 4.8). This, together with the assumption that L−1∗ is
a continuous operator on S (Rd ), implies that the composed functional P s (ϕ) =
Pw (L ϕ) is continuous on S (R ). The reasoning is also applicable when L−1∗ is a
−1∗ d
continuous operator S (Rd ) → Lp (Rd ) and P w (ϕ) is continuous over Lp (Rd ) – see the
triangular diagram in Figure 3.1 with X = S (Rd ) and Y = Lp (Rd ). This latter scenario
is covered by Theorem 8.2, which establishes the positive definiteness and continuity
of Pw over Lp (Rd ) when f is p-admissible (see Section 8.2). The case where L−1∗ is
a continuous operator S (Rd ) → R (Rd ) is handled in the same fashion by invoking
Proposition 8.1.
Next, for any given set of functions ϕ1 , . . . , ϕN ∈ S (Rd ) and coefficients ξ1 , . . . ,
ξN ∈ C, we have
N N
s (ϕm − ϕn )ξm ξ n
P
m=1 n=1
N N
= w L−1∗ (ϕm − ϕn ) ξm ξ n
P
m=1 n=1
N N
= w (L−1∗ ϕm − L−1∗ ϕn )ξm ξ n
P (by linearity)
m=1 n=1
≥0, w )
(by the positivity of P
4.6 Bibliographical notes 87
distributions in Table 4.1. Further distributional properties relating to decay and the
effect of repeated convolution are exposed in Chapter 9.
Section 4.4
The specification of white Lévy noise by means of its characteristic functional (see
Definition 4.5) is based on a series of theorems by Gelfand and Vilenkin [GV64].
Interestingly, the generic form (4.13) is not only sufficient for defining a (stationary)
innovation, as proven by these authors, but also necessary if one adds the observability
constraint that Xid = rect, w is a well-defined random variable [AU14]. The restric-
tion of the family to the space of tempered distributions was investigated by Fageot
et al. [Fag14]. Theorem 4.9 is adapted from [UT11].
The abstract characterization of infinite divisibility and the full generalization of the
Lévy–Khintchine formula for measures over topological vector spaces is covered in the
works of Fernique and Prakasa Rao [Fer67, PR70].
Section 4.5
The innovation or filtered-white-noise model has a long tradition in communication and
statistical signal processing in relation to time-series analysis [BS50, WM57, Kai70].
The classical assumption is that the excitation noise (innovation) is Gaussian and that
the shaping filter is causal with a causal inverse. The innovation, as defined by Wiener
and Masani, then corresponds to the unpredictable part of the signal; that is, the differ-
ence between the value of the signal at given time t and the optimal linear forecast of that
value based on the information available prior to t. Thanks to the Gaussian hypothesis,
one can then formulate a coherent correlation theory of such processes, using standard
Fourier and Hilbert-space techniques, in which continuous-domain white noise only
intervenes as a formal entity; that is, a Gaussian process whose power spectrum is a
constant. This purely spectral description of white noise is consistent with the Wiener–
Khintchine theorems, 6 which explains its popularity among engineers [Pap91, Yag86].
The non-Gaussian extension of the innovation model that is presented in this chapter
is conceptually similar, but relies on the more elaborate definition of continuous-domain
white noise and the functional tools that were developed by Gelfand to formulate his
theory of generalized stochastic processes [Gel55, GV64]. A slightly more restrictive
version of the model with Gaussian and/or impulsive Poisson excitation was presented
in [UT11]. The original statement of Theorem 4.17 for d = 1 can be found in [UTS14].
While the level of generality of this result is sufficient for our purpose, we must warn
the reader that the framework cannot directly handle non-linear transformations because
the underlying objects are generalized functions, which are intrinsically linear. For com-
pleteness, we mention the existence of an extended theory of white noise, due to Hida,
which is aimed at overcoming this limitation [HKPS93,HS04,HS08]. This theory gives
a meaning to certain classes of non-linear white-noise functionals – in analogy with Itô’s
calculus – but it is mathematically quite involved and beyond the scope of this book.
6 The Wiener–Khintchine theorem states that the autocorrelation function of a second-order stationary pro-
cess is the inverse Fourier transform of its power spectrum.
5 Operators and their inverses
In this chapter we review three classes of linear shift-invariant (LSI) operators: convo-
lution operators with stable LSI inverses, operators that are linked with ordinary differ-
ential equations, and fractional operators.
The first class, considered in Section 5.2, is composed of the broad family of mul-
tidimensional operators whose inverses are stable convolution operators – or filters.
Convolution operators play a central role in signal processing. They are easy to charac-
terize mathematically via their impulse response. The corresponding generative model
for stochastic processes amounts to LSI filtering of a white noise, which automatically
yields stationary processes.
Our second class is the 1-D family of ordinary differential operators with constant
coefficients, which is relevant to a wide range of modeling applications. In the “stable”
scenario, reviewed in Section 5.3, these operators admit stable LSI inverses on S and
are therefore included in the previous category. On the other hand, when the differential
operators have one or more zeros on the imaginary axis (the marginally stable/unstable
case), they find a non-trivial null space in S , which consists of (exponential) poly-
nomials. This implies that they are no longer unconditionally invertible on S , and
that we can at best identify left- or right-side inverses, which should additionally fulfill
appropriate “boundedness” requirements in order to be usable in the definition of sto-
chastic processes. However, as we shall see in Section 5.4, obtaining an inverse with
the required boundedness properties is feasible but requires giving up shift-invariance.
As a consequence, stochastic processes defined by these operators are generally non-
stationary.
The third class of LSI operators, investigated in Section 5.5, consists of fractional
derivatives and/or Laplacians in one and several dimensions. Our focus is on the family
of linear operators on S that are simultaneously homogeneous (scale-invariant up to
a scalar coefficient) and invariant under shifts and rotations. These operators are inti-
mately linked to self-similar processes and fractals [BU07, TVDVU09]. Once again,
finding a stable inverse operator to be used in the definition of self-similar processes
poses a mathematical challenge since the underlying system is inherently unstable. The
difficulty is evidenced by the fact that statistical self-similarity is generally not compa-
tible with stationarity, which means that a non-shift-invariant inverse operator needs to
be constructed. Here again, a solution may be found by extending the approach used for
the previous class of operators. From our first example in Section 5.1, we shall actually
90 Operators and their inverses
see that one is able to reconcile the classical theory of stationary processes with that of
self-similar ones by viewing the latter as a limit case of the former.
Before we begin our discussion of operators, let us formalize some notions of
invariance.
where R is any orthogonal matrix in Rd×d (by using orthogonal matrices in the defini-
tion, we take into account both proper and improper rotations, with respective determi-
nants 1 and −1).
(D − αId)−1 ϕ = ρα ∗ ϕ.
5.1 Introductory example: first-order differential equation 91
Figure 5.1 Comparison of antiderivative operators. (a) Input signal. (b) Result of shift-invariant
integrator and its adjoint. (c) Result of scale-invariant integrator I0 and its Lp -stable adjoint I∗0 ;
the former yields a signal that vanishes at the origin, while the latter enforces the decay of the
output as t → −∞ at the cost of a jump discontinuity at the origin.
Likewise, it is easy to see that the corresponding adjoint inverse operator L−1∗ is speci-
fied by
(D − αId)−1∗ ϕ = ρα∨ ∗ ϕ,
where ρα∨ (r) = ρα (−r) is the time-reversed version of ρα . Thanks to its rapid decay, ρα∨
defines a continuous linear translation-invariant map from S (R) into itself.
This allows us to express the solution of the first-order SDE as a filtered version of
the input noise sα = ρα ∗ w. It follows that sα is a stationary
process that is completely
specified by its characteristic form E{ejϕ,sα } = exp R f (ρα∨ ∗ ϕ)(r) dr , where f is
the Lévy exponent of the innovation w (see Section 4.5).
Let us now focus our attention on the limit case α = 0, which yields an operator L = D
that is scale-invariant. Here too, it is possible to specify the LSI inverse (integrator)
r
D−1 ϕ(r) = ϕ(τ ) dτ = (1+ ∗ ϕ)(r),
−∞
whose output is well defined pointwise when ϕ ∈ S (R). The less favorable aspect is
that the classical LSI integrator does not fulfill the usual stability requirement due to the
92 Operators and their inverses
non-integrability of its impulse response 1+ ∈ / L1 (R). This implies that D−1∗ ϕ = 1∨+ ∗ϕ
is generally not in Lp (R). Thus, we are no longer fulfilling the admissibility condition in
−1∗
4.17. The source of the problem is the lack of decay of D ϕ(r) as r → −∞
Theorem
when R ϕ(τ ) dτ = ϕ (0) = 0 (see Figure 5.1b). Fortunately, this can be compensated
by defining the modified antiderivative operator
∞
I∗0 ϕ(r) = ϕ (0) = (1∨
ϕ(τ ) dτ − 1+ (−r) + ∗ ϕ)(r) − ϕ(τ ) dτ 1∨
+ (r)
r R
−1∗
=D ϕ(r) − (D−1∗ ϕ)(−∞)1∨
+ (r), (5.3)
which happens to be the only left inverse of D∗ = −D that is both scale-invariant and
Lp -stable for any p > 0. The adjoint of I∗0 specifies the adjusted 1 integrator
r
I0 ϕ(r) = ϕ(τ ) dτ = D−1 ϕ(r) − (D−1 ϕ)(0), (5.4)
0
The association of LSI operators with convolution integrals will be familiar to most
readers. In effect, we saw in Section 3.3.5 that, as a consequence of Schwartz’ ker-
nel theorem, every continuous LSI operator L : S (Rd ) → S (Rd ) corresponds to a
convolution
L : ϕ → L{δ} ∗ ϕ
L : ϕ → F −1 {
L(ω)
ϕ (ω)}.
1 Any two valid right inverses can only differ by a component (constant) that is in the null space of the
operator. The scale-invariant solution is the one that forces the output to vanish at the origin.
5.2 Shift-invariant inverse operators 93
We call
L the Fourier multiplier or symbol associated with the operator L.
From the Fourier-domain characterization of L, we see that, if L is smooth and does
d d
not grow too fast, then L maps S (R ) back into S (R ). This is in particular true if L{δ}
(the impulse response) is an ordinary locally integrable function with rapid decay. It is
also true if L is a 1-D linear differential operator with constant coefficients, in which
case L{δ} is a finite sum of derivatives of the Dirac distribution and the corresponding
Fourier multiplier L(ω) is a polynomial in jω.
For operators with smooth Fourier multipliers that are nowhere zero in Rd and not
decaying (or decaying slowly) at ∞, we can define the inverse L−1 : S (Rd ) → S (Rd )
of L by
* +
−1 −1 ϕ (ω)
L : ϕ → F .
L(ω)
This inverse operator is also linear and shift-invariant, and has the convolution kernel
* +
−1 1
ρL = F , (5.5)
L(ω)
which is in effect the Green’s function of the operator L. Thus, we may write
L−1 : ϕ → ρL ∗ ϕ.
For the cases in which L(ω) vanishes at some points, its inverse 1/L(ω) is not in
general a locally integrable function, but even in the singular case we may still be able
to regularize the singularities at the zeros of
L(ω) and obtain a singular “generalized
function” whose inverse Fourier transform, per (5.5), once again yields a convolution
kernel ρL that is a Green’s function of L. The difference is that in this case, for an
arbitrary ϕ ∈ S (Rd ), ρL ∗ ϕ may no longer belong to S (Rd ).
As in our introductory example, the simplest scenario occurs when the inverse opera-
tor L−1 is shift-invariant with an impulse response ρL that has sufficient decay for the
system to be BIBO-stable (bounded input, bounded output).
P R O P O S I T I O N 5.1 Let L−1 ϕ(r) = (ρL ∗ ϕ)(r) = Rd ρL (r )ϕ(r − r ) dr with ρL ∈
L1 (Rd ) (or, more generally, where ρL is a complex-valued Borel measure of bounded
variation). Then, L−1 and its adjoint specified by L−1∗ ϕ(r) = (ρL∨ ∗ϕ)(r) = Rd ρL (−r )
ϕ(r − r ) dr are both Lp -stable in the sense that
for all p ≥ 1. In particular, this ensures that L−1∗ continuously maps S (Rd ) →
Lp (Rd ).
The result follows from Theorem 3.5. For the sake of completeness, we shall establish
the bound based on the two extreme cases p = 1 and p = +∞.
94 Operators and their inverses
Proof To obtain the L1 bound, we manipulate the norm of the convolution integral as
ρL ∗ ϕL1 = ρ (r)ϕ(r
− r) dr dr
L
Rd Rd
Note that the L1 condition in Proposition 5.1 is the standard hypothesis that is made in
the theory of linear systems to ensure the BIBO stability of an analog filter. It is slightly
stronger than the total variation (TV) condition in Theorem 3.5, which is necessary and
sufficient for both BIBO (p = ∞) and L1 stabilities.
If, in addition, ρL (r) decays faster than any polynomial (e.g., is compactly supported
or decays exponentially), then we can actually ensure S -continuity so that there is no
restriction on the class of corresponding stochastic processes.
and r ∈ Rd . Then, −1
L and L −1∗ are S -continuous in the sense that ϕ ∈ S ⇒
L−1 ϕ, L−1∗ ϕ ∈ S with both operators being bounded in an appropriate sequence of
seminorms.
The key here is that the convolution with ρL preserves the rapid decay of the test
function ϕ. The degree of smoothness of the output is not an issue because, for non-
constant functions, the convolution operation commutes with differentiation.
The good news is that the entire class of stable 1-D differential systems with rational
transfer functions and poles in the left half of the complex plane falls into the category
of Proposition 5.2. The application of such operators provides us with a convenient
mechanism for solving ordinary differential equations, as detailed in Section 5.3.
The S -continuity property is important for our formulation. It also holds for all
shift-invariant differential operators whose impulse response is a point distribution; e.g.,
(D − Id){δ} = δ − δ. It is preserved under convolution, which justifies the factorization
of operators into simpler constituents.
5.3 Stable differential systems in 1-D 95
The generic form of a linear shift-invariant differential equation in 1-D with (determi-
nistic or random) output s and driving term w is
N M
an Dn s = bm Dm w, (5.6)
n=0 m=0
where the an and bm are arbitrary complex coefficients with the normalization constraint
aN = 1. Equation (5.6) thus covers the general 1-D case of Ls = w, where L is a shift-
invariant operator with the rational transfer function
The poles of the system, which are the roots of the characteristic polynomial pN (ζ ) =
ζ N + aN−1 ζ N−1 + · · · + a0 with Laplace variable ζ ∈ C, are denoted by {αn }N n=1 . In
the standard causal-stable scenario where Re(αn ) < 0 for n = 1, . . . , N, the solution is
obtained as
1 $N
1
= qM (jω) (5.8)
L(ω) jω − αn
n=1
M
(jω − γm )
= bM m=1
N
, (5.9)
n=1 (jω − αn )
which is then broken into simple constituents, either by serial composition of first-order
factors or by decomposition into simple partial fractions. We are providing the fully
factorized form (5.9) of the transfer function to recall the property that a stable Nth-
order system is completely characterized by its poles {αn }N n=1 and zeros {γm }m=1 , up to
M
Pα = D − αId
corresponds to the convolution kernel (impulse response) (δ − αδ) and Fourier multi-
plier (jω−α). For Re(α) = 0 (the stable case in signal-processing parlance), the Fourier
multiplier, (jω − α)−1 , is non-singular, with the well-defined inverse Fourier transform
* +
1 eαr 1[0,∞) (r) if Re(α) < 0
ρα (r) = F −1 (r) =
jω − α −e 1(−∞,0] (r) if Re(α) > 0.
αr
Clearly, in either the causal (Re(α) < 0) or the anti-causal (Re(α) > 0) case, these
functions decay rapidly at infinity, and convolutions with them map S (R) back into
itself. Moreover, a convolution with ρα inverts Pα : S (R) → S (R) from both the left
and the right, for which we can write
P−1
α ϕ = ρα ∗ ϕ (5.10)
for Re(α) = 0.
Since Pα and P−1α , Re(α) = 0, are both S -continuous (continuous from S (R) into
S (R)), their action can be transferred to the space S (R) of Schwartz distributions by
identifying the adjoint operator P∗α with (−P−α ), in keeping with the identity
s(r) = P−1 −1
α · · · Pα1 qM (D){w}(r). (5.12)
N
L−1
5.4 Unstable Nth-order differential systems 97
This translates into the following formulas for the corresponding inverse operator L−1 :
L−1 = P−1 −1
αN · · · Pα1 qM (D)
= bM P−1 −1
αN · · · Pα1 Pγ1 · · · PγN ,
which are consistent with (5.8) and (5.9), respectively. These operator-based mani-
pulations are legitimate since all the elementary constituents in (5.11) and (5.12) are
S -continuous convolution operators. We also recall that all Lp -stable and, a fortiori
S -continuous, convolution operators satisfy the properties of commutativity, associati-
vity, and distributivity, so that the ordering of the factors in (5.11) and (5.12) is imma-
terial. Interestingly, this divide-and-conquer approach to the problem of inverting a
differential operator is also extendable to the unstable scenarios (with Re(αn ) = 0 for
some values of n), the main difference being that the ordering of operators becomes
important (partial loss of commutativity).
Classically, a differential system is categorized as being unstable when some of its poles
are in the right complex half-plane, including the imaginary axis. Mathematically, there
is no fundamental reason for excluding the cases Re(αn ) > 0 because one can simply
switch to an anti-causal configuration which preserves the exponential decay of the
response, as we did in defining these inverses in Section 5.3.1.
The only tricky situation occurs for purely imaginary poles of the form αm = jω0 with
ω0 ∈ R, to which we now turn our attention.
Let S jω0 denote the image of S (R) under Pjω0 . This is the same as the subspace of
S (R) consisting of functions ϕ for which 2
+∞
e−jω0 r ϕ(r) dr = 0.
−∞
In particular, for jω0 = 0, we obtain S 0 , the space of Schwartz test functions with
vanishing zeroth-order moment. We may then view Pjω0 as an operator S (R) → S jω0 ,
and this operator now has an inverse P−1
jω0 from S jω0 → S (R) defined by
P−1
jω0 ϕ(r) = (ρjω0 ∗ ϕ)(r), (5.13)
where
* +
1 1
ρjω0 (r) = F −1 (r) = sign(r)ejω0 r . (5.14)
jω − jω0 2
Specifically, this LSI operator satisfies the right- and left-inverse relations
Pjω0 P−1
jω0 ϕ = ϕ for all ϕ ∈ S jω0
P−1
jω0 Pjω0 ϕ = ϕ for all ϕ ∈ S (R).
In order to be able to use such inverse operators for defining stochastic processes, we
need to extend P−1
jω0 to all of S (R).
−1
Note that, unlike the case of P−1
α with Re(α) = 0, here the extension of Pjω0 to
an operator S (R) → S (R) is in general not unique. For instance, P−1
jω0 may also be
specified as
P−1
jω0 ,+ ϕ(r) = (ρjω0 ,+ ∗ ϕ)(r), (5.15)
which defines the same operator on S jω0 but not on S (R). In fact, we could as well
have taken any impulse response of the form ρjω0 (r) + p0 (r), where p0 (r) = c0 ejω0 r is an
oscillatory component that is in the null space of Pjω0 . By contrast, the Lp -continuous
inverses that we define below remain the same for all of these extensions. To convey the
idea, we shall first consider the extension based on the causal operator P−1∗
jω0 ,+ defined by
(5.15). Its adjoint is denoted by P−1∗
jω0 ,+ and amounts to an (anti-causal) convolution with
∨ .
ρjω 0 ,+
To solve the stochastic differential equation Pjω0 s = w, we need to find a left inverse
of the adjoint P∗jω0 acting on the space of test functions, which maps S (R) into the
required Lp space (Theorem 4.17). The problem with the “shift-invariant” 3 extensions
of the inverse defined above is that their image inside S (R) is not contained in arbitrary
Lp spaces. For this reason, we now introduce a different, “corrected” extension of the
2 To see this, note that (D − jω Id)ϕ(r) = ejω0 r D{e−jω0 r ϕ(r)}.
0
3 These operators are shift-invariant because they are defined by means of convolutions.
5.4 Unstable Nth-order differential systems 99
inverse of P∗jω0 that maps S (R) to R (R) – therefore, a fortiori, also into all Lp spaces.
This corrected left inverse, which we shall denote as I∗ω0 , is constructed as
I∗ω0 ϕ(r) = P−1∗
jω0 ,+ ϕ(r) − lim P−1∗
jω0 ,+ ϕ(y) ∨
ρjω0 ,+
(r)
y→−∞
∨ ∨
= (ρjω 0 ,+
∗ ϕ)(r) −
ϕ (−ω0 )ρjω0 ,+
(r), (5.17)
in direct analogy with (5.3). As in our introductory example, the role of the second term
is to remove the tail of P−1
jω0 ,+ ϕ(r). This ensures that the output decays fast enough to
belong to R (R). It is not difficult to show that the convolutional definition of I∗ω0 given
by (5.17) does not depend on the specific choice of the impulse response within the
class of admissible LSI inverses of Pjω0 on S jω . Then, we may simplify the notation by
writing
for any Green’s function ρjω0 of the operator Pjω . While the left-inverse operator I∗ω0
fixes the decay, it fails to be a right inverse of P∗jω0 unless ϕ ∈ S jω0 or, equivalently,
ϕ (−ω0 ) = 0.
The corresponding right inverse of Pjω0 is provided by the adjoint of I∗ω0 . It is identi-
fied via the scalar product manipulation
where ρjω0 is defined by (5.14). The specificity of Iω0 is to impose the boundary condi-
tion s(0) = 0 on the output s = Iω0 ϕ, irrespective of the input function ϕ. This is achieved
by the addition of a component that is in the null space of Pjω . This also explains why
we may replace ρjω0 in (5.19) by any other Green’s function of Pjω , including the causal
one given by (5.16).
In particular, for jω0 = 0 (that is, for Pjω0 = D), we have
r
r 0 ϕ(t) dt r ≥ 0
I0 ϕ(r) = ϕ(t) dt − ϕ(t) dt = 00
−∞ −∞ − r ϕ(t) dt r < 0,
These are equivalent to the solution described in Section 5.1 (see Figure 5.1).
100 Operators and their inverses
Putting everything together, with the definitions of Section 5.3.2, we arrive at the
following corollary of Theorem 5.3:
We shall now use this result to solve the differential equation (5.11) in the non-stable
scenario. To that end, we order the poles in such a way that the unstable ones come last
with αN−K+m = jωm , 1 ≤ m ≤ K, where K is the number of purely imaginary poles. We
thus specify the right-inverse operator
which we then apply to w to obtain the solution s = L−1 w. In effect , by applying I(ωK :ω1 )
last, we are also enforcing the K linear boundary conditions
⎧
⎪
⎪ s(0) = 0
⎪
⎪
⎪
⎨ PjωK {s}(0) = 0
.. (5.25)
⎪
⎪
⎪
⎪ .
⎪
⎩
Pjω2 · · · PjωK {s}(0) = 0.
ϕ, P(α1 :αN−K ) P(jω1 :jωK ) s = ϕ, P(α1 :αN−K ) P(jω1 :jωK ) L−1 w
= P∗(jω1 :jωK ) P∗(α1 :αN−K ) ϕ, I(ωK :ω1 ) P−1
(αN−K :α1 ) qM (D)w
= P−1∗
(α
∗ ∗ ∗
:α ) I(ω :ω ) P(jω :jω ) P(α1 :αN−K ) ϕ, qM (D)w
N−K 1 K 1 1 K
Id
= ϕ, qM (D)w ,
where we have made use of Corollary 5.5. This proves that s satisfies the differential
equation (5.11) with driving term w, subject to the boundary conditions (5.25).
on the solution s = Iω0 ,ϕ0 w. This leads to the definition of the right-inverse operator
ρjω0 ∗ ϕ, ϕ0
Iω0 ,ϕ0 ϕ(r) =(ρjω0 ∗ ϕ)(r) − ejω0 r , (5.27)
ϕ0 (−ω0 )
where ρjω0 is a Green’s function of Pjω0 and ϕ0 is some given rapidly decaying func-
tion such that ϕ0 (−ω0 ) = 0. In particular, if we set ω0 = 0 and ϕ0 = δ, we reco-
ver the scale-invariant integrator I0 = I0,δ that was used in our introductory example
(Section 5.1) to provide the connection with the classical theory of Lévy processes. The
Fourier-domain counterpart of (5.27) is
ϕ0 (−ω)
ejωr − ejω0 r
ϕ0 (−ω0 ) dω
Iω0 ,ϕ0 ϕ(r) =
ϕ (ω) . (5.28)
R j(ω − ω0 ) 2π
The above operator is well defined pointwise for any ϕ ∈ L1 (R). Moreover, it is a right
inverse of (D − jω0 Id) on S (R) because the regularization in the numerator amounts
to a sinusoidal correction that is in the null space of the operator. The adjoint of Iω0 ,ϕ0
is specified by the Fourier-domain integral
ϕ (−ω0 )
ϕ (ω) − ϕ0 (−ω0 )
ϕ0 (ω) jωr dω
I∗ω0 ,ϕ0 ϕ(r) = e , (5.29)
R −j(ω + ω0 ) 2π
which is non-singular too, thanks to the regularization in the numerator. The beneficial
effect of this adjustment is that I∗ω0 ,ϕ0 is R -continuous and Lp -stable, unlike its more
conventional shift-invariant counterpart P−1∗ jω0 . The time-domain counterpart of (5.29) is
ϕ (−ω0 )
I∗ω0 ,ϕ0 ϕ(r) =(ρjω
∨
∗ ϕ)(r) − ∨
ϕ0 ∗ ρjω (r), (5.30)
0
ϕ0 (−ω0 ) 0
where ρjω0 is a Green’s function of Pjω0 . This relation is very similar to (5.18), with
the notable difference that the second term is convolved by ϕ0 . This suggests that we
can restore the smoothness of the output by picking a kernel ϕ0 with a sufficient degree
of differentiability. In fact, by considering a sequence of such kernels in S (R) that
converge to the Dirac distribution (in the weak sense), we can specify a left-inverse
operator that is arbitrarily close to I∗ω0 and yet S -continuous.
While the imposition of generalized boundary conditions of the form (5.26) has some
significant implications for the statistical properties of the signal (non-stationary beha-
vior), it is less of an issue for signal processing because of the use of analysis tools
(wavelets, finite-difference operators) that stationarize these processes – in effect, fil-
tering out the null-space components – so that the traditional tools of the trade remain
applicable. Therefore, to simplify the presentation, we shall only consider boundary
conditions at zero in what follows, and work with the operators I∗ω0 and Iω0 .
104 Operators and their inverses
mer. We also note that ∂τ0 is equivalent to the fractional Hilbert transform operator H τ
γ γ
investigated in [CU10]. A special case of the semigroup property is ∂τ = ∂γ /2 ∂τ0−γ /2 =
Dγ H τ −γ /2 ,which indicates that the fractional derivatives of order γ are all related to
Dγ via a fractional Hilbert transform. The latter is a unitary operator (all-pass filter) that
essentially acts as a shift operator on the oscillatory part of a wavelet.
γ γ
The property that ∂τ can be factorized as ∂τ = Dn ∂τα , with n ∈ N, α = γ − n,
τ = τ − n/2, and Dn :S (R) → S (R), has important consequences for the theory. In
rλ+
particular, it suggests several equivalent descriptions of the generalized function (λ+1) ,
as in
rλ+ r+λ+n
ϕ, = ϕ, Dn { }
(λ + 1) (λ + n + 1)
r+λ+n
= Dn∗ ϕ,
(λ + n + 1)
(−1)n ∞
= rλ+n ϕ (n) (r) dr. (5.33)
(λ + n + 1) 0
The last equality of (5.33) with n = min(0, −λ) provides an operational definition that
reduces to a conventional integral.
γ
In principle, we can obtain the shift-invariant inverse of the derivative operator ∂τ
with γ ≥ 0 by taking the order to be negative (fractional integrator) and reversing
the sign of τ . Yet, based on (5.32), which is valid for γ ∈ R\N, we see that this is
problematic because the impulse response becomes more delocalized as γ decreases.
As in the case of the ordinary derivative Dn , this calls for a stabilized version of the
inverse.
continuously maps S (R) into Lp (R) for p > 0, τ ∈ R, and γ ∈ R+ , subject to the
restriction γ + 1p = 1, 2, . . .. It is a linear operator that is scale-invariant and is a left
γ γ∗ −γ
inverse of ∂−τ = ∂τ . The adjoint operator ∂−τ ,p is given by
γ + p1 −1 (jωr)k
−γ 1 ejωr − k=0 k!
∂−τ ,p ϕ(r) = γ γ
ϕ (ω) dω
2π R (−jω) 2 −τ (jω) 2 +τ
γ
and is the proper scale-invariant right inverse of ∂τ to be applied to generalized
functions.
106 Operators and their inverses
Proof We only give a sketch of the proof for p ≥ 1, leaving out the derivation of
Proposition 5.9, which is somewhat technical. The first observation is that the operator
−γ ∗
can be factorized as ∂−τ ,p = ∂τα (I∗0 )np with α = np − γ and τ = τ − np /2, where I∗0
is the corrected (adjoint) integrator defined by (5.21) with ω0 = 0. The integer order of
pre-integration is np = γ + 1p − 1 + 1 = γ + 1p , which implies that the residual degree
of differentiation α is constrained according to
1
p − 1 < α = np − γ < 1p .
−γ ∗
This allows us to write ∂−τ ,p ϕ = ∂τα φ, where φ = (I∗0 )np ϕ is rapidly decaying by
Corollary 5.5. The required ingredient to complete the proof is a result analogous to
Theorem 5.7 which would ensure that ∂τα φ ∈ Lp (R) for α > 1p − 1. The easy scenario is
when ϕ ∈ S (R) has np vanishing moments, in which case φ = (I∗0 )np ϕ ∈ S (R) so that
Theorem 5.7 is directly applicable. In general, however, φ is (only) rapidly decreasing;
this is addressed by the following extension.
PROPOSITION 5.9 The fractional operator ∂τα I∗0 is continuous from S (R) to Lp (R)
for p > 0, τ ∈ R, and 1p − 1 < α < 1p . It also admits a continuous extension R (R) →
Lp (R) for p ≥ 1.
The reason for including the operator I∗0 in the statement is to avoid making explicit
hypotheses about the derivative of φ, which is rapidly
decaying but also exhibits a
Dirac impulse at the origin with a weight − ∗
ϕ (0) . Since I0 :R (R) → R (R) (by
Proposition 5.4), the global continuity result for p ≥ 1 then follows from the chaining
of these elementary operators.
Similarly, we establish the left-inverse property by considering the factorization
−α
∂−τ = (D∗ )np ∂−τ
γ
and by recalling that I∗0 is a left inverse of D∗ = −D. The result then follows from
−α
the identity ∂τα ∂−τ = Id, which is a special case of the semigroup property of scale-
invariant LSI operators, under the implicit assumption that the underlying operations
are well defined in the Lp sense.
A final observation that gives insight into the design of Lp -stable inverse operators
is that the form of (5.34) for p = 1 coincides with the finite-part definition (see Appen-
dix A) of the generalized function
ejωr
gr (ω) = γ γ ∈ S (R).
(−jω) 2 −τ (jω) 2 +τ
5.5 Fractional-order operators 107
where the finite-part regularization in the latter integral is the same as in the definition
in (A.1) of the generalized function xλ+ with − Re(λ) − 1 = γ . The catch with (5.34) is
that the number of regularization terms np = γ + 1p is not solely dependent upon γ ,
but also on 1/p.
ρ γ = γρ γ −2 ,
initially valid for Re(γ ) − 2 > −d, in the variable γ . For the details of the previous two
observations we refer the reader to Appendix A and Tafti [Taf11, Section 2.2]. 4
Finally, it is important to remark that, unlike integer-order operators, unless γ is a
positive even integer the image of S (Rd ) under Lγ is not contained in S (Rd ). Specifi-
cally, while for any ϕ ∈ S , Lγ ϕ is always an infinitely differentiable regular function,
in the case of γ = 2m, m = 1, 2, . . ., it may have slow (polynomial) decay or growth, in
direct analogy with the general 1-D scenario characterized by Theorem 5.7.
Put more simply, the fractional Laplacian of a Schwartz function is not in general a
Schwartz function.
1
ρ γ (ω)ρ −γ (ω) = d+γ d−γ for ω = 0,
2 2
d
4 The difference in factors of (2π) 2 and (2π)d between the formulas given here and in Tafti [Taf11] is due
to different normalizations used in the definition of the Fourier transform.
5 Here we exclude the cases where the gamma functions in the denominator have poles, namely where their
argument is a negative integer. For details see [Taf11, Section 2.2.2.]
5.6 Discrete convolution operators 109
THEOREM 5.11 ( [Taf11], Theorem 2.aq and Corollary 2.am) The operator
∂k ρ γ −d
Lp−γ ∗ : ϕ → ρ γ −d ∗ ϕ − yk ϕ(y) dy (5.36)
k! Rd
|k|≤Re(γ )+ pd −d
with Re(γ ) + d
p = 1, 2, 3, . . . is rotation-invariant and homogeneous of order (−γ ) in
the sense of Definition 5.2. It maps S (Rd ) continuously into Lp (Rd ) for p ≥ 1.
−γ ∗
The adjoint of Lp is given by
∂k L−γ ϕ(0)
L−γ
p : ϕ → ρ
γ −d
∗ϕ− rk .
k!
|k|≤Re(γ )+ pd −d
If we exclude the cases where the denominator of (5.35) has a pole, we may normalize
the above operators to find left and right inverses corresponding to the fractional Lapla-
γ
cian (− ) 2 . The next theorem gives an equivalent Fourier-domain characterization of
these operators.
γ∗
THEOREM 5.12 (see [SU12, Theorem 3.7]) Let Ip with p ≥ 1 and γ > 0 be the
isotropic fractional integral operator defined by
∂ k
ϕ (0)ωk
ϕ (ω) −
k!
|k|≤γ + pd −d dω
Iγp ∗ ϕ(r) = ejω,r . (5.37)
Rd ωγ (2π)d
γ∗
Then, under the condition that γ = 2, 4, . . . and γ + d
p = 1, 2, 3, . . ., Ip is the unique
γ
left inverse of (− ) 2 that continuously maps S (Rd ) into Lp (Rd ) for p ≥ 1 and is scale-
γ
invariant. The adjoint operator Ip , which is the proper scale-invariant right inverse of
γ
(− ) 2 , is given by
j|k| rk ωk
ejω,r −
k!
|k|≤γ + pd −d dω
Iγp ϕ(r) =
ϕ (ω) . (5.38)
Rd ωγ (2π)d
γ γ∗
The fractional integral operators Ip and Ip are both scale-invariant of order (−γ ),
but they are not shift-invariant.
We conclude this chapter by providing a few basic results on discrete convolution ope-
rators and their inverses. These will turn out to be helpful for establishing the existence
of certain spline interpolators which are required for the construction of operator-like
wavelet bases in Chapter 6 and for the representation of autocorrelation functions in
Chapter 7.
110 Operators and their inverses
The convention in this book is to use square brackets to index sequences. This allows
one to distinguish them from functions of a continuous variable. In other words, h(·)
or h(r) stands for a function defined on a continuum, while h[·] or h[k] denotes some
discrete sequence. The notation h[·] is often simplified to h when the context is clear.
The discrete convolution between two sequences h[·] and a[·] over Zd is defined as
(h ∗ a)[n] = h[m]a[n − m]. (5.39)
m∈Zd
This convolution describes how a digital filter with discrete impulse response h acts on
some input sequence a. If h ∈ 1 (Zd ), then the map a[·] → (h ∗ a)[·] is a continuous
operator p (Zd ) → p (Zd ). This follows from Young’s inequality for sequences,
h ∗ ap ≤ h1 ap , (5.40)
where
1
n∈Zd |a[n]|p p for 1 ≤ p < ∞
ap =
supn∈Zd |a[n]| for p = ∞.
Note that the condition h ∈ 1 (Zd ) is the discrete counterpart of the (more involved)
TV condition in Theorem 3.5. As in the continuous formulation, it is necessary and suf-
ficient for stability in the extreme cases p = 1, +∞. A simplifying aspect of the discrete
setting is that the boundedness of the operator for p = ∞ implies all the other forms of
p -stability because of the embedding p (Zd ) ⊂ q (Zd ) for any 1 ≤ p < q ≤ ∞. The
latter property is a consequence of the basic norm inequality
ap ≥ aq ≥ a∞
for all a ∈ p (Zd ) ⊂ q (Zd ) ⊂ ∞ (Zd ).
In the Fourier domain, (5.39) maps into the multiplication of the discrete Fourier
transforms of h and a as
b[n] = (h ∗ a)[n] ⇔ B(ω) = H(ω)A(ω),
where we are using capital letters to denote the discrete-time Fourier transforms of the
underlying sequences. Specifically, we have that
B(ω) = Fd {b}(ω) = b[k]e−jω,k ,
k∈Zd
THEOREM 5.13 (Wiener’s lemma) Let H(ω) = k∈Zd h[k]e−jω,k , with h ∈ 1 (Zd ),
be a stable discrete Fourier multiplier such that H(ω) = 0 for ω ∈ [−π, π]d . Then,
G(ω) = 1/H(ω) has the same property in the sense that it can be written as 1/H(ω) =
−jω,k with g ∈ (Zd ).
k∈Zd g[k]e 1
The so-defined filter g identifies a stable convolution inverse of h with the property
that
(g ∗ h)[n] = (h ∗ g)[n] = δ[n],
where
1 for n = 0
δ[n] =
0 for n ∈ Zd \ {0}
is the Kronecker unit impulse.
Section 5.2
The specification of Lp -stable convolution operators is a central topic in harmonic ana-
lysis [SW71, Gra08]. The basic result in Proposition 5.1 relies on Young’s inequality
with q = r = 1 [Fou77]. The complete class of functions that result in S -continuous
convolution kernels is provided by the inverse Fourier transform of the space of smooth
slowly increasing Fourier multipliers, which play a crucial role in the theory of gener-
alized functions [Sch66]. They extend R (Rd ) in the sense that they also contain point
distributions such as δ and its derivatives.
Section 5.3
The operational calculus that is used for solving ordinary differential equations (ODEs)
can be traced to Heaviside, who also introduced the symbol D for the derivative op-
erator. It was initially met with skepticism because Heaviside’s exposition lacked rigor.
Nowadays, the preferred method for solving ODEs is based on the Laplace transform
or the Fourier transform. The operator-based formalism that is presented in Section 5.3
is a standard application of distribution theory and Green’s functions [Kap62, Zem10].
Section 5.4
The extension of the operational calculus for solving unstable ODE/SDEs is a more
recent development. It was initiated in [BU07] in an attempt to link splines and frac-
tals. The material presented in this section is based on [UTS14], for the most part. The
generalized boundary conditions of Section 5.4.3 were introduced in [UTAK14].
Section 5.5
Fractional derivatives and Laplacians play a central role in the theory of splines, the
primary reason being that these operators are scale-invariant [Duc77, UB00]. The proof
112 Operators and their inverses
of Proposition 5.6 can be found in [UB07, Proposition 2], while Theorem 5.7 is a slight
extension of [UB07, Theorem 3]. The derivation of the corresponding left and right
inverses for p = 2 is carried out in the companion paper [BU07]. These results were
extended to the multivariate setting for both the Gaussian [TVDVU09] and the gener-
alized Poisson setting [UT11]. A detailed investigation of the Lp -stable left inverses
of the fractional Laplacian, together with some unicity results, is provided in [SU12].
Further results and proofs can be found in [Taf11].
Section 5.6
Discrete convolution is a central topic in digital signal processing [OSB99]. The discrete
version of Young’s inequality can be established by using the same technique as for
the proof of Proposition 5.1. In that respect, the condition h ∈ 1 (Zd ) is the standard
criterion for stability in the theory of discrete-time linear systems [OSB99, Lat98]. It is
necessary and sufficient for BIBO stability (p = ∞) and for the preservation of absolute
summability (p = 1). Wiener stated his famous lemma in 1932 [Wie32, Lemma IIe].
Other relevant references are [New75, Sun07].
6 Splines and wavelets
where the B-spline basis functions (rectangles) are dilated versions of the cardinal ones
by a factor of 2i :
⎧ ,
i k ⎨ 1, for r ∈ 2i k, 2i (k + 1)
0 r − 2
φi,k (r) = β+ = (6.2)
2i ⎩ 0, otherwise.
The variable i is the scale index that specifies the resolution (or knot spacing) a = 2i ,
while the integer k encodes for the spatial location. The B-spline of degree zero, φ =
φ0,0 = β+0 , is the scaling function of the representation. Interestingly, it is the identi-
fication of a proper scaling function that constitutes the most fundamental step in the
construction of a wavelet basis of L2 (R).
D E F I N I T I O N 6.1 (Scaling function) φ ∈ L2 (R) is a valid scaling function if and only
if it satisfies the following three properties:
• Two-scale relation:
φ(r/2) = h[k]φ(r − k), (6.3)
k∈Z
where the sequence h ∈ 1 (Z) is the so-called refinement mask
• Partition of unity:
φ(r − k) = 1 (6.4)
k∈Z
• The set of functions {φ(· − k)}k∈Z forms a Riesz basis.
In practice, a given brand of (orthogonal) wavelets (e.g., Daubechies or splines) is
often summarized by its refinement filter h since the latter uniquely specifies φ, subject
to the admissibility constraints (6.4) and φ ∈ L2 (R). In the case of the B-spline of degree
zero, we have that h[k] = δ[k] + δ[k − 1], where
⎧
⎨ 1, for k = 0
δ[k] =
⎩ 0, otherwise
is the discrete Kronecker impulse. This translates into what we jokingly refer to as the
Lego–Duplo relation 1
β+0 (r/2) = β+0 (r) + β+0 (r − 1). (6.5)
The fact that β+0
satisfies the partition of unity is obvious. Likewise, we already observed
in Chapter 1 that β+0 generates an orthogonal system which is a special case of a Riesz
basis.
By considering the rescaled version of such a basis, we specify the subspace of splines
at scale i as
Vi = si (r) = ci [k]φi,k (r) : ci ∈ 2 (Z) ⊂ L2 (R),
k∈Z
1 Duplos are the larger-scale versions of Lego building blocks and are more suitable for smaller children to
play with. The main point of the analogy with wavelets is that Legos and Duplos are compatible; they can
be combined to build more complex shapes. The enabling property is that a Duplo is equivalent to two
smaller Legos placed next to each other, as expressed by (6.5).
6.1 From Legos to wavelets 115
4 s0(r) Ã(r=2)
3 Wavelet:
2
1
0 2 4 6 8 2
+ 1
4 s1(r)
3 2 4 6 8
2
- –1
–2
1 ri(r) = si−1(r)−si(r)
2
0 2 4 6 8
4 + 1
3 s2(r)
2 4 6 8
2 - –1
1 –2
2
0 2 4 6 8
4 + 1
3 s3(r)
2 4 6 8
2 - –1
1 –2
0 2 4 6 8
Figure 6.1 Multiresolution signal analysis using piecewise-constant splines with a dyadic scale
progression. Left: multiresolution pyramid. Right: error signals between two successive levels of
the pyramid.
1 1
ci [k] = ci−1 [2k] + ci−1 [2k + 1] = ci−1 ∗ h̃ [2k]. (6.6)
2 2
116 Splines and wavelets
It is run recursively for i = 1, . . . , imax where imax denotes the bottom level of the
pyramid. The outcome is a multiresolution analysis of our input signal s0 .
In order to uncover the wavelets, it is enlightening to look at the residual signals
ri (r) = si−1 (r) − si (r) ∈ Vi−1 on the right side of Figure 6.1. While these are splines
that live at the same resolution as si−1 , they actually have half the apparent degrees of
freedom. These error signals exhibit a striking sign-alternation pattern due to the fact
that two consecutive samples (ci−1 [2k], ci−1 [2k + 1]) are at an equal distance from their
mean value (ci [k]). This suggests rewriting the residuals more concisely in terms of
oscillating basis functions (wavelets) at scale i, like
which is the wavelet counterpart of the two-scale relation (6.3). In the present example,
we have g[k] = (−1)k h[k], which is a general relation that is characteristic of an or-
thogonal design. Likewise, in order to gain in generality, we have chosen to express
the decomposition algorithms (6.6) and (6.8) in terms of discrete convolution (filter-
ing) and downsampling operations where the corresponding Haar analysis filters are
h̃[k] = 12 h[−k] and g̃[k] = 12 (−1)k h[−k]. The Hilbert-space interpretation of this ap-
proximation process is that ri ∈ Wi , where Wi is the orthogonal complement of Vi in
Vi−1 ; that is, Vi−1 = Vi +Wi with Vi ⊥ Wi (as a consequence of the orthogonal-projection
theorem).
Finally, we can close the loop by observing that
imax
s0 (r) = simax (r) + si−1 (r) − si (r)
i=1 ri (r)
imax
= cimax [k]φimax ,k (r) + di [k]ψi,k (r), (6.10)
k∈Z i=1 k∈Z
Figure 6.2 Decomposition of a signal into orthogonal scale components. The error signals
ri = si−1 − si between two successive signal approximations are expanded using a series of
properly scaled wavelets.
More generally, we can push the argument to the limit and apply the decomposition
to any finite-energy function
∀s ∈ L2 (R), s= di [k]ψi,k , (6.11)
i∈Z k∈Z
where di [k] = s, ψ̃i,k L2 and {ψ̃i,k }(i,k)∈Z2 is a suitable (bi-)orthogonal wavelet basis
with the property that ψ̃i,k , ψi ,k L2 = δk−k ,i−i .
Remarkably, the whole process described above – except the central expressions in
(6.6) and (6.8), and the equations explicitly involving β+0 – is completely generic and
applicable to any other wavelet basis of L2 (R) that is specified in terms of a wavelet
filter g and a scaling function φ (or, equivalently, an admissible refinement filter h). The
bottom line is that the wavelet decomposition and reconstruction algorithm is fully des-
cribed by the four digital filters (h, g, h̃, g̃) that form a perfect reconstruction filterbank.
The Haar transform is associated with the shortest possible filters. Its less favorable as-
pects are that the basis functions are discontinuous and that the scale-truncated error
decays only like the first power of the sampling step a = 2i (first order of approxima-
tion).
The fundamental point of our formulation is that the Haar wavelet is matched to
d
the pure derivative operator D = dr , which goes hand-in-hand with Lévy processes (see
Chapter 1). In that respect, the critical observations relating to spline and wavelet theory
are as follows:
• All piecewise-constant functions can be interpreted as D-splines.
• The Haar wavelet acts as a smoothed version of the derivative in the sense that
Haar = Dφ, where φ is an appropriate kernel (triangle function).
118 Splines and wavelets
• The B-spline of degree zero can be expressed as β+0 = βD = Dd D−1 δ, where the
finite-difference operator Dd is the discrete counterpart of D.
We shall now show how these ideas are extendable to a much broader class of differen-
tial operators L.
By proceeding in a similar manner with the other monomials and combining the results,
we find that
n!
(r − r0 )n = (−1)|k| rm rk0
m! k!
m+k=n
= bm (r0 )rm ,
m≤n
with polynomial coefficients bm (r0 ) that depend upon the multi-index m and the shift
r0 . Finally, we note that the exponential factor ez0 ,r can be shifted by r0 by simple
6.2 Basic concepts and definitions 119
multiplication with a constant (see (6.12) below). These facts taken together establish
the structure of the underlying vector space. As for the last statement, we rely on the
theory of Lie groups that tells us that the only finite-dimensional collection of functions
that are translation-invariant is made of exponential polynomials. The pure exponentials
ez0 ,r (with n = 0) are special in that respect: they are the eigenfunctions of the shift
operator in the sense that
ez0 ,r−r0 = λ(r0 ) ez0 ,r (6.12)
with the (complex) eigenvalue λ(r0 ) = ez0 ,r0 , and hence the only elements that specify
shift-invariant subspaces of dimension 1.
Since our formulation relies on the theory of generalized functions, we shall focus on
the restriction of NL to S (Rd ). This rules out the exponential factors z0 = α 0 + jω0
in Proposition 6.1 with α 0 ∈ Rd \ {0}, for which the Fourier-multiplier operator is not
necessarily well defined. We are then left with null-space atoms of the form ejω0 ,r rn
with ω0 ∈ Rd , which are functions of slow growth.
The next important ingredient is the Green’s function ρL of the operator L. Its defining
property is LρL = δ, where δ is the d-dimensional Dirac distribution. Since there are
many equivalent Green’s functions of the form ρL + p0 where p0 ∈ NL is an arbitrary
component of the null space, we resolve the ambiguity by defining the (primary) Green’s
function of L as
* +
1
ρL (r) = F −1 (r), (6.13)
L(ω)
with the requirement that ρL ∈ S (Rd ) is an ordinary function of slow growth. In other
words, we want ρL (r) to be defined pointwise for any r ∈ Rd and to grow no faster
than a polynomial. The existence of the generalized inverse Fourier transform (6.13)
imposes some minimal continuity and decay conditions on 1/ L(ω) and also puts
some
restrictions on the number and nature of its singularities e.g., the zeros of
L(ω) .
D E F I N I T I O N 6.2 (Spline admissibility) The Fourier-multiplier operator L : X (Rd ) →
S (Rd ) with frequency response L(ω) is called spline admissible if (6.13) is well de-
fined and ρL (r) is an ordinary function of slow growth.
An important characteristic of spline-admissible operators is the rate of growth of
their frequency response at infinity.
DEFINITION 6.3 (Order of a Fourier multiplier) The Fourier multiplier L(ω) is of
(asymptotic) order γ ∈ R+ if there exists a radius R ∈ R+ and a constant C such that
C|ω|γ ≤ |
L(ω)| (6.14)
for all |ω| ≥ R, where γ is critical in the sense that the condition fails for any larger
value.
The order is in direct relation with the degree of smoothness of the Green’s function
ρL . In the case of a scale-invariant operator, it also coincides with the scaling order (or
degree of homogeneity) of L(ω). For instance, the fractional-derivative operator Dγ ,
120 Splines and wavelets
which is defined via the Fourier multiplier (jω)γ , is of order γ . Its Green’s function is
given by (see Table A.1)
* + γ −1
1 r+
ρDγ (r) = F −1 (r) = , (6.15)
(jω)γ (γ )
γ −1
where is Euler’s gamma function (see Appendix C) and r+ = max(0, r)γ −1 . Clearly,
the latter is a function of slow growth. It has a single singularity at the origin whose
Hölder exponent is (γ − 1), and is infinitely differentiable everywhere else. It follows
that ρDγ is uniformly Hölder-continuous of degree (γ −1). This is one less than the order
of the operator. On the other hand, the null space of Dγ consists of the polynomials of
n (jω)γ
degree N = γ − 1 since d dω n ∝ (jω)γ −n is vanishing at the origin up to order N
with (γ − 1) ≤ N < γ (see argument in Section 6.4.1).
A fundamental result is that all partial differential operators with constant coefficients
are spline-admissible. This follows from the Malgrange–Ehrenpreis theorem, which
guarantees the existence of their Green’s function [Mal56, Wag09]. The generic form
of such operators is
LN = an ∂ n
|n|<N
n1 +···+n
with an ∈ Rd , where ∂ n is the multi-index notation for ∂n1 d
n . The corresponding
∂r1 ··· ∂rdd
Fourier multiplier is LN (ω) = |n|<N an j|n| ωn , which is a polynomial of degree N =
|n|. The operator is elliptic if
LN (ω) vanishes at the origin and nowhere else. More
generally, it is called quasi-elliptic of order γ if
LN (ω) fulfills the growth condition
in Definition 6.3. For d = 1, it is fairly easy to determine ρL using standard Fourier-
inversion techniques (see Chapter 5). Moreover, the condition for quasi-ellipticity of
order N is automatically satisfied. When moving to higher dimensions, the study of
partial differential operators and the determination of their Green’s functions becomes
more challenging because of the absence of a general multidimensional factorization
mechanism. Yet, it is possible to treat special cases in full generality, such as the scale-
invariant operators (with homogeneous, but not necessarily, rotation-invariant, Fourier
multipliers) and the class of rotation-invariant operators that are polynomials of the
Laplacian (− ).
The location of the Dirac impulses specifies the spline discontinuities (or knots). The
term “cardinal” refers to the particular setting where these are located on the Cartesian
grid Zd .
The remarkable aspect of this definition is that the operator L has the role of a
mathematical A-to-D converter since it maps a continuously defined signal s into a
discrete sequence a = (a[k]). Also note that the weighted sum of Dirac impulses on
the right-hand side of the above equation can be interpreted as the continuous-domain
representation of the discrete signal a – it is a hybrid-type representation that is com-
monly used in the theory of linear systems to model ideal sampling (multiplication with
a train of Dirac impulses).
The underlying concept of a spline is fairly general and it naturally extends to non-
uniform grids.
D E F I N I T I O N 6.5 (Non-uniform spline) Let {rk }k∈S be a set of points (not necessarily
finite) that specifies a (non-uniform) grid in Rd . Then, a function s(r) (possibly of slow
growth) is a non-uniform L-spline with knots {rk }k∈S if and only if
Ls(r) = ak δ(r − rk ).
k∈S
The direct implication of this definition is that a (non-uniform) L-spline with knots
{rk } can generally be expressed as
DEFINITION 6.6 (Riesz basis) A sequence of functions {φk (r)}k∈Z in L2 (Rd ) forms a
Riesz basis if and only if there exist two constants A and B such that
for any sequence c = (ck ) ∈ 2 . More generally, the basis is Lp -stable if there exist two
constants Ap and Bp such that
This definition imposes an equivalence between the L2 (Lp , resp.) norm of the conti-
nuously defined function s(r) = k∈Z ck φk (r) and the 2 (p , resp.) norm of its expan-
sion coefficients (ck ). This ensures that the representation is stable in the sense that a
small perturbation of the expansion coefficients results in a perturbation of comparable
magnitude on s(r) and vice versa. Also note that the lower inequality implies that the
functions {φk } are linearly independent (by setting s(r) = 0), which is the defining pro-
perty of a basis in finite dimensions – but which, on its own, does not ensure stability in
infinite dimensions. When A = B = 1, we have a perfect norm equivalence, which trans-
lates into the basis being orthonormal (Parseval’s relation). Finally, we point out that
the existence of the bounds A and B ensures that the (infinite) Gram matrix is positive
definite so that it can be readily diagonalized to yield an equivalent orthogonal basis.
In the (multi-)integer shift-invariant case where the basis functions are given by
φk (r) = φ(r − k), k ∈ Zd , there is a simpler equivalent reformulation of the Riesz
basis requirement of Definition 6.6.
THEOREM 6.2 Let φ(r) ∈ L2 (Rd ) be a B-spline-like generator whose Fourier trans-
form is denoted by φ(ω). Then, {φ(r − k)}k∈Zd forms a Riesz basis with Riesz bounds A
and B if and only if
0 < A2 ≤ + 2πn)|2 ≤ B2 < ∞
|φ(ω (6.17)
n∈Zd
for almost every ω. Moreover, the basis is Lp -stable for all 1 ≤ p ≤ +∞ if, in addition,
where the right-hand side follows from Poisson’s summation formula applied to the
∨
sampling at the integers of the autocorrelation function (φ ∗ φ)(r). Equation (6.19)
is especially advantageous in the case of compactly supported B-splines, for which the
autocorrelation is often known explicitly (as a B-spline of twice the order), since it
6.2 Basic concepts and definitions 123
reduces the calculation to a finite sum over the support of the Gram sequence (discrete-
domain Fourier transform).
Theorem 6.2 is a fundamental result in sampling and approximation theory [Uns00].
It is instructive here to briefly run through the L2 part of the proof, which also serves as
a refresher on some of the standard properties of the Fourier transform. In particular, we
emphasize the interaction between the continuous and discrete aspects of the problem.
Proof We start by computing the Fourier transform of s(r) = k∈Zd c[k]φ(r − k),
which gives
F {s}(ω) =
c[k] e−jω,k φ(ω) (by linearity and shift property)
k∈Zd
= C(ejω ) · φ(ω),
Here, we have used the fact that C(ejω ) is 2π-periodic and the non-negativity of the
integrand to interchange the summation and the integral (Fubini). This naturally leads
to the inequality
where we are now making use of Parseval’s identity for sequences, so that
j(ω 2 dω
c22 = C(e ) .
[0,2π]d (2π)d
The final step is to show that these bounds are sharp. This can be accomplished through
the choice of some particular (bandlimited) sequence c[·].
Note that the “almost everywhere” part of (6.17) can be dropped when φ ∈ L1 (Rd )
because the Fourier transform of such a function is continuous (Riemann–Lebesgue
lemma).
While the result of Theorem 6.2 is restricted to the classical Lp spaces, there is no fun-
damental difficulty in extending it to wider classes of weighted (with negative powers)
Lp spaces by imposing some stricter condition than (6.18) on the decay of φ. For ins-
tance, if φ has exponential decay, then the definition of the function space Vφ can be
124 Splines and wavelets
extended for all sequences c that are growing no faster than a polynomial. This happens
to be the appropriate framework for sampling generalized stochastic processes which
do not live in the Lp spaces since they are not decaying at infinity.
Rather than aiming for the highest level of generality right away, we propose to first
examine the 1-D first-order scenario in some detail. First-order differential models are
important theoretically because they go hand-in-hand with the Markov property. In that
respect, they constitute the next level of generalization beyond the Lévy processes.
Mathematically, the situation is still quite comparable to that of the derivative operator
in the sense that it leads to a nice, self-contained construction of (exponential) B-splines
and wavelets. The interesting aspect, though, is that the underlying basis functions are
no longer conventional wavelets that are dilated versions of a single prototype: they
now fall into the lesser-known category of non-stationary 2 wavelets.
2 In the terminology of wavelets, the term “non-stationary” refers to the property that the shape of the wavelet
changes with scale, but not with respect to the location, as the more usual statistical meaning of the term
would suggest.
6.3 First-order exponential B-splines and wavelets 125
Figure 6.3 Comparison of basis functions related to the first-order differential operator
Pα = D − αI for α = 0, −1, −2, −4 (dark to light). (a) Green’s functions ρα (r). (b) Exponential
B-splines βα (r). (c) Augmented spline interpolators ϕint (r). (d) Orthonormalized versions of the
exponential spline wavelets ψα (r) = P∗α ϕint (r).
where the embedding Vi ⊇ Vj for i ≤ j is obvious from the (dyadic) hierarchy of spline
knots, so that sj ∈ Vj implies that sj ∈ Vi with an appropriate subset of its coefficients
ai [k] being zero.
We now detail the construction of a wavelet basis at resolution 1 such that W1 =
span{ψ1,k }k∈Z with W1 ⊥ V1 and V1 + W1 = V0 = span{βα (· − k)}k∈Z . The recipe is to
take ψ1,k (r) = ψα (r − 1 − 2k)/ψα L2 , where ψα is the mother wavelet given by
α ϕint,α (r) ∝
ψα (r) = PH H
α βα (r).
the fact that ψ1,k ∈ V0 for all k ∈ Z, one can show that these wavelets span W1 , which
translates into
W1 = v(r) = v1 [k]ψ1,k (r) : v1 ∈ 2 (Z) .
k∈Z
This method of construction extends to the other wavelet subspaces Wi provided that
the interpolating kernel ϕint,α is replaced by its proper counterpart at resolution a = 2i−1
and the sampling grid adjusted accordingly. Ultimately, this results in a wavelet basis of
L2 (R) whose members are all Pα -splines – that is, piecewise-exponential with parameter
α – but not dilates of the same prototype unless α = 0. Otherwise, the corresponding
decomposition is not fundamentally different from a conventional wavelet expansion.
The basis functions are equally well localized and the scheme admits the same type of
fast reversible filterbank algorithm, albeit with scale-dependent filters [KU06].
The procedure of Section 6.3.1 remains applicable for the broad class of spline-
admissible operators (see Definition 6.2) in one or multiple dimensions. The two
ingredients for constructing a generalized B-spline basis are: (1) the knowledge of the
Green’s function ρL of the operator L, and (2) the availability of a discrete approxima-
tion (finite-difference-like) of the operator of the form
Ld s(r) = dL [k]s(r − k) (6.23)
k∈Zd
In light of Theorem 6.2, the latter property requires the existence of the two Riesz
bounds A and B such that
−jk,ω 2
k∈Zd dL [k]e
0<A ≤ 2
|βL (ω + 2πn)| =
2
≤ B2 . (6.26)
d |L(ω + 2πn)| 2
d n∈Z n∈Z
so that βL is itself a cardinal L-spline in accordance with Definition 6.4. The bottom
line in Definition 6.8 is that any cardinal L-spline admits a unique representation in the
B-spline basis {βL (· − k)}k∈Zd as
The foundation of spline theory is that there are two complementary ways of representing
splines using different types of basis functions: Green’s functions vs. B-splines. The
first representation follows directly from Definition 6.4 see also (6.16) and is given by
where p0 ∈ NL is a suitable element of the null space of L and where ρL = L−1 δ is the
Green’s function of the operator. The functions ρL (·−k) are non-local and very far from
being orthogonal. In many cases, they are not even part of VL , which raises fundamental
issues concerning the L2 convergence of the infinite sum 4 in (6.30) and the conditions
that must be imposed upon the expansion coefficients a[·]. The second type of B-spline
expansion (6.28) does not have such stability problems. This is the primary reason why
it is favored by practitioners.
4 Without further assumptions on ρ and a, (6.30) is only valid in the weak sense of distributions.
L
6.4 Generalized B-spline basis 129
where the generalized B-spline βL satisfies the conditions in Definition 6.8. The Riesz-
basis property ensures that the representation is stable in the sense that, for all s ∈ VL ,
we have that
Ac2 ≤ sL2 ≤ Bc2 . (6.31)
1
Here, c2 = k∈Zd |c[k]|
2 2 is the -norm of the B-spline coefficients c. The fact
2
that the underlying functions are cardinal L-splines is a simple consequence of the atoms
being splines themselves. Moreover, we can easily make the link with (6.30) by using
(6.27), which yields
Ls(r) = c[k]LβL (r − k) = (c ∗ dL )[k] δ(r − k).
k∈Zd k∈Zd a[k]
The less obvious aspect, which is implicit in the definition of the B-spline, is the com-
pleteness of the representation in the sense that the B-spline basis spans the space VL
defined by (6.29). We shall establish this by showing that the B-splines are capable of
reproducing ρL as well as any component p0 ∈ NL in the null space of L. The impli-
cation is that any function of the form (6.30) admits a unique expansion in a B-spline
basis. This is also true when the function is not in L2 (Rd ), in which case the B-spline
coefficients c are no longer in 2 (Zd ) due to the discrete–continuous norm equivalence
(6.31).
results in
ρL (r) = p[k]βL (r − k). (6.32)
k∈Zd
130 Splines and wavelets
To illustrate the concept, let us get back to our introductory example in Section 6.3.1,
with L = Pα = (D − αId) where Re(α) < 0. The frequency response of this first-order
operator is
Pα (ω) = jω − α,
where Fd−1 denotes the discrete-domain inverse Fourier transform. 5 The application of
(6.32) then yields the exponential-reproduction formula
∞
1+ (r)eαr
= eαk βα (r − k), (6.33)
k=0
where βα is the exponential B-spline defined by (6.21). Note that the range of applica-
bility of (6.33) extends to Re(α) ≤ 0.
It turns out that this reproduction property is induced by the matching null-space
constraint (6.24) that is imposed upon the localization filter. While the reproduction of
exponentials is interesting in its own right, we shall focus here on the important case
of polynomials and provide a detailed Fourier-based analysis. We start by recalling that
the general form of a multidimensional polynomial of total degree N is
qN (r) = an rn ,
|n|≤N
% &
5 Our definition of the inverse discrete Fourier transform in 1-D is Fd−1 H(ejω ) [k] =
1 π jω jωk dω with k ∈ Z.
2π −π H(e )e
6.4 Generalized B-spline basis 131
n
using the multi-index notation with n = (n1 , . . . , nd ) ∈ Nd , rn = rn11 · · · rdd , and |n| =
n1 + · · · + nd . The generalized Fourier transform of qN ∈ S (Rd ) (see Table 3.3 and
entry rn f (r) with f (r) = 1) is given by
qN (ω) = (2π)d an j|n| ∂ n δ(ω),
|n|≤N
where ∂ nδ
denotes the nth partial derivative of the multidimensional Dirac impulse δ.
Hence, the Fourier multiplier
L will annihilate the polynomials of order N if and only if
L(ω)∂ δ(ω) = 0 for all |n| ≤ N. To understand when this condition is met, we expand
n
L(ω)∂ n δ(ω) in terms of ∂ k
L(0), |k| ≤ |n|, by using the general product rule for the
manipulation of Dirac impulses and their derivatives, given by
n!
f (r) ∂ n δ(r − r0 ) = (−1)|n|+|l| ∂ k f (r0 ) ∂ l δ(r − r0 ).
k! l!
k+l=n
The latter follows from Leibniz’ rule for partially differentiating a product of functions,
n! k l
∂ n (fϕ) = ∂ f ∂ ϕ,
k! l!
k+l=n
and the adjoint relation ϕ, f ∂ n δ(· − r0 ) = ∂ n∗ (fϕ), δ(· − r0 ) with ∂ n∗ = (−1)|n| ∂ n .
This allows us to conclude that the necessary and sufficient condition for the inclusion
of the polynomials of order N in the null space of L is
∂ n
L(0) = 0, for all n ∈ Nd with |n| ≤ N, (6.34)
which is equivalent to L(ω) = O(ωN+1 ) around the origin. Note that this behavior is
prototypical of scale-invariant operators such as fractional derivatives and Laplacians.
The same condition has obviously to be imposed upon the localization filter Ld in order
for the Fourier transform of the B-spline in (6.25) to be non-singular at the origin. Since
Ld (ω) is 2π-periodic, we have that
∂ n
Ld (2πk) = 0, k ∈ Zd , n ∈ Nd with |n| ≤ N. (6.35)
For practical convenience, we shall assume that the B-spline βL is normalized to have a
L (ω) =
L (0) = 1. Based on (6.35) and β
unit integral 6 so that β Ld (ω)/
L(ω), we find that
⎧
⎨ βL (0) = 1
(6.36)
⎩ ∂ nβL (2πk) = 0, k ∈ Zd \{0}, n ∈ Nd with |n| ≤ N,
L (ω) is
which are the so-called Strang–Fix conditions of order N. Recalling that j|n| ∂ n β
n
the Fourier transform of r βL (r) and that periodization in the signal domain corresponds
to a sampling in the Fourier domain, we finally deduce that
L (0) = Cn ,
(r − k)n βL (r − k) = j|n| ∂ n β n ∈ Nd with 0 < |n| ≤ N, (6.37)
k∈Zd
with the implicit assumption that βL has a sufficient order of algebraic decay for the
above sums to be convergent. The special case of (6.37) with n = 0 reads
βL (r − k) = 1 (6.38)
k∈Zd
and is called the partition It reflects the fact that βL reproduces the constants.
of unity.
More generally, (6.37) or (6.36) is equivalent to the existence of sequences pn such
that
rn = pn [k]βL (r − k) for all |n| ≤ N, (6.39)
k∈Zd
from which one deduces that p(1,...,1) [k] = k + C(1,...,1) . The other sequences pn , which
are polynomials in k, may be determined in a similar fashion by proceeding recursively.
Another equivalent way of stating the Strang–Fix conditions of order N is that the sums
" #
kn βL (r − k) = j|n| ∂ωn e−jω,r β L (−ω)
ω=2πl
k∈Zd l∈Zd
are polynomials with leading term rn for all |n| ≤ N. The left-hand-side expression
follows from Poisson’s summation formula 7 applied to the function f (x) = xn βL (r − x)
with r being considered as a constant shift.
Localization
The guiding principle for designing B-splines is to produce basis functions that are
maximally localized on Rd . Ideally, B-splines should have the smallest possible sup-
port, which is the property that makes them so useful in applications. When it is not
possible to construct compactly supported basis functions, the B-spline should at least
be concentrated around the origin and satisfy some decay bound with the tightest pos-
sible constants. The primary types of spatial localization, by order of preference, are:
(1) Compact support: βL (r) = 0 for all r ∈/ where ⊂ Rd is a bounded set with the
smallest possible Lebesgue measure
(2) Exponential decay: |βL (r)| ≤ C exp(−α|r|) with the largest possible α ∈ R+
(3) Algebraic decay: |βL (r)| ≤ C 1+r
1
α with the largest possible α ∈ R .
+
By relying on the classical relations that link spatial decay to the smoothness of the
Fourier transform, one can get a good estimate of spatial decay based on the knowledge
7 The standard form of Poisson’s summation formula is
k∈Zd f (k) = l∈Zd f (2πl). It is valid for any
Fourier pair f,
f = F {f} ∈ L1 (Rd ) with sufficient decay for the two sums to be convergent.
6.4 Generalized B-spline basis 133
of the Fourier transform β L (ω) = Ld (ω)/L(ω) of the B-spline. Since the localization
Ld (ω) acts by compensating the (potential) singularities of
filter L(ω), the guiding prin-
ciple is that the rate of decay is essentially determined by the degree of differentiability
of
L(ω).
Specifically, if β L (ω) is differentiable up to order N, then the B-spline βL is gua-
ranteed to have an algebraic decay of order N. To show this, we consider the Fourier
L subject to the constraint that ∂ n β
transform pair rn βL (r) ↔ j|n| ∂ n β L ∈ L1 (Rd ) for all
|n| < N. From the definition of the inverse Fourier integral, it immediately follows that
n 1
r βL (r) ≤ L L1 ,
∂ n β
(2π)d
which, when properly combined over all multi-indices |n| < N, yields an algebraic
decay estimate with α = N. By pushing the argument to the limit, we see that exponential
decay (which is faster than any order of algebraic decay) requires that β L ∈ C∞ (Rd )
(infinite order of differentiability), which is only possible if ∞
L(ω) ∈ C (Rd ) as well.
The ultimate limit in Fourier-domain regularity is when β L has an analytic extension
8
that is an entire function. In fact, by the Paley–Wiener theorem (Theorem 6.3 below),
one achieves compact support of βL if and only if β L (ζ ) is an entire function of expo-
nential type. To explain this concept, we focus on the 1-D case where the B-spline βL is
supported in the finite interval [−A, +A]. We then consider the holomorphic Fourier (or
Fourier–Laplace) transform of the B-spline, given by
+A
L (ζ ) =
β βL (r)e−ζ r dr (6.40)
−A
+A
βL (ζ ) ≤ eA|ζ |
|βL (r)| dr
−A
. .
+A +A
≤ eA|ζ | 1 dr |βL (r)|2 dr
−A −A
√
= eA|ζ | 2A βL L2 ,
where we have applied the Cauchy–Schwarz inequality to derive the lower inequality.
Since e−ζ r for r fixed is itself an entire function and (6.40) is convergent over the whole
complex plane, the conclusion is that β L (ζ ) is entire as well, in addition to being a
function of exponential type A as indicated by the bound. The whole strength of the
Paley–Wiener theorem is that the implication also works the other way around.
8 An entire function is a function that is analytic over the whole complex plane C.
134 Splines and wavelets
F(ζ ) = f (r)e−ζ r dr
R
is an entire function of exponential type A, meaning that there exists a constant C such
that
|F(ζ )| ≤ CeA|ζ |
for all ζ ∈ C.
The result implies that one can deduce the support of f from its Laplace transform. We
can also easily extend the result to the case where the support is not centered around
the
origin by applying the Paley–Wiener theorem to the autocorrelation function f ∗ f∨ (r).
The latter is supported in the interval [−2A, 2A], which is twice the size of the support of
f, irrespective of its center. This suggests the following expression for the determination
of the support of a B-spline:
log sup|ζ |≤R L (−ζ )
βL (ζ )β
support(βL ) = lim sup . (6.41)
R→∞ R
It returns twice the exponential type of the recentered B-spline, which gives
support(βL ) = 2A. While this formula is only strictly valid when β L (ζ ) is an entire
function, it can be used otherwise as an operational measure of localization when the
underlying B-spline is not compactly supported. Interestingly, (6.41) provides a mea-
sure that is additive with respect to convolution and proportional to the order γ . For
instance, the support of an (exponential) B-spline associated with an ordinary differen-
tial operator of order N is precisely N, as a consequence of the factorization property of
such B-splines (see Sections 6.4.2 and 6.4.4).
To get some insight into (6.41), let us consider the case of the polynomial B-spline of
order one (or degree zero) with βD (r) = 1[0,1) (r) and Laplace transform
−ζ
D (ζ ) = 1 − e
β .
ζ
The required product in (6.41) is
−ζ
D (−ζ ) = −e + 2 − e ,
ζ
D (ζ )β
β
ζ2
which is analytic over the whole complex plane because of the pole–zero cancellation
at ζ = 0. For R sufficiently large, we clearly have that
R −R eR
max D (−ζ ) = e + 2 + e
βD (ζ )β → .
|ζ |≤R R2 R2
By plugging the above expression into (6.41), we finally get
R − 2 log R
support(βD ) = lim sup = 1,
R→∞ R
6.4 Generalized B-spline basis 135
which is the desired result. While the above calculation may look like overkill for the
determination of the already-known support of βD , it becomes quite handy for mak-
ing predictions for higher-order operators. To illustrate the point, we now consider the
B-spline of order γ associated with the (possibly fractional) derivative operator Dγ
whose Fourier–Laplace transform is
γ
1 − e−ζ
βD (ζ ) =
γ .
ζ
We can then essentially replicate the previous manipulation while moving the order out
of the logarithm to deduce that
γ R − 2γ log R
support(βDγ ) = lim sup = γ.
R→∞ R
This shows that the “support” of the B-spline is equal to its order, with the caveat that
the underlying Fourier–Laplace transform β Dγ (ζ ) is only analytic (and entire) when the
order γ is a positive integer. This points to the fundamental limitation that a B-spline
associated with a fractional operator – that is, when L(ζ ) is not an entire function –
cannot be compactly supported.
Smoothness
The smoothness of a B-spline refers to its degree of continuity and/or differentiability.
Since a B-spline is a linear combination of shifted Green’s functions, its smoothness is
the same as that of ρL .
Smoothness descriptors come in two flavors – Hölder continuity vs. Sobolev differ-
entiability – depending on whether the analysis is done in the signal or Fourier domain.
Due to the duality between Fourier decay and order of differentiation, the smoothness
of βL may be predicted from the growth of L(ω) at infinity, without need for the exp-
licit calculation of ρL . To that end, one considers the Sobolev spaces Wα2 , which are
defined as
* +
Wα2 (Rd ) = f : (1 + ω2 )α |
f (ω)|2 dω < ∞ .
Rd
it is sufficient to check for the second. To that end, we recall the stability conditions
βL ∈ L1 (Rd ) and dL ∈ 1 (Rd ), which are implicit to the B-spline construction (6.25).
These, together with the order condition (6.14), imply that
|
Ld (ω)| ≤ dL 1
dL 1
Ld (ω)
|βL (ω)| = ≤ min βL L1 , C .
L(ω) ωγ
This latter bound allows us to control the L2 norm of (− )α/2 βL by splitting the spectral
range of integration as
L (ω)|2 dω
(− )α/2 βL 2L2 = ω2α |β
Rd (2π)d
= L (ω)|2 dω +
ω2α |β ω2α |βL (ω)|2 dω
(2π) d (2π)d
ω<R ω>R
dω dω
≤ βL 2L1 ω2α d
+C2 d21 ω2α−2γ .
ω<R (2π) ω>R (2π)d
I1 I2
The first integral I2 is finite due to the boundedness of the domain. As for I2 , it is
convergent provided that the rate of decay of the argument is faster than d, which cor-
responds to the critical Sobolev exponent α = γ − d/2.
As the final step of the analysis, we invoke the Sobolev embedding theorems to infer
that βL is Hölder-continuous of order r with r < α − d2 = (γ − d), which essentially
means that βL is differentiable up to order r with bounded derivatives. One should keep
in mind, however, that the latter estimate is a lower bound on Hölder continuity, unlike
the Sobolev exponent in Proposition 6.4, which is sharp. For instance, in the case of the
1-D Fourier multiplier (jω)γ , we find that the corresponding (fractional) B-spline – if it
exists – should have a Sobolev smoothness (γ − 12 ), and a Hölder regularity r < (γ −1).
Note that the latter is arbitrarily close (but not equal) to the true estimate r0 = (γ − 1)
that is readily deduced from the Green’s function (6.15).
γ
for all ω ∈ [0, 2π]d . When L1 = L2 for γ ≥ 0, then the auxiliary condition is automati-
cally satisfied.
Proof Since βL1 , βL2 ∈ L1 (Rd ), the same holds true for βL (by Young’s inequality).
From the Fourier-domain definition (6.25) of the B-splines, we have
−jk,ω
Li (ω) = k∈Zd dLi [k]e
β =
Ld,i (ω)
,
Li (ω)
Li (ω)
which implies that
Ld,1 (ω)
Ld,2 (ω)
Ld (ω)
−1 −1
βL = F =F = Ld L−1 δ
L1 (ω)L2 (ω)
L(ω)
= L1 (ω + 2πn)|2 |β
|β L2 (ω + 2πn)|2
n∈Zd
⎛ ⎞2
≤⎝ L1 (ω + 2πn)| |β
|β L2 (ω + 2πn)|⎠
n∈Zd
≤ L1 (ω + 2πn)|2
|β L2 (ω + 2πn)|2
|β
n∈Zd n∈Zd
≤ B21 B22 < +∞,
where the third line follows from the norm inequality a2 ≤ a1 and the fourth from
Cauchy–Schwarz; B1 and B2 are the upper Riesz bounds of βL1 and βL2 , respectively.
The additional condition in the proposition takes care of the lower Riesz bound.
where βα = βPα is the first-order exponential spline defined by (6.21). The Fourier-
domain counterpart of (6.42) is
$
N
1 − eαn e−jω
α (ω) =
β , (6.43)
jω − αn
n=1
• For α = (0, . . . , 0), one recovers Schoenberg’s classical polynomial B-splines of de-
gree (N − 1) [Sch46, Sch73a], as expressed by the notational equivalence
β+n (r) = βDn+1 (r) = β(0, . . . , 0) (r).
n+1
• Complex conjugation:
βα (r) = βα (r)
• Modulation by parameter shifting:
ejω0 r βα (r) = βα+jω0 (r),
with the convention that j = (j, . . . , j).
Finally, we point out that exponential B-splines can be computed explicitly on a case-
by-case basis using the mathematical software described in [Uns05, Appendix A].
What is remarkable about this construction is the way in which the classical B-spline
formulas of Section 1.3.2 carry over to the fractional case almost literally, by merely
replacing n by α. This is especially striking when we compare (6.47) to (1.11), as well
as the expanded versions of these formulas given below, which follow from the (gener-
alized) binomial expansion of (1 − e−jω )α+1 .
Likewise, it is possible to construct the (α, τ ) extension of these B-splines. They are
α+1 α+1
associated with the operators L = ∂τα+1 ←→ (jω) 2 +τ (−jω) 2 −τ and τ ∈ R [BU03].
This family covers the entire class of translation- and scale-invariant operators in 1-D
(see Proposition 5.6).
The fractional B-splines share virtually all the properties of the classical B-splines,
including the two-scale relation, and can also be used to define fractional wavelet bases
with an order γ = α + 1 that varies continuously. They only lack positivity and compact
support. Their most notable properties are summarized below.
• Generalization. For α integer, they are equivalent to the classical polynomial splines.
The fractional B-splines interpolate the polynomial ones in very much the same way
as the gamma function interpolates the factorials.
• Stability. All brands of fractional B-splines satisfy the Riesz-basis condition in Theo-
rem 6.2.
• Regularity. The fractional splines are α-Hölder continuous; their critical Sobolev
exponent (degree of differentiability in the L2 sense) is α + 1/2 (see Proposition 6.4).
• Polynomial reproduction. The fractional B-splines reproduce the polynomials of
degree N = α that are in the null space of the operator Dα+1 (see Section 6.2.1).
• Decay. The fractional B-splines decay at least like |r|−α−2 ; the causal ones are com-
pactly supported for α integer.
• Order of approximation. The fractional splines have the non-integer order of approxi-
mation α + 1, a property that is rather unusual in approximation theory.
6.4 Generalized B-spline basis 141
• Fractional derivatives. Simple formulas are available for obtaining the fractional de-
rivatives of B-splines. In addition, the corresponding fractional spline wavelets be-
have essentially like fractional-derivative operators.
We encourage the reader who finds the present list incomplete to work on expanding it.
The good news for the present study is that the polyharmonic B-splines are particularly
relevant for image-processing applications because they are associated with the class
of operators that are scale- and rotation-invariant. They naturally come into play when
considering isotropic fractal-type random fields.
The principal message of this section is that B-splines – no matter the type – are
localized functions with an equivalent width that increases in proportion to the order.
In general, the fractional brands and the non-separable multidimensional ones are not
142 Splines and wavelets
compactly supported. The important issue of localization and decay is not yet fully
resolved in higher dimensions. Also, since Ld s = βL ∗ Ls, it is clear that the search for a
“good” B-spline βL is intrinsically related to the problem of finding an accurate nume-
rical approximation Ld of the differential operator L. Looking at the discretization issue
from the B-spline perspective leads to new insights and sometimes to non-conventional
solutions. For instance, in the case of the Laplacian L = , the continuous-domain
localization requirement points to the choice of the 2-D discrete operator d described
by the 3 × 3 filter mask
⎛ ⎞
−1 −4 −1
1⎜⎜ −4 20 −4 ⎟ ,
⎟
Isotropic discrete Laplacian: ⎝ ⎠
6
−1 −4 −1
which is not the standard version used in numerical analysis. This particular set of
weights produces a much nicer, bell-shaped polyharmonic B-spline than the conven-
tional finite-difference mask, which induces significant directional artifacts, especially
when one starts iterating the operator [VDVBU05].
In direct analogy with the first-order scenario in Section 6.3.3, we shall now take
advantage of the general B-spline formalism to construct a wavelet basis that is
matched to some generic operator L.
where Ld,i is the discretized version of L on the grid Di Zd . The Fourier-domain coun-
terpart of this equation is
di [k]e−jω,D k
i
L,i (ω) =
β k∈Zd
. (6.48)
L(ω)
The implicit requirement for the multiresolution decomposition scheme to work is that
βL,i generates a Riesz basis. This needs to be asserted on a case-by-case basis.
A particularly favorable situation occurs when the operator L is scale-invariant with
L(aω) = |a|γ
L(ω). Let i > i be two multiresolution levels of the pyramid such that
i −i
D = mI, where m is a proportionality constant. It is then possible to relate the
B-spline at resolution i to the one at the finer level i via the simple dilation relation
This is shown by considering the Fourier transform of βL,i (r/m), which is written as
di [k]e−jω,mD k
i
L,i (mω)
|m|d β = |m| d k∈Zd
L(mω)
i −i Di k
k∈Zd di [k]e−jω,D
= |m| d−γ
L(ω)
i k
k∈Zd di [k]e−jω,D
= |m| d−γ
L(ω)
and found to be compatible with the form of β L,i (ω) given by (6.48) by taking di [k] ∝
di [k]. The prototypical scenario is the dyadic configuration D = 2I for which the
B-splines at level i are all constructed through the dilation of the single prototype
βL = βL,0 , subject to the scale-invariance constraint on L. This happens, for instance, for
the classical polynomial splines which are associated with the Fourier multipliers (jω)N .
A crucial ingredient for the fast wavelet-transform algorithm is the two-scale relation
that links the B-spline basis functions at two successive levels of resolution. Specifically,
we have that
where the sequence hi specifies the scale-dependent refinement filter. The frequency
response of hi is obtained by taking the ratios of the Fourier transforms of the
corresponding B-splines as
L,i+1 (ω)
β
hi (ω) = (6.49)
L,i (ω)
β
−jω,Di+1 k
k∈Zd di+1 [k]e
= , (6.50)
−jω,Di k
k∈Zd di [k]e
which is 2π(DT )−i -periodic and hence defines a valid digital filter with respect to the
spatial grid Di Zd .
To illustrate those relations, we return to our introductory example in Section 6.1:
the Haar wavelet transform, which is associated with the Fourier multipliers jω (deri-
vative) and (1 − e−jω ) (finite-difference operator). The dilation matrix is D = 2 and the
localization filter is the same at all levels because the underlying derivative operator is
scale-invariant. By plugging those entities into (6.48), we obtain the Fourier transform
of the corresponding B-spline at resolution i as
j2i ω
D,i (ω) = 2−i/2 1 − e
β ,
jω
where the normalization by 2−i/2 is included to standardize the norm of the B-splines.
The application of (6.49) then yields
i+1
1 1 − ej2 ω
hi (ω) = √
2 1 − ej2 ω
i
1 i
= √ (1 + ej2 ω ),
2
√
which, up to the normalization by 2, is the expected refinement filter with coefficients
proportional to (1, 1) that are independent of the scale.
is the Fourier transform of the dual B-spline β̃L . The above factorization implies that
∨
ϕint (r) = (β̃L ∗ β L )(r), which ensures the biorthonormality β̃L (· − k), βL (· − k ) L2 =
δ[k] = ϕint (k − k ) of the basis functions.
146 Splines and wavelets
which can be used to show that LH ϕint,i (· − Di k) ∝ LH d,i β̃L,i (· − D k) ∈ Vi for any
i
k∈Z . d
While we have seen that this scheme produces an orthonormal basis for the first-
order operator Pα in Section 6.3.3, the general procedure does only guarantee semi-
orthogonality. More precisely, it ensures the orthogonality between the wavelet sub-
spaces Wi . If necessary, one can always fix the intra-scale orthogonality a posteriori by
forming appropriate linear combinations of wavelets at a given resolution. The resulting
orthogonal wavelets will still be L-admissible in the sense of Definition 6.7. However,
for d > 1, intra-scale orthogonalization is likely to spoil the simple, convenient struc-
ture of the above construction, which uses a single generator per scale irrespective of
the number of dimensions. Indeed, the examples of multidimensional orthogonal wave-
let transforms that can be found in the literature – either separable or non-separable –
systematically involve M = (det(D) − 1) distinct wavelet generators per scale. More-
over, unlike the present operator-like wavelets, they generally do not admit an explicit
analytical description.
6.6 Bibliographical notes 147
Section 6.1
Alfréd Haar constructed the orthogonal Haar system as part of his Ph.D. thesis, which
he defended in 1909 under the supervision of David Hilbert [Haa10]. From then on,
the Haar system remained relatively unnoticed until it was revitalized by the discov-
ery of wavelets nearly one century later. Stéphane Mallat set the foundation of the
multiresolution theory of wavelets in [Mal89] with the help of Yves Meyer, while
Ingrid Daubechies constructed the first orthogonal family of compactly supported wave-
lets [Dau88]. Many of the early constructions of wavelets are based on splines [Mal89,
CW91, UAE92, UAE93]. The connection with splines is actually quite fundamental in
the sense that all multiresolution wavelet bases, including the non-spline brands such
as Daubechies’, necessarily include a B-spline as a convolution factor – the latter is
responsible for their primary mathematical properties such as vanishing moments, dif-
ferentiability, and order of approximation [UB03]. Further information on wavelets can
be found in several textbooks [Dau92, Mey90, Mal09].
Section 6.2
Splines constitute a beautiful topic of investigation in their own right, with hundreds
of papers specifically devoted to them. The founding father of the field is Schoenberg,
who, during wartime, was asked to develop a computational solution for constructing
an analytic function that fits a given set of equidistant noisy data points [Sch88]. He
came up with the concept of spline interpolation and proved that polynomial spline
functions have a unique expansion in terms of B-splines [Sch46]. While splines can also
be specified for non-uniform grids and extended in a variety of ways [dB78,Sch81a], the
cardinal setting is especially pleasing because it lends itself to systematic treatment with
the aid of the Fourier transform [Sch73a]. The relation between splines and differential
operators was recognized early on and led to the generalization known as L-splines
[SV67].
The classical reference on partial differential operators and Fourier multipliers is
[Hör80]. A central result of the theory is the Malgrange–Ehrenpreis theorem [Mal56,
148 Splines and wavelets
Ehr54], and its extension stating that the convolution with a compactly supported gener-
alized function is invertible [Hör05].
The concept of a Riesz basis is standard in functional analysis and approximation
theory [Chr03]. The special case where the basis functions are integer translates of a
single generator is treated in [AU94]. See also [Uns00] for a review of such representa-
tions in the context of sampling theory.
Section 6.3
The first-order illustrative example is borrowed from [UB05a, Figure 1] for the
construction of the exponential B-spline, and from [KU06, Figure 1] for the wave-
let part of the story.
Section 6.4
The 1-D theory of cardinal L-splines for ordinary differential operators with constant
coefficients is due to Micchelli [Mic76]. In the present context, we are especially con-
cerned with ordinary differential equations, which go hand-in-hand with the extended
family of cardinal exponential splines [Uns05]. The properties of the relevant B-splines
are investigated in full detail in [UB05a], which constitutes the ground material for
Section 6.4. A key property of B-splines is their ability to reproduce polynomials. It is
ensured by the Strang–Fix conditions (6.37) which play a central role in approximation
theory [dB87, SF71]. While there is no fundamental difficulty in specifying cardinal-
spline interpolators in multiple dimensions, it is much harder to construct compactly
supported B-splines, except for the special cases of the box splines [dBH82, dBHR93]
and exponential box splines [Ron88]. For elliptic operators such as the Laplacian, it is
possible to specify exponentially decaying B-splines, with the caveat that the construc-
tion is not unique [MN90b,Rab92a,Rab92b]. This calls for some criterion to identify the
most localized solution [VDVBU05]. B-splines, albeit non-compactly supported ones,
can also be specified for fractional operators [UB07]. This line of research was initiated
by Unser and Blu with the construction of the fractional B-splines [UB00]. As sugges-
ted by the name, the (Gaussian) stochastic counterparts of these B-splines are Mandel-
brot’s fractional Brownian motions [MVN68], as we shall see in Chapters 7 and 8. The
association is essentially the same as the connection between the B-spline of degree
zero (rect) and Brownian motion, or, by extension, the whole family of Lévy processes
(see Section 1.3).
Section 6.5
de Boor et al. were among the first to extend the notion of multiresolution analysis
beyond the idea of dilation and to propose a general framework for constructing “non-
stationary” wavelets [dBDR93]. Khalidov and Unser proposed a systematic method for
constructing wavelet-like basis functions based on exponential splines and proved that
these wavelets behave like differential operators [KU06]. The material in Section 6.5
is an extension of those ideas to the case of a generic Fourier-mutiplier operator in
multiple dimensions; the full technical details can be found in [KUW13]. Operator-like
wavelets have also been specified within the framework of conventional multiresolution
6.6 Bibliographical notes 149
analysis; in particular, for the Laplacian and its iterates [VDVBU05, TVDVU09] and
for the various brands of 1-D fractional derivatives [VDVFHUB10], which have the
common property of being scale-invariant. Finally, we mention that each exponential-
spline wavelet has a compactly supported Daubechies counterpart that is orthogonal
and operator-like in the sense of having the same vanishing exponential moments
[VBU07].
7 Sparse stochastic processes
Having dealt with the technicalities of defining acceptable inverse operators, we can
now apply the framework to characterize – and also generate – relevant families of
sparse processes. As in the previous chapters, we start with a simple example to expose
the key ideas. Then, in Section 7.2, we develop the generalized version of the innovation
model that covers the complete spectrum of Gaussian and sparse stochastic processes.
We characterize the solution(s) in full generality, while pinpointing the conditions under
which the so-defined processes are stationary or self-similar. In Section 7.3, we provide
a complete description of the stationary processes, including the CARMA (continuous-
time autoregressive moving average) family which constitutes the non-Gaussian exten-
sion of the classical ARMA processes. In Section 7.4, we turn our attention to non-
stationary signals and characterize the important class of Lévy-type processes that are
defined by unstable linear SDEs. Finally, in Section 7.5, we investigate fractal-type pro-
cesses (not necessarily Gaussian) that are solutions of fractional, scale-invariant SDEs.
Pα sα = w,
where w is a white Lévy noise excitation. We have already seen that the solution for
Re(α) < 0 is given by sα = P−1 α w = ρα ∗ w, where ρα is the impulse response of
the underlying system. We have also shown in Section 5.3.1 that the concept remains
applicable for Re(α) > 0 using the extended definition (5.10) of P−1 −1
α . Since Pα is a
S -continuous convolution operator, this results in a well-defined stationary process, the
Gaussian version of which is often referred to as the Ornstein–Uhlenbeck process.
To make the connection with splines, we observe that the first-order impulse response
can be written as a sum of exponential B-splines,
as illustrated in Figure 7.1a. The B-spline generator βα (r) is defined by (6.21) and is
supported in the interval [0, 1). A key observation is that the B-spline coefficients eαk
7.1 Introductory example: non-Gaussian AR(1) processes 151
Figure 7.1 Spline-based representation of the impulse response and autocorrelation function of a
differential system with a pole at α = −1. (a) The impulse response (dashed line) is decomposed
as a linear combination of the integer shifts of an exponential B-spline (solid line). (b) The
autocorrelation is synthesized by interpolating its sample values at the integers (discrete
autocorrelation); the corresponding (second-order) spline interpolation kernels are represented
using solid lines.
in (7.1) correspond to the impulse response of the digital filter −1 α described by the
transfer function 1−e1α z−1 , which is the natural discrete counterpart of P−1
α .
−1 −1
The operator version of (7.1) therefore reads ρα = Pα δ = α βα , which makes an
interesting connection between the analog and discrete versions of a first-order operator.
We have shown in prior work that this type of relation is fundamental to the theory of
linear systems and that it carries over for higher-order systems [Uns05].
Since the driving term is white, the correlation structure of the process (second-order
statistics) is fully characterized by the (Hermitian-symmetric) autocorrelation of the
impulse response. In the case where α is real-valued and negative, we get
Rρα (r) = ρα (· + r), ρα = (ρ ∨
α ∗ ρα )(r) ∝ e
α|r|
= eα|k| ϕint,α r − k , (7.2)
k∈Z
E{sα (· + r) sα (·)}
csα (r) = = eα|r| .
E{|sα |2 }
The rightmost sum in (7.2) expresses the fact that the continuous-domain correlation
function can be reconstructed from the discrete-domain one, csα [k] = csα (r)r=k = eα|k|
(the sampled version of the former), by using a Shannon-type interpolation formula,
which involves the “augmented-order” spline kernel ϕint,α introduced in Section 6.3.2.
Let σw2 < +∞ denote the variance of a zero-mean white input noise w observed
through a normalized observation window ϕ/ϕ. Then, it is well known (Wiener–
Khintchine theorem) that the power spectrum of sα is given by
α (ω) = jω−α
with ρ 1
. We note, however, that the power spectrum provides a complete
characterization of the process only when the input noise is Gaussian.
The generic solution of the general innovation model (4.23) is given by s = L−1 w,
where w is a particular brand of white noise with Lévy exponent f (see Definition 4.1)
and L−1 a proper right inverse of L. The theoretical results of Section 4.5.2 guarantee
the existence of this solution as a generalized stochastic process over S (Rd ) provided
that L−1∗ (the adjoint of L−1 ) and f satisfy some joint regularity conditions. The three
configurations of interest that balance the range of acceptable innovations are listed
below for further reference.
Note that the conditions of Definition 7.1 are given in order of increasing level of
complexity in the inversion of the whitening operator L when the problem is ill-posed
over S (Rd ).
Now, if (L−1∗ , f ) is admissible, then the underlying generalized stochastic pro-
cesses, or random fields when d > 1, are well defined and completely specified by their
7.2 General abstract characterization 153
We have already used this mechanism in Section 4.4.3, Proposition 4.15 to determine
the covariance of a white Lévy noise under the standard zero-mean and finite-variance
assumptions. Specifically, we showed that the correlation/covariance form of the inno-
vation process w is given by
Bw (ϕ1 , ϕ2 ) = σw2 ϕ1 , ϕ2 , (7.6)
with σw2 = −f (0). Here, we rely on duality (i.e., ϕ, L−1 w = L−1∗ ϕ, w ) to transfer
this result to the output of the general innovation model (4.23) as
Bs (ϕ1 , ϕ2 ) = BL−1 w (ϕ1 , ϕ2 )
= Bw (L−1∗ ϕ1 , L−1∗ ϕ2 )
= σw2 L−1 L−1∗ ϕ1 , ϕ2 , (7.7)
which is consistent with (7.5) under the implicit assumptions that σw2 = −f (0) and
f (0) = 0. Finally, we recover the autocorrelation function of s by making the substitu-
tion ϕ1 = δ(· − r1 ) and ϕ2 = δ(· − r2 ) in (7.7), which leads to
Rs (r1 , r2 ) = E{s(r1 ) s(r2 )}
= Bs (δ(· − r1 ), δ(· − r2 ))
= σw2 L−1 L−1∗ δ(· − r1 ), δ(· − r2 ) . (7.8)
This is justified by the kernel theorem (see Section 3.3.4), which allows one to express
the correlation functional as
The bottom line is that the correlation structure of the process is entirely determined
by the impulse response of the Hermitian symmetric operator L−1 L−1∗ , which may or
may not be shift-invariant.
For formalization purposes, it is useful to categorize stochastic processes based on
whether or not they are invariant to the elementary coordinate transformations. Inva-
riance here is not meant literally, but probabilistically, in the sense that the application
of a given spatial transformation (translation, rotation, or scaling) leaves the probability
laws of the process unchanged.
However, since the objects of interest are generalized functions, we need to properly
define the underlying notions. The translation by r0 ∈ Rd of a generalized function
φ ∈ S (Rd ) is denoted by φ(· − r0 ), while its scaling by a is written as φ(·/a). The
definition of these operations (see Section 3.3.2) is
for any pair (ϕ, φ) ∈ S (Rd ) × S (Rd ) where it is implicitly assumed that the (d × d)
coordinate transformation matrix T is invertible. The scaling by a > 0 is obtained by
selecting T = aI whose determinant is ad .
The scaling order H is also called the Hurst exponent of the process. Here, the relevant
adjoint relation is ϕ, aH s(·/a) = aH |a|d ϕ(a·), s , which follows from the definition of
the affine transformation and the linearity of the duality product.
7.2 General abstract characterization 155
One can readily check that all Lévy noises are stationary and isotropic. Self-similarity,
by contrast, is a more restrictive property that is only shared by the stable members of
the family for which the exponent f is a homogeneous function of degree α.
Similarly, one can also define weaker forms of invariance by considering the effect of
a transformation on the first- and second-order moments only. This leads to the notions
of wide-sense stationarity, isotropy, and self-similarity under the implicit assumption
that the variances are finite (second-order process).
E{ϕ, s } = E{ϕ(· + r0 ), s }
Bs ϕ1 , ϕ2 = Bs ϕ1 (· + r0 ), ϕ2 (· + r0 )
for any ϕ, ϕ1 , ϕ2 ∈ S (Rd ) and any r0 ∈ Rd , or, equivalently, if its (generalized) mean
E{s(r)} is constant and its (generalized) autocorrelation is a function of the relative
displacement only; that is,
If, in addition, Rs (r1 − r2 ) = Rs (|r1 − r2 |), then the process is WSS isotropic.
For the innovation model s = L−1 w, we have E{s(r)} = 0 whenever f (0) = 0, which
is a property that is shared by all symmetric Lévy exponents (and all concrete examples
considered in this book). This zero-mean assumption facilitates the treatment of second-
order processes (with minimal loss in generality).
Then, depending on the characteristics of L−1 (or, equivalently, L−1∗ ), the process s
enjoys the following properties.
(1) If L−1 is linear shift-invariant, then s is stationary and
h(r, r ) = h(r − r , 0) = ρL (r − r ),
1 The class of such admissible Lévy exponents are the α-stable ones; the symmetric members of the family
are fα (ω; s0 ) = −|s0 ω|α .
7.2 General abstract characterization 157
Since the Lévy innovations w are all intrinsically stationary, there is no distinction in
this model between stationarity and WSS, except for the fact that the latter requires the
variance σw2 to be finite. This is not so for self-similarity, which is a more-demanding
property. In that respect, we note that there is no contradiction between Statements (3)
and (4) in Theorem 7.1 because the second-order moments of α-stable processes (for
which f is homogeneous of order α) are undefined for α < 2 (due to the unbounded
variance of the noise). The intersection occurs for α = 2 (Gaussian scenario) while
larger homogeneity indices (α > 2) are excluded by the Lévy admissibility condition.
The last result in Theorem 7.1 is fundamental, for it tells us when s(r) can be inter-
preted as a conventional stochastic process; that is, as a random function of the index
158 Sparse stochastic processes
which shows the connection with conventional stochastic integration (Itô calculus).
Here, W is a random measure over Rd that is formally defined as W(E) = 1E , w for
any Borel set E ⊆ Rd .
While we have already pointed out the incompatibility between stationarity and self-
similarity, there is a formal way to bypass this limitation by enforcing stationarity selec-
tively through the test functions whose moments are vanishing (up to a certain order).
Specifically, we shall specify in Section 7.5 processes that fulfill this quasi-stationarity
condition with the help of the Lp -stable, scale-invariant inverse of L∗ defined in Sec-
tion 5.5. This construction results in self-similar processes with stationary increments,
the prototypical example being fractional Brownian motion. But before that we shall
investigate other concrete examples of generalized stochastic processes, starting with
the simpler stationary ones.
If the inverse operator L−1 is shift-invariant with generic impulse response ρL ∈L1 (Rd ),
then (7.12) is equivalent to a convolutional system with s(r) = (ρL ∗ w)(r). We can then
apply (7.3) in conjunction with Proposition 5.1 to obtain the characteristic functional of
this process, which reads
s (ϕ) = exp
P f (ρL∨ ∗ ϕ)(r) dr . (7.14)
Rd
More generally, we may consider generalized processes that are obtained by LSI filter-
ing of an innovation process w and are not necessarily solutions of a stochastic differ-
ential equation.
P R O P O S I T I O N 7.2 (Generalized stationary processes) Let s = h∗w, where μh TV <∞
(bounded variation) (resp., h is rapidly decaying) and w is a white-noise process over
S (Rd ) whose Lévy exponent f is p-admissible (resp., Lévy–Schwartz admissible).
Then, s is a generalized stochastic process in S (Rd ) that isstationary
∨ and complete-
ly specified by the characteristic functional Ps (ϕ) = exp Rd f (h ∗ ϕ)(r) dr . In
σw2
general, the process is non-Gaussian unless f (ω) = − 2 |ω| .
2
The proof is the same as that of Statement (1) in Theorem 7.1. As for the existence of
the process when f is p-admissible, we rely on the convolution inequality, h∨ ∗ ϕLp =
h ∗ ϕ ∨ Lp ≤ μh TV ϕLp , which ensures that the Lp condition in Definition 7.1 is
satisfied. In that respect, we note that the bounded variation hypothesis on h (which is
less stringent than h ∈ L1 ) is the minimal requirement for stability when p = 1, while it
can be weakened to hL∞ < ∞ for p = 2 (see Statements (1) and (3) in Theorem 3.5).
7.3 Non-Gaussian stationary processes 159
In addition, when h ∈ Lp (Rd ) resp., h ∈ R (Rd ) , we can invoke the last part of
Theorem 7.1 to show that the point values s(r) of the process are well defined, so that s
also admits a classical interpretation.
ρL (r) = L−1
d βL (r) = p[k]βL (r − k), (7.15)
k∈Zd
jω,k
with p[k] = [−π,π]d Le (ω) (2π) d ∈ 1 (Z ). The concept also generalizes for the specifi-
dω d
d
cation of the second-order moments.
where the interpolation function is defined by (6.51) and where the expansion coeffi-
cients Rs [k] = E{s(· + k) s(·)} = Rs (r)|r=k correspond to the discrete-domain version of
the correlation.
The power spectrum of the process is the Fourier transform of the autocorrelation
function,
σw2
s (ω) = F {Rs }(ω) = , (7.16)
|
L(ω)|2
an expression that is consistent with the interpretation of the signal as a filtered white
noise.
160 Sparse stochastic processes
As expected, this yields a correlation function that only depends on the relative differ-
ence r = (r1 − r2 ) of the index variables. To establish the validity of the interpolation
formula, we consider (6.51) and manipulate the Fourier-domain expression for ϕint as
|βL (ω)|2
ϕint (ω) =
(ω + 2πn)|2
d |β
n∈Z L
|
Ld (ω)|2
|
L(ω)|2
= (from definition of generalized B-spline)
|
Ld (ω+2πn)|2
n∈Zd |
L(ω+2πn)|2
1
|
L(ω)|2
= 1
n∈Zd |
L(ω+2πn)|2
s (ω)
= ,
n∈Zd s (ω + 2πn)
where the simplification in the third line results from the property that Ld (ω) (transfer
function of a digital filter) is 2π-periodic. The final formula is the ratio of s (ω) (the
continuous-domain power spectrum of s given by (7.16)) and its discrete-domain coun-
terpart k∈Zd Rs [k]e−jω,k = n∈Zd s (ω + 2πn) (by Poisson’s summation formula),
which proves the desired result.
Note that we can also write an extended version of this result for the class of spline-
admissible operators L with βL ∈ L1 (Rd ) which requires that f be p-admissible for some
p ∈ [1, 2].
Proof The characterization of the B-spline in the statement of the proposition ensures
that the composition of Ld and L−1 is LSI with impulse response βL = Ld L−1 δ. Since Ld
is LSI, the condition is trivially satisfied when L−1 is shift-invariant as well. Generally,
the condition will also be met in the non-stationary scenarios considered later in this
chapter (see Proposition 7.6), the fundamental reason being that Ld must annihilate
the components that are in the null space of L (see Section 6.4 on the construction of
generalized B-splines). This property allows us to write u = Ld s = Ld L−1 w = βL ∗ w.
The result then follows as a direct consequence of Propositions 7.2 and 7.3.
variance of the process E{|sGauss (r0 )|2 } is finite. The simplest example that fits the gener-
ic filtered-white-noise model but does not meet the latter condition is wGauss with h = δ.
specify not only Gaussian CARMA processes, but also a whole variety of non-Gaussian
variants associated with more general Lévy innovations. These extended CARMA pro-
cesses are stationary and completely characterized by the characteristic functional (7.14)
with ρL = hα,γ and d = 1 without any restriction on the Lévy exponent f. Moreover,
since the underlying kernel h(t0 , τ ) = hα,γ (t0 − τ ) for t0 fixed is bounded and expo-
nentially decreasing, they admit an interpretation as an ordinary stochastic process (by
Theorem 7.1).
The autocorrelation function of the process is defined under the additional second-
order hypotheses f (0) = −σw2 and f (0) = 0. It is given by
∨
Rsα (t) = σw2 (hα,γ ∗ hα,γ )(t),
Lévy processes and their extensions can also be defined in the introduced framework,
but their specification is more delicate due to the fact that their underlying SDE is un-
stable. This requires the use of the “regularized” bounded inverse operators that we
presented in Section 5.4. As this constitutes a significant departure from the traditional
shift-invariant setting, we shall detail the construction in Section 7.4.1 and provide the
connection with the classical theory.
In addition, one typically requires the process to fulfill some form of probabilistic
continuity.
In our framework, the equivalent of the above processes is obtained as a solution of
the stochastic differential equation
DW = Ẇ = w, (7.20)
subject to the boundary condition W(0) = 0, where D = dtd is the derivative operator
and W is to be defined as a random element of S (R). The driving term w in (7.20) is a
1-D Lévy innovation, as defined in Section 4.4, with characteristic functional
E{e
jϕ,w
} = Pw (ϕ) = exp f (ϕ(r)) dr . (7.21)
R
We recall that w has the property of independence at every point, meaning that any pair
of random variables ϕ, w and ψ, w , for test functions ϕ and ψ of disjoint support,
are independent. In terms of the characteristic functional, this property translates into
having Pw (ω1 ϕ + ω2 ψ) factorize as
P w (ω1 ϕ)P
w (ω1 ϕ + ω2 ψ) = P w (ω2 ψ) for disjointly supported ϕ and ψ. (7.22)
To say that a generalized random process W fulfills (7.20) is, for us, to have
where I∗0 is the left inverse S (R) → R (R) of D∗ specified in Section 5.4.1. In view of
(7.24), ϕ, W is probabilistically characterized by the functional
P w (I∗ ϕ).
W (ϕ) = P (7.25)
0
To see that (7.24) implies (7.23), note, first, that for any ϕ ∈ S (R) that can be written
as D∗ ϕ for some ϕ ∈ S (R), we have
ϕ , W = D∗ ϕ, W = I∗0 D∗ ϕ, w .
Now, since I∗0 is a left inverse S (R) → R (R) of D∗ : S (R) → S (R), we find
D∗ ϕ, W = ϕ, w
which follows from (5.17) with ω0 = 0 and ϕ being replaced by δ(· − τ ). While (7.27)
and (7.28) are equivalent identities with the role of the variables t and τ being inter-
changed, the main point is that the kernel on the right of (7.28) for τ fixed is compactly
supported. This allows us to invoke Theorem 7.1 to show that the point values of the
process, W(tn ) = 1(0,tn ] , w , are ordinary random variables.
Having defined a particular solution of (7.20) as I0 w, let us now show that it is
consistent with the axiomatic definition of a Lévy process given by Definition 7.7. The
zero boundary condition at the origin (Property (1) in Definition 7.7) is imposed by the
operator I0 (see Theorem 5.3 with ω0 = 0). As for the other two properties, we recall
that, for t ≥ 0,
+∞
I∗0 ϕ(t) = ϕ(τ ) dτ ,
t
from which we deduce
∞ t+T
I∗0 δT∗ ϕ(t) = ϕ(τ ) − ϕ(t + T) dτ = ϕ(τ ) dτ = 1[−T,0) ∗ ϕ (t).
t t
From there, we see that, for the increment process δT W,
ϕ, δT W = δT∗ ϕ, W = I∗0 δT∗ ϕ, w = 1[−T,0) ∗ ϕ, w ,
which is equivalent to
δT W = W(·) − W(· − T) = 1(0,T] ∗ w
because 1(0,T] = 1∨
[−T,0) . Now, since w is stationary, we have that
w
Xt = ϕ(· − t), δT W = 1[−T,0) ∗ ϕ(· − t), w = 1[−T,0) ∗ ϕ, w = ϕ, δT W = X0
w
for all t ∈ R, where = denotes equivalence
in law. This proves that δT W is stationary.
Finally, by writing W(tm ) − W(tm−1 ) = 1(tm−1 ,tm ] , w and using Proposition 3.10 in
combination with (7.22), we see that the joint characteristic function of the increments
U1 = W(t2 ) − W(t1 ), U2 = W(t3 ) − W(t2 ), . . ., Un−1 = W(tn ) − W(tn−1 ) with 0 ≤ t1 <
t2 < . . . < tn separates as
w (ω1 1(t ,t ] ) P
p(U1 :Un−1 ) (ω1 , . . . , ωn−1 ) = P w (ω2 1(t ,t ] ) · · · P
w (ωn−1 1(t ,t ] ),
1 2 2 3 n−1 n
166 Sparse stochastic processes
operators I(ωK :ω1 ) and TLSI = P−1 (α1 :αN−K ) qM (D), the last of which is linear shift-invariant
and S -continuous. While the ordering of the factors of TLSI is immaterial (due to
the commutativity of convolution), this is not so for the Kth-order integration operator
I(ωK :ω1 ) = IωK · · · Iω1 , which is shift-variant and directly responsible for the boundary
conditions. Specifically, the prescribed ordering of the imaginary poles imposes the K
linear boundary conditions at the origin
⎧
⎪
⎪ sα (0) = 0
⎪
⎪
⎪
⎨ PjωK sα (0) = 0
.. (7.31)
⎪
⎪
⎪
⎪ .
⎪
⎩
Pjω2 · · · PjωK sα (0) = 0,
which are part of the definition of the underlying generalized Lévy process sα .
Finally, we invoke Corollary 5.5 and Theorem 4.17 to get the full characterization of
the Nth-order generalized Lévy process in terms of its characteristic functional
∗ ∗
Psα (ϕ) = exp f TLSI I(ω1 :ωK ) ϕ(t) dt , (7.32)
R
where f is the Lévy–Schwartz exponent of the innovation, I∗(ω1 :ωK ) is the composite
stabilized integration operator defined by (5.24) and (5.18), and TLSI is the convolution
operator corresponding to the stable part of the system with Fourier multiplier
qM (jω) Kk=1 (jω − jωk ) qM (jω)
TLSI (ω) = = N−K ,
n=1 (jω − αn )
pN (jω)
where Re(αn ) = 0 for 1 ≤ n ≤ N − K. Clearly, the extended filtered-white-noise model
(7.30) is compatible with the definition of CARMA processes of Section 7.3.4 if we
simply set K = 0; it also yields the classical Lévy processes when N = K = 1 and
ω1 = 0. In that respect, we observe that the derived process Pjω1 · · · PjωK sα = TLSI w
is stationary (by Proposition 7.2 and the right-inverse property of I(ωK :ω1 ) ), so that we
may interpret K as the order of stationary deficiency.
operator L that meets the boundary condition sα (0) = 0 imposed by the presence of the
pole jω0 on the imaginary axis. The non-stationary correlation of the continuous and
discrete-domain versions of the process is fully specified by
Rsα (t, t ) = E{sα (t) sα (t )} = vsα (t − t) − ejω0 t vsα (t ) − e−jω0 t vsα (−t)
Rsα [k, k ] = E{sα [k] sα [k ]} = vsα [k − k] − ejω0 k vsα [k ] − e−jω0 k vsα [−k],
where vs (t) = σw2 ρLL∗ (t) and where vsα [k] = vsα (t)|t=k . These two entities are linked
through the exponential-spline interpolation formula
where ϕint (t) is specified by (6.51). Moreover, vsα (t) = O(|t|) is a function of slow growth
whose generalized Fourier transform is given by
σw2 ,
vsα (ω) = = Vsα (ejω )
ϕint (ω),
|
L(−ω)|2
where
L(ω) is the frequency response of L which exhibits a zero at αN = jω0 .
Proof From (7.8), we have that Rsα (t1 , t2 ) = σw2 L−1 L−1∗ {δ(· − t1 )}(t2 ). Since the
system has a single singularity on the imaginary axis at jω0 , (7.32) implies that L−1∗ =
T∗LSI I∗ω0 , where TLSI is a BIBO-stable LSI system whose transfer function is TLSI (ω) =
j(ω+ω0 )
L(ω)
Using the Fourier-domain formula (5.21) of I∗ω0 and the fact that TLSI is a
.
standard convolution operator, we find that the Fourier transform of T∗LSI I∗ω0 {δ(· − t1 )}
is given by
e −jωt1− ejω0 t1
TLSI (−ω) .
−j(ω + ω0 )
Likewise, we have that L−1 = Iω0 TLSI , which, by considering the complex-conjugate
counterpart of (5.20) for Iω0 , yields
e−jωt1 − ejω0 t1 e jωt − e−jω0 t dω
L−1 L−1∗ {δ(· − t1 )}(t) = |
TLSI (−ω)|2
R −j(ω + ω0 ) j(ω + ω 0 ) 2π
ejω(t−t1 ) − ejω0 t1 ejωt − e−jω0 t e−jωt1 + e−jω0 (t−t1 ) dω
= |
TLSI (−ω)|2 (7.33)
R |ω + ω0 |2 2π
= ρLL∗ (t − t1 ) − ejω0 t1 ρLL∗ (t) − e−jω0 t ρLL∗ (−t1 ).
The critical step in this derivation is the evaluation of the integral in (7.33) which,
contrary to appearances, is non-singular, due to the presence of the fourth term in the
numerator. To make this explicit, we recall that the proper specification of the inverse
7.4 Lévy processes and their higher-order extensions 169
with Re(αn ) = 0 for 1 ≤ n ≤ N − K and αK+k = jωk for 1 ≤ k ≤ K, where Iωk and
P−1
αn are specified by (5.19) and (5.10), respectively. Then, the generalized B-spline βL
defined by (7.36) and (7.34) has the following properties:
−1
αL ϕ = βL ∗ ϕ
−1∗ ∗
L αϕ = βL∨ ∗ ϕ
for all ϕ ∈ S (R).
7.4 Lévy processes and their higher-order extensions 171
Proof The key is to rely on the factorization property (6.42) of the exponential
B-spline βα and to consider the elementary factors one at a time. To that end, we first
establish that
−1
α Pα ϕ = βα ∗ ϕ (7.37)
jωk Iωk ϕ = βjωk ∗ ϕ, (7.38)
as well as the adjoint counterparts of these relations. The first identity is a direct conse-
quence of the definition of the first-order exponential B-spline
−1 1 − e
α−jω
βα (t) = α ρα (t) = F (t),
jω − α
where ρα = F −1 {1/(jω − α)} is the Green’s function of the operator Pα or, equiva-
lently, the impulse response of the inverse operator P−1
α . As for the second identity, we
apply the time-domain definition (5.19) of Iωk , which yields
where we have used the fact that jωn annihilates the sinusoidal components that are
in the null space of Pjωk and the associativity of convolution operators such as jωk .
By applying (7.37) and (7.38) recursively and making use of the commutativity of
S -continuous LSI operators, we find that
−1
−1 −1
α L ϕ = (α1 :αN−1 ) jωK IωK I(ωK−1 :ω1 ) PαN−K · · · Pα1 qM (D){ϕ}
= βjωK ∗ (α1 :αN−2 ) jωK−1 IωK−1 I(ωK−2 :ω1 ) P−1 −1
αN−K · · · Pα1 qM (D){ϕ}
..
.
= βjωK ∗ · · · ∗ βjω1 ∗ βαN−K ∗ · · · ∗ βα1 ∗ qM (D){ϕ}
= qM (D){βα1 ∗ · · · ∗ βαN } ∗ ϕ
= qM (D){βα } ∗ ϕ
= βL ∗ ϕ.
The adjoint counterpart of this identity is obtained by applying the same divide-and-
conquer strategy.
statistical structure than sα . Indeed, by combining the result of Proposition 7.6 with
(7.30), we show that
ϕ, uα = ϕ, α sα
= ∗α ϕ, L−1 w
= L−1∗ ∗α ϕ, w = βL∨ ∗ ϕ, w
for all ϕ ∈ S (R). This is equivalent to
uα = α sα = βL ∗ w, (7.39)
which is a form that is much more convenient than sα = L−1 w because the convo-
lution with βL preserves stationarity. The other favorable aspect is that the general-
ized B-spline βL (which has a compact support) is much better localized than the
Green’s function ρL , especially in the non-stable scenario where ρL exhibits polyno-
mial growth. We are now in the position to invoke Proposition 7.4 to get the complete
statistical characterization of uα , including its correlation function which is proportional
∨
to βLL∗ (t) = (βL ∗ β L )(t), where βL is the generalized exponential B-spline defined by
(7.36).
Example 1: generation
Figure 7.2 of generalized stochastic processes
with
whitening operator
L = D pole vector α = (0) . (a) B-spline functions βL (t) = rect t − 12 and βLL∗ (t) = tri(t).
(b) Brownian motion. (c) Compound-Poisson process with λ = 1/32 and Gaussian amplitude
distribution pA (a) = (2π)−1/2 e−a /2 . (d) SαS Lévy motion with α = 1.2.
2
7.4 Lévy processes and their higher-order extensions 173
innovations: Gaussian (panel b), impulsive Poisson (panel c), and symmetric-alpha-
stable (SαS) with α = 1.2 (panel d).
The relevant operators are:
• Example 1: L = D (Lévy process)
• Example 2: L = D2 (second-order extension of Lévy process)
• Example 3: L = (D − α1 Id)(D − α2 Id) and α = (j3π/4, −j3π/4) (generalized Lévy
process)
• Example 4: L = (D−α1 Id)(D−α2 Id) and α = (−0.05+jπ/2, −0.05−jπ/2) (CAR(2)
process).
The corresponding B-splines (βL and βLL∗ ) are shown in the upper-left panel of each
figure.
The signals that are displayed side-by-side share the same whitening operator, but
they differ in their sparsity patterns, which come in three flavors: none (Gaussian), finite
rate of innovation (Poisson), and heavy-tailed statistics (SαS). The Gaussian signals are
uniformly textured, while the generalized Poisson processes are piecewise-smooth by
construction.
Lowpass processes
The classical Lévy processes (Figure 7.2) are obtained by integration of white Lévy
noise. They go hand-in-hand with the B-spline of degree zero (rect) and its autocorrela-
tion (triangle function) which is a B-spline degree 1. The Gaussian version (Figure 7.2b)
is a Brownian motion. It is quite rough and nowhere differentiable in the classical sense.
174 Sparse stochastic processes
Yet, it is mean-square continuous due to the presence of the single pole at the origin.
The Poisson version (compound-Poisson process) is piecewise-constant, each jump cor-
responding to the occurrence of a Dirac impulse. The SαS Lévy motion exhibits local
fluctuations punctuated by large (but rare) jumps, as is characteristic for this type of
process [ST94, App09]. Overall, it is the jump behavior that dominates, making it even
sparser than its Poisson counterpart.
The example in Figure 7.3 (second-order extension of a Lévy process) corresponds to
one additional level of integration, which yields smoother signals (i.e., one-time diffe-
rentiable in the classical sense). The corresponding Poisson process is piecewise-linear,
while the SαS version looks globally smoother than the Gaussian one, except for a few
sharp discontinuities in its slope. The basic B-spline here is a triangle, while βLL∗ is
a cubic B-spline. The signals in Figures 7.2 and 7.3 are non-stationary; the underlying
processes have the remarkable property of being self-similar (fractals) due to the scale-
invariance of the pure-derivative operators. The Gaussian and SαS processes are strictly
self-similar in the sense that the statistics are preserved through rescaling. By contrast,
the scaling of the Poisson processes necessitates some corresponding adjustment of the
rate parameter λ [UT11].
Bandpass processes
The second-order signals in Figure 7.4 are non-stationary as well, but no longer low-
pass nor self-similar. They are real-valued, and C1 -continuous almost everywhere (pair
of complex-conjugate poles in the left complex plane). They constitute some kind
7.4 Lévy processes and their higher-order extensions 175
Figure 7.5 Example 4: generation of generalized stochastic processes with whitening operator
L = (D − α1 Id)(D − α2 Id) and α = (−0.05 + jπ/2, −0.05 − jπ/2). (a) B-spline functions βL and
βLL∗ . (b) Gaussian AR(2) process. (c) Generalized Poisson process with λ = 1/32 and Gaussian
amplitude distribution. (d) SαS AR(2) process with α = 1.2.
where sm is some elementary process with whitening operator Lm and Lévy function
fm (ω). As a demonstration of the concept, we have synthesized some acoustic samples
176 Sparse stochastic processes
by mixing random signals associated with elementary musical notes (pair of poles at the
corresponding frequency). These can be downloaded from http://www.sparseprocesses.
org/. The Gaussian versions are diffuse, cluttered, and boring to listen to. The general-
ized Poisson and SαS samples are more interesting perceptually – reminiscent of chimes
– with the latter sounding less dry and more realistic. Note that mixing does not gain us
anything in the Gaussian case because the resulting signal is still part of the traditional
family of Gaussian ARMA processes (this follows from Parseval’s relation and the fact
σw2
that M m=1 is expressible as an equivalent rational power spectrum). This is
|Lm(−ω)|2
not so for the non-Gaussian members of the family, which are not decomposable in gen-
eral, meaning that the mixing of sparse processes opens up new modeling perspectives.
Interestingly, the Gaussian acoustic samples are almost impossible to compress using
mp3/AAC, while the generalized Poisson and SαS ones can be faithfully reproduced at
a much lower bit rate.
In Sections 7.5.1 and 7.5.2, we review genuine self-similar models 3 that result
from the application of scale-invariant operators to SαS innovations, and then devote
Section 7.5.3 to the case of Poisson innovations, which yield wide-sense self-similar
models.
with α ∈ (0, 2], where · αα denotes the αth power of the Lα (quasi-)norm and s0 is
an arbitrary positive normalization constant. Note that the largest domain of definition
α,s on which it remains finite is the Lebesgue space Lα (Rd ). The characteristic
of P 0
functional Pw
α,s0 is also continuous with respect to the Lα topology (and, a fortiori,
with respect to any finer topology such as those of D , S ⊂ Lα ).
The innovation
d wα,s 0 defined by (7.40) is stationary, isotropic, and self-similar with
scaling index α − d in the following sense:
d
= ad− α ϕ, wα,s0 . in probability law (self-similarity).
H=.5
H=.75
H=1.25
H=1.5
Figure 7.6 Gaussian vs. sparse fractal-like processes: comparison of fractional Brownian motion
(left column) and generalized Poisson (right column) stochastic processes as the order increases.
The processes that are side-by-side have the same order γ = H + 12 and identical second-order
statistics.
δyi : f → f − f (· − yi ).
δY∗ sα,H
is stationary.
7.5 Self-similar processes 179
yk δY ϕ(y) dy = 0
Rd
for |k| ≤ n (this is proved easily by induction on n or by differentiation in the
Fourier domain). From there, according to (5.36) we have
Lα−γ ∗ δY ϕ = ρ γ −d ∗ (δY ϕ)
−γ ∗
because the correction term for δY ϕ that normally distinguishes Lα from the shift-
invariant inverse is zero in this case. Consequently,
ϕ(· − h), δY∗ sα,H = δY ϕ(· − h), sα,H
−H−d+ αd ∗
= Lα δY ϕ(· − h), wα
−H−d+ αd ∗
= Lα δY ϕ, wα (· + h) by shift-invariance
−H−d+ αd ∗
= Lα δY ϕ, wα by the stationarity of wα
= ϕ, δY∗ sα,H , in probability law
which proves that δY∗ sα,H is stationary.
(4) Variogram and covariance in the finite-variance case (with 0 < H < 1). For α < 2,
fractional stable motions have infinite variance. But for α = 2, the covariance struc-
ture of fractional Brownian fields can be derived from the characteristic functional
of its (n + 1)th-order increments. Here, we show this derivation for 0 < H < 1.
P R O P O S I T I O N 7.8 (Self-similar variograms) The variogram and the covariance
function of a fractional Brownian field with 0 < H < 1 are given by
2γs2,H (r, s) = E{|s2,H (r) − s2,H (s)|2 } ∝ 2ρ 2H (r − s)
and
Rs2,H (r, s) = E{s2,H (r)ss,H (s)} ∝ 2ρ 2H (r) − 2ρ 2H (r − s) + 2ρ 2H (s),
respectively, where ρ γ is the γ -homogeneous distribution defined in Theorem 5.10.
Proof Let us fix r and s and temporarily denote by u the increment process
u(h) = s2,H (r + h) − s2,H (s + h) = δr−s s2,H (r + h).
Then, the variogram of s2,H corresponds to the variance of u at 0. To compute it, we
first observe that
−H−d/2
ϕ, δr−s s2,H (· + r) = ϕ, δr−s L2 w(· + r)
−H−d/2∗ ∗
= L2 {δr−s ϕ(· + r)}, w
= ρ ∗ ϕ(· + r) − ρ
H−d/2 H−d/2
∗ ϕ(· + s), w
H−d/2
= ρ (· + r) − ρ H−d/2
(· + s) ∗ ϕ, w .
180 Sparse stochastic processes
This shows that, for fixed r and s, u is a filtered Gaussian white noise with general-
ized covariance function
H−d/2 ∨
ρ (·+r)−ρ H−d/2 (·+s) ∗ ρ H−d/2 (·+r)−ρ H−d/2 (·+s) ∝ 2ρ 2H (r−s)−2ρ 2H (·),
using the symmetry and convolution properties of ρ γ . In particular, for the variance
at 0 of u (which is the same as the variance everywhere, due to stationarity), we
have
2γs2,H (r, s) = E{|s2,H (r) − s2,H (s)|2 } ∝ 2ρ 2H (r − s).
By developing the above result, we then find the generalized covariance
Rs2,H (r, s) = E{s2,H (r)ss,H (s)} ∝ 2ρ 2H (r) − 2ρ 2H (r − s) + 2ρ 2H (s).
with γ = H + 12 , where we have applied (5.38) and used Parseval’s relation to rewrite
−γ ∗
L2 ϕ22 in the Fourier domain. It is important to understand that (7.42) completely
characterizes fBm. While there are several equivalent ways of writing the denominator
in the Fourier-domain integral, we have chosen the form |ω|2γ = |(−jω)γ |2 . In terms of
operators, this translates into
" #
PB (ϕ) = exp − 1 Iγ ∗ {ϕ}L ,
H 2 0,2 2
γ∗
where I0,2 is the canonical left inverse of the fractional-derivative operator Dγ ∗ (see
definitions in Table 7.1). This, in turn, implies that fBm is the solution of the fractional
stochastic differential equation
Dγ BH = w, (7.43)
where w is a normalized Gaussian white noise.
γ∗
By writing the explicit form of I0,2 with γ ∈ (0.5, 1.5), we obtain
γ∗
ϕ (ω) −
ϕ (0) jωτ dω
I0,2 {ϕ}(τ ) = e (7.44)
R (−jω)γ 2π
= (D−γ +1 )∗ I∗0 {ϕ}(τ ), (7.45)
7.5 Self-similar processes 181
Definition Properties
where I∗0 = I1∗0,2 is the regularized integrator that we have already encountered during
our investigation of Lévy processes. One can also verify that (7.44) coincides with the
−γ ∗
Lp -stable left inverse ∂τ ,2 of Theorem 5.8 with τ = −γ /2 and p = 2. The interest of
(7.45) is that it suggests a possible representation of fBm as
BH = I0 D−H+1/2 w,
where I0 imposes the boundary condition BH (0) = 0.
To determine the underlying kernel denoted by hγ ,2 (t, τ ), we recall that h(t, τ ) =
L−1 {δ(· − τ )}(t) = L−1∗ {δ(· − t)}(τ ). Specifically, by inserting the Fourier transform of
δ(· − t) into (7.44), we find that
e−jωt − 1 jωτ dω
hγ ,2 (t, τ ) = γ
e
R (−jω) 2π
1 " γ −1 #
γ −1
= − (τ − t) + − (−τ )+
(γ )
1 " γ −1 γ −1
#
= (t − τ )+ − (−τ )+ , (7.46)
(γ )
which is consistent with the relation
* +
−1 1 tα+
F (t) =
(jω)α+1 (α + 1)
for α ∈
/ Z (see Table A.1).
Examples of such kernel functions with t = t1 = 1 (fixed) are shown in Figure 7.7 –
the ones of interest with γ ∈ (0.5, 1.5) have their area shaded. These are representatives
of the two main regimes where the functions are either bounded γ ∈ [1, 3/2) or
unbounded γ ∈ (1/2, 1) with two (square-integrable) singularities at τ = 0 and τ = t1 .
While hγ ,2 (t, τ ) is made up of individual atoms (one-sided power functions) whose
182 Sparse stochastic processes
1 1
−2 −1 1 2 −2 −1 1 2
(a) (b)
−2 −1 1 2
(c)
Figure 7.7 Comparison of the standard and higher-order fBm-related kernels hγ1 ,2 (1, τ ) (shaded)
vs. hγ2 ,2 (1, τ ) (dashed line) with γ2 = γ1 + 1: (a) (γ1 , γ2 ) = (0.8, 1.8); (b) (γ1 , γ2 ) = (1, 2);
(c) (γ1 , γ2 ) = (1.2, 2.2).
energy is infinite, the remarkable twist is that the combination in (7.46) yields a function
of τ that is square-integrable.
Proof We consider the scenario t1 ≥ 0. First, we note that (7.47) with γ = 1 and t = t1
simplifies to
which is compactly supported and hence rapidly decaying (see Figure 7.7b). For the
other values of γ , we split the range of integration into two parts in order to handle the
singularities separately from the decay of the tail.
(1) τ ∈ [−t1 , t1 ] with t1 > 0 finite. 1
This part of the Lp -integral will be bounded if 0 τ p(γ −1) dτ < ∞. This happens
for 1 − 1p < γ , which is always satisfied for p = 2 since γ = H + 12 > 12 .
(2) τ ∈ (−∞, −t1 ).
Here, we base our derivation on the factorized representation I0,2 = I0 D−γ +1 , which
γ
follows from (7.45). The operator D−γ +1 is a shift-invariant operator. Its impulse
7.5 Self-similar processes 183
response is given by
* + γ −2
−1 1 t+
ρDγ −1 (t) = F (t) = ,
(jω)γ −1 (γ − 1)
Taking advantage of Proposition 7.9, we invoke Theorem 7.1 to show that the gen-
eralized process BH (t) = I0 D−H+1/2 w(t) admits a classical interpretation as a random
function of t. The theorem also yields the stochastic integral representation
where the last equation with γ = H + 1/2 is the one that was originally used by Man-
delbrot and Van Ness to define fBm. Here, W(τ ) = B1/2 (τ ) is the standard Brownian
motion whose derivative in the sense of generalized functions is w.
While this result is reassuring, the real power of the distributional approach is that it
naturally lends itself to generalizations. For instance, we may extend the definition to
larger values of H by applying an additional number of integrals. The relevant adjoint
operator is In∗ ∗ n
0 = (I0 ) , whose Fourier-domain expression is
ϕ (k) (0)ωk
ϕ (ω) − nk=0
dω
(I∗0 )n {ϕ}(τ ) = k!
ejωτ . (7.49)
R (−jω)n 2π
γ∗
This formula, which is consistent with the form of I2,0 in Table 7.1 with γ = n, follows
from the definition of I∗0 and the property that the Fourier transform of ϕ ∈ S (R) admits
ϕ (ω) = ∞
k
the Taylor series representation k=0
ϕ (k) (0) ωk! , which yields the required limits
at the origin. This allows us to consider the higher-order version of (7.43) for γ = H + 12
with H ∈ R+ \N, the solution of which is formally expressed as
1
BH = In D−(H−n)+ 2 w,
184 Sparse stochastic processes
where n = H. This construction translates into the extended version of fBm specified
by the characteristic functional
⎛ 2 ⎞
H ϕ (k) (0)ωk
ϕ (ω) − k=0 dω ⎟
E{e
jϕ,BH
}=P B (ϕ) = exp ⎜
⎝−
1 k!
H
2 R H+1/2 2π ⎠ , (7.50)
(−jω)
where the case of interest for fBm is p = 2 (see examples in Figure 7.7). For γ = n ∈ N,
the expression simplifies to
⎧
⎨ 1 (τ ) (t−τ )n , for 0 < t
(0,t]
hn,p (t, τ ) = In∗ {δ(· − t)}(τ ) = n!
0 ⎩ −1[t,0) (τ ) (t−τ )n , for t < 0,
n!
For p = 2, this covers the whole range of non-integer Hurst exponents H = γ − 1/2
so that the classical interpretation of BH (t) remains applicable. Moreover, since In0 is a
right inverse of Dn , these processes satisfy the extended boundary conditions
for k = 0, . . . , H − 1.
Similarly, we can fractionally integrate the SαS noise wα to generate stable self-
similar processes that are inherently sparse for α < 2. The operation is acceptable as
γ
long as I0,p meets the Lp -stability requirement of Theorem 5.8 with p = α, which we
restate as
Hα = γ − 1 + 1
α ∈ R+ \N.
Interestingly enough, this is the same condition as in Proposition 7.10. This ensures that
the so-constructed fractional stable motions (fSm) are well defined in the generalized
7.5 Self-similar processes 185
sense (by the Minlos-Bochner theorem), as well as in the ordinary sense (by Theo-
rem 7.1). Statement (4) of Theorem 7.1 also indicates that the parameter Hα actually
represents the Hurst exponent of these processes, which is consistent with the analysis
of Section 7.5.1 (in particular, (7.41) with d = 1).
It is also possible to define stable fractal processes by considering any other variant
γ
∂τ of fractional derivative within the family of scale-invariant operators (see Proposi-
tion 5.6). Unlike in the Gaussian case where the Fourier phase is irrelevant, the fractio-
γ
nal SDE ∂τ sα,H = wα will specify a wider variety of self-similar processes with the
same overall properties.
where the sequence {rk } forms a Poisson point process in R+ and the amplitudes Ak
are i.i.d. Clearly, realizations of the above process correspond to random polynomial
splines of degree zero with random knots (see Figure 1.1). This characterization on the
half-axis is consistent with the one given on the full axis in Equation (1.14).
More generally, we may define fractional processes with a compound-Poisson inno-
vation and the fractional integrators identified in Section 5.5.1, which give rise to ran-
dom fractional splines. The subclass of these generalized Poisson processes associated
with causal fractional integrators finds the following representation as a piecewise frac-
tional polynomial on the positive half-axis:
γ −1
Ak (r − rk )+ .
k∈Z
They are whitened by the causal fractional-derivative operator Dγ whose Fourier
symbol is (jω)γ . Some examples of such processes are given in the right column of
Figure 7.6.
Figure 7.8 Gaussian fBm (top row) vs. sparse generalized Poisson images (bottom row) in 2-D.
The fields in each column share the same operator and second-order statistics.
variable with parameter λVol(), where λ is a fixed density parameter and Vol()
is the volume (Lebesgue measure) of . Second, fixing and conditioning on the
number of Diracs therein, the distribution of the position of each Dirac is uniform in
and independent of all the other Diracs (their amplitudes are also independent). We can
represent a realization of w symbolically as
w(r) = Ak δ(r − rk ),
k∈Z
where Ak are the i.i.d. amplitudes with some generalized probability density pA , the
rk are the positions coming from a Poisson innovation (i.e., fulfilling the above require-
ments) independent from the Ak , and the ordering (the index k) is insignificant. We recall
that, by Theorem 4.9, the characteristic functional of the compound-Poisson process is
related to the density of the said amplitudes by
" #
PwPoisson (ϕ) = exp λ e jaϕ(r)
− 1 pA (a) da dr .
Rd R
We shall limit ourselves here to amplitude distributions that have, at a minimum, a finite
first moment.
The combination of P w
Poisson with the operators noted previously involves no addi-
tional subtleties compared to the case of α-stable innovations, at least for the processes
with finite mean that we are considering here. It is, however, noteworthy to recall that
the compound-Poisson Lévy exponents are p-admissible with p = 1 and, in the special
case of symmetric and finite-variance amplitude distributions, for any p ∈ [1, 2] (see
Proposition 4.5). The inverse operator with which P w
Poisson is composed therefore needs
to be chosen/constructed such that it maps S (R ) continuously into Lp (Rd ). Such op-
d
erators were identified in Section 5.4.2 for integer-order SDEs and Sections 5.5.3, 5.5.1
for fractional-order derivatives and Laplacians.
7.6 Bibliographical notes 187
where ρL is the Green’s function of L and p0 (r) ∈ NL is a component in the null space
of the whitening operator. The reason for the presence of p0 is to satisfy the boundary
conditions at the origin that are imposed upon the process.
We provide examples of the outcome of applying inverse fractional Laplacians to a
Poisson field in Figure 7.8. The fact that the Green’s function of the underlying fractio-
nal Laplacian is proportional to rH−1 explains the change of appearance of these pro-
cesses when switching from the singular mode with H < 1 to the continuous one with
H > 1. As H increases further, the generalized Poisson processes become smoother and
more similar visually to their fractal Gaussian counterparts (fractional Brownian field),
which are displayed on the upper row of Figure 7.8 for comparison. Next, we shall look
at another interesting example, the Mondrian processes.
Ak 1r1 ,r2 ≥0 (r − rk ),
k∈Z
in direct parallel to (7.52), where the rk are distributed uniformly in any neighborhood
in the positive quarter-plane (Poisson point distribution).
A sample realization of this process, which bears some resemblance to paintings by
Piet Mondrian, is shown in Figure 7.9.
Section 7.1
The first-order processes considered in Section 7.1 are often referred to as non-Gaussian
Ornstein–Uhlenbeck processes; one of their privileged areas of application is financial
modeling [BNS01]. The representation of the autocorrelation in terms of B-splines is
discussed in [KMU11, Section II.B], albeit in the purely Gaussian context. The fact
188 Sparse stochastic processes
that exponential B-splines provide the formal connection between ordinary continuous-
domain differential operators and their discrete-domain finite-difference counterpart
was first made explicit in [Uns05].
Section 7.2
Our formulation relies heavily on Gelfand and Vilenkin’s theory of generalized sto-
chastic processes, in which stationarity plays a predominant role [GV64]. The main
analytical tools are the characteristic and the correlation functionals which have been
used with great effect by Russian probabilists in the 1970s and 1980s (e.g., [Yag86]),
but which are lesser known in Western circles. The probable reason is that the domi-
nant paradigm for the investigation of stochastic processes is measure-theoretic – as
opposed to distributional – meaning that processes are defined through stochastic inte-
grals (with the help of Itô’s calculus and its non-Gaussian variants) [Øks07, Pro04].
Both approaches have their advantages and limitations. On one hand, the Itô calculus
can handle certain non-linear operations on random processes that cannot be dealt with
in the distributional approach. Gelfand’s framework, on the other hand, is ideally suited
for performing any kind of linear operations, including some, such as fractional deriva-
tives, which are much more difficult to define in the other framework. Theorem 7.1 is
fundamental in that respect because it provides the higher-level elements for performing
the translation between the two modes of representation.
7.6 Bibliographical notes 189
Section 7.3
The classical theory of stationary processes was developed for the most part in the
1930s. The concept of (second-order) stationarity is due to Khintchine, who established
the correlation properties of such processes [Khi34]. Other key contributors include
Wiener [Wie30], Doob [Doo37], Cramér [Cra40], and Kolmogorov [Kol41]. Classical
textbooks on stochastic processes are [Bar61,Doo90,Wie64]. The manual by Papoulis is
very popular among engineers [Pap91]. The processes described by Proposition 7.2 are
often called linear processes (under the additional finite-variance hypothesis) [Bar61].
Their conventional representation as a stochastic integral is
t
s(t) = h(t − τ ) dW(τ ) = h(t − τ ) dW(τ ), (7.53)
R −∞
which coincides with the Lévy exponent f of our formulation (see Proposition 4.12). A
special case of the model is the so-called shot noise, which is obtained by taking w to
be a pure Poisson noise [Ric77, BM02].
Implicit in Property 7.3 is the fact that the autocorrelation of a second-order stationary
process is a cardinal L∗ L-spline. This observation is the key to proving the equivalence
between smoothing-spline techniques and linear mean-square estimation in the sense of
Wiener [UB05b]. It also simplifies the identification of the whitening operator L from
sampled data [KMU11].
The Gaussian CARMA processes are often referred to as Gaussian stationary
processes with rational spectral density. They are the solutions of ordinary sto-
chastic differential equations and are treated in most books on stochastic processes
[Bar61, Doo90, Pap91]. They are usually defined in terms of a stochastic integral such
as (7.53) where h is the impulse response of the underlying Nth-order differential
system and W(t) a standard Brownian motion. Alternatively, one can solve the SDE
using state-space techniques; this is the approach taken by Brockwell to characterize
the extended family of Lévy-driven CARMA processes [Bro01]. Brockwell makes
the assumption that E{|W(1)| } < ∞ for some > 0, which is equivalent to our
Lévy–Schwartz admissibility condition (see (9.10) and surrounding text) and makes the
formulations perfectly compatible. The CARMA model can also be extended to N ≤ M,
in which case it yields stationary processes that are no longer defined pointwise. The
Gaussian theory of such generalized CARMA processes is given in [BH10]. In light of
what has been said, such results are also transposable to the more general Lévy setting,
190 Sparse stochastic processes
since the underlying convolution operator remains S -continuous (under the stability
assumption that there is no pole on the imaginary axis).
Section 7.4
The founding paper that introduces Lévy processes – initially called additive processes –
is the very same that uncovers the celebrated Lévy–Khintchine formula and specifies the
family of α-stable laws [Lév34]. We can therefore only concur with Loève’s statement
about its historical importance (see Section 1.4). Besides Lévy’s classical monograph
[Lév65], other recommended references on Lévy processes are [Sat94, CT04, App09].
The book by Cont and Tankov is a good starting point, while Sato’s treatise is remark-
able for its precision and completeness.
The operator-based method of resolution of unstable stochastic differential equa-
tions that is deployed in this section was developed by the authors with the help of
Qiyu Sun [UTS14]. The approach provides a rigorous backing for statements such as
“a Lévy process is the integral of a non-Gaussian white noise,” “a continuous-domain
non-Gaussian white noise is the (weak) derivative of a Lévy process,” or “a Lévy pro-
cess is a non-Gaussian process with a 1/ω spectral behavior,” which are intuitively
appealing, but less obviously related to the conventional definition. The main advantage
of the framework is that it allows for the direct transposition of deterministic methods
for solving linear differential equations to the stochastic setting, irrespective of any
stability considerations. Most of the examples of sparse stochastic processes are taken
from [UTAK14].
Section 7.5
Fractional Brownian motion (fBm) with Hurst exponent 0 < H < 1 was introduced
by Mandelbrot and Van Ness in 1968 [MVN68]. Interestingly, there is an early paper
by Kolmogorov that briefly mentions the possibility of defining such stochastic objects
[Kol40]. The multidimensional counterparts of these processes are fractional Brownian
fields [Man01], which are also solution of the fractional SDE (− )γ /2 s = w where w
is a Gaussian white noise [TVDVU09]. The family of stable self-similar processes with
0 < H < 1 is investigated in [ST94], while their higher-order extensions are briefly
considered in [Taf11].
The formulation of fBms as generalized stochastic processes was initiated by Blu
and Unser in order to establish an equivalence between MMSE estimation and spline
interpolation [BU07]. A by-product of this study was the derivation of (7.42) and (7.50).
The latter representation is compatible with the higher-order generalization of fBm for
H ∈ R+ \Z+ proposed by Perrin et al. [PHBJ+ 01], with the underlying kernel (7.51)
being the same.
The scale-invariant Poisson processes of Section 7.5.3 were introduced by the authors
[UT11]; they are the direct stochastic counterparts of spline functions where the knots
and the strength of the singularities are assigned in a random fashion. The examples,
including the Mondrian process, are taken from that paper on sparse stochastic
processes.
8 Sparse representations
When the Lévy process is sampled uniformly on the integer grid, we have access to its
equispaced increments
which are i.i.d. Moreover, due to the property that W(0) = 0 (almost surely), the relation
can be inverted as
k
W(k) = u[n], (8.2)
n=1
where β0 = βD = β+0 is the rectangle B-spline defined by (1.4) (or (6.21) with α = 0) and
w is an innovation process with Lévy exponent f. Based on Proposition 7.4, we deduce
that u0 (t), which
is defined
for all t ∈ R, is stationary with characteristic functional
Pu (ϕ) = P w β ∨ ∗ϕ , where Pw is defined by (7.21). Let us now consider the random
0 0
variable U = δ(· − k), s = u[k]. To obtain its characteristic function, we rely on the
theoretical framework of Section 8.2 and simply substitute ϕ = ωδ(· − k) in P u (ϕ),
0
which yields
pU (ω) = Pw ωβ ∨ (· − k) = P w ωβ ∨ (by stationarity)
0 0
∨
= exp f ωβ0 (t) dt
R
= ef (ω) =
pid (ω). (8.4)
The disappearance of the integral results from the binary nature of β0∨ and the fact that
f (0) = 0. This shows that the increments of the Lévy process are infinitely divisible (id)
with canonical pdf pU = pid , which corresponds to the observation of the innovation
through a unit rectangular window (see Proposition 4.12). Likewise, we find that the
joint characteristic function of U1 = β0∨ (· − k1 ), w = u[k1 ] and U2 = β0∨ (· − k2 ), w =
u[k2 ] for any |k1 − k2 | ≥ 1 factorizes as
w ω1 β ∨ (· − k1 ) + ω2 β ∨ (· − k2 )
pU1 ,U2 (ω1 , ω2 ) = P 0 0
pU (ω1 )
= pU (ω2 ),
pU is given by (8.4). Here, we have used the fact that the supports of β0∨ (· − k1 )
where
∨
and β0 (· − k2 ) are disjoint, together with the independence at every point of w
(Proposition 4.11). This implies that the random variables U1 and U2 are independent
8.1 Decoupling of Lévy processes 193
for all pairs of distinct indices (k1 , k2 ), which proves that the sequence of Lévy incre-
ments {u[k]}k∈Z is i.i.d. with pdf pU = pid . The bottom line is that the decoupling of the
samples of a Lévy process afforded by (8.1) is perfect and that this transformation is
reversible.
The alternative is to expand the Lévy process in a wavelet basis that is matched to the
operator L = D. To that end, we revisit the Haar analysis of Section 1.3.4 by applying
the generalized wavelet framework of Section 6.5.3 with d = 1 and (scalar) dilation
D = 2. The required ingredient is the D∗ D-interpolator with respect to the grid 2i−1 Z:
ϕint,i−1 (t) = β(0,0) (t/2i−1 + 1), which is a symmetric triangular function of unit height
and support [−2i−1 , 2i−1 ] (piecewise-linear B-spline). The specialized form of (6.54)
then yields the derivative-like wavelets
where φi (t) = 2i/2−1 ϕint,i−1 (t) ∝ φ0 (t/2i ) is the normalized triangular kernel at reso-
lution level i and k ∈ Z \ 2Z. The difference with the classical Haar-wavelet formulas
(1.19) and (1.20) is that the polarity is reversed (since D∗ = −D) and the location index
k restricted to the set of odd integers. Apart from this change in indexing convention,
required by the general (multidimensional) formulation, the two sets of basis functions
are equivalent (see Figure 1.4). Next, we recall that the Lévy process can be represented
as s = I0 w, where I0 is the integral operator defined by (5.4). This allows us to express
its Haar wavelet coefficients as
Vi [k] = ψi,k , s
= D∗ φi (· − 2i−1 k), I0 w
= I∗0 D∗ φi (· − 2i−1 k), w (by duality)
= φi (· − 2 i−1
k), w , (left-inverse property of I∗0 )
which is very similar to (8.3), except that the rectangular smoothing kernel β0∨ is
replaced by a triangular one. By considering the continuous version Vi (t) = φi (·−t), w
of the wavelet transform at scale i, we then invoke Proposition 7.2 to show that Vi (t)
is stationary with characteristic functional P V (ϕ) = P w (φi ∗ ϕ). Moreover, since
i
the smoothing kernels φi (· − 2 k) for i fixed and odd indices k are not overlapping,
i−1
we deduce that the sequence of Haar-wavelet coefficients {Vi [k]}k∈Z\2Z is i.i.d. with
characteristic function w (ωφi ).
pVi (ω) = E{ejωφi ,w } = P
If we now compare the wavelet situation for i = 1 with that of the Lévy increments,
we observe that the wavelet analysis involves one more layer of smoothing of the inno-
vation with β0 (since φ1 = ϕint,0 = β0 ∗ β0∨ ), which slightly complicates the statistical
calculations. For i > 1, there is an additional coarsening effect which has some interest-
ing statistical implications (see Section 9.8).
While the smoothing effect on the innovation is qualitatively the same in both sce-
narios, there are fundamental differences too. In the wavelet case, the underlying dis-
crete transform is orthogonal, but the coefficients are not fully decoupled because of
194 Sparse representations
The implicit assumption here is that the process s has a sufficient degree of continuity
for its sample values to be well defined. Since s = L−1 w, it is actually sufficient that the
operator L−1 satisfies some mild regularity conditions. The simplest scenario is when
L−1 is LSI with its impulse response ρL belonging to some function space X ⊆ L1 (Rd ).
Then,
Now, if the characteristic functional of the noise admits a continuous extension from
S (Rd ) to X (specifically, X = R (Rd ) or X = Lp (Rd ), as shown in Section 8.2.2), we
w (ϕ).
obtain the characteristic function of s[k] by replacing ϕ ∈ X by ωρL∨ (· − k) in P
The continuity and positive definiteness of Pw ensure that the sample values of the
process have a well-defined probability distribution (by Bochner’s Theorem 3.7) so that
they can be interpreted as conventional random variables. The argument carries over to
the non-shift-invariant scenario as well, by replacing the samples by their generalized
increments. This is equivalent to sampling the generalized increment process u = Ld s,
which is stationary by construction (see Proposition 7.4).
The second discretization option is to expand s in a suitable basis whose expansion
coefficients (or projections) are given by
Yn = ψn , s ,
where the ψn are appropriate analysis functions (typically, wavelets). The argument
concerning the existence of the underlying expansion coefficients as conventional ran-
dom variables is based on the following manipulation:
The condition for admissibility, which is essentially the same as in the previous
w admits an
scenario, is that φn = L−1∗ ψn ∈ X , subject to the constraint that P
extension that is continuous and positive definite over X . This is consistent with
the notion of L-admissible wavelets (Definition 6.7), which asks that ψ = L∗ φ with
φ ∈ X ⊆ L1 (Rd ).
The proof is the same as that of Theorem 4.8. It suffices to replace all occurrences of
S (Rd ) by R (Rd ) and to restate the pointwise inequalities of the proof in the “almost
everywhere” sense. The reason is that the only property that matters in the proof is the
decay of ϕ, while its smoothness is irrelevant to the argument.
As we show next, reducing the constraints on the decay of the analysis functions is
possible too, by imposing additional conditions on f.
w (ϕ) =
F(ϕ) = log P f ϕ(r) dr
Rd
is continuous over Lp (Rd ) and conditionally positive definite of order one (see
Definition 4.6). To that end, we first observe that F(ϕ) is well defined for all ϕ ∈ Lp (Rd )
with
|F(ϕ)| ≤ f ϕ(r) dr ≤ Cϕpp ,
Rd
due to the p-admissibility of f. This assumption also implies that |ω|f (ω) ≤ C|ω|p ,
which translates into
ω2
|f (ω2 ) − f (ω1 )| = f (ω) dω
ω1
ω2
≤ C ωp−1 dω ≤ C max(|ω1 |p−1 , |ω2 |p−1 ) |ω2 − ω1 |
ω1
where we have used the fact that max(a, b) ≤ |a| + |b − a|. Next, we consider a
convergent sequence {ϕn }∞ d
n=1 in Lp (R ) whose limit is denoted by ϕ. We then have
that
" #
F(ϕn ) − F(ϕ) ≤ C |ϕ(r)|p−1 |ϕn (r) − ϕ(r)| + |ϕn (r) − ϕ(r)|p dr
"R
d
#
≤ C ϕp ϕn − ϕp + ϕn − ϕp ,
p−1 p
p
(by Hölder’s inequality with q = p−1 )
where the right-hand side converges to zero as limn→∞ ϕn − ϕp = 0, which proves
the continuity of F on Lp (Rd ). The second part of the statement is a direct conse-
quence of the conditional positive definiteness of f. Indeed, for every choice ϕ1 , . . . ,
ϕN ∈ Lp (Rd ), ξ1 , . . . , ξN ∈ C, and N ∈ Z+ , we have that
N N N N
F(ϕm − ϕn )ξm ξ n = f ϕm (r) − ϕn (r) ξm ξ n dr ≥ 0
m=1 n=1 Rd m=1 n=1
≥0
N
subject to the constraint n=1 ξn = 0.
We recall that the generalized increment process associated with the stochastic process
s = L−1 w is defined as
βL = Ld ρL ∈ L1 (Rd ). Then, the operator L−1 : ϕ(r) → Rd h(r, r )ϕ(r ) dr is a valid
right inverse of L if its kernel is of the form
h(r, r ) = L−1 {δ(· − r )}(r) = ρL (r − r ) + p0 (r; r ),
where p0 (r; r ) with r fixed is included in the null space NL of L. Moreover, we have
that
Ld L−1 ϕ = βL ∗ ϕ (8.5)
L−1∗ L∗d ϕ = βL∨ ∗ϕ (8.6)
for all ϕ ∈ S (Rd ).
Proof The equivalent kernel (see kernel Theorem 3.1) of the composed operator
LL−1 is
LL−1 {δ(· − r )} = L h(·, r )
= L ρL (· − r ) + p0 (·; r )
= L{ρL (· − r )} + L{p0 (·; r )}
0
= δ(· − r ),
which proves that LL−1 = Id, and hence that L−1 is a valid right inverse of L. Next, we
consider the composed operator Ld L−1 and apply the same procedure to show that
Ld L−1 {δ(· − r )} = βL (· − r ),
where we have used the property that Ld p0 (·; r ) = 0 since NL is included in the null
space of Ld (see (6.24) and text immediately below). It follows that
Since this relation is central to our argument, we want to detail the explicit steps of its
derivation. Specifically, for all ϕ ∈ S (Rd ), we have that
ϕ, u = ϕ, Ld s = ϕ, Ld L−1 w
= L−1∗ L∗d ϕ, w (by duality)
= βL∨ ∗ ϕ, w . (from (8.6))
This in turn implies that the characteristic functional of u = Ld s is given by
u (ϕ) = P
P w (β ∨ ∗ ϕ). (8.8)
L
This points once more to the importance of selecting the discrete operator Ld such that
the support of βL = Ld ρL is minimal or decays as fast as possible when a compact
support is not achievable.
This analysis clearly shows that the support of the B-spline governs the range of
dependency of the generalized increments with the property that u[k1 ] and u[k2 ] are
independent whenever |k1 − k2 | ≥ support(βL ). In particular, this implies that the
sequence u[·] is i.i.d. if and only if support(βL ) ≤ 1, which is precisely the case for
the first-order B-splines βα with α ∈ C, which go hand-in-hand with the Lévy (α = 0)
and AR(1) processes.
8.3 Generalized increments 201
where dα and b+L are defined by (8.12) and (8.14), respectively. The driving term e[k]
is a discrete stationary white noise (“white” meaning fully decorrelated or with a flat
power spectrum). However, e[k] is a valid innovation sequence with i.i.d. samples only
if the corresponding continuous-domain process is Gaussian, or, in full generality (i.e.,
in the non-Gaussian case), if it is a first-order Markov or Lévy-type process with N = 1.
2
Proof Since |B+L (ejω )| = BL (ejω ) is non-vanishing and is a trigonometric polyno-
mial of ejω whose roots are inside the unit circle, we have the guarantee that the inverse
filter whose frequency response is B+ (e1 jω ) is causal-stable. It follows that e (ejω ) =
L
n∈Z |βL (ω+2πn)|
2
σw2 =
BL (ejω )
σw2 , which proves the first part of the statement. As for the second
part, we recall that decorrelation is equivalent to independence in the Gaussian case
only. In the non-Gaussian case, the only way to ensure independence is by restrict-
ing ourselves to a first-order process, which results in an AR(1)-type equation with
e[n] = uα1 [n]. Indeed, since the corresponding B-spline is a size 1, uα1 [·] is i.i.d.,
which implies that pU uα1 [k] {uα1 [k − m]}m∈Z+ = pU (uα1 [k]). This is equivalent to
sα1[·] having
the Markov property since pS sα1 [k] {sα1 [k − m]}m∈Z+ = pU (uα1 [k]) =
pS sα1 [k] sα1 [k − 1] .
The Riesz-basis property of the B-spline ensures that the denominator of (8.15) is non-
vanishing, so that we may invoke Wiener’s lemma (Theorem 5.13) to show that the filter
is stable. Generally, the support of its impulse response is infinite, unless the support of
the B-spline is unity (Markov property).
filter
Ld . This option is especially useful for decoupling fractal-type processes that
are associated with fractional whitening operators whose discrete counterparts have an
infinite support. The use of ordinary finite-difference operators, in particular, is motiva-
ted by Theorem 7.7 because of their stationarizing effect.
The guiding principle is to select a localization filter
Ld that has a compact support
and is associated with some “augmented” operator L = L0 L, where L0 is a suitable dif-
ferential operator. The natural candidate for
L is a (non-fractional) differential operator
of integer order γ̃ ≥ γ whose null space is identical to that of L and whose B-spline βL̃
has a compact support. This latter function is given by
βL̃ =
Ld ρL̃ ,
This shows that φ is a (finite) linear combination of the integer shifts of ρL (the Green’s
function of the original operator L), and hence is a cardinal L-spline (see Definition 6.4).
To describe the decoupling effect of Ld on the signal s = L−1 w, we observe that
−1
(8.17) is equivalent to φ = Ld L δ, which yields
Ld s =
ũ = Ld L−1 w
= φ ∗ w.
This equation is the same as (8.7), except that we have replaced the original B-spline
βL by the smoothing kernel defined by (8.16). The procedure is acceptable whenever
φ ∈ Lp (Rd ), and its decay at infinity is comparable to that of βL . We call such a scheme
a robust localization because its qualitative effect is the same as that of the canonical
operator Ld . For instance, we can rely on the results of Chapter 9 to show that the
statistical distributions of ũ and u have the same global properties (sparsity pattern).
The price to pay is a slight loss in decoupling power because the localization of φ is
worse than that of βL , itself being the best that can be achieved within the given spline
space (B-spline property).
To access the remaining dependencies, it is instructive to determine the corresponding
autocorrelation function
% &
∗ ∨
Rũ (r) = E ũ(· + r)ũ(·) = σw2 (L0 L0 ){βL̃ ∗ β L̃ }(r) (8.18)
that primarily depends upon the size of the augmented-order B-spline βL̃ . Now, in the
special case where L0 is an all-pass operator with | (ω)|2 =
L0 (ω)|2 = 1, we have that |β L̃
L (ω)| so that the autocorrelation functions of u = Ld s and ũ =
|β 2 Ld s are identical. This
implies that the decorrelation effect of the localization filters Ld and Ld are equivalent,
which justifies replacing one by the other.
204 Sparse representations
where
α+1
dα,0 [k] = (−1)k α+1 (8.21)
2 +k
and
⎧
* + ⎨ (−1)n+1 2n
−1 1 π(2n)! t log |t|, for α = 2n ∈ 2N
ρα,0 (t) = F (t) =
|ω|α+1 ⎩ −1
|t|α , for α ∈ R+ \ 2N.
2(α+1) sin( π2 α)
Observe that ρα,0 is the Green’s function of the fractional-derivative operator ∂0α+1 ,
while the dα,0 [k] are the coefficients of the corresponding (canonical) localization filter.
The simplest instance occurs for α = 1, where (8.20) reduces to
which is the triangular function supported in [−1, 1] (symmetric B-spline of degree one)
shown in Figure 8.1c. In general, however, the fractional B-splines β0α (t) with α ∈ R+
are not compactly supported, unless α is an odd integer. In fact, they can be shown to
decay like O(1/tα+2 ), which is a behavior that is characteristic of fractional operators.
8.4 Wavelet analysis 205
1.5 1.5 1
1.0 1.0
0.5
0.5 0.5
0.0 0.0 0
−0.5 −0.5
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
1
φα,1 (t) = ∂01−α β01 (t) = π 12 |t + 1|α − |t|α + 12 |t − 1|α .
(α + 1) sin 2 α
This function decays like O(1/|t|−α+2 ) since the finite-difference operator asymptoti-
cally acts as a second-order derivative. The relevant functions for α = 1/2 are shown in
1/2
Figure 8.1. The fractional-derivative operator ∂0 is a non-local operator. Its application
to the compactly supported B-spline β01 produces a kernel with algebraically decaying
1/2
tails. While β0 and φ1/2,1 are qualitatively similar, the former function is better local-
ized with smaller secondary lobes, which reflects the superior decoupling performance
of the canonical scheme. Yet, this needs to be balanced against the fact that the robust
version uses a three-tap filter (second-order finite difference), as opposed to the solution
(8.21) of infinite support dictated by the theory.
The other option for uncoupling the information is to analyze the signal s(r) using
wavelet-like functions. The implicit assumption for the following properties is that we
have a real-valued wavelet basis available that is matched to the operator L. Specifically,
(n)
the structure of the transform must be such that the basis functions ψi,k at scale i are
206 Sparse representations
(n)
where the φi with n = 1, . . . , N0 are scale-dependent smoothing kernels whose width is
proportional to the scale ai = det(D)i/d , where D is the underlying dilation matrix (e.g.,
D = 2I for a standard dyadic scale progression). In the traditional wavelet transform,
L is scale-invariant and the wavelets at resolution i are dilated versions of the primary
ones at level i = 0 with φi(n) (r) ∝ φ0(n) (D−i r).
Here, for simplicity of notation, we shall focus on the general operator-like design
of Section 6.5 with N0 = 1, which has the advantage of involving the single mother
wavelet
ψi (r) = L∗ φi (r),
while the complete set of basis functions is represented by
ψi,k = L∗ φi (· − Di−1 k)
= L∗ φi,k (8.22)
with i ∈ Z and k ∈ Zd \ DZd . The technical assumption for the wavelet coefficients
Yi [k] = ψi,k , s to be well defined is that L−1∗ ψi,k = φi,k ∈ X , where X = R (Rd ) or,
in the event that f is p-admissible, X = Lp (Rd ) with p ∈ [1, 2].
where we have used the fact that L−1∗ is a valid (continuous) left inverse of L∗ . Since
φi ∈ X , we can invoke Proposition 7.2 to prove the first part.
For the second part, we also use the fact that Yi,k = ψi,k , s = φi,k , w . Based on the
w (ϕ) = E{ejϕ,w }, we then obtain
definition of the characteristic functional P
pYi (ω) = E{ejωYi,k } = E{ejωφi,k ,w }
= E{ejωφi ,w } = E{ejωφi ,w } (by stationarity and linearity)
=Pw (ωφi ) = exp f ωφi (r) dr ,
Rd
where we have inserted the explicit form given in (4.13). The result then follows by
w : X → C is a continuous, positive definite functional with
identification. Since P
Pw (0) = 1, we conclude that the function fφi is conditionally positive of order one
(by Schoenberg’s correspondence Theorem 4.7) so that it is a valid Lévy exponent
(see Definition 4.1). This proves that the underlying pdf is infinitely divisible (by
Theorem 4.1).
where f is the Lévy exponent of the innovation process w. The coefficients are independent
if the kernels φi1 ,k1 and φi2 ,k2 have disjoint support. Their correlation is given by
Proof The first formula is obtained by substitution of ϕ = ω1 ψi1 ,k1 + ω2 ψi2 ,k2 in
w (L−1∗ ϕ) and simplification using the left-inverse property of L−1∗ . The
E{ejϕ,s } = P
statement about independence follows from the exponential nature of the characteristic
function and the property that f (0) = 0, which allows for the factorization of the charac-
teristic function when the support of the kernels are distinct (independence of the noise
at every point). The correlation formula is obtained by direct application of (7.7) with
ϕ1 = ψi1 ,k1 = L∗ φi1 ,k1 and ϕ2 = ψi2 ,k2 = L∗ φi2 ,k2 .
It should be noted, however, that the quality of the decoupling depends strongly on the
spread of the smoothing kernels φi , which should be chosen to be maximally localized
for best performance. In the case of a first-order system (see the example in Section 6.3.3
and the wavelets in Figure 6.3d), the basis functions for i fixed are not overlapping,
which implies that the wavelet coefficients within a given scale are independent. This is
not so across scales because of the cone-shaped region where the support of the kernels
φi1 and φi2 overlap. Incidentally, the inter-scale correlation of wavelet coefficients is
often exploited in practice to improve coding performance [Sha93] and signal recon-
struction by imposing joint sparsity constraints [CNB98].
Let
p(X1 :XK ) (ω), ω ∈ RK , be the multidimensional characteristic function associated
with the Kth-order joint pdf p(X1 :XK ) (x) whose polynomial moments are all assumed to
be finite. Then, if the function f (ω) = log p(X1 :XK ) (ω) is well defined, we can write the
full multidimensional Taylor series expansion
∞
∂ n f (0) n
p(X1 :XK ) (ω) =
f (ω) = log ω
n!
n=|n|=0
∞
jn ωn
= (−j)n ∂ n f (0) , (8.25)
n!
n=|n|=0 κn
8.4 Wavelet analysis 209
where the internal summation is through all multi-indices whose sum is |n| = n. By
definition, the cumulants of p(X1 :XK ) (x) are the coefficients of this expansion, so that
κn = (−j)|n| ∂ n f (0).
The interest of these quantities is that they are in one-to-one relation with the (multidi-
mensional) moments of p(X1 :XK ) defined by
PROPOSITION 8.8 (Higher-order wavelet dependencies) Let {(ik , kk )}K k=1 be a set of
indices and {Yk = ψik ,kk , s }K with ψ = L ∗φ be the corresponding wavelet
k=1 ik ,kk ik ,kk
coefficients of the generalized stochastic process s in Proposition 8.6. Then, the joint
characteristic function of (Y1 , . . . , YK ) is given by
p(Y1 :YK ) (ω1 , . . . , ωK ) = exp f ω1 φi1 ,k1 (r) + · · · + ωK φiK ,kK (r) dr ,
Rd
where f is the Lévy exponent of the innovation process w. The coefficients are inde-
pendent if the kernels φik ,kk with k = 1, . . . , K have disjoint support. The wavelet cumu-
lants with multi-index n are given by
are finite.
Proof Since Yk = ψik ,kk , L−1 w = φik ,kk , w , the first part is obtained by direct substi-
tution of ϕ = ω1 φi1 ,k1 + · · ·+ ωKφiK ,kK inthe characteristic functional of the innovation
E{eϕ,w } = Pw (ϕ) = exp
R f ϕ(r) dr . To prove the second part, we start from the
Taylor-series expansion of f, which reads
∞
(jω)n
f (ω) = κn ,
n!
n=0
210 Sparse representations
where the κn are the cumulants of the canonical innovation pdf pid . Next, we consider
the multidimensional wavelet Lévy exponent fY (ω) = logp(Y1 :YK ) (ω) and expand it as
fY (ω) = f ω1 φi1 ,k1 (r) + · · · + ωK φiK ,kK (r) dr
Rd
∞ n
jω1 φi1 ,k1 (r) + · · · + jωK φiK ,kK (r)
= κn dr (1-D Taylor expansion of f )
Rd n!
n=0
∞
1 n! n n
= κn jω1 φi1 ,k1 (r) 1 · · · jωK φiK ,kK (r) K dr
Rd n! n!
n=0 n=|n|
(Multinomial expansion of inner sum)
∞
jn ωn
= κn φin11,k1 (r) · · · φinKK,kK (r) dr .
n! Rd
n=|n|=0
The formula for the cumulants of (Y1 , . . . , YK ) is then obtained by identification with
(8.25).
We note that, for N = 2, we recover Proposition 8.8. For instance, under the second-
order hypothesis, we have that
κ1 = −j f (0) = 0
We conclude the chapter with a detailed investigation of the effect of such signal
decompositions on first-order processes. We are especially interested in the evaluation
of performance for data reduction and the comparison with the “optimal” solutions pro-
vided by the Karhunen–Loève transform (KLT) and independent-component analysis
(ICA). While ICA is usually determined empirically by running a suitable algorithm,
the good news is that it can be worked out analytically for this particular class of signal
models and used as gold standard.
The Gaussian AR(1) model is of special relevance since it has been put forward in the
past to justify two popular source-encoding algorithms: linear predictive coding (LPC)
8.5 Optimal representation of Lévy and AR(1) processes 211
and DCT-based transform coding [JN84]. LPC, on the one hand, is used for voice com-
pression in the GSM standard for mobile phones (2G cellular network). It is also part
of the FLAC lossless audio codec. The DCT, on the other hand, was introduced as an
approximation of the KLT of an AR(1) process [Ahm74]. It forms the core of the widely
used JPEG method of lossy compression for digital pictures. Our primary interest here
is to investigate the extent to which these classical tools of signal processing remain
relevant when switching to the non-Gaussian regime.
(D − α1 Id)s(t) = w(t)
with a1 = eα1 . From the result in Section 8.3.1 and Proposition 8.4, we know that u[·]
is i.i.d. with an infinitely divisible distribution characterized by the (modified) Lévy
exponent
1
fU (ω) = f ωβα1 (t) dt = f ωeα1 t dt.
R 0
Here, βα1 (t) = 1[0,1) (t)eα1 t is the exponential B-spline associated with the first-order
whitening operator L = D−α1 Id (see Section 6.3) and f is the exponent of the continuous
domain innovation w.
To make the link with predictive coding and classical estimation theory, we observe
that s̃[k] = a1 s[k − 1] is the best (minimum-error) linear predictor of s[k] given the
past of the signal {s[k − n]}∞ n=1 . This suggests interpreting the generalized increments
u[k] = s[k] − s̃[k] as prediction errors.
For the benefit of the reader, we recall that the main idea behind linear predictive
coding (LPC) is to sequentially transmit the prediction error u[k], rather than the sample
values of the signal, which are inherently correlated. One also typically uses higher-
order AR models to better represent the effect of the sound-production system. A re-
finement for real-world signals is to re-estimate the prediction coefficients from time to
time in order to readapt the model to the data.
212 Sparse representations
Lx = u,
where u = U1 , . . . , UN with Un = u[n] and L is a Toeplitz matrix with non-zero entries
[L]n,n = 1 and [L]n−1,n = a1 = eα1 .
Assuming that L is invertible,1 we then solve this linear system of equations, which
yields
x = L−1 u, (8.28)
y = (Y1 , . . . , YN ) = Tx (8.29)
1 This property is dependent upon a proper choice of discrete boundary conditions (e.g., s[0] = 0).
8.5 Optimal representation of Lévy and AR(1) processes 213
But, before that, let us determine the transform-domain statistics of the signal in order
to qualify the effect of T. Generally, if we know the Nth-order pdf of x, we can readily
deduce the joint pdf of the transform-domain coefficients y = Tx as
1
p(Y1 :YN ) (y) = p(X :X ) (T−1 y). (8.30)
|det(T)| 1 N
The Fourier equivalent of this relation is
p(Y1 :YN ) (ω) =
p(X1 :XN ) (TT ω),
is the composite matrix that combines the effect of noise shaping (innovation model)
and the linear transformation of the data. The row vectors of A are am = (am,1 , . . . , am,N )
with am,n = [A]m,n , while its column vectors are denoted by bn = (a1,n , . . . , aN,n ).
where am,n = [A]m,n = [bn ]m and κm {U} = (−j)m ∂ m fU (0) is the (scalar) cumulant
N
of order m = n=1 mn of the innovation. Finally, the marginal distribution of Yn =
tn , x = an , u is infinitely divisible with Lévy exponent
N
fYn (ω) = fU an,m ω . (8.35)
m=1
214 Sparse representations
Moreover,
% & % &
p(Y1 :YN ) (ω) = E ejy,ω = E ejAu,ω
% T
&
= E eju,A ω =
p(U1 :UN ) (AT ω).
whose exponent is the sought-after quantity. Implicit to this result is the fact that the
infinite-divisibility property is preserved when performing linear combination of id
variables. Specifically, let f1 and f2 be two valid Lévy exponents. Then, it is not hard to
see that f (ω) = f1 (a1 ω) + f2 (a2 ω), where a1 , a2 ∈ R are arbitrary constants, is a valid
Lévy exponent as well (in reference to Definition 4.1).
Proposition 8.9 in conjunction with (8.31) provides us with the complete character-
ization for the transform-domain statistics. To illustrate its usage, we shall now deduce
the expression for the transform-domain covariances. Specifically, the covariance bet-
ween Yn1 and Yn2 is given by the second-order cumulant with multi-index m = en1 + en2 ,
whose expression (8.33) with m = 2 simplifies to
N
E Yn1 − E{Yn1 } Yn2 − E{Yn2 } = κ2 {U} an1 ,n an2 ,n
n=1
= σ02 an1 , an2 , (8.36)
where σ02 = −f U (0) is the variance of the innovation and where an1 and an2 are the
n1 th and n2 th row vectors of the matrix A = [a1 · · · aN ]T , respectively. In particular, for
n1 = n2 = n, we find that the variance of Yn is given by
Covariance matrix
An equivalent way of expressing second-order moments is to use covariance matrices.
Specifically, the covariance matrix of the random vector x ∈ RN is defined as
% T &
CX = E x − E{x} x − E{x} . (8.37)
CY = TCX TT
= ACU AT = σ02 AAT , (8.38)
where A is defined by (8.31). The reader can easily verify that this result is equivalent
to (8.36). Likewise, the second-order transcription of the innovation model is
CX = L−1 CU L−1T
= σ02 (LT L)−1 . (8.39)
Differential entropy
A final important theoretical quantity is the differential entropy of the random vector
x = (X1 , . . . , XN ), which is defined as
H(X1 :XN ) = E − log p(X1 :XN ) (x)
For instance, the differential entropy of the N-dimensional multivariate Gaussian pdf
with mean x0 and covariance matrix CX ,
1 " #
pGauss (x) = 2 exp − 12 (x − x0 )T C−1
X (x − x0 ) ,
(2π)N det(CX )
is given by
1
H(pGauss ) = (N + N log(2π) + log det(CX ))
2
1 " #
= log (2πe)N det(CX ) . (8.41)
2
The special relevance of this expression is that the Gaussian distribution is known to
have the maximum differential entropy among all densities with a given covariance CX .
This leads to the inequality
1 " #
−H(X1 :XN ) + log (2πe)N det(CX ) ≤ 0, (8.42)
2
where the quantity on the left is called the negentropy.
216 Sparse representations
This implies that the solution is optimal in the Gaussian scenario, where decorrelation
is equivalent to independence. PCA is essentially the same technique, except that it
replaces CX by a scatter matrix that is estimated empirically from the data.
In addition to the decorrelation property, the KLT minimizes any criterion of the form
(see [Uns84, Appendix A])
N
R(T) = G Var{Yn } , (8.45)
n=1
σ02
(LT L)h = h.
λ
For the AR(1) model, LT L is tridiagonal. This can then be converted into a set of
second-order difference equations that may be solved recursively. In particular, for
α1 = 0 (Lévy process), the eigenvectors for n = 1, . . . , N correspond to the sinusoi-
dal sequences
2 2n − 1
hn [k] = √ sin π k . (8.46)
2N + 1 2N + 1
Depending on the boundary conditions, one can obtain similar formulas for the eigen-
vectors when α1 = 0. The bottom line is that the solutions are generally sinusoids, with
minor variations in phase and frequency. This is consistent with the fact that the correla-
tion matrix CX is very close to being circular, and that circular matrices are diagonalized
by the discrete Fourier transform (DFT). Another “universal” transform that provides an
excellent approximation of the KLT of an AR(1) process (see [Ahm74]) is the discrete
cosine transform (DCT) whose basis vectors for n = 1, . . . , N are
2 πn 1
gn [k] = √ cos (k − ) . (8.47)
2N N 2
A key property is that the DCT is asymptotically optimal in the sense that its per-
formance is equivalent to that of the KLT when the block size N tends to infinity.
Remarkably, this is a result that holds for the complete class of wide-sense-stationary
processes [Uns84], which may explain why this transform performs so well in practice.
218 Sparse representations
Independent-component analysis
The decoupling afforded by the KLT emphasizes decorrelation, which is not necessa-
rily appropriate when the underlying model is non-Gaussian. Instead, one would rather
like to obtain a representation that maximizes statistical independence, which is the
goal pursued by independent-component analysis (ICA). Unlike the KLT, there is no
single ICA but a variety of numerical solutions that differ in terms of the criterion being
optimized [Com94]. The preferred measure of independence is mutual information
(MI), with the caveat that it is often difficult to estimate reliably from the data. Here, we
take advantage of our analytic framework to bypass this estimation step, which is the
empirical part of ICA.
Specifically, let
z = (Z1 , . . . , ZN ) = TICA x
with TTICA TICA = I. By definition, the mutual information of the random vector (Z1 , . . . ,
ZN ) is given by
N
I(Z1 , . . . , ZN ) = HZn − H(Z1 :ZN ) ≥ 0, (8.48)
n=1
where HZn = − R pZn (z) log pZn (z) dz is the differential entropy of the component
Zn (which is computed from the marginal distribution
pZn ), and H(Z1 :ZN ) is the Nth-
order differential entropy of z see (8.40) . The relevance of this criterion is that
I(Z1 , . . . , ZN ) = 0 if and only if the variables Z1 , . . . , ZN are statistically independent.
The other important property is that H(Z1 :ZN ) = H(Y1 :YN ) = H(X1 :XN ) , meaning that the
joint differential entropydoes not depend upon the choice of transformation as long as
|det(T)| = 1 see (8.43) . Therefore, the ICA transform that minimizes I(Z1 , . . . , ZN )
subject to the orthonormality constraint is also the one that minimizes the sum of the
transform-domain entropies. We call this optimal solution the min-entropy ICA.
The implicit understanding with ICA is that the components of z are (approximately)
independent so that it makes good sense to approximate the joint pdf of z by the product
of the marginals given by
$
N
p(Z1 :ZN ) (z) ≈ pZn (zn ) = q(Z1 :ZN ) (z). (8.49)
n=1
As it turns out, the min-entropy ICA minimizes the Kullback–Leibler divergence be-
tween p(Z1 :ZN ) and its separable approximation q(Z1 :ZN ) . Indeed, the Kullback–Leibler
divergence between two N-dimensional probability density functions p and q is de-
fined as
p(z)
D pq = log p(z) dz.
RN q(z)
8.5 Optimal representation of Lévy and AR(1) processes 219
$
N
N
D p(Z1 :ZN ) pZn = −H(Z1 :ZN ) − p(Z1 :ZN ) (z) log pZn (zn ) dz
n=1 RN n=1
N
= −H(Z1 :ZN ) − pZn (z) log pZn (z) dz
n=1 R
N
= −H(Z1 :ZN ) − HZn = I(Z1 , . . . , ZN ) ≥ 0,
n=1
with equality to zero if and only if p(Z1 :ZN ) (z) and its product approximation on the
middle/left-hand-side of (8.49) are equal.
In the special case where the innovation follows an SαS distribution with fU (ω) =
−|s0 ω|α , we can derive a form of the entropy criterion that is directly amenable to
numerical optimization. Indeed, based on Proposition 8.9, we find that the characteristic
function of Yn is given by
N
α
pY (ω) = exp
n − s0 an,m ω
m=1
= exp −sα0 an αα |ω|α = e−|sn ω| ,
α
(8.50)
This implies that pYn is a rescaled version of pU so that its entropy is given by
Hence, minimizing the mutual information (or, equivalently, the sum of the entropies of
the transformed coefficients) is equivalent to the optimization of
N
I(T; α) = − log an α
n=1
N N α
1
=− log [TL−1 ]αn,m . (8.51)
α
n=1 m=1
Based on (8.48), (8.44), and the property that |det(L)| = 1, we can actually verify that
(8.51) is the exact formula for the mutual information I(Y1 , . . . , YN ). In particular, for
α = 2 (Gaussian case), we find that
N N
1 1
I(T; 2) = − log an 22 = − log Var{Yn },
2 2
n=1 n=1
220 Sparse representations
which is a special instance of (8.45). It follows that, for α = 2, ICA is equivalent to the
KLT, as expected.
101
Mutual Information
100
Identity
DCT/ KLT
Haar wavelet
Optimal (ICA)
10−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
a
Figure 8.2 Decoupling ability of various transforms of size N = 64 for the representation of SαS
Lévy processes as a function of the stability/sparsity index α. The criterion is the average mutual
information. The Gaussian case α = 2 corresponds to a Brownian motion.
8.5 Optimal representation of Lévy and AR(1) processes 221
We also examined the basis functions of ICA and found that they were very similar to
Haar wavelets. In particular, it appeared that the ICA training algorithm would almost
systematically uncover the N/2 shorter Haar wavelets, which is not overly surprising
since these are basis vectors also shared by the discrete whitening filter L.
Remarkably, this statistical model supports the (quasi)-optimality of wavelets. Since
mutual information is in direct relation to the transform-domain entropies which are
predictors of coding performance, these results also provide an explanation of the
superiority of wavelets for the M-term approximation of Lévy processes reported in
Section 1.3.4 (see Figure 1.5).
The graph in Figure 8.3 provides the matching results for AR(1) signals with a1 = 0.9.
Also included is a comparison between the Haar transform and the operator-like wavelet
transform that is matched to the underlying AR(1) model. The conclusions are essen-
tially the same as before. Here, too, the DCT closely replicates the performance of the
KLT associated with the Gaussian brand of these processes. While the Haar transform
is superior to the DCT for the sparser processes (small values of α), it is generally
outperformed by the operator-like wavelet transform, which confirms the relevance of
applying matched basis functions.
The finding that a fixed set of basis functions (operator-like wavelet transform) is
capable of essentially replicating the performance of ICA is good news for applications.
Indeed, the computational cost of the wavelet algorithm is O(N), as opposed to O(N2 )
for ICA, not to mention the price of the training procedure (iterative optimization) which
is even more demanding (O(N2 ) per iteration of a gradient-based scheme). A further
conceptual advantage is that the operator-like wavelets are known in analytic form (see
Section 6.3.3), while ICA can, at best, only be determined numerically by running a
suitable optimization algorithm.
Identity
DCT/KLT
101 Haar wavelet
Mutual Information
Operator–like wavelet
Optimal (ICA)
100
10−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
a
Figure 8.3 Decoupling ability of various transforms for the representation of stable AR(1) signals
of size N = 64 with correlation a1 = 0.9 as a function of the stability/sparsity index α. The
criterion is the average mutual information. The DCT is known to be asymptotically optimal in
the Gaussian case (α = 2).
222 Sparse representations
Sections 8.1–8.4
The property that finite differences decouple Lévy processes is well known – in fact, it
is the starting point of the definition of such “additive” processes [Lév65]. By contrast,
the observation that Haar wavelets have the same kind of ability (on a scale-by-scale
basis) is much more recent [UTS14].
A crucial theoretical aspect is the extension of the domain of the characteristic func-
tional that was carried out in Section 8.2. The proof of Theorem 8.2 is adapted from
[UTS14, Theorem 3], while more general results for arbitrary Lévy exponents can be
found in [Fag14].
The material in Section 8.3 is an extension of the results presented in [UTAK14].
In that respect, we note that the correspondence between continuous-time and discrete-
time ARMA models (the Gaussian part of Proposition 8.4) is a classical result in the
theory of Gaussian stationary processes [Doo90]. The localization of the fractional-
derivative operators in Section 8.3.5 is intimately linked to the construction of fractional
B-splines with the help of the generalized binomial expansion (see [BU03, UB00]).
The theoretical results on wavelet-domain statistics in Section 8.4 are an extension
of [UTS14, Section V.D]. In particular, the general cumulant formulas (Proposition 8.8)
are new, to the best of our knowledge.
Section 8.5
The use of linear transforms for the decorrelation of signals is a classical topic in signal
and image processing [Pra91, JN84, Jai89]. The DCT was introduced by Ahmed et al.
as a practical substitute for the KLT of an AR(1) process [Ahm74]. As part of his thesis
work, Unser proved its asymptotic equivalence with the KLT for the complete class
of Gaussian stationary processes [Uns84]. The AR(1) model has frequently been used
for comparing linear transforms using various performance metrics derived from the
Gaussian hypothesis [PAP72, HP76, Jai79, JN84]. While such measures clearly point
to the superiority of sinusoidal transforms, they lose their relevance in the context of
sparse processes. The derivation of the KLT of a Lévy process (see (8.46)) can be found
in [KPAU13, Appendix II].
Two classical references on ICA are [Com94, HO00]. In practice, ICA is determined
from the data based on the minimization of a suitable contrast that favors independence
or, by extension, non-Gaussianity/sparsity. There is some empirical evidence of a link
between ICA and wavelets. The first is a famous experiment by Olshausen and Field,
who computed ICA from a large collection of natural image patches and pointed out
the similarity between the extracted factors and a directional wavelet/Gabor analysis
[OF96a]. In 1999, Cardoso and Donoho reported a numerical experiment involving a
(non-Gaussian) sawtooth process that resulted in ICA basis functions that were remark-
ably similar to wavelets [Car99]. The characterization of ICA for SαS processes that is
presented in Section 8.5.4, and the demonstration of the connection with operator-like
wavelets, are based on the more recent work of Pad and Unser [PU13].
The source for the calculation of the differential entropies in Table 8.1 is [LR78].
9 Infinite divisibility and
transform-domain statistics
where f is the Lévy exponent of the continuous-domain innovations. This also translates
into a correspondence between the Lévy density vϕ (a) of fϕ and the canonical v(a) that
characterizes the innovation (see Theorem 4.2).
The central part of the chapter is devoted to the demonstration that fϕ (resp., vϕ ) pre-
serves the primary features of f (resp. v) and by the same token those of the underlying
pdf. The properties of unimodality, self-decomposability, and stability are covered in
224 Infinite divisibility and transform-domain statistics
Sections 9.2, 9.3, and 9.4, respectively, while the characterization of the decay and the
determination of the moments are carried out in Sections 9.5–9.6. The conclusion that
follows is that the shape of the analysis window does not fundamentally change the
nature of the transform-domain pdfs. This is good news for practical applications since
it allows us to do some relatively robust modeling by sticking to a particular family of id
distributions (such as the Student, symmetric gamma, or SαS laws in Table 4.1) whose
dispersion and decay parameters can be tuned to fit the statistics of a given type of sig-
nal. These findings also suggest that the transform-domain statistics are only mildly
dependent upon the choice of a particular wavelet basis as long as the analysis wavelets
are matched to the whitening operator of the process in the sense that ψi = L∗ φi with
φi ∈ Lp (Rd ).
In the case where the operator is scale-invariant (fractal process), we can be more
specific and obtain a precise characterization of the evolution of the wavelet statistics
across scale. Our mathematical analysis hinges upon the semigroup properties of id
laws, which are reviewed in Section 9.7. These results are then used in Section 9.8 to
show that the wavelet-domain pdfs are ruled by a diffusion-like equation. This allows
us to prove that the wavelet-domain pdfs converge to Gaussians – or, more generally, to
stable distributions – as the scale gets coarser.
4w (ωϕ) = e fϕ (ω) ,
pXϕ (ω) = P
b2,ϕ = b2 ϕ2L2
1
vϕ (a) = v(a/ϕ(r)) dr, (9.2)
ϕ |ϕ(r)|
where ϕ ⊆ Rd denotes the domain over which ϕ(r) = 0. Moreover, fϕ (ω) (resp., vϕ )
satisfies the same type of admissibility condition as f (ω) (resp., v).
Note thatthe restriction ϕ ∈ L1 (Rd ) is only required for the non-centered scena-
rios, where R av(a) da = 0 and b1 = 0. The corresponding Lévy parameter b1,ϕ then
depends upon the type of Lévy–Khintchine formula. In the case of the fully-compensated
representation (4.5), we have b1,ϕ = b1 Rd ϕ(r) dr.
LEMMA 9.2 Let v(a) ≥ 0 be a valid Lévy density such that R min(a2 , |a|p )v(a) da <
∞ for some p ∈ [0, 2]. Then, for any given ϕ ∈ Lp (Rd ) ∩ L2 (Rd ), the modified density
vϕ (a) specified by (9.2) is Lévy-admissible with
p
min(a2 , |a|p )vϕ (a) da < 2ϕ2L2 a2 v(a) da + 2ϕLp |a|p v(a) da. (9.3)
R |a|<1 |a|≥1
If, in addition, R |a|p v(a) da < ∞ for some p ≥ 0, then the result holds for any ϕ ∈
Lp (Rd ) and vϕ satisfies the inequality
p
|a|p vϕ (a) da < 2ϕLp |a|p v(a) da. (9.4)
R R
where μϕ is the measure describing the amplitude distribution of ϕ(r) within the range
θ ∈ R, with zero contribution at θ = 0 (to avoid dividing by zero). For further conven-
ience, we also define the measure
μ|ϕ| that specifies the amplitude distribution of |ϕ(r)|.
To check the finiteness of |a|>1 |a|p vϕ (a) da, we first consider the contribution I1 of the
positive values
∞ ∞ ∞ |a|p
I1 = |a|p vϕ (a) da = v a/θ μϕ (dθ ) da
1 1 −∞ |θ |
∞ ∞ |a|p
= v a/θ da μϕ (dθ )
−∞ 1 |θ |
∞ ∞
= |a θ |p v(a ) da μϕ (dθ )
0 1/θ
∞ 1 ∞
≤ |a θ |p v(a ) da + |a θ |p v(a ) da μϕ (dθ ),
0 min(1,1/θ) 1
where we are relying on Tonelli’s theorem to interchange the integrals and where we
have made the reverse change of variable a = a/θ . The crucial step is to note that
a θ ≥ 1 for the range of values within the first inner integral, which yields
∞ 1 ∞
2 p
I1 ≤ |a θ | v(a ) da + |a θ | v(a ) da μϕ (dθ )
0 min(1,1/θ) 1
∞ 1 ∞ ∞
≤ θ 2 μϕ (dθ ) a2 v(a) da + |θ |p μϕ (dθ ) ap v(a) da.
−∞ 0 −∞ 1
R |θ |
Proceeding in the same fashion for the negative values and recalling that p μ (dθ )
ϕ
p
= ϕLp , we find that
p
|a|p vϕ (a) da ≤ ϕ2L2 a2 v(a) da + ϕLp |a|p v(a) da < ∞,
|a|≥1 |a|<1 |a|≥1
where we have used R min(a2 , |a|p )v(a) da = |a|<1 a2 v(a) da + |a|≥1 |a|p v(a) da. As
for the quadratic part (I2 ) of the admissibility condition, we consider the integral
1
I2 = a2 vϕ (a) da
0
1 ∞ a2
= v a/θ μϕ (dθ ) da
0 −∞ θ
∞ 1/θ
= (a θ )2 v(a ) da μϕ (dθ )
0 0
9.1 id laws, spectral mixing, and analysis of white noise 227
with the change of variable a = a/θ . Since a < 1/θ within the bound of the inner
integral, we have
∞ 1 max(1,1/θ)
I2 ≤ (a θ )2 v(a ) da + |a θ |p v(a ) da μϕ (dθ )
0 0 1
∞ 1 ∞ ∞
≤ θ 2 μϕ (dθ ) a2 v(a) da + |θ |p μϕ (dθ ) ap v(a) da.
−∞ 0 −∞ 1
p
a2 vϕ (a) da ≤ ϕ2L2 a2 v(a) da + ϕLp |a|p v(a) da < ∞,
|a|<1 |a|<1 |a|≥1
which is then combined with the previous result to yield (9.3). The announced Lp
inequality is obtained in a similar fashion without the necessity of splitting the inte-
grals in subparts.
L E M M A 9.3 Let v(a) ≥ 0 be a Lévy density such that R min(a2 , |a|p )v(a) da < ∞
for some p ∈ [0, 2]. Then, the non-Gaussian Lévy exponent
" #
g(ω) = ejaω − 1 − jaω1|a|<1 (a) v(a) da
R\{0}
is bounded by
|g(ω)| ≤ C2 |ω|2 + Cq |ω|q ,
with C2 = |a|<1 |a|2 v(a) da, q = min(1, p), and Cq = 2 |a|≥1 |a|q v(a) da.
for all 0 ≤ q ≤ p, and, in particular, q = min(1, p). We then consider the inequalities
jx
e − 1 − jx ≤ x2
and
jx
e − 1 ≤ min(|x|, 2)
≤ 2 min(|x|, 1)
≤ 2|x|q ,
228 Infinite divisibility and transform-domain statistics
where the restriction q ≤ 1 ensures that |x|q ≥ |x| for |x| ≤ 1. By combining these
elements, we construct the upper bound
jaω jaω
|g(ω)| ≤ e − 1 − jaω v(a) da + e − 1 v(a) da
|a|<1 |a|≥1
Proof of Theorem 9.1. First, we use Lemma 9.3 together with the Lévy–Khintchine
formula (4.3) to show that the modified Lévy exponent fϕ is a well-defined function of
ω. This is achieved by establishing the upper bound
fϕ (ω) ≤ A1 |ω| + A2 |ω|2 + Aq |ω|q , (9.5)
" #
where q = min(1, p), A1 = |b1 |ϕL1 , A2 = b22 + |a|<1 |a|2 v(a) da ϕ2L2 and A3 =
q
2 |a|≤1 |a|q v(a) da ϕLq .
To lay out the technique of proof, we first assume that |a|>1 |a| v(a) da < ∞ and
consider the Lévy–Khintchine representation (4.5) of f. This yields
ω2
fϕ (ω) = jb1 ϕ(r)ω dr − b2 |ϕ(r)|2 dr + gϕ (ω)
Rd Rd 2
ω2
= j b1 ϕ(r) dr ω − b2 ϕ2L2 + gϕ (ω),
Rd
2
b2,ϕ
b1,ϕ
where
" #
gϕ (ω) = ejaϕ(r)ω − 1 − jaϕ(r)ω v(a) da dr,
Rd R\{0}
with the property that gϕ (0) = 0. Next, we identify vϕ (a) by making the change of
variable a (r) = aϕ(r), while restricting the domain of integration to the subregion of
Rd over which the argument is non-zero:
" # 1
gϕ (ω) = eja ω − 1 − ja ω v a /ϕ(r) da dr
ϕ R\{0} |ϕ(r)|
" # 1
= eja ω − 1 − ja ω v a /ϕ(r) dr da ,
R\{0} ϕ |ϕ(r)|
vϕ (a )
where the interchange of integrals is legitimate thanks to (9.5) (by Fubini). Lemma 9.2
ensures that vϕ is admissible in accordance with Definition 4.3.
9.1 id laws, spectral mixing, and analysis of white noise 229
The scenario |a|>1 |a| v(a) da = ∞ is trickier because it calls for a more careful
compensation of the singularity of the Lévy density. The classical Lévy–Khintchine
formula (4.4) leads to an integral of the form
" #
gϕ (ω) = ejaϕ(r)ω − 1 − jaϕ(r)ω h a, ϕ(r) v(a) da dr
Rd R\{0}
with
h a, ϕ(r) = 1|a|<1 (a). It turns out that the exact form of the compensation
h a, ϕ(r) is not important as long as it stabilizes the integral by introducing an
appropriate linear bias which results in a modification of the constant b 1 . Instead
of the canonical
solution, we propose an alternative regularization with h a, ϕ(r) =
1|aϕ(r)|<1 a , which is compatible with the change of variable a = a/ϕ(r). The ratio-
nale is that this particular choice is guaranteed to lead to a convergent integral, as a
consequence of Lemma 9.2, so that the remainder of the proof is the same as in the
previous case: change of variable and interchange of integrals justified by Fubini’s
theorem. This also leads to some modified constant b1,ϕ which is necessarily finite
since both gϕ (ω) and fϕ (ω) are bounded.
To carry out the proof of Lemma 9.2, we have exploited the fact that the integration
of f against a function ϕ amounts to a spectral mixing f (ω; μϕ ) = R f (θ ω) μϕ (dθ ). This
results in an admissible
Lévy exponent provided that the pth absolute moment of the
measure μ is finite: R |θ |p μϕ (dθ ) < ∞. The equivalence with fϕ (ω) is obtained by
defining μϕ ((−∞, θ ]) = Meas{r : ϕ(r) < θ and ϕ(r) = 0}, which specifies the ampli-
R . To gain further insight into the mixing process,
tude distribution of ϕ(r) as r ranges over d
we like to view the Lebesgue integral R f (θ ω) μϕ (dθ ) as the limit of a sequence of Lévy
exponents fN (ω) = N n=1 f (θn,N ω)τn , each corresponding to a characteristic function of
N
the form e fN (ω) = n=1 pX (sn ω)τn with sn = θn,N and τn = μϕ [θn−1,N , θn,N ) . The latter
is interesting because it provides a convolutional interpretation of the mixing process,
and also because it shows that all that matters is the amplitude distribution of ϕ, and not
its actual spatial structure.
C O R O L L A RY 9.4 (Symmetric Lévy exponents) Let f be an admissible Lévy exponent
and let ϕ ∈ Lp (R) for p ≥ 1 be a function such
that its
amplitude
distribution (or his-
togram) μϕ is symmetric. Then, fϕ (ω) = Rd f ωϕ(r) dr = R f (θ ω) μϕ (dθ ) is a valid
symmetric Lévy exponent
that admits the canonical representation (4.4) with modified
Lévy parameters bϕ , vϕ (a) = vϕ (−a) , as specified in Theorem 9.1. Conversely, if f is
symmetric to start with, then fϕ stays symmetric for any ϕ, irrespective of its amplitude
distribution.
Proof Based on the Lévy–Khintchine representation of f and by relying on Fubini’s
theorem to justify the interchange of integrals, we get
b2 jaθω
fϕ (ω) = − ω2 θ 2 μϕ (dθ ) + e − 1 − jaθ ω v(a) da μϕ (dθ )
2
R R R
d
ϕ2
b2 2 ∞
=− ω ϕ2 + 2 cos(aθ ω) − 1 μϕ (dθ )v(a) da,
2 R 0
230 Infinite divisibility and transform-domain statistics
where we have made use of the symmetry assumption μϕ (E) = μϕ (−E). Since the
above formula is symmetric in ω and fϕ (ω) is a valid Lévy exponent (by Theorem 9.1),
we can invoke Corollary 4.3, which yields the desired result. The converse part of
Corollary 9.4 is obvious.
The practical relevance of Corollary 9.4 is that we can restrict ourselves to a sym-
metric model of noise without any loss of generality as long as the analysis function
(typically, a wavelet) has a symmetric amplitude distribution μϕ . This is equivalent to
all odd moments of μϕ being zero, including the mean.
One of the cornerstones of our formulation is that the mixing (white-noise integration)
does not fundamentally affect the key properties of the Lévy exponent, as will be made
clear in what follows.
The pleasing observation is that all the examples of id distributions in Table 4.1 are
in this category.
P R O P O S I T I O N 9.6 (Alternative statement) Let f (ω) = f (−ω) be a real-valued ad-
missible Lévy exponent of class C associated with a symmetric unimodal distribution
and ϕ a d-dimensional
function such that ϕLp < ∞ for p ∈ R+ . Then, fϕ (ω) =
Rd f ωϕ(r) dr is a valid Lévy exponent that retains the symmetry and unimodality
properties.
The interest of the above is more in the constructive proof given below than in the
statement, which is slightly less general than Corollary 9.5.
Proof of Proposition 9.6 The result is obtained as a corollary of two theorems
in [GK68]: (i) the convolution of two symmetric 1 unimodal distributions is unimo-
dal, and (ii) if a sequence of unimodal distributions converges to a distribution, then the
limit function is unimodal as well. The idea is then to view the Lebesgue integral as the
n2 −1
limit as n goes to infinity of the sequence of sums k=1 f ωsk,n μk,n , where μ k,n is the
measure of the set Ek,n = {r : nk ≤ |ϕ(r)| < k+1 } and s k,n = arg min ϕ(r):r∈E
f ωϕ(r) ,
n n,k
recalling that the Lévy exponent f is continuous.
f (ωs )Each
μk,n individual term corresponds to
a characteristic function e μk,n f (ωsk,n ) = e k,n (Fourier-domain convolutional
factor) that is id (thanks to the rescaling and exponentiation property), symmetric, and
unimodal. Finally, we rely on the admissibility condition and Lebesgue’s dominated-
convergence theorem to show that the sequence converges to the limit fϕ (ω) = fϕ (−ω)
which specifies a valid symmetric id distribution.
This proof also suggests that the class-C property is the tightest possible condition in
the mixing scenario with arbitrary ϕ. Indeed, class C plus symmetry is necessary and
sufficient for pXτ (x) to be unimodal for all τ ∈ R+ [Wol78].
Another interesting class of id distributions are those that are completely monotonic.
We recall that a function (or density) q(x) is completely monotonic on R+ if it is of class
C∞ with alternating derivatives, so that
dn q(x)
(−1)n ≥ 0 for n ∈ N, x ≥ 0.
dxn
Thanks to Bernstein’s theorem, the symmetric version of completely monotonic distri-
butions can be expressed as mixtures of Laplacians; that is,
+∞ 1 −λ|x|
q(x) = λe pZ (λ) dλ (9.6)
0 2
for some mixing density pZ (λ) on R+ . By making the change of variable θ = 1/λ (scale
of the exponential distribution), we may also express q(x) as
+∞ 1 −|x|/θ ∞
q(x) = e p1/Z (θ ) dθ = pY (x|θ )p1/Z (θ ) dθ
0 2θ 0
1 While translating Gnedenko and Kolmogorov’s book into English, K. L. Chung realized that Lapin’s theo-
rem on the convolution of unimodal distributions, which does not impose the symmetry condition, is
generally not true. The result for symmetric distributions goes back to Wintner in 1936 (see [Wol78] for a
historical account).
232 Infinite divisibility and transform-domain statistics
with pZ (λ) dλ = p1/Z (θ ) dθ , where pY (x|θ ) is the Laplace distribution with standard
deviation θ . The probabilistic interpretation of the above expansion is that of an expo-
nential scale mixture: q = pX is the pdf of the ratio X = Y/Z of two independent random
variables Y and Z with pdfs pY (standardized Laplacian with λ = 1) and pZ (under the
constraint that Z is positive), respectively.
Complete monotonicity is one of the few simple criteria that ensures that a distribu-
tion is infinitely divisible [SVH03, Theorem 10.1]. Moreover, one can readily show that
a symmetric completely monotonic distribution is log-convex, since
2
d2 log q(x) q (x) q (x)
2
= − ≥ 0. (9.7)
dx q(x) q(x)
Indeed, based on the canonical form (9.6), we have that
+∞ 2
−λ|x| λ
2
q (x) = −λe pZ (λ) dλ
0 2
+∞ +∞
λ 2 −λ|x| λ
≤ e−λ|x| pZ (λ) dλ λ e pZ (λ) dλ = q(x)q (x),
0 2 0 2
where we have made use of the Cauchy–Schwarz inequality applied to the inner product
1 1
u, v Z = EZ { λ2 uv} with u(λ) = e− 2 λ|x| and v(λ) = −λe− 2 λ|x| .
Note that, unlike monotonicity, complete monotonicity is generally not preserved
through spectral mixing.
A related property at the other end of the spectrum is log-concavity. It is equivalent
to the convexity of the log-likelihood potential X (x) = − log pX (x), which is advan-
tageous for optimization purposes and for designing MAP estimators (see Chapter 10).
The id laws in Table 4.1 that are log-concave are the Gaussian, Laplace, and hyper-
bolic secant distributions. It should be kept in mind, however, that most id distributions
are not log-concave since the property is incompatible with a slower-than-exponential
decay. The limit case is the symmetric unimodal Laplace distribution, which is both
log-concave and log-convex.
pX (ω) =
pX (λω) ·
pXλ (ω), (9.8)
where
pXλ (ω) is a valid characteristic function for any λ ∈ (0, 1).
9.3 Self-decomposable distributions 233
All the examples of id distributions in Table 4.1 are self-decomposable. For instance,
in the case of the symmetric gamma (sym gamma) distribution, we have that
r
pX (ω) 1
pλ (ω) = = λ + (1 − λ )
2 2
.
pX (λω) 1 + ω2
Proof First, we invoke Theorem 9.1, which ensures that fϕ is a proper Lévy exponent
to start with. The self-decomposability condition (9.8) is equivalent to
where fλ (ω) is a valid Lévy exponent for any λ ∈ (0, 1). Inserting ϕ and taking the
integral on both sides gives
fϕ (ω) = f λωϕ(r) dr + fλ ωϕ(r) dr
Rd Rd
= fϕ (λω) + fλ,ϕ (ω),
where fλ,ϕ (ω) = Rd fλ ωϕ(r) dr. Finally, since | fλ (ω)| ≤ | f (ω)|+| f (λω)|, we are gua-
ranteed that the integration with respect to ϕ yields an acceptable Lévy exponent. This
implies that both e fϕ (ω) and e fλ,ϕ (ω) are valid characteristic functions, which establishes
the self-decomposability property.
234 Infinite divisibility and transform-domain statistics
Stability is the most stringent distributional property in the chain since all stable distri-
butions are self-decomposable and hence unimodal of class C.
DEFINITION 9.3 A random variable X is called (strictly) stable if, for every n ∈ Z+ ,
there exists cn such that X = cn (X1 + · · · + Xn ) in law, where the Xi are i.i.d. with pdf
pX (x).
The definition clearly implies infinite divisibility. On the side of the Lévy exponent,
the condition translates into the homogeneity requirement f (aω) = aα f (ω), with α ∈
(0, 2] being the stability index. One can then exploit this property to get the complete
parameterization of the family of stable distributions. If we add the symmetry constraint,
then the class of homogeneous Lévy exponents reduces to f (ω) ∝ −|ω|α . These expo-
nents are associated with the symmetric-alpha-stable (SαS) laws whose characteristic
function is specified by
= −|s0 ω|α
|x|θ1 eθ2 |x| pX (x) dx < ∞ ⇔ |a|θ1 eθ2 |a| v(a) da < ∞
R |a|>1
The first two equivalences are deduced as special cases of the last one by considering
the weighting sequences gθ (x) = max(1, |x|θ ) and gθ1 ,θ2 (x) = max(1, |x|θ1 )eθ2 x , which
are submultiplicative for θ , θ2 > 0 and θ1 ≥ 0.
As an application of Theorem 9.9, we can restate the Lévy–Schwartz admissibility
condition in Theorem 4.8 as
The key here is that Xid = 1[0,1) , w (the canonical observation of the innovation w)
is an id random variable whose characteristic function is pid (ω) = e f (ω) (see Proposi-
tion 4.12).
The direct implication is that pX will have the same decay behavior as v, keeping in
mind that it can be no better than supra-exponential (e.g., when v is compactly suppor-
ted), unless we are in the purely Gaussian scenario with v(a) = 0. This also means that,
from a decay point of view, supra-exponential and compact support are to be placed in
the same category.
The fundamental reason for the equivalences in Theorem 9.9 is that the corresponding
types of decay (exponential vs. polynomial) are preserved through convolution. In the
latter case, this can be quantified quite precisely using a generalized version of the
Young inequality for weighted Lp -spaces (see [AG01]):
for any p ≥ 1 and any family of submultiplicative weighting functions gθ (x). The
canonical example of weighting function that is used to characterize algebraic decay is
gα (x) = (1 + |x|)α with αR+ , which is submultiplicative with constant Cα = 1.
In light of Theorem 9.9, our next result implies that the rate of polynomial decay and
the moment-related properties of pX are preserved through the white-noise integration
process.
p
mvϕ ,p = |a|p vϕ (a) da = ϕLp mv,p (9.11)
R
and is well defined whenever mv,p , the pth-order absolute moment of v, is bounded.
Likewise, the integration against ϕ will preserve the finiteness of all absolute moments
of the underlying distribution, including those for p ≤ 2.
Proof For the admissibility of vϕ , we refer to Theorem 9.1. To show that the integrand
in (9.2) is well defined when ϕ(r) tends to zero and that vϕ has the same decay properties
at infinity as v, we consider the bound v(a) < C|a|−p0 with p0 ≥ 1, which implies
v(a/ϕ(r)) ≤ C|ϕ(r)|p0 −1 |a|−p0 . It follows that vϕ (a) ≤ CϕLp |a|−p0 with
1 p
that |ϕ(r)|
p = p0 − 1, meaning that the rate of decay of v is preserved. To check the integrability
of vϕ , we make the reverse change of variable a = ϕ(r)a and obtain
1
v(a/ϕ(r)) dr da = v(a ) dr da
R ϕ |ϕ(r)| R ϕ
The fact is that, with id distributions, it is often simpler to work with cumulants than
with ordinary moments, the idea being that cumulants are to the Lévy exponent what
moments are to the characteristic
function.
Specifically, let
pX (ω) = R pX (x)ejωx dx be the characteristic function of a pdf (not
necessarily id) whose moments mp = R xp pX (x) dx < ∞ are assumed to be well defined
for any p ∈ N. The boundedness of the moments implies that pX (ω) (the conjugate
238 Infinite divisibility and transform-domain statistics
where the second formula provides the definition of the κn , which are the cumulants
of pX . This clearly shows that pX is uniquely characterized by its moments (by way
of its characteristic function), or, equivalently, in terms of its cumulants by way of its
cumulant generating function log pX (ω). Another equivalent way of expressing this cor-
respondence is
1 dn logpX (ω)
κn {X} = n ,
j dωn ω=0
which is the direct counterpart of (9.9). By equating the Taylor series of the exponential
of the right-hand sum in (9.13) with (9.12), one can derive a direct relation between the
moments and the cumulants
p−1
p−1
mp = mn κp−n ,
n
n=0
κ2 = μ2
κ3 = μ3
κ4 = μ4 − 3μ22
κ5 = μ5 − 10μ3 μ2
κ6 = μ6 − 15μ4 μ2 − 10μ23 + 30μ32 ,
consistent with id distributions being more peaky than a Gaussian around the mean and
exhibiting fatter tails.
A further theoretical motivation for using cumulants is that they provide a direct
measure of the deviation from Gaussianity since the cumulants of a Gaussian are neces-
2
sarily zero for n > 2 (because log gσ (ω) = − σ2 ω2 ).
A final practical advantage of cumulants is that they offer a convenient means of
quantifying – and possibly inverting – the effect of the white-noise integration process.
P R O P O S I T I O N 9.11 Let f be an admissible Lévy exponent and let ϕ ∈ Lp (R) for all
p ≥ 1. Then, the cumulants of fϕ (ω) = Rd f ωϕ(r) dr are related to those of f (ω) by
n
κn,ϕ = κn ϕ(r) dr.
Rd
Proof We start by writing the Taylor series expansion of f as
∞
(jω)n
f (ω) = κn ,
n!
n=1
where the quantities κn are, by definition, the cumulants of the density function pX (x) =
−1
F {e f (ω) }(x). Next, we replace ω by ωϕ(r) and integrate over Rd , which gives
∞
(jω)n n
f ωϕ(r) dr = κn ϕ(r) dr = fϕ (ω).
Rd n! Rd
n=1
The last step is to equate the above expression to the expansion of the cumulant gen-
erating function fϕ
∞
(jω)n
fϕ (ω) = κn,ϕ ,
n!
n=1
which yields the desired result.
A direct implication is that the odd-order cumulants of ϕ, w are zero whenever the
analysis kernel ϕ has a symmetric amplitude distribution. This results in a modified
Lévy exponent fϕ that is real-valued symmetric, which is consistent with Corollary 9.4.
In particular, this condition is fulfilled when the analysis kernel has an axis of antisym-
metry; that is, when there exists r0 such that ϕ(r0 + r) = −ϕ(r0 − r), ∀r ∈ Rd .
As for the non-zero even-order cumulants, we can use the relation
κ2m,ϕ
κ2m =
ϕ2m
L (Rd ) 2m
to recover the cumulants of w from the moments of the observed random variable ϕ, w .
It turns out that every id pdf is embedded in a semigroup that plays a central role in
the classical theory of Lévy processes. These convolution semigroups define natural
240 Infinite divisibility and transform-domain statistics
families of id pdfs such as the sym gamma and Meixner distributions that are extending
the Laplace and hyperbolic secant distributions, respectively. In Section 9.8, we put
the semigroup property to good use to characterize the behavior of the wavelet-domain
statistics across scales.
Let us recall the exponentiation property:
τ pX1 is id with characteristic function
if
pXτ (ω) = eτ f (ω) =
e f (ω) then pXτ with pX1 (ω) is id as well for any τ ∈ R+ . A direct
implication is the convolution relation that relates the pdfs at scale τ + τ0 and τ0 with
τ > 0, expressed as
pXτ +τ0 (x) = pXτ ∗ pXτ0 (x).
This suggests that the family of pdfs {pXτ : τ ∈ [0, ∞)} is endowed with a semigroup-
like structure. To specify the semigroup properties and spell out the implications, we
introduce the family of linear convolution operators Qτ with τ ≥ 0,
Qτ q(x) = pXτ ∗ q (x)
for any q ∈ L1 (R). Clearly, the family {Qτ : τ ≥ 0} is bounded over L1 (R) with Qτ =
supqL =1 Qτ qL1 ≤ 1 (as a consequence of Young’s inequality) and is such that: (1)
1
Qτ1 Qτ2 = Qτ1 +τ2 for τ1 , τ2 ∈ [0, ∞), (2) Q0 = Id, and (3) limτ ↓0 Qτ q−qL1 = 0 for any
q ∈ L1 (R). It therefore satisfies all the properties of a strongly continuous contraction
semigroup. Such semigroups are entirely characterized by their infinitesimal generator
G, which is defined as
Qτ q(x) − q(x)
Gq(x) = lim . (9.14)
τ ↓0 τ
Based on this generator, the members of the group can be represented via the exponen-
tial map
1
Qτ = eτ G = Id + τ G + τ 2 G2 + · · · .
2!
Likewise, we may also write
pXτ + τ (x) − pXτ (x)
lim = GpXτ (x), (9.15)
τ →0 τ
which implies that pXτ (x) = p(x, τ ) is the solution of the partial differential equation
∂
p(x, τ ) = Gp(x, τ ),
∂τ
with initial condition p(x, 0) = pX0 (x) = δ(x). In the present case where Qτ = eτ G is
shift-invariant, we have a direct correspondence with the frequency response eτ f (ω) . By
transposing Definition (9.14) into the Fourier domain, we identify G as the LSI operator
specified by
eτ f (ω) − 1 −jωx dω
Gq(x) = lim
q(ω) e
τ ↓0 R τ 2π
dω
= q(ω)f (ω)e−jωx , (9.16)
R 2π
9.7 Semigroup property 241
where q(ω) is the (conjugate) Fourier transform of q(x). The fact that the Lévy exponent
f (ω) is the frequency response of G has some pleasing consequences for the interpreta-
tion of the PDE that rules the evolution of pXτ (x) as a function of τ .
which is quite
attractive from
an engineering perspective. In the case where f (ω) =
2
−b2 ω2 + R ejaω −1−jaω v(a) da, we obtain the explicit form of the impulse response
by the formal inverse (conjugate) Fourier transformation
b2
g(x) = G{δ}(x) = δ (x) + δ(x − a) − δ(x) + aδ (x) v(a) da,
2 R
where δ and δ are the first and second derivatives of the Dirac impulse. Some fur-
ther simplification is possible if we can split the second component of g(x) into parts,
although this requires some special care because all non-Poisson Lévy densities are sin-
gular around the origin. A first simplification occurs when R |a|v(a) da < ∞, which
allows us to pull δ out of the integral with its weight being m1 = R av(a) da. To
bypass the singularity issue, we consider the sequence of non-singular Lévy densities
v1/n (a) = v(a) for |a| > 1/n and zero otherwise, which converges to v(a) as n goes to
infinity. Using the fact that v1/n is Lebesgue-integrable (as a consequence of the admissi-
bility condition), we can perform a standard (conjugate) Fourier inversion, which yields
b2
g(x) = −m1 δ (x) + δ (x) + lim v1/n (x) − δ(x) v(a) da ,
2 n→∞ |a|<1/n
with the limit component limn→∞ v1/n (x) = v(x) being the original Lévy density. The
above representation is enlightening because the impulse response is now composed of
two terms: a distribution that is completely localized at the origin (linear combination
of Dirac impulse and its derivatives up to order two) plus a smoothing component that
converges to the initial Lévy density v. The Dirac correction is actually crucial because
it converts an essentially lowpass filter (convolution with v(x) ≥ 0) into a highpass one
which is consistent with the requirement that f (0) = 0.
We can also use this result to describe the evolution of pXτ for some small increment
τ , like
* +
−1 1
pXτ + τ (x) = F pXτ (ω) 1 + τ f (ω) + τ 2 f2 (ω) + · · · (x)
2!
1
= pXτ (x) + τ (g ∗ pXτ )(x) + τ 2 g ∗ g ∗ pXτ (x) + O( τ 3 ),
2!
where g = G{δ} is the impulse response of the semigroup generator. Since g is represen-
table as the sum of a point distribution concentrated at the origin and a Lévy density v,
this partly explains why pXτ will be endowed with the properties of v that are preserved
through convolution; in particular, the rate of decay (exponential vs. polynomial), sym-
metry, and unimodality.
The semigroup property is especially relevant for describing the evolution across scale
of the wavelet-domain statistics of a sparse stochastic process. The premise for the
9.8 Multiscale analysis 243
ψ(r) = L∗ φ(r),
where φ ∈ L1 (Rd ) is a suitable “smoothing” kernel. We also assume that the wave-
lets are normalized to have a constant L2 norm across scales. In the framework of the
continuous wavelet transform, the wavelet at scale a > 0 is therefore given by
with
where we have made use of the scale-invariance property of L (see Definition 5.2 in
Chapter 5). The modified Lévy exponent at scale a can therefore be determined to be
fφa (ω) = f ωaγ −d/2 φ(r/a) dr
Rd
= f ωaγ −d/2 φ(r ) |a|d dr (change of variable r = r/a)
Rd
= a fφ aγ −d/2 ω .
d
(9.20)
Thus, we see that there are two mechanisms at play that determine the evolution of
the wavelet distribution across scale. The first is a simple change of amplitude 2 of the
wavelet coefficients with their standard deviation being divided by the factor aγ −d/2
that multiplies ω in (9.20). The second is the multiplication of the Lévy exponent by
τ = ad , which induces the kind of convolution semigroup investigated in Section 9.7.
2 Let Y = bX where X is a random variable with pdf p and b a fixed scaling constant. Then, p (x) =
X Y
|b|−1
pX (x/b),
2
Var(Y) = b Var(X), and pY (ω) = pX (bω). In the present case, X is id with
pX (ω) =
exp fφ (ω) .
244 Infinite divisibility and transform-domain statistics
that are tied to the smoothing kernel φ and indexed by the parameter τ ∈ R+ . Next, we
consider the random variables associated with the wavelet coefficients of the stochastic
process s(r) at some fixed location r0 and scale a:
Va = ψa (· − r0 ), s = φa (· − r0 ), w .
Since the Lévy noise w is stationary, it follows that Va has an id distribution with modi-
fied Lévy exponent fφa , as specified by (9.20). This allows us to express pVa , the pdf of
the wavelet coefficients Va , as
pVa (x) = pid x; ad , fφ (b·)
= |b|−1 pid x/b; ad , fφ
with b = aγ −d/2 . Instead of rescaling the argument of fφ or the pdf, we can also consider
the renormalized wavelet coefficients Ya = Va /aγ −d/2 whose distribution is character-
ized by
which indicates that the evolution across scale is part of the same extended id family.
This connection allows us to transpose the results of Section 9.7 into the wavelet domain
and to infer the corresponding evolution equation. These findings are summarized as
follows.
Since we know the wavelet-domain Lévy exponent fφa (ω) = log pVa (ω), we apply
the technique of Section 9.6 to determine the evolution of the cumulants across scale.
We obtain
1 dn log pVa (ω)
κn {Va } = n
j dωn
ω=0
1 dn ad fφ aγ −d/2 ω
= n
j dωn
ω=0
" #n 1 dn fφ ω
= ad aγ −d/2 (chain rule of differentiation)
jn dω n
ω=0
= ad an(γ −d/2) κn {V1 }. (9.22)
In the case of the variance (i.e., κ2 {Va } = Var{Va }), this simplifies to
Not too surprisingly, the result is compatible with the scaling law of the variance of the
wavelet coefficients of a fractional Brownian field: Var{Va } = σ02 a2H+d , where H = γ − d2
is the Hurst exponent of the process. The latter corresponds to the special Gaussian
version of the theory with κn {Va } = 0 for n > 2.
The implication of (9.22) is that the evolution of the wavelet cumulants across scale
is linear in a log-log plot. Specifically, we have that
which suggests a simple regression scheme for estimating γ from the moments of the
wavelet coefficients of a self-similar process.
Based on (9.22), we relate the evolution of the kurtosis to the scale as
κ4 {Va }
η4 (a) = = a−d η4 (1). (9.24)
κ22 {Va }
This implies that the kurtosis (if initially well defined) converges to zero as the scale
gets coarser. We also see that the rate of convergence is universal (independent of the
order γ ) and that it is faster in higher dimensions.
The normalization ratio for the other cumulants is given by
κm {Va } 1
ηm (a) = m/2
= ηm (1), (9.25)
κ2 {Va } ad(m/2−1)
which shows that the relative contributions of the higher-order cumulants (with m > 2)
tend to zero as the scale increases. This implies that the limit distribution converges to
a Gaussian under the working hypothesis that the moments (or cumulants) of pid are
well defined. This asymptotic behavior happens to be a manifestation of a generalized
version of the central-limit theorem, the idea being that the dilation of the observation
window has an effect that is equivalent to the summation of an increasing number of
i.i.d. random contributions.
246 Infinite divisibility and transform-domain statistics
where
This confirms the fact that the stability property is conserved in the wavelet
√ domain.
In the particular case of a Gaussian process with α = 2 and s0 = σ0 / 2, we obtain
sa,φ = s0 aγ φL2 , which is compatible with (9.23).
The remarkable aspect is that the combination of (9.26) and (9.27) specifies the limit
distribution of the wavelet pdf for a sufficiently large under very general conditions,
where α ≤ 2 is the critical exponent of the underlying distribution. The parameter s0 is
related to the αth moment of the canonical pdf, where α is the largest possible exponent
for this moment to be finite, the standard case being α = 2.
Since the derivation of this kind of limit result for α < 2 (generalized version of
the central-limit theorem) is rather technical, we concentrate now on the finite-variance
case (α = 2), which can be established under rather general conditions using a basic
Taylor-series argument.
For simplicity, we consider a centered scenario withmodified Lévy triplet b1,φ = 0,
b2,φ = b2 φ2L2 , and vφ as specified by (9.2) such that R tvϕ (t) dt = 0. Note that these
conditions are automatically satisfied when the variance of the innovations is finite and
f is real-valued symmetric
(see Corollary 9.4). Due to the finite-variance hypothesis, we
have that m2,φ = R t2 vφ (t) dt < ∞. This allows us to write the Taylor series expansion
of fφa (ω/aγ ) as
fφa (ω/aγ ) = ad fφ aγ −d/2 ω/aγ
= ad fφ a−d/2 ω
b2,φ ω2 m2,φ 2
=− − ω + O(a−d ω4 ),
2 2
which corresponds to the Lévy exponent of the normalized variable Za = Va /aγ . This
implies that: (1) the variance of Za is given by E{Z2a } = b2,φ + m2,φ = Var{V1 } and is
b +m
independent of a, and (2) lima→+∞ fφa (ω/aγ ) = − 2,φ 2 2,φ ω2 , which indicates that the
limit distribution of Za is a centered Gaussian (central-limit theorem).
9.9 Notes and pointers to the literature 247
Practically, this translates into the Gaussian approximation of the pdf of the wavelet
coefficients given by
pVa (x) ≈ Gauss 0, a2γ Var{V1 } ,
which becomes more and more accurate as the scale a increases. We note the simplicity
of the asymptotic model and the fact that it is consistent with (9.23) which specifies the
general evolution of the variance across scale.
While most of the results in this chapter are specific to sparse stochastic processes, the
presentation relies heavily on standard results from the theory of infinitely divisible
laws. Much of this theory gravitates around the one-to-one relation that exists between
the pdf (pid ) and the Lévy density (v) that appears in the Lévy–Khintchine representation
(4.2) of the exponent f (ω), the general idea being to translate the properties from one
domain to the other. Two classical references on this subject are the textbooks by Sato
[Sat94] and Steutel and Van Harn [SVH03].
The novel twist here is that the observation of white noise through an analysis win-
dow ϕ results in a modified Lévy exponent fϕ and hence a modified Lévy density vϕ ,
as specified by (9.2). The technical aspect is to make sure that both fϕ and vϕ are ad-
missible, which is the topic of Section 9.1. Determining the properties of the pdf of
Xϕ = ϕ, w then reduces to the investigation of vϕ , which can be carried out using
the classical tools of the theory of id laws. An alternative proof of Theorem 9.1 as
well as some complementary results on infinite divisibility and tail decay are reported
in [AU14].
Many of the basic concepts such as stability and self-decomposability go back to
the groundwork of Lévy [Lév25, Lév54]. The general form of Theorem 9.9 is due to
Kruglov [Kru70] (see also [Sat94, pp. 159–160]), but there are antecedents from Rama-
chandran [Ram69] and Wolfe [Wol71].
The use of semigroups and potential theory is a fruitful approach to the study of
continuous-time Markov processes, including Lévy processes. The concept appears to
have been pioneered by William Feller [Fel71, see Chapters 9 and 10]. Suggested read-
ings on this fascinating topic are [Sat94, Chapter 8], [App09, Chapter 3], and [Jac01].
The transposition of these tools in Section 9.8 for characterizing the evolution of the
wavelet statistics across scale is new, to the best of our knowledge; an expanded version
of this material will be published elsewhere.
10 Recovery of sparse signals
In this chapter, we apply the theory of sparse stochastic processes to the reconstruction
of signals from noisy measurements. The foundation of the approach is the specifi-
cation of a corresponding (finite-dimensional) Bayesian framework for the resolution
of ill-posed inverse problems. Given some noisy measurement vector y ∈ RM pro-
duced by an imaging or signal acquisition device (e.g., optical or X-ray tomography,
magnetic resonance), the problem is to reconstruct the unknown object (or signal) s as
a d-dimensional function of the space-domain variable r ∈ Rd based on the accurate
physical modeling of the imaging process (which is assumed to be linear).
The non-standard aspect here is that the reconstruction problem is stated in the conti-
nuous domain. A practical numerical scheme is obtained by projecting the solution onto
some finite-dimensional reconstruction space. Interestingly, the derived MAP estima-
tors result in optimization problems that are very similar to the variational formulations
that are in use today in the field of biomaging, including Tikhonov regularization and
1 -norm minimization. The proposed framework provides insights of a statistical nature
and also suggests novel computational schemes and solutions.
The chapter is organized as follows. In Section 10.1, we present a general method for
the discretization of a linear inverse problem in a shift-invariant basis. The correspon-
ding finite-dimensional statistical characterization of the signal is obtained by suitable
“projection” of the innovation model onto the reconstruction space. This information is
then used in Section 10.2 to specify the maximum a posteriori (MAP) reconstruction
of the signal. We also develop an iterative optimization scheme that alternates between
a classical linear reconstructor and a shrinkage estimator that is specified by the signal
prior. In Section 10.3, we apply these techniques to the reconstruction of biomedical
images. After reviewing the physical principles of image formation, we derive practical
MAP estimators for the deconvolution of fluorescence micrographs, for the reconstruc-
tion of magnetic resonance images, and for X-ray computed tomography. We present
illustrative examples and discuss the connections with several reconstruction algorithms
currently in favor. In Section 10.4, we investigate the extent to which such variational
methods approximate the minimum-mean-square-error (MMSE) solution for the sim-
pler problem of signal denoising. To that end, we present a direct algorithm for the
MMSE denoising of Lévy processes that is based on belief propagation. This optimal
solution is then used as reference for assessing the performance of non-Gaussian MAP
estimators.
10.1 Discretization of linear inverse problems 249
We note that these conditions are met by the polynomial B-splines. These functions
are known to offer the best cost–quality tradeoff among the known families of interpo-
lation kernels [TBU00]. The computational cost is typically proportional to the support
of a basis function, while the quality is determined by its approximation order (asymp-
totic rate of decay of the approximation error). The approximation order of a B-spline
of degree n is (n + 1), which is the maximum that is achievable with a support of size
n + 1 [BTU01].
In practice, it makes good sense to choose β compactly supported so that condition
(2) is automatically satisfied. Under such a hypothesis, we can extend the representation
for sequences ch [·] with polynomial growth, which may be required for handling non-
stationary signals such as Lévy processes and fractals.
Our next ingredient is a biorthogonal (generalized) function β̃ such that
51 · − hk 6 r − hk
PVh s(r) = β̃ ,s β .
d
hd h h
k∈Z
Using (10.1), it is not hard to verify that PVh is idempotent (i.e., PVh PVh s = PVh s) and
hence a valid (linear) projection operator. Moreover, the partition of unity guarantees
that the error of approximation decays like s − PVh sLp = O(h) (or even faster if β has
an order of approximation greater than one), so that the discretization error becomes
negligible for h sufficiently small.
To simplify the notation, we shall assume from here on that h = 1 and that the control
of the discretization error is adequate. (If not, we can decrease the sampling step further,
10.1 Discretization of linear inverse problems 251
which may also be achieved through an appropriate rescaling of the system of coordi-
nates in which r is expressed.) Hence, our signal reconstruction model is
with
where the right-hand convolution between dL and s[·] is discrete. Specifically, dL is
the discrete-domain impulse response of Ld i.e., Ld {δ}(r) = k∈Zd dL [k]δ(r − k) ,
while the signal coefficients are given by s[k] in (10.2). Based on the defining properties
s = L−1 w and Ld L−1 δ = βL , where βL is the B-spline associated with L, we express the
discrete innovation u as
This implies that u is stationary and that its statistical properties are in direct relation
with those of the white Lévy noise w (continuous-domain innovation). Recalling that w
is specified by its characteristic form Pw (ϕ) = E{ejϕ,w }, we obtain the joint charac-
teristic function of the innovation values in a K-point neighborhood, u = u[k] k∈ ,
∨ (r − k), where ω = ω
K
by direct substitution of ϕ(r) = k∈K ω k β̃ ∗ β L k k∈K
is
the corresponding Fourier-domain indexing vector. Specifically, pU (ω), which is the
(conjugate) Fourier transform of the joint pdf pU (u), is given by
⎛ ⎞
w ⎝
pU (ω) = P ωk β̃ ∗ βL∨ (· − k)⎠ , (10.3)
k∈K
ϕint , δ(· − k) = δk
so that the corresponding biorthogonal analysis function is β̃(r) = δ(r). This is the
solution that minimizes the overlap of the basis functions in (10.3).
Decoupling simplification
To obtain an innovation-domain description that is more directly applicable in practice,
we make the decoupling simplification pU (ω) ≈ k∈K pU (ωk ), which is equivalent
to assuming that the discrete innovation sequence u[·] is i.i.d. This means that the Kth-
order joint pdf of the discrete innovation can be factorized as
$
pU (u) = pU u[k] , (10.4)
k∈K
Note that the latter comes as a special case of (10.3) with K = 1. The important practical
point is that pU (x) is infinitely divisible, with modified Lévy exponent
pU (ω) = fβ̃∗β ∨ (ω) =
log f ω(β̃ ∗ βL∨ )(r) dr. (10.6)
L
Rd
This makes it fall within the general family of distributions investigated in Chapter 9
(by setting ϕ = β ∗ βL∨ ). Equivalently, we can write the corresponding log-likelihood
function
− log (pU (u)) = U u[k] (10.7)
k∈K
with U u = − log pU (u).
suitable modification of the basis functions that intersect the boundaries of the ROI. The
corresponding signal representation then reads
where βk (r) is the basis function corresponding to β(r − k) in (10.2) up to the modi-
fications
at the boundaries. This model is specified by the K-dimensional
signal vector
s = s[k] k∈ that is related to the discrete innovation vector u = u[k] k∈ by
u = Ls,
where L is the (K× K) matrix representation of Ld , the discrete version of the whitening
operator L.
The general form of a linear, continuous-domain measurement model is
with sampling/imaging functions {ηm }M m=1 and additive measurement noise n[·]. The
sampling function ηm (r) represents the spatio-temporal response of the mth detector of
an imaging/acquisition device. For instance, it can be a 3-D point-spread function in
deconvolution microscopy, a line integral across a 2-D or 3-D specimen in computed
tomography, or a complex exponential (or a localized version thereof) in the case of
magnetic resonance imaging. This measurement model is conveniently described in
matrix-vector form as
y = y0 + n = Hs + n, (10.9)
where
• y = (y1 , .. . , yM ) is the M-dimensional measurement vector
• s = s[k] k∈ is the K-dimensional signal vector
• H is the (M × K) system matrix with
determination of the posterior distribution pS|Y , which depends on the prior distribution
pS and the underlying noise model. Using Bayes’ rule, we have that
pY|S (y|s)pS (s) pN y − Hs pS (s)
pS|Y (s|y) = =
pY (y) pY (y)
1
= pN (y − Hs)pS (s),
Z
where the proportionality factor Z is not essential to the estimation procedure because it
only depends on the input data y, which is a known quantity.
We also note that Z can be
recalculated by imposing the normalization constraint RK pS|Y (s|y) ds = 1, which is the
way it is handled in message-passing algorithms (see Section 10.4.2). The next step is
to introduce the discrete innovation variable u = Ls, whose pdf pU (u) has been derived
explicitly. If the linear mapping between u and s is one-to-one, 1 we clearly have that
pS (s) ∝ pU (Ls).
Using this relation together with the decoupling simplification (10.4), we find that
$
pS|Y (s|y) ∝ pN y − Hs pU (Ls) ≈ pN y − Hs pU [Ls]k , (10.11)
k∈
where pU is specified by (10.6) and solely depends on the Lévy exponent f of the
continuous-domain innovation w and the B-spline kernel βL = Ld L−1 δ associated with
the whitening operator L.
In the standard additive white Gaussian noise scenario (AWGN), we find that
y − Hs2 $
pS|Y (s|y) ∝ exp − 2
pU [Ls]k ,
2σ
k∈
The estimate sMMSE (y) is optimal in that it is closest (in the mean-square sense)
to the
(unknown) noise-free signal s; i.e., E{|sMMSE (y)−s|2 } = min E{|s̃(y) − s|2 } among all
signal estimators s̃(y). The downside of this estimator is that it is difficult to compute in
practice, except for special cases such as those discussed in Sections 10.2.2 and 10.4.2.
1 A similar proportionality relation can be established for the cases where L has a non-empty null space via
the imposition of the boundary conditions of the SDE. While the rigorous formulation can be carried out, it
is often not worth the effort because it will only result in a very slight modification of the solution such that
s1 (r) satisfies some prescribed boundary conditions (e.g., s1 (0) = 0 for a Lévy process), which are artificial
anyway. The pragmatic approach, which better suits real-world applications, is to ignore this technical issue
by adopting the present stationary formulation, and to let the optimization algorithm adjust the null-space
component of the signal to produce the solution that is maximally consistent with the data.
10.2 MAP estimation and regularization 255
Among the statistical estimators that incorporate prior information, the most popular
is the MAP solution, which extracts the mode of the posterior distribution. Its use can
be justified by the fact that it produces the signal estimate that best explains the ob-
served data. While MAP does not necessarily yield the estimator with the best average
performance, it has the advantage of being tractable numerically.
Here, we make use of the prior information that the continuous-domain signal s satis-
fies the innovation model Ls = w, where w is a white Lévy noise. The finite-dimensional
transposition of this model (under the decoupling simplification) is that the discrete
innovation vector Ls = u can be assumed to be i.i.d., 2 where L is the matrix counter-
part of the whitening operator L. For a given set of noisy measurements y = Hs + n
with AWGN of variance σ 2 , we obtain the MAP estimator through the maximization of
(10.11). This results in
1
sMAP = arg min y − Hs2 + σ
2 2
U ([Ls]k ) , (10.12)
s∈RK 2
k∈
with U (x) = − log pU (x), where pU (x) is given by (10.5). Observe that the cost functio-
nal in (10.12) has two components: a data term 12 y−Hs22 that enforces the consistency
between the data and the simulated, noise-free measurements Hs, and a second regulari-
zation term that favors likeliest solutions in reference to the prior stochastic model. The
balancing factor is the variance σ 2 which amplifies the influence of the prior informa-
tion as the data get noisier. The specificity of the present formulation is that the poten-
tial function is given by the log-likelihood of the infinitely divisible random variable U,
which has strong theoretical implications, as discussed in Sections 10.2.1 and 10.2.3.
For the time being, we observe that the general form of the estimator (10.12) is com-
patible with the standard variational approaches used in signal processing. The three
cases of interest that correspond to valid id (infinitely divisible) log-likelihood func-
tions (see Table 4.1) are:
√ 1 e−x /(2σ0 )
2 2
(1) Gaussian: pU (x) = ⇒ U (x) = 1 2
x + C1
2πσ0 2σ02
λ −λ|x|
(2) Laplace: pU (x) = 2e ⇒ U (x) = λ|x| + C2
r+ 1
1 1 2 1
(3) Student: pU (x) = " # ⇒ U (x) = r + log(x2 + 1) + C3 ,
B r, 12 x2 + 1 2
where the constants C1 , C2 , and C3 can be ignored since they do not affect the solu-
tion. The first quadratic potential leads to the classical Tikhonov regularizer, which
yields a stabilized linear solution. The second absolute-value potential produces an
1 -type regularizer; it is the preferred solution for solving deterministic compressed-
sensing and sparse-signal-recovery problems. If L is a first-order derivative operator,
2 The underlying discrete innovation sequence u[k] = L s(r)
d r=k is stationary and therefore identically distri-
buted. It is also independent for Markov and/or Gaussian processes, but only approximatively so otherwise.
To justify the decoupling simplification, we like to invoke the minimum-support property of the B-spline
βL = Ld L−1 δ ∈ L1 (Rd ), where Ld is the discrete counterpart of the whitening operator L.
256 Recovery of sparse signals
then (10.12) maps into total-variation (TV) regularization, which is widely used in appli-
cations [ROF92]. The third log-based potential is interesting as well because it relates to
the limit on an p -relaxation scheme when p tends to zero [WN10]. The latter has been
proposed by several authors as a practical “debiasing” method for improving the spar-
sity of the solution of a compressed-sensing problem [CW08]. The connection between
log and p norm relaxation is provided by the limit
x2p − 1
log x2 = lim ,
p→0 p
which is compatible with the Student prior for x2 ' 1.
Table 10.1 Asymptotic behavior of the potential function X (x ) for the infinitely divisible distributions
in Table 4.1.a
a (z) and ψ (1) (r) are Euler’s gamma and first-order polygamma functions, respectively (see Appendix C).
of behaviors with potential functions falling in between the linear solution (Laplace
and hyperbolic secant) and the log form that is characteristic of the heavier-tailed laws
(Student and SαS).
where Cnn = σ 2 I is the (N × N) covariance matrix of the noise. We note that the
LMMSE solution is also valid for the non-Gaussian, finite-variance scenarios: it pro-
vides the MMSE solution among all linear estimators, irrespective of the type of signal
model.
3 We recommend the use of the classical discrete whitening filter L = C−1/2 as a substitute for L in the
G ss
Gaussian case because it results in an exact formulation. However, we do not advise one to do so for
non-Gaussian models because it may induce undesirable long-range dependencies, the difficulty being that
decorrelation alone is no longer synonymous with statistical independence.
10.2 MAP estimation and regularization 259
It is well known from estimation theory that the Gaussian MAP and LMMSE solu-
tions are equivalent. This can be seen by considering the following sequence of equiva-
lent matrix identities:
HT HCss HT + σ 2 C−1 T T T
ss Css H = H HCss H + σ H
2 T
" # " #
HT H + σ 2 C−1
ss Css HT
= HT
HCss H T
+ σ 2
I
" #−1 " #−1
Css HT HCss HT + σ 2 I = HT H + σ 2 C−1ss HT ,
where have used the hypothesis that the covariance matrix Css is invertible.
The availability of the closed-form solution (10.14) or (10.15) is nice conceptually,
but it is not necessarily applicable for large-scale problems because the system matrix
is too large to be stored in memory and inverted explicitly. The usual numerical
approach is to solve the corresponding system of linear equations iteratively using the
conjugate-gradient (CG) method. The convergence speed of CG can often be improved
by applying some problem-specific preconditioner. A particularly favorable situation is
when the matrix HT H + σ 2 LTG LG is block-Toeplitz (or circulant) and is diagonalized by
the Fourier transform. The signal reconstruction can then be computed very efficiently
with the help of an FFT-based inversion. This strategy is applicable for the basic fla-
vors of deconvolution, computed tomography, and magnetic resonance imaging. It also
makes the link with the classical methods of Wiener filtering, filtered backprojection,
or backprojection filtering, which result in direct image reconstruction.
with 1 (u) = λ|u| (Laplace law) and 2 (u) = |u|2 /(2σ02 ) (Gaussian). The first is a soft-
threshold (shrinkage operator) and the second a linear scaling (scalar Wiener filter).
While not all id laws lend themselves to such an analytic treatment, we can nevertheless
determine the asymptotic form of their proximal operator. For instance, if U is sym-
metric and twice-differentiable at the origin, which is the case for most 4 examples in
Table 10.1, then
y
proxU (y; σ 2 ) = as y → 0.
1 + σ U (0)
2
The result is established by using a basic first-order Taylor series argument: U (u) =
U (0)u + O(u2 ). The required slope parameter is
2
pU (0) ωpU (ω) dω
U (0) = − = R , (10.20)
pU (0) R
pU (ω) dω
which is computable from the moments of the characteristic function.
To determine the larger-scale behavior, one has to distinguish between the (non-
Gaussian) intermediate scenarios, where the asymptotic trend of the potential is predo-
minantly linear (e.g., Laplace, hyperbolic secant, sym gamma, and Meixner families),
and the distributions with algebraic decay, where it is logarithmic (Student and SαS
stable), as made explicit in Table 10.1. (This information can readily be extracted
from the special-function formulas in Appendix C.) For the first exponential and sub-
exponential category, we have that limu→∞ U (u) = b1 + O(1/u) with b1 > 0, which
implies that
proxU (y; σ 2 ) ∼ y − σ 2 b1 as y → +∞.
proxU (y; σ 2 ) ∼ y as y → ∞.
4 Only the Laplace law and its sym gamma variants with r < 3/2 fail to meet the differentiability
requirement.
10.2 MAP estimation and regularization 261
15
30
25 10
20
5
15
10 0
5
-5
-30 -20 -10 0 10 20 30 -10
(a)
-15
-20 -10 0 10 20
(b)
Figure 10.1 Normalized Student potentials for r = 2, 4, 8, 16, 32 (dark to light) and corresponding
proximity maps. (a) Student potentials Student (y) with unit signal variance. (b) Shrinkage
operator proxStudent (y; σ 2 ) for σ 2 = 1.
where the vector α = (αk )k∈ represents the Lagrange multipliers for the desired
constraint.
To find the solution, we handle the optimization task sequentially according to the
alternating-direction method of multipliers (ADMM) [BPCE10]. Specifically, we consi-
der the unknown variables s and u in succession, minimizing LA (s, u, α) with respect
to each of them, while keeping α and the other one fixed. This is combined with an
update of α to refine the current estimate of the Lagrange multipliers. By using the
index k to denote the iteration number, the algorithm cycles through the following steps
until convergence:
We now look into the details of the optimization. The first step amounts to the mini-
mization of the quadratic form in s given by
1 μ
L A (s, uk , α k ) = y − Hs22 − (Ls − uk )T α k + Ls − uk 22 + C1 , (10.21)
2 2
where C1 = C1 (uk ) is a constant that does not depend on s. By setting the gradient of
the above expression to zero, as in
∂ LA (s, uk , α k )
= −HT (y − Hs) − LT α k − μ(Ls − uk ) = 0, (10.22)
∂s
we obtain the intermediate linear estimate,
" #−1 " #
sk+1 = HT H + μLT L HT y − zk+1 , (10.23)
with zk+1 = LT α k + μuk . Remarkably, this is essentially the same result as the
Gaussian solution (10.14) with a slight adjustment of the data term and regularization
strength. Note that the condition Ker(H) ∩ Ker(L) = {0} is required for this linear
solution to be well defined and unique.
To justify the update of the Lagrange multipliers, we note that (10.22) can be
rewritten as
∂ LA (s, uk , α k )
= −HT (y − Hs) − LT α k+1 = 0,
∂s
which is consistent with the global optimality conditions
Ls − u = 0,
−HT (y − Hs ) − LT α = 0,
where (s , u , α ) is a stationary point of the optimization problem. This ensures that
α k+1 gets closer to the correct vector of Lagrange multipliers α as uk converges to
u = Ls .
10.3 MAP reconstruction of biomedical images 263
As for the third step, we define ũ[k] = [Lsk+1 ]k and rewrite the AL criterion as
μ
L A (sk+1 , u, α k+1 ) = C2 + σ 2 U u[k] + αk u[k] + (ũ[k] − u[k])2
2
k∈
2
μ αk
= C3 + ũ[k] − − u[k] + σ 2 U u[k] ,
2 μ
k∈
where C2 and C3 are constants that do not depend on u. This shows that the optimization
problem is decoupled and that the update can be obtained by direct application of the
proximal operator (10.16) in a coordinate-wise fashion, so that
σ2
uk+1 = proxU Lsk+1 − μ1 α k+1 ; . (10.24)
μ
A few remarks are in order. While there are other possible numerical approaches
to the present MAP-estimation problem, the proposed algorithm is probably the simp-
lest to deploy because it makes use of two very basic modules: a linear solver (akin
to a Wiener filter) and a model-specific proximal operator that can be implemented as
a pointwise non-linearity. The approach provides a powerful recipe for improving on
some prior linear solution by reapplying the solver sequentially and embedding it into a
proper computational loop. The linear solver needs to be carefully engineered because
it has a major impact on the efficiency of the method. The adaptation to a given sparse
stochastic model amounts to a simple adjustment of the proximal map (lookup table).
The ADMM is guaranteed to converge to the global optimum when the cost func-
tional is convex. Unfortunately, convexity is not necessarily observed in our case (see
Table 10.1). But, since each step of the algorithm involves an exact minimization, the
cost functional is guaranteed to decrease so that the method remains applicable in non-
convex situations. There is the risk, though, that it gets trapped into a local optimum. A
way around the difficulty is to consider a warm start that may be obtained by running
the 2 (Gauss) or 1 (Laplace) version of the method.
gradient) and are presenting practical reconstruction results that highlight the influence
of the potential function.
• Gaussian prior: Gauss (x) = Ax2 . In this case, the algorithm implements a classical
linear reconstruction. This solution is representative of the level of performance of the
reconstruction methods that are currently in use and that do not impose any sparsity
on the solution.
• Laplace prior: Laplace (x) = B|x|. This configuration is at the limit of convexity and
imposes some medium level of sparsity. It corresponds to an 1 -type minimization
and is presently very popular for the recovery of sparse signals in the context of
compressed sensing.
• Student prior: Student (x) = C log(x2 + ). This log-like penalty is non-convex and
allows for the kind of heavy-tail behavior associated with the sparsest processes.
The regularization constants A, B, C are optimized for each experiment by comparison
of the solution with the noise-free reference (oracle) to obtain the highest possible SNR.
The proximal step of the algorithm described by (10.24) is adapted to handle the dis-
crete gradient operator by merely shrinking its magnitude. This is the natural vectorial
extension dictated by Definition (10.16). The shrinkage function for the Laplace prior
is a soft-threshold see (10.18) , while Student’s solution with small is much closer
to a hard threshold and favors sparser solutions. The reconstruction is initialized in a
systematic fashion: the solution of the Gaussian estimator is used as initialization for
the Laplace estimator and the output of the Laplace estimator is used as initial guess for
Student’s estimator. The parameter for Student’s estimator is set to = 10−2 .
position. This implies that the fluorescence microscope acts as a linear shift-invariant
system. It can therefore be described by the convolution equation
where h3D is the 3-D impulse response, also known as the point-spread function
(PSF) of the microscope. A high-performance microscope can be modeled as a perfect
aberration-free optical system whose only limiting factor is the finite size of the pupil
of the objective. The PSF is then given by
" x y z #2
h3D (x, y, z) = I0 pλ , , , (10.26)
M M M2
where I0 is a constant gain, M is the magnification factor, and pλ is the coherent dif-
fraction pattern of an ideal point source with emission wavelength λ that is induced by
the pupil. The square modulus accounts for the fact that the quantity measured at the
detector is the (incoherent) light intensity, which is the energy of the electric field. Also
note that the rescaling is not the same along the z dimension. The specific form of pλ ,
as provided by the Fraunhofer theory of diffraction, is
ω12 + ω22 xω1 + yω2
pλ (x, y, z) = P(ω1 , ω2 ) exp j2πz exp −j2π dω1 dω2 ,
R2 2λf20 λf0
(10.27)
where f0 is the focal length of the objective and P(ω1 , ω2 ) = 1ω<R0 is the pupil func-
tion. The latter is an indicator function that describes the circular aperture of radius
R0 in the so-called Fourier plane. The ratio R0 /f0 is a good predictor of the numerical
aperture 6 (NA), the optical parameter that is used in microscopy to specify the resolu-
tion of an objective through Abbe’s law.
The 3-D PSF given by (10.26) is shown in Figure 10.2. We observe that h3D is cir-
cularly symmetric with respect to the origin (x, y) = (0, 0) in the planes perpendicular
to the optical axis z and that it exhibits characteristic diffraction rings. It is narrowest
in the focal x-y plane with z = 0. The focal spot in Figure 10.2a is the Airy pattern
that determines the lateral resolution of the microscope (see (10.29) and the discus-
sion below). By contrast, the PSF is significantly broader in the axial direction. It also
spreads out linearly along the lateral dimension as one moves away from the focal plane.
The latter represents the effect of defocusing, with the external cone-shaped envelope in
Figure 10.2c being consistent with the simplified behavior predicted by ray optics. This
shows that, besides the fundamental limit on the lateral resolution that is imposed by the
pupil function, the primary source of blur in wide-field microscopy is along the optical
axis and is due to the superposition of the light contributions coming from the neighbor-
ing planes that are out of focus. The good news is that these effects can be partly com-
pensated through the use of 3-D deconvolution techniques. In practical deconvolution
6 The precise definition is NA = n sin θ , where n is the index of refraction of the operating medium (e.g., 1.0
for air, 1.33 for pure water, and 1.51 for immersion oil) and θ is the half-angle of the cone of light entering
the objective. The small-angle approximation for a normal use in air is NA ≈ R0 /f0 . A finer analysis that
takes into account the curvature of the lens shows that this simple ratio formula remains accurate even at
large numerical apertures in a well-corrected optical system.
10.3 MAP reconstruction of biomedical images 267
x
y
−2
0
−1
1
0
1
2
−2 −1 0 1 2
(a) z
−2 −2
−1 −1
0 0
1 1
2 2
−2 −1 0 1 2 −2 −1 0 1 2
(b) (c)
Figure 10.2 Visualization of the 3-D PSF of a wide-field microscope in a normalized coordinate
system. (a) Cut through the lateral x-y plane with z = 0 (in focus). (b) Cut through a lateral x-y
plane with z = 1 (out of focus). (c) Cut through the axial x-z plane with y = 0.
microscopy, one typically acquires a focal series of images (called a z-stack) which are
then deconvolved in 3-D by using a PSF that is derived either experimentally, through
the imaging of fluorescent nano beads, or theoretically see (10.26) and (10.27) , based
on the optical parameters of the microscope (e.g., NA, λ, M). Note that there are also
optical solutions, such as confocal or light-sheet microscopy, for partially suppressing
the out-of-focus light, but that these require more sophisticated instrumentation and lon-
ger acquisition times. These modalities may also benefit from deconvolution, but to a
lesser extent.
Here, for simplicity, we shall concentrate on the simpler 2-D scenario where one is
imaging a thin horizontal specimen s whose fluorescence emitters are confined to the
focal plane z = 0. The image formation then simplifies to the purely 2-D convolution
autocorrelation function of the circular pupil function P(ω1 , ω2 ). The calculation of this
convolution yields
⎧ 7
⎪
⎪ " # " #2
⎨ ω ω ω
π arccos ω0 − ω0 1 − ω0 , for 0 ≤ ω < ω0
2
h2D (ω) = (10.30)
⎪
⎪
⎩ 0, otherwise,
where
2R0 π 2NA
ω0 = = ≈
λf0 r0 λ
is the Rayleigh frequency. This shows that a microscope is a radially symmetric low-
pass filter whose cutoff frequency ω0 imposes a fundamental limit on resolution. A
Shannon-type sampling argument suggests that the ultimate resolution is r0 , assuming
that one is able to deconvolve the image. Alternatively, one can apply Rayleigh’s space-
domain criterion, which stipulates that the minimum distance at which two point sources
can be separated is when the first diffraction minimum coincides with the maximum
response of the second source. Since the function (J1 (r)/r)2 reaches its first zero at
r = 3.8317, this corresponds to a resolution limit at dRayleigh = 0.61λ × (f0 /R0 ). This
is consistent with Abbe’s celebrated formula for the diffraction limit of a microscope:
dAbbe = λ/(2NA). The better objectives have a large numerical aperture with the cur-
rent manufacturing limit being NA < 1.45 (with oil immersion). This puts the resolu-
tion limit to about one-third the wavelength λ (or, one-half the wavelength for the more
typical value of NA = 1). Sometimes in the literature, the PSF is approximated by a 2-D
isotropic Gaussian. The standard deviation that provides the closest fit with the physical
model (10.29) is σ0 = 0.42λ(f0 /R0 ) = 0.84/ω0 .
Discretization
From now on, we assume that ω0 ≤ π so that we can sample the data on the integer
grid while meeting the Nyquist criterion. The corresponding analysis functions, which
are indexed by m = (m1 , m2 ), are therefore given by
ηm (x, y) = h2D (x − m1 , y − m2 ).
In order to discretize the system, we select a sinc basis {sinc(x − k)}k∈Z2 with
sinc(x, y) = sinc(x)sinc(y),
where sinc(x) = sin(πx)/(πx). The entries of the system matrix in (10.9) are then ob-
tained as
In effect, this is equivalent to constructing the system matrix from the samples of the
PSF since h2D is already band-limited as a result of the imaging physics (diffraction-
limited microscope).
10.3 MAP reconstruction of biomedical images 269
Figure 10.3 Images used in deconvolution experiments. (a) Stem cells surrounded by goblet cells.
(b) Nerve cells growing around fibers. (c) Artery cells.
Experimental results
The reference data are provided by the three microscopic images in Figure 10.3, which
display different types of cells. The input images of size (512 × 512) are blurred with
a Gaussian PSF of support (9 × 9) and standard deviation σ0 = 4 to simulate the effect
of a wide-field microscope with a low-NA objective. The measurements are degraded
with additive white Gaussian noise so as to meet some prescribed blurred SNR (BSNR),
defined as BSNR = var(Hs)/σ 2 .
For deconvolution, the algorithm is run for a maximum of 500 iterations, or until the
absolute relative error between the successive iterates is less than 5×10−6 . The results are
summarized in Table 10.2. The first observation is that the standard linear deconvolution
(MAP estimator based on a Gaussian prior) performs remarkably well for the image in
Figure 10.3a, which is heavily textured. The MAP estimator based on the Laplace prior,
on the other hand, yields the best performance for images having sharp edges with a
moderate amount of texture, such as those in Figures 10.3b,c. This confirms the general
claim that it is possible to improve the reconstruction performance through the promotion
of sparse solutions. However, as the application of the Student prior to images typically
encountered in microscopy demonstrates, exaggeration in the enforcement of sparsity is
a distinct risk. Finally, we note that the Gaussian and Laplace versions of the algorithm
are compatible with the methods commonly used in the field; for instance, 2 -Tikhonov
regularization [PMC93] and 1 /TV regularization [DBFZ+ 06].
Table 10.2 Deconvolution performance of MAP estimators based on different prior distributions. The
best results are shown in boldface.
frequency that is proportional to the strength of the magnetic field. The basic idea of
magnetic resonance imaging (MRI) is to induce a space-dependent variation of the fre-
quency of resonance by imposing spatial magnetic gradients. The specimen is then exci-
ted by applying pulsed radio waves that cause the nuclei (or spins) in the specimen to
produce a rotating magnetic field detectable by the receiving coil(s) of the scanner.
Here, we shall focus on 2-D MRI, where the excitation is confined to a single plane. In
effect, by applying a proper sequence of magnetic gradient fields, one is able to sample
the (spatial) Fourier transform of the spin density s(r) with r ∈ R2 . Specifically, the mth
(noise-free) measurement is given by
s(ωm ) = s(r)e−jωm ,r dr,
R2
where the sampling occurs according to some predefined k-space trajectory (the conven-
tion in MRI is to use k = ωm as the spatial frequency variable). This is to say that the
underlying analysis functions are the complex exponentials ηm (r) = e−jωm ,r .
The basic problem in MRI is then to reconstruct s(r) based on the partial knowledge
of its Fourier coefficients which are also corrupted by noise. While the reconstruction in
the case of a dense Cartesian sampling amounts to a simple inverse Fourier transform,
it becomes more challenging for other trajectories, especially as the sampling density
decreases.
For simplicity, we discretize the forward model by using the same sinc basis functions
as for the deconvolution problem of Section 10.3.2. This results in the system matrix
Table 10.3 MR reconstruction performance of MAP estimators based on different prior distributions.
Figure 10.4 Data used in MR reconstruction experiments. (a) Cross section of a wrist. (b)
Angiography image. (c) k-space sampling pattern along 40 radial lines.
under the assumption that ωm ∞ ≤ π. The clear advantage of using the sinc basis
is that H reduces to a discrete Fourier-like matrix, with the caveat that the frequency
sampling is not necessarily uniform.
A convenient feature of this imaging model is that the matrix HT H is circulant so that
the linear iteration step of the algorithm can be computed in exact form using the FFT.
Experimental results
To illustrate the method, we consider the reconstruction of the two MR images of size
(256 × 256) shown in Figure 10.4: a cross section of a wrist and a MR angiogram. The
Fourier-domain measurements are simulated using the type of radial sampling pattern
shown in Figure 10.4c. The reconstruction algorithm is run with the same stopping
criteria as in Section 10.3.2. The reconstruction results for two sampling scenarios are
quantified in Table 10.3.
The first observation is that the estimator based on the Laplace prior generally outper-
forms the Gaussian solution, which corresponds to the traditional type of linear recons-
truction. The Laplace prior is a clear winner for the wrist image, which has sharp edges
and some amount of texture. While this is similar to the microscopy scenario, the ten-
dency appears to be more systematic for MRI: we were unable to find a single MR scan
in our database for which the Gaussian solution performs best. Yet, the supremacy of
the 1 solution is not universal, as illustrated by the reconstruction of the angiogram, for
272 Recovery of sparse signals
)
(t
s}
θ{
y
R
r
θ
x
(a)
θ
(b) (c)
Figure 10.5 X-ray tomography and the Radon transform. (a) Imaging geometry. (b) 2-D
reconstruction of a tomogram. (c) Its Radon transform (sinogram).
which the Student prior yields the best results, because this image is inherently sparse
and composed of piecewise-smooth components. Similarly to the microscopy modality,
we note that the present MAP formulation is compatible with the deterministic schemes
used for practical MRI reconstruction; in particular, the methods that rely on total varia-
tion [BUF07] and log-based regularization [TM09], which are in direct correspondence
with the Laplace and Student priors, respectively.
with index variables t ∈ R and θ ∈ [−π, π]. An example of a Radon transform corres-
ponding to a cross section of the human thorax is shown in Figure 10.5c. The resulting
family of functions gθ (t) = Rθ {s}(t) is called the sinogram, owing to the property that
the trajectory of a point (x0 , y0 ) = (R0 cos φ0 , R0 sin φ0 ) in Radon space is a sinusoidal
curve given by t0 (θ ) = x0 cos θ + y0 sin θ = R0 cos(θ − φ0 ).
In practice, the measurements correspond to the sampled values of the Radon trans-
form of the absorption map s(x) at a series of points (tm , θm ), m = 1, . . . , M. From
(10.32), we deduce that the analysis functions are
ηm (x) = δ tm − x, θ m ,
Discretization
For discretization purposes, we represent the absorption distribution as the weighted
sum of separable B-spline-like basis functions:
s(x) = s[k]β(x − k) ,
k
with β(x) = β(x)β(y), where β(x) is a suitable symmetric kernel (typically, a poly-
nomial B-spline of degree n). The constraint here is that β ought to have a short support
to reduce computations, which rules out the use of the sinc basis.
In order to determine the system matrix, we need to compute the Radon transform
of the basis functions. The properties of the Radon transform that are helpful for that
purpose are
(1) Projected translation-invariance
Rθ {ϕ}(t)e−jωt dt =
ϕ (ω)|ω=ωθ . (10.35)
R
The first result is obtained by a simple change of variable in (10.32). The second is
a direct consequence of (10.35). The Fourier central-slice theorem states that the 1-D
Fourier transform of the projection of ϕ at angle θ is equal to a corresponding central
274 Recovery of sparse signals
cut of the 2-D Fourier transform of the function. The result is easy to establish for θ = 0,
in which case the t and x axes coincide. For this particular configuration, we have that
−jωx
R0 {ϕ}(x)e dx = ϕ(x, y) dy e−jωx dx
R R R
= ϕ(x, y)e−jωx dx dy
R2
=
ϕ (ω, 0), (by definition)
where the interchange of integrals in the second line requires that ϕ ∈ L1 (R2 ) (Fubini).
The angle-dependent formula (10.35) is obtained by rotating the system of coordinates
and invoking the rotation property of the Fourier transform.
Next we show that the Radon transform of the basis functions can be obtained through
the convolution of two rescaled 1-D kernels.
Rθ {ϕ(· − x0 )}(t) = ϕθ (t − t0 ),
ϕ (ω) =
ϕ1 (ω1 )
ϕ2 (ω2 ),
where ϕ1 and ϕ2 are the 1-D Fourier transforms of ϕ1 and ϕ2 , respectively. The Fourier
central-slice theorem then implies that
4
ϕθ (ω) = Rθ ϕ(ω) =
ϕ1 (ω cos θ )
ϕ2 (ω sin θ ).
Next, we note that ϕ1 (ω cos θ ) and ϕ2 (ω sin θ ) are the 1-D Fourier transforms of | cos1 θ|
t
ϕ1 cos θ and | sin1 θ| ϕ2 sint θ , respectively. The final result then follows from (10.33) and
the property that the Fourier-domain product maps into a time-domain convolution.
where βθm (t) is the projection of β(x) = β(x)β(y) along the direction θm , as specified in
Proposition 10.1.
10.3 MAP reconstruction of biomedical images 275
We shall now apply the result of Proposition 10.1 to determine the Radon transform
of a symmetric tensor-product polynomial spline of degree n. The relevant 1-D formula
for β(x) is
" #n
n+1 x − k + n+1
n+1 2 +
β n (x) = (−1)k ,
k n!
k=0
which is the recentered version of (1.11). Next, by making use of the distributivity of
convolution and the relation
tn+1 tn+2 t+n1 +n2 +1
∗ = ,
n1 ! n2 ! (n1 + n2 + 1)!
we find that
Rθ β n (x)β n (y) (t) =
" #2n+1
n+1 n+1 t + n+1 − k cos θ + n+1 − k sin θ
n+1 n+1 2 2
(−1)k+k +
,
k k | cos θ |n+1 | sin θ |n+1 (2n + 1)!
k=0 k =0
(10.36)
which provides an explicit
formula for the Radon transform of a B-spline of any degree
n for θ = 0, ± π2 , ±π when the projection is along a coordinate axis, the Radon trans-
form is simply β n (t) . This result is a special case of the general box-spline calculus
described in [ENU12].
For the present experiments, we select n = 1. This corresponds to piecewise bilinear
basis functions whose Radon transforms are the (non-uniform) cubic splines specified
by (10.36) for θ = 0, ± π2 , ±π, or simple triangle functions otherwise. The Radon
profiles are stored in a lookup table to speed up computations. In essence, the for-
ward matrix H amounts to a “standard” projection with angle-dependent interpolation
weights given by βθ , while HT is the corresponding
backprojection.
For a parallel geo-
metry, their computation complexity is O N × Mθ × (n + 1) where N is the number of
pixels (or B-spline coefficients) of the reconstructed image, Mθ the number of distinct
angular directions, and n the degree of the B-spline.
Experimental results
We consider the two images shown in Figure 10.6. The first is the Shepp–Logan (SL)
phantom of size (256 × 256), while the second is a real CT reconstruction of the cross
section of the lung of size (750 × 750). In the simulations of the forward model, we
use a standard parallel geometry with an angular sampling that is matched to the size
of the images. Specifically, the projections are taken along Mθ = 180, 360 equiangular
directions for the lung image and Mθ = 120, 180 directions for the SL phantom. The
measurements are degraded with Gaussian noise with a signal-to-noise ratio of 20 dB.
For the reconstruction, we solve the quadratic minimization problem (10.21) itera-
tively by using 50 conjugate-gradient (inner) iterations. The reconstruction results are
reported in Table 10.4.
276 Recovery of sparse signals
Table 10.4 Reconstruction results of X-ray computed tomography using different estimators.
(a) (b)
Figure 10.6 Images used in X-ray tomographic reconstruction experiments. (a) The Shepp-Logan
(SL) phantom. (b) Cross section of a human lung.
We observe that the imposition of the strong level of sparsity brought by Student
priors is advantageous for the SL phantom. This is not overly surprising given that the
SL phantom is an artificial construct composed of piecewise-constant regions (ellipses).
For the realistic lung image (true CT), we find that the Gaussian solution outperforms
the others. Similarly to the deconvolution and MRI problems, the present MAP esti-
mators are in line with the Tikhonov-type [WLLL06] and TV [XQJ05] reconstructions
used for X-ray CT.
10.3.5 Discussion
During our investigation of real image-reconstruction problems, we have highlighted
the similarity between deterministic sparsity-promoting methods and MAP estimators
for sparse stochastic processes. The experiments we conducted with different imaging
modalities confirm the importance of sparse modeling in the reconstruction of biomed-
ical images. We found that imposing a medium level of sparsity, as afforded by the
Laplace prior (1 -norm minimization), is beneficial in most instances. Heavier-tailed
priors are available too, but they are helpful only for a limited class of images that
are inherently sparse. At the other end of the palette is the “classical” linear type of
10.4 The quest for the minimum-error solution 277
reconstruction (Gaussian prior), which performs remarkably well for images whose
content is more diffuse/textured or when the inverse problem is well conditioned. This
confirms that the efficiency of a potential function depends strongly on the type of image
being considered. In our model, this is related to the Lévy exponent of the underlying
continuous-domain innovation process w, which is in direct relationship with the signal
prior.
As far as the relevance of the underlying model is concerned, we like to view the
present set of techniques and continuous-domain stochastic models as a conceptual
framework for deriving and refining state-of-the-art algorithms in a principled fashion.
The reassuring aspect is that the approach gives support to several algorithms that are
presently used in the field.
The next step, of course, would be to determine how to best fit the model to the
data. However, the inherent difficulty with this Bayesian view of the problem is that
there is actually no guarantee that (non-Gaussian) MAP estimation performs best for
the class of signals for which it is designed. There is even evidence that a slight model
mismatch (e.g., modification of the MAP criterion) can be beneficial in some instances
(see Section 10.4.3 for explicit illustrations of this statement).
The current challenge is to take full advantage of the statistical model and to find a
proper way of constraining the solution. One possible approach is to specify recons-
truction methods that are (sub)optimal in the MMSE sense for particular classes of
stochastic processes. While it is still not clear how this can be achieved in full general-
ity, Section 10.4 demonstrates how to proceed for the simpler signal-denoising scenario
where the system matrix is the identity.
In the Bayesian framework where the prior distribution of the signal is known, the
optimal (MMSE) reconstruction is the conditional expectation of the signal given the
measurements. Unfortunately, the direct computation of the MMSE solution, which is
specified by an N-dimensional integral, is intractable numerically for the (non-Gaussian)
cases of interest to us. This is to be contrasted with the previous MAP formulation which
translates into the “Gibbs energy” minimization problem (10.12) that can be solved
numerically using standard optimization techniques. Since the algorithms favored by
practitioners are based on similar variational principles, a key issue is to characterize
their degree of (sub)optimality and, in the case of deficiency of the MAP criterion, to
understand how the energy functional should be modified in order to improve the quality
of the reconstruction.
In this section, we investigate the problem of the denoising of Lévy processes for
which questions regarding optimality can be answered to a large extent. Specifically, we
shall define the corresponding MMSE signal estimator, derive a computational solution
based on belief propagation, and use the latter as gold standard to assess the performance
of the primary types of MAP estimators previously considered.
278 Recovery of sparse signals
which assumes that the noise contributions are independent and characterized by the
conditional pdf pY|X . For instance, in the case of AWGN, we have that pY|X (y|x) =
gσ (y − x), where gσ is a centered Gaussian distribution with standard deviation σ .
Given the measurements y = (y1 , . . . , yN ), the problem is to recover the unknown
signal vector x = (x1 , . . . , xN ) based on the knowledge that the latter is a realization
of a sparse first-order process of the type characterized in Section 8.5.2. This prior
information is summarized by the stochastic difference equation
un = xn − a1 xn−1 ,
where (un ) is an i.i.d. sequence with infinitely divisible pdf pU , with the implicit
convention that x0 = 0 (or, alternatively, x0 = xN if we are applying circular boundary
conditions). This model covers the cases of the non-Gaussian AR(1) processes (when
|a1 | < 1) and of the Lévy processes for a1 = 1. The posterior distribution of the signal
is therefore given by
1$ $
N N
p(X1 :XN |Y1 :YN ) (x|y) = pY|X yn |xn pU xn − a1 xn−1 , (10.37)
Z
n=1 n=1 un
where Z is a proper normalization constant. We can then formally specify the optimal
signal estimate as
xMMSE (y) = E{x|y} = x p(X1 :XN |Y1 :YN ) (x|y) dx. (10.38)
RN
For completeness, we now briefly show that the conditional mean (10.38) minimizes
the mean-square estimation error among all signal estimators. An estimator x̃ = x̃(y) is
a specific function of the measurement vector y and its performance is measured by the
(conditional) mean-square estimation error
E{(x − x̃) |y} = (x − x̃)2 p(X1 :XN |Y1 :YN ) (x|y) dx.
2
(10.40)
RN
10.4 The quest for the minimum-error solution 279
Since y is fixed, we can minimize this expression by annihilating its partial derivatives
with respect to x̃. This gives
∂ E{(x − x̃)2 |y}
=− 2(x − x̃)p(X1 :XN |Y1 :YN ) (x|y) dx = 0
∂ x̃ RN
which
proves that (10.38) is the MMSE solution. The implicit assumption here is that
R N |x| np
X|Y (x|y) dx < ∞ for n = 1, 2 so that (10.40) is well defined and so that we
can safely differentiate under the integral sign (by Lebesgue’s dominated-convergence
theorem).
Finally, we note that the MMSE estimator provided by (10.38) has the following
properties:
• It is unbiased with E{xMMSE (y)} = E{x}.
• It satisfies the statistical “orthogonality” principle
% T &
E x̃(y) x − xMMSE (y) =0
p(X1 :X3 |Y1 :Y3 ) (x|y) ∝ pU (x1 ) pY|X (y1 |x1 ) pU (x2 − x1 )×
pY|X (y2 |x2 ) pU (x3 − x2 ) pY|X (y3 |x3 ). (10.41)
280 Recovery of sparse signals
2 4 6
¹−
X2 (x)
¹+
X2 (x)
1 X1 3 X2 5 X3
Figure 10.7 Factor-graph representation of the posterior distribution (10.41) or (10.37) in greater
generality. The boxed nodes represent the factors of the pdf and the circled nodes the unknown
variables. The presence of an edge between a factor and a variable node indicates that the
latter variable is active within the factor. The functions μ− +
X2 (x) and μX2 (x) represent the beliefs
at the variable node X2 ; they condense all the statistical information coming from the left and
right of the graph, respectively.
The crucial step for exact Bayesian inference is to marginalize p(X1 :XN |Y1 :YN ) (x|y) with
respect to xn . In short, we need to integrate the posterior pdf over all other variables.
For instance, in the case of x2 , we get
p(X2 |Y1 :Y3 ) (x2 |y) = p(X1 :X3 |Y1 :Y3 ) (x|y) dx1 dx3
R R
μ−
X (x1 )
1
μ+X (x3 )
3
To evaluate the marginal with respect to x3 , we take advantage of the previous inte-
gration over x1 encoded in the “belief” function μ−
X2 (x2 ) and proceed as follows:
p(X3 |Y1 :Y3 ) (x3 |y) = p(X1 :X3 |Y1 :Y3 ) (x|y) dx1 dx2
R R
∝ μ−
X2 (x2 ) pY|X (y2 |x2 ) pU (x3 − x2 ) dx2 · pY|X (y3 |x3 ) ·
1 .
R
+
μX (x3 )
3
μ−
X (x3 )
3
10.4 The quest for the minimum-error solution 281
The emerging pattern is that the marginal distribution of the variable xn can be expressed
as a product of three terms:
• Initialization. Set
μ−
X1 (x) = pU (x)
μ+XN (x) = 1
The symbol ∝ denotes a renormalization such that the resulting function integrates to
one. The critical part of this algorithm is the evaluation
of the convolution-like
N inte-
− +
grals (10.42) and (10.43). The scalar belief functions μXn (x), μXn (x) n=1 that result
from these calculations also need to be stored, which presupposes some form of discre-
tization.
clear
in Chapters 4and 9. The second reason is that, for a1 = 1 (Lévy process), (10.42)
as well as (10.43) may be rewritten as the convolution
μ−
Xn (x) ∝ g(z)pU (x − z) dz = (pU ∗ g)(x) = F −1 {e fU (ω)
g(ω)}(x),
R
where g(z) = μ− Xn−1 (z) pY|X yn−1 |z . In order to obtain a cost-effective implementation,
we suggest evaluating the various formulas by relying only on products by switching
back and forth between the “time” domain to compute g(z) and the Fourier domain to
evaluate the convolution. The corresponding algorithm is summarized below. For sim-
plicity, we are assuming that the underlying pdfs are symmetric; therefore, the presence
of complex conjugation and the differences in convention from the Fourier transform as
used by statisticians are inconsequential.
• Initialization. Set
μ−
X1 (ω) = e
fU (ω)
μ+XN (ω) = δ(ω)
μ−
Xn (ω) ∝
g(ω)
pU (ω)
The symbol ∝ denotes a renormalization such that the value of the Fourier transform at
the origin is one. We have taken advantage of the moment-generating property of the
Fourier transform to establish (10.45).
The conventional and Fourier-based versions of the BP algorithm yield the exact
MMSE estimator for our problem. However, both involve continuous mathematics
(integrals and/or Fourier transforms) and neither one can be implemented in the given
form. The simplest and most practical generic solution is to represent the belief func-
tions by their samples on a uniform grid with a sampling step that is sufficiently fine
and to truncate their support while maintaining the error within an acceptable bound.
Integrals are then approximated by Riemann sums and the Fourier transform is imple-
mented using the FFT.
10.4 The quest for the minimum-error solution 283
(a) Gaussian
(b) Laplace
(c) Poisson
(d) Cauchy
Figure 10.8 Examples of Lévy processes with increasing degree of sparsity. (a) Brownian motion
(with Gaussian increments). (b) Lévy–Laplace motion. (c) Compound–Poisson process. (d)
Lévy flight with Cauchy-distributed increments.
The signal length was set to N = 100. Each data point in a graph is an average resulting
from 500 realizations of the denoising experiment. The MMSE estimator (Fourier-based
implementation of belief propagation) relies on the correct prior signal model, while the
regularization parameter τ of each of the other three estimators is kept constant for a
given noise level, and adjusted to yield the lowest collective estimation error.
We show in the graph of Figure 10.9 the signal-to-noise improvement of the various
algorithms for the denoising of Brownian motion. The first observation is that the results
of the BP MMSE estimator and the Wiener filter (LMMSE=MAP Gaussian) are indis-
tinguishable and that these methods perform best over the whole range of experimenta-
tion, in agreement with the theory. The worst results are obtained for TV regularization,
while the Log penalty gives intermediate results. A possible explanation of the latter
finding is that the Log potential is quadratic around the origin, so that it can replicate
the behavior of 2 , but only over some limited range of input values.
A similar scenario is repeated in Figure 10.10 for the compound-Poisson process. We
note that the corresponding MAP estimate is a constant signal since the probability of
its increments being zero is overwhelmingly larger (Dirac mass at the origin) than any
other acceptable value. This trivial estimator is excluded from the comparison. At low
noise levels where the sparsity of the source dictates the structure, the performance of
the TV estimator is very close to that of the MMSE denoiser, which can be considered
as gold standard. Yet, the relative performance of the TV estimator deteriorates with
increasing noise, so much so that TV ends being worst at the other end of the scale. One
can observe a reverse trend for the LMMSE estimator, which progressively converges
to the MMSE solution as the variance of the noise increases. Here, the explanation is
that the statistics of the noisy signal is dominated by the Gaussian constituent, which is
favorable to the LMMSE estimator.
10.4 The quest for the minimum-error solution 285
Figure 10.9 SNR improvement as a function of the level of noise for Brownian motion. The
denoising methods by order of decreasing performance are: MMSE estimator (which is
equivalent to the LMMSE and MAP estimators), Log regularization, and TV regularization.
Figure 10.10 SNR improvement as a function of the level of noise for a piecewise-constant signal
(compound-Poisson process). The denoising methods are: MMSE estimator, Log regularization,
TV regularization, and LMMSE estimator.
The more challenging case of a Lévy flight, which is the sparsest process in the series,
is documented in Figure 10.11. Here, the MAP estimator (Log potential) performs well
over the whole range of experimentation. A possible explanation is that the corrupted
signal looks sparse even at large noise powers since the convolution of a heavy-tailed
pdf with a Gaussian remains heavy-tailed. The dominance of the non-Gaussian regime
also explains why the LMMSE performs so poorly. The main limitation of the LMMSE
algorithm is that it fails to preserve the sharp edges that are characteristic of this type of
signal.
The last example, in Figure 10.12, is particularly telling because the results go against
our initial expectation, especially at higher noise levels. While the MAP (TV) estimator
performs well in the low-noise regime, it progressively falls behind all other estimators
as the variance of the noise increases. Particularly surprising is the good behavior of the
LMMSE algorithm, which matches the MMSE solution at higher noise levels. Apart
from the MMSE denoiser, there is no single estimator that outperforms the others over
the whole range of noise. The possible reason for the poor performance of MAP is that
the underlying signal is at the very low
end of sparsity, with
its general appearance being
rather similar to Brownian motion see Figure 10.8a,b . This finding suggests that one
should be cautious with the Bayesian justification of 1 -norm minimization techniques
286 Recovery of sparse signals
Figure 10.11 SNR improvement as a function of the level of noise for a Lévy flight with
Cauchy-distributed increments. The denoising methods by order of decreasing performance are:
MMSE estimator, MAP estimator (Log regularization), TV regularization, and LMMSE
estimator.
Figure 10.12 SNR improvement as a function of the level of noise for a Lévy process with
Laplace-distributed increments. The denoising methods are: MMSE estimator, LMMSE
estimator, MAP estimator (TV regularization), and Log regularization.
based on Laplace priors since MAP-TV does not necessarily perform so well for the
model to which it is matched – in fact, often worse than the classical Wiener solution
(LMMSE).
While this series of experiments stands as a warning against a strict application of
the MAP principle, it also shows that it is possible to specify variational estimators that
approximate the MMSE solution well. The caveat is that the best performing potential is
not necessarily the prior log-likelihood function associated with the probability model,
which calls for further investigation.
Section 10.3
Self-similar probability models are commonly used as prior knowledge for image pro-
cessing [PPLV02]. The property of scale-invariance is supported by empirical obser-
vations of the power spectrum of natural images [Fie87, RB94, OF96b, SO01]; it is
also motivated by physics and biology [Man82, Rud97, MG01]. The non-Gaussian cha-
racter of images is well documented too [SLSZ03]. The wavelet-domain histograms,
in particular, are typically leptokurtotic (with a peak at the origin) and heavy-tailed
[Fie87,Mal89,SLG02,PSWS03]. The same holds true for the pdfs of image derivatives,
which are often exponential or even subexponential. Grenander and Srivastava have
shown that this behavior can be induced by a simple generative model that involves the
random superposition of a fixed collection of templates [GS01]. Mumford and Gidas
have introduced a scale-invariant adaptation of this model that takes the form of a sto-
chastic wavelet expansion with a random placement and scaling of the atoms [MG01].
This random wavelet model has the same phenomenological characteristics – infinite
divisibility and wide-sense self-similarity – as the sparse stochastic processes being
considered here. However, it does not lend itself as well to statistical inference because
of the lack of an underlying innovation model.
288 Recovery of sparse signals
discretization that is used in the present experiments is slightly more involved but has
better approximation properties [ENU12].
Section 10.4
The connection between TV denoising and the MAP estimation of Lévy processes was
pointed out in [UT11]. The direct solution for the MMSE denoising of Lévy processes,
which is based on belief propagation, was proposed by Kamilov et al. [KPAU13]. For
a general presentation of the methods of belief propagation and message passing, we
refer to the articles of Loeliger et al. [KFL01, Loe04]. The paper by Amini et al.
[AKBU13] provides the basic tools for the proper statistical formulation of signal recov-
ery for higher-order sparse stochastic models. It also includes the type of experimental
comparison presented in Section 10.4.3. The term “Lévy flight” was coined by Mandel-
brot [Man82]. This stochastic model induces a chaotic behavior with random displace-
ments interspersed with sudden jumps. It is characteristic of the path followed by birds
and other animals when searching for food [VBB+ 02].
Several authors have identified deficiencies of non-Gaussian MAP estimation tech-
niques [Nik07, Gri11, SDFR13]. Conversely, Gribonval has shown that, in the AWGN
scenario, there exists a penalized least-squares estimator with an appropriate penalty
that is equivalent to the MMSE solution, and that this modified penalty may not coincide
with the prior log-likelihood function associated with the underlying statistical model.
11 Wavelet-domain methods
A simple and surprisingly effective approach for removing noise in images is to expand
the signal in an orthogonal wavelet basis, apply a soft-threshold to the wavelet coef-
ficients, and reconstruct the “denoised” image by inverse wavelet transformation. The
classical justification for the algorithm is that i.i.d. noise is spread out uniformly in the
wavelet domain while the signal gets concentrated in a few significant coefficients (spar-
sity property) so that the smaller values can be primarily attributed to noise and easily
suppressed.
In this chapter, we take advantage of our statistical framework to revisit such wave-
let-based reconstruction methods. Our first objective is to present some alternative
dictionary-based techniques for the resolution of general inverse problems based on
the same stochastic models as in Chapter 10. Our second goal is to take advantage
of the orthogonality of wavelets to get a deeper understanding of the effect of proxi-
mal operators while investigating the possibility of optimizing shrinkage/thresholding
functions for better performance. Finally, we shall attempt to bridge the gap between
operator-based regularization, as discussed in Sections 10.2–10.3, and the imposition
of sparsity constraints in the wavelet domain. Fundamentally, this relates to the dicho-
tomy between an analysis point of view of the problem (typically in the form of the
minimization of an energy functional with a regularization term) vs. a synthesis point
of view, where a signal is represented as a sum of elementary constituents (wavelets.)
The chapter is composed of two main parts. The first is devoted to inverse problems
in general. Specifically, in Section 11.1 we apply our general discretization and model-
ing paradigm to the derivation of wavelet-domain MAP estimators for the resolution of
linear inverse problems. One of the key differences from the innovation-based formula-
tion of Chapter 10 is the presence of scale-dependent potential functions whose form is
specified by the stochastic model. We then address practical issues in Section 11.2 with
the presentation of the two primary iterative thresholding algorithms (ISTA and FISTA).
These methods are illustrated with the deconvolution of fluorescence micrographs.
The second part of the chapter focuses on the denoising problem, with the aim
of improving upon simple soft-thresholding and wavelet-domain MAP estimation.
Section 11.3 presents a detailed investigation of shrinkage functions in relation to infi-
nitely divisible laws, with the emphasis on pointwise estimators that are optimal in the
MMSE sense. In Section 11.4, we show how the performance of wavelet denoising
can be boosted even further through the use of redundant representations (tight wavelet
frames). In particular, we describe the concept of consistent cycle spinning, which
11.1 Discretization of inverse problems in a wavelet basis 291
provides a conceptual bridge with the optimal estimation techniques of Section 10.4.
We then close the circle in Section 11.4.4 by combining all ingredients – tight operator-
like wavelet frames, MMSE shrinkage functions, and consistent cycle spinning – and
present an iterative wavelet-based algorithm that converges empirically to the reference
MMSE solution of Section 10.4.2.
vi [k] = ψ̃i,k , s .
where φ̃i ∈ L1 (Rd ) is some suitable (possibly, scale-dependent) smoothing kernel and
D is the dilation matrix that specifies the multiresolution decomposition. Recalling that
s = L−1 w, this implies that
so that it is possible to derive any finite-dimensional joint pdf of the wavelet coefficients
vi [·] by using the general white-noise analysis exposed in Chapters 8 and 9. In particular,
Proposition 8.6 tells us that pVi , the pdf of the wavelet
coefficients
at scale i, is infinitely
divisible with modified Lévy exponent fφ̃i (ω) = Rd f ωφ̃i (r) dr.
292 Wavelet-domain methods
where i denotes the wavelet-domain index set corresponding to the region of interest
(ROI) . Note that the above expansion spans the same signal space as (10.8), provided
that we select β = βL as the scaling function of the wavelet system {ψi,k }.
The signal in (11.3) is uniquely specified by an N-dimensional vector v of pooled
wavelet coefficients vi [k], k ∈ i , i = 1, . . . , Imax . The right-hand side of (11.3) also
indicates that there is a linear, one-to-one correspondence between the sequence of
wavelet coefficients vi [·] and the discrete signal s[·]. This mapping specifies the discrete
wavelet transform which admits a fast filterbank implementation. In vector notation,
this translates into
v = W̃s ⇔ s = Wv
−1
with W = W̃ , where the entries of the (N × N) wavelet matrices W̃ and W are
given by
[W̃](i,k),k = ψ̃i,k , βL,k
[W]k ,(i,k) = β̃L,k , ψi,k ,
respectively. Also note that the wavelet basis is orthonormal if and only if ψ̃i,k = ψi,k ,
which translates into W̃ = WT being an orthonormal matrix; this latter property presup-
poses that the underlying scaling functions are orthogonal too.
With the above convention, we write the wavelet version of the measurement equa-
tion (10.9) as
y = Hwav v + n,
with wavelet-domain system matrix Hwav whose entries are given by
[Hwav ]m,(i,k) = ηm , ψi,k , (11.4)
where ηm is the analysis function corresponding to the mth measurement. The link with
(10.10) in Section 10.1.2 is Hwav = HW with the proper choice of analysis function
β̃ = β̃L .
For the purpose of simplification and mathematical tractability, we now make the
same kind of decoupling simplification as in Section 10.1.2, treating the wavelet com-
ponents as if they were independent. 1 Using Bayes’ rule, we get the corresponding
1 While this approximation is legitimate within a given scale for sufficiently well-localized wavelets, it is
less so between scales because the wavelet smoothing kernels φ̃i and φ̃i typically overlap. (A more refined
probabilistic model should take those inter-scale dependencies into consideration.)
11.1 Discretization of inverse problems in a wavelet basis 293
f (ω)
where pVi is the (conjugate) inverse Fourier transform of
pVi (ω) = e φ̃i . By maximiz-
ing pV|Y , we derive the wavelet-domain version of the MAP estimator
⎧ ⎫
⎨1 ⎬
vMAP (y) = arg min y − Hwav v22 + σ 2 Vi vi [k] , (11.5)
v ⎩2 ⎭
i k∈i
which is similar to (10.12), except that it now involves the series wavelet potentials
Vi (x) = − log pVi (x).
The specificity of the present MAP formulation is that the potential functions Vi are
scale-dependent and tied to the Lévy exponent f of the continuous-domain innovation w.
Since the pdfs pVi of the wavelet coefficients are infinitely divisible with Lévy exponent
fφ̃i , we can determine the exact form of the potentials as
" # dω
Vi (x) = − log exp fφ̃i (ω) − jωx (11.6)
R 2π
with
fφ̃i (ω) = f ωφ̃i (r) dr,
Rd
where φ̃i is the wavelet smoothing kernel at resolution i in (11.2). Moreover, we can
rely on the theoretical analysis of id potentials in Section 10.2.1, which remains valid in
the wavelet domain, to extract the global characteristics of Vi . The general trend that
emerges is that these characteristics are mostly insensitive to the exact shape of φ̃i , and
hence to the choice of a particular wavelet basis.
The assumption that the underlying signal s(r) is (second-order) self-similar has
direct repercussions on the form of the potentials and their evolution across scale. Based
on the analysis in Section 9.8, we find that the wavelet-domain pdfs pVi are members of
the same class. Specifically, their Lévy exponent at resolution i is given by
fφ̃i (ω) = 2id fφ̃ 2i(γ −d/2) ω . (11.7)
It follows that the wavelet potential at resolution i can be written as
x id
Vi (x) = i log b1 + i ; 2 (11.8)
b1
The main point is that, up to a dilation by (b1 )i , the wavelet potentials are part of the
parametric family (x, τ ), which corresponds to the natural semigroup extension of the
wavelet pdf at scale i = 0.
Interestingly, we can also provide an iterated convolution interpretation of this result
by considering the pdfs of the scale-normalized wavelet coefficients zi = vi /(b1 )i . To
see this, we express the characteristic function of zi as
" #
pZi (ω) =
pVi (ω/bi1 ) = exp 2id fφ̃ ω
2d
= pZi−1 (ω)
2id
= pZ0 (ω) ,
which indicates that pZi is the 2id -fold convolution of pZ0 = pV0 , which is itself the
pdf of the wavelet coefficients at resolution 0. In 1-D, this translates into the recursive
relation
pZi (x) = pZi−1 ∗ pZi−1 (x), (11.9)
which we like to view as the probabilistic counterpart of the two-scale relation of the
multiresolution theory of the wavelet transform. Incidentally, this iterated convolution
relation also explains why pZi spreads out and converges to a Gaussian as the scale
increases. Observe that the effect is more pronounced in higher dimensions since the
number of elementary convolution factors in the probabilistic two-scale relation grows
exponentially with d.
Having specified the statistical reconstruction problem in a wavelet basis, we now des-
cribe numerical methods of solution. To that end, we consider the general optimization
problem
* +
1
min y − Hs22 + τ (WT s) , (11.10)
s 2
11.2 Solving linear inverse problems 295
N T −1 an
where (v) = n=1 n (vn ) is a separable potential function and W = W
orthonormal transform matrix. The qualitative effect of the second term in (11.10) is
to favor solutions that admit a sparse wavelet expansion; the strength of this “regu-
larization” constraint is controlled by the parameter τ ∈ R+ . Clearly, the solution
of (11.10) is equivalent 2
to the MAP estimator (11.5) if we set τ = σ and (v) =
i k∈i Vi vi [k] .
While a possible approach for solving (11.10) is to apply the ADMM algorithm of
Section 10.2.4 with the replacement of L by WT and a slight adjustment for scale-
dependent potentials, we shall present two alternative techniques (ISTA and FISTA)
that capitalize on the orthogonality of the matrix W. The second algorithm (FISTA) is
a modification of the first one that results in faster convergence.
11.2.1 Preliminaries
To exploit the separability of the potential function , we restate the reconstruction
problem in terms of the wavelet coefficients v = (v1 , . . . , vN ) = WT s as the minimization
of the cost functional
N
1
C (v) = y − Hwav v22 + τ n (vn ), (11.11)
2
n=1
where Hwav = HW. In order to gain insights into the algorithmic components of ISTA,
we first investigate two extreme cases for which the solution can be written down
explicitly.
Least-squares estimation
For τ = 0, the minimization of (11.11) reduces to a classical least-squares estimation
problem, and there is no advantage in expressing the signal in terms of wavelets. The
solution of the reconstruction problem is given by
under the assumption that HT H is invertible. When the underlying matrix is too large
to be inverted numerically, the corresponding linear system of equations is solved itera-
tively. The simplest iterative reconstruction method is the Landweber algorithm
sk+1 = sk + μ HT y − Hsk (11.12)
so that
⎛ ⎞
prox1 (z1 ; τ )
⎜ ⎜ ..
⎟
⎟
ṽ = prox z; τ = ⎜ . ⎟, (11.14)
⎝ ⎠
proxN (zN ; τ )
where the definition of the underlying proximal operators (vectorial and scalar) is
consistent with the formulation of Section 10.2.3. Hence, the solution ṽ can be com-
puted by applying a series of component-wise shrinkage/thresholding functions to the
wavelet coefficients of y. This is the model-based version of the standard denoising
algorithm mentioned in the introduction. The relation between proxn and the under-
lying probability model is investigated in more detail in Section 11.3. The bottom line
is that these are scale-dependent non-linear maps (see examples in Figure 11.5) that
can be precomputed and stored in a lookup table, which makes the denoising procedure
very efficient.
where C0 (vk , y) is a term that does not depend on v and where the auxiliary variable zk
is given by
zk = vk + L1 HTwav y − Hwav vk )
= WT sk + L1 HT y − Hsk ) . (11.17)
The crucial point is that the minimization of (11.16) with respect to v is equivalent to
the denoising problem (11.13). This implies that
" τ#
arg min C (v, vk ) = prox zk ; ,
v∈RN L
which corresponds to a shrinkage/thresholding of the wavelet coefficients of the signal.
The form of the update
equation (11.17) is also highly suggestive, for it boils down to a
Landweber iteration see (11.12) followed by a wavelet transform. The resulting ISTA
is summarized in Algorithm 1.
% &
2 y − Hs2
1 2
Algorithm 1: ISTA solves s = arg mins + τ (WT s)
input: A = HT H, a = HT y, s0 , τ , and L
set: k ← 0
repeat
sk+1 ← sk + L1 a − Ask (Landweber step)
vk+1 ← prox WT sk+1 ; Lτ (wavelet-domain denoising)
sk+1 ← Wvk+1 (inverse wavelet transform)
k←k+1
until stopping criterion
return sk
The remarkable aspect is that this simple sequence of Landweber updates and
wavelet-domain thresholding operations converges to the solution of (11.10). The only
subtle point is that the strength of the thresholding (τ/L) is tied to the step size of the
gradient update.
In the same paper, these authors have proposed a refinement of the scheme, called
the fast iterative shrinkage/thresholding algorithm (FISTA), which improves the rate of
convergence by one order. This is achieved via a controlled over-relaxation that utilizes
the previous iterates to produce a better guess for the next update. A possible imple-
mentation of FISTA is shown in Algorithm 2.
% &
2 y − Hs2
1 2
Algorithm 2: FISTA solves s = arg mins + τ (WT s)
input: A = HT H, a = HT y, s0 , τ and L
set: k ← 0, w0 ← Ws0 , t0 ← 0;
repeat
1 τ
wk+1 ← prox WT sk + (a − Ask ) ; (ISTA step)
3 L L
1
tk+1 ← 1 + 1 + 4tk 2
2
tk − 1 k+1
vk+1 ← wk+1 + w − wk
tk+1
sk+1 ← Wvk+1
k←k+1
until stopping criterion
return sk
The only difference from ISTA is the update of vk+1 , which is an extrapolation of the
two previous ISTA computations wk+1 and wk . The variable tk controls the strength of
the over-relaxation, which increases with k up to some asymptotic limit.
The theoretical justification for FISTA (see [BT09b, Theorem 4.4]) is that the scheme
improves the convergence such that, for any k > 1,
2L
C (vkFISTA ) − C (v ) ≤ vk − v 22 .
(k + 1)2 FISTA
Practically, switching from a linear to a quadratic convergence rate can translate to
a spectacular speed improvement over ISTA, with the advantage that this change of
regime essentially comes for free. FISTA therefore constitutes the method of choice for
wavelet-based regularization; it typically delivers state-of-the-art performance for the
kind of large-scale optimization problems encountered in imaging.
ISTA
FISTA
Cost functional 100.5
100.4
100.3
Figure 11.1 Comparison of the convergence properties of ISTA (light) and FISTA (dark) for the
image in 11.2(b) as a function of the iteration index.
used the same type of potential functions: Gauss (x) = Ai |x|2 , Laplace (x) = Bi |x|, and
Student (x) = Ci log(x2 + ), where Ai , Bi , and Ci are some proper scale-dependent
constants. As in the previous experiments, the overall regularization strength τ was
tuned for best performance (maximum SNR with respect to the reference). Here, we are
presenting the results for the image of nerve cells (see Figure 10.3b) with the use of 1
wavelet-domain regularization.
The plot in Figure 11.1 documents the evolution of the cost functional (11.11) as a
function of the iteration index for both ISTA and FISTA. It illustrates the faster conver-
gence rate of FISTA, in agreement with Beck and Teboulle’s prediction. As far as qual-
ity is concerned, a general observation is that the output of the basic version of the
wavelet-based reconstruction algorithm is not on a par with the results of Section 10.3.
The main problem (see Figure 11.2f) is that the reconstructed images suffer from arti-
facts (in the form of wavelet footprints) that are typically the consequence of the lack
of shift-invariance of the wavelet representation. Fortunately, there is a simple remedy
to correct for this effect via a mechanism called cycle spinning. 2 The approach is to
randomly shift the signal back and forth during the course of iterations, which is equi-
valent to cycling through a family of shifted wavelet transforms, as will be described in
Section 11.4.2. Incorporating cycle spinning in ISTA does not increase the compu-
tational cost but improves the SNR of the reconstruction significantly, as shown in
Figure 11.2e. Hence, we end up with a result that is comparable in quality to the output
of the MAP reconstruction algorithm of Section 10.2 (see Figure 11.2c). This trend per-
sists with other images and across imaging modalities. Combining cycle spinning with
FISTA is feasible as well, with the advantage that the convergence rate of the latter is
typically superior to that of the ADMM technique.
While averaging across shifts appears to be essential for making wavelets competi-
tive, we are left with the conceptual problem that the cycle-spun version of ISTA does
not rigorously fit our statistical formulation. It converges to a solution that is not a strict
2 Cycle spinning is used almost systematically for the wavelet-based reconstructions showcased in the liter-
ature. However, the method is rarely accounted for in the accompanying theory.
300 Wavelet-domain methods
TV
(a) (b) (c)
Ref TV
Figure 11.2 Results of deconvolution experiment: (a) Blurry and noisy input of the deconvolution
algorithm (BSNR=20dB). (b) Ground truth image (nerve cells). (c) Result of MAP
deconvolution with TV regularization (SNR=15.23 dB). (d) Result of wavelet-based
deconvolution (SNR=12.73dB). (e) Result of wavelet-based deconvolution with cycle spinning
(SNR=15.18dB). (f) Zoomed comparison of results for the region marked in (b).
minimizer of (11.10) but, rather, to some kind of average over a family of “shifted”
wavelet transforms. While this description is largely empirical, there is a theoretical
explanation of the phenomenon for the simpler signal-denoising problem. Specifically,
in Section 11.4, we shall demonstrate that cycle spinning necessarily improves denois-
ing performance (see Proposition 11.3) and that it can be seen as an alternative means
of computing the “exact” MAP estimators of Section 10.4.3. In other words, cycle spin-
ning somehow compensates for the inter-scale dependencies of wavelet coefficients that
were neglected when writing (11.5).
The most favorable aspect of wavelet-domain processing is that it offers direct control
over the reconstruction error, thanks to Parseval’s relation. In particular, it allows for
a more refined design of thresholding functions based on the minimum-mean-square-
error (MMSE) principle. This is the reason why we shall now investigate non-iterative
strategies for improving simple wavelet-domain denoising.
In the remainder of the chapter, we concentrate on the problem of signal denoising with
H = I (identity) or, equivalently, Hwav = W, under the assumption that the transform
11.3 Study of wavelet-domain shrinkage estimators 301
matrix W is orthonormal. The latter ensures that any reduction of the quadratic error
achieved in the wavelet domain is automatically transferred to the signal domain.
In this particular setting, we can address the important issue of the dependency be-
tween the wavelet-domain thresholding functions and the prior probability model. Our
practical motivation is to improve the standard algorithm by identifying the solution that
minimizes the mean-square estimation error. To specify the underlying scalar estimation
problem, we transpose the measurement equation y = s + n into the wavelet domain as
z = WT s + WT n = v + n
⇔ zi [k] = vi [k] + ni [k], (11.18)
where vi and ni are the wavelet coefficients of the noise-free signal s and of the AWGN
n, respectively. Since the wavelet transform is orthonormal, the transformed noise n =
WT n remains white, so that ni is Gaussian i.i.d. with variance σ 2 . Now, when the wave-
let coefficients vi are statistically independent as has been assumed so far, the denoising
can be performed in a separable fashion by considering the wavelet coefficients indivi-
dually. The estimation problem is then to recover v from the noisy coefficient z = v + n,
where we have dropped the wavelet indices to simplify the notation. Irrespective of the
statistical criterion used (MAP vs. MMSE), the estimator ṽ(z) will be a function of the
(scalar) noisy input z, in agreement with the standard wavelet-denoising procedure.
Next, we develop the theory associated with the statistical wavelet-based estimators.
The prior information is provided by the wavelet-domain pdfs pVi , which are known
to be infinitely divisible (see Proposition 8.6). We then make use of those results to
characterize and compare the shrinkage/thresholding functions associated with the id
distributions of Table 4.1.
are only approximately independent. The MMSE estimator of vi , given the noisy coef-
ficient z, is provided by the posterior mean
This means that we can derive the explicit form of vMMSE (z) for any given pVi via the
evaluation of the Gaussian convolution integrals
* +
−1 ω2 σ 2
pZ (z) = (gσ ∗ pVi )(z) = F e− 2 pVi (ω) (z) (11.22)
* +
−1 ω2 σ 2
pZ (z) = (gσ ∗ pVi )(z) = F jωe− 2 pVi (ω) (z). (11.23)
These can be calculated in either the time or frequency domain. The frequency-domain
formulation offers more convenience for the majority of id distributions and is also
directly amenable to numerical computation with the help of the FFT. Likewise, we use
(11.21) to infer the general asymptotic behavior of this estimator.
THEOREM 11.1 Let z = v + n, where v is infinitely divisible with symmetric pdf pV and
n is Gaussian-distributed with variance σ 2 . Then, the MMSE estimator of v given z has
the linear behavior around the origin given by
" #
vMMSE (z) = z 1 − σ 2 Z (0) + O(z3 ), (11.24)
where
2 2
2 e− ω 2σ
Rω
pVi (ω) dω
Z (0) = 2 2
> 0. (11.25)
− ω 2σ
Re
pVi (ω) dω
11.3 Study of wavelet-domain shrinkage estimators 303
If, in addition, pVi is unimodal and does not decay faster than an exponential, then
vMMSE (z) ∼ vMAP (z) ∼ z − σ 2 b1 as z → ∞,
where b1 = limx→∞ Z (x) = limx→∞ V (x) ≥ 0.
Proof Since the Gaussian kernel gσ is infinitely differentiable, the same holds true
for pZ = pVi ∗ gσ even if pVi is not necessarily smooth to start with (e.g., it is a
compound-Poisson or Laplace distribution.) This implies that the second-order Taylor
series Z (z) = − log (pZ (z)) = Z (0) + 12 Z (0)z2 + O(z4 ) is well defined, which yields
(11.24). The expression for Z (0) follows from (10.20) with pZ (ω) = e−ω σ /2
2 2
pVi (ω).
We also note that the Fourier-domain moments that appear in (11.25) are positive and
f (ω)
finite because pVi (ω) = e φ̃i ≥ 0 is tempered by the Gaussian window. Next, we recall
that the Gaussian is part of the family of strongly unimodal functions which have the
remarkable property of preserving the unimodality of the functions they are convolved
with [Sat94, pp. 394–399]. The second part then follows from the fact that the convolu-
tion with gσ , which decays much faster than pVi , does not modify the decay at the tail
of the distribution.
Several remarks are in order. First, the linear approximation (11.24) is exact in the
Gaussian case. It actually yields the classical linear (LMMSE) estimator
σi2
vLMMSE (z) = z,
σi2 + σ 2
where σi2 is the variance of the signal contribution in the ith wavelet channel. Indeed,
z2
when pVi is a Gaussian distribution, we have that Z (z) = 2 2 , which, upon substi-
2(σi +σ )
tution in (11.21), yields the vLMMSE estimator.
Second, by applying Parseval’s relation, we can express the slope of the MMSE esti-
mator at the origin as the ratio of time-domain integrals
x
σ 2 −x2 − 2σ 2
2
2 R
e pVi (x) dx
1 − σ 2 Z (0) = 1 − σ σ 4
− x2
2σ 2
Re pVi (x) dx
2
2
− x2
Rx e pVi (x) dx
2σ
= , (11.26)
− x2
σ2 Re
2σ 2 pVi (x) dx
which may be simpler to evaluate for some id distributions.
Laplace distribution
The Laplace distribution with parameter λ is defined as
1 −λ|x|
pLaplace (x; λ) = λe .
2
2
Its variance is given by σ02 = λ2
. The Lévy exponent is fLaplace (ω; λ) = log
pLaplace (ω; λ)
λ2
= log( λ2 +ω2 ), which is p-admissible with p = 2. The Laplacian potential is
Since the second term of Laplace does not depend on x, this translates into a MAP
estimator that minimizes the 1 -norm in the corresponding wavelet channel. It is well
known that the solution of this optimization problem yields the soft-threshold estimator
(see [Tib96, CDLL98, ML99])
⎧
⎪
⎨ z − λ, z > λ
⎪
vMAP (z; λ) = 0, z ∈ [−λ, λ]
⎪
⎪
⎩
z + λ, z < λ.
By applying the time-domain versions of (11.22) and (11.23), one can also derive the
analytical form of the corresponding MMSE estimator in AWGN. For reference pur-
poses, we give its normalized version with σ 2 = 1 as
" " # " # #
λ erf z−λ
√ − e2λz erfc λ+z
√ +1
vMMSE (z; λ) = z − " 2# " 2# ,
z−λ
erf √ + e2λz erfc λ+z √ +1
2 2
where erfc(t) = 1−erf(t) denotes the complementary (Gaussian) error function, which is
a result that can be traced back to [HY00, Proposition 1]. A comparison of the estimators
for the Laplace distribution with λ = 2 and unit noise variance is given in Figure 11.3b.
While the graph of the MMSE estimator has a smoother appearance than that of the
soft-thresholding function, it does also exhibit two distinct regimes that are well repre-
sented by first-order polynomials: behavior around the origin vs. behavior at ±∞. How-
ever, the transition between the two regimes is much more progressive in the MMSE
case. Asymptotically, the MAP and MMSE estimators are equivalent, as predicted by
Theorem 11.1. The key difference occurs around the origin, where the MMSE estimator
is linear (in accordance with Theorem 11.1) and quite distinct from a thresholding func-
tion. This means that the MMSE estimator will never annihilate a wavelet coefficient,
which somewhat contradicts the predominant paradigm for recovering sparse signals.
11.3 Study of wavelet-domain shrinkage estimators 305
0
-4 -2 0 2 4
(a)
3 3
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
-4 -2 0 2 4 -4 -2 0 2 4
(b) (c)
Figure 11.3 Comparison of potential functions (z) and pointwise estimators v(z) for signals
with matched Laplace and sech distributions corrupted by AWGN with σ = 1. (a) Laplace
(dashed) and sech (solid) potentials. (b) Laplace MAP estimator (light), MMSE (dark) estimator,
and its first-order equivalent (dot-dashed line) for λ = 2. (c) Sech MAP (light) and MMSE (dark)
estimators for σ0 = π/4.
Remarkably, its characteristic function is part of the same class of distributions, with
psech (ω; σ0 ) = sech (ωσ0 ) .
which is convex and increasing for x ≥ 0. Indeed, the second derivative of the potential
function is
π2 2 πx
sech (x) = sech ,
4σ0 2σ0
306 Wavelet-domain methods
π |x| + log σ ,
which is positive. Note that, for large absolute values of x, sech (x) ∼ 2σ0
0
suggesting that it is essentially equivalent to the 1 -type Laplace potential (see
Figure 11.3a). However, unlike the latter, it is infinitely differentiable everywhere,
with a quadratic behavior around the origin.
The corresponding MAP and MMSE estimators, with a parameter value that is
matched to the Laplace example, are shown in Figure 11.3c. An interesting observation
is that the sech MAP thresholding functions are very similar to the MMSE Laplacian
ones over the whole range of values. This would suggest using hyperbolic-secant-
penalized least-squares regression as a practical substitute for the MMSE Laplace
solution.
15 10
10
5
5
0 0
-5
-5
-10
-10
-15-20 -10 0 10 20 -10 -5 0 5 10
(a) (b)
Figure 11.4 Examples of pointwise MMSE estimators vMMSE (z) for signals corrupted by
AWGN with σ = 1 and the fixed SNR = 1. (a) Student priors with r = 2, 4, 8, 16, 32, +∞ (dark to
light). (b) Compound-Poisson priors with λi = 1/8, 1/4, 1/2, 1, 2, 4, ∞ (dark to light) and
Gaussian amplitude distribution.
11.3 Study of wavelet-domain shrinkage estimators 307
The Student MAP estimator is specified by a third-order polynomial equation that can
be solved explicitly. This results in the thresholding functions shown in Figure 10.1b.
We have also observed experimentally that the Student MAP and MMSE estimators are
rather close to each other, with linear trends around the origin that become indistingui-
shable as r increases; this can be verified by comparing Figure 10.1b and Figure 11.4a.
This finding is also consistent with the distributions becoming more Gaussian-like for
larger r.
Note that Definition (11.27) remains valid in the super-sparse regimes with r ∈ (0, 1],
provided that the normalization constant C > 0 is no longer tied to r and σ0 . The catch
is that the variance of the signal is unbounded for r ≤ 1, which tends to flatten the
shrinkage function around the origin, but maintains continuity since Student is infinitely
differentiable.
Compound-Poisson family
We have already mentioned that the Poisson case results in pdfs that exhibit a Dirac
distribution at the origin and are therefore unsuitable for MAP estimation. A compound-
Poisson variable is typically generated by integration of a random sequence of Dirac
impulses with some amplitude distribution pA and a density parameter λ corresponding
to the average number of impulses within the integration window. The generic form of
a compound-Poisson pdf is given by (4.9). It can be written as pPoisson (x) = e−λ δ(x) +
(1 − e−λ )pA,λ (x), where the pdf pA,λ describes the distribution of the non-zero values.
The determination of the MMSE estimator from (11.21) requires the computation
of Z (z) = −pZ (z)/pZ (z). The most convenient approach is to evaluate the required
factors using the right-hand expressions in (11.22) and (11.23), where pVi is specified
by its Poisson parameters as in Table 4.1. This leads to
pAi (ω) − 1) ,
pVi (ω) = exp λi (
where λi ∈ R+ and pAi : R → C are the Poisson rate and the characteristic function
of the Poisson amplitude distribution at resolution i, respectively. Moreover, due to the
multiscale structure of the analysis, the wavelet-domain Poisson parameters are related
to each other by
λi = λ0 2id
pA0 2i(γ −d/2) ω ,
pAi (ω) =
which follows directly from (11.7). The first formula highlights the fact that the sparse-
ness of the wavelet distributions, as measured by the proportion e−λi of zero coeffic-
ients, decreases substantially as the scale gets coarser. Also note that the strength of this
effect increases with the number of dimensions.
Some examples of MMSE thresholding functions corresponding to a sequence
of compound-Poisson signals with Gaussian amplitude distributions are shown in
Figure 11.4b. Not too surprisingly, the smaller λ (dark curve), the stronger the thre-
sholding behavior at the origin. In that experiment, we have considered a wavelet-like
progression of the rate parameter λ, while keeping the signal-to-noise ratio constant to
308 Wavelet-domain methods
facilitate the comparison. For larger values of λ (light), the estimator converges to the
LMMSE solution (thin black line), which is consistent with the fact that the distribution
becomes more and more Gaussian-like.
6
4
2
0
–2
–4
–6
–10 –5 0 5 10
(a)
6 6
4 4
2 2
0 0
–2 –2
–4 –4
–6 –6
–10 –5 0 5 10 –10 –5 0 5 10
(b) (c)
Figure 11.5 Sequence of wavelet-domain estimators vi (z) for a Laplace-type Lévy process
corrupted by AWGN with σ = 1 and wavelet resolutions i = 0 (dark) to 4 (light). (a) LMMSE
(or Brownian-motion MMSE) estimators. (b) Sym gamma MAP estimators. (c) Sym gamma
MMSE estimators. The reference (fine-scale) parameters are σ02 = 1 (SNR0 = 1) and r0 = 1
(Laplace distribution). The scale progression is dyadic.
11.3 Study of wavelet-domain shrinkage estimators 309
progressive convergence to the identity map is easiest to describe for a Gaussian signal
where the sequence of estimators is linear – this is illustrated in Figure 11.5a for γ = 1
and d = 1 (Brownian motion) so that b1 = 21/2 .
In the non-Gaussian case, the sequence of wavelet-domain estimators will be part of
some specific family that is completely determined by pV0 , the wavelet pdf at scale 0, as
described in Section 11.1.2. When the variance of the signal is finite, the implication of
the underlying semigroup structure and iterated convolution relations is that the MAP
and MMSE estimators both converge to the Gaussian solution (LMMSE estimator) as
the scale gets coarser (light curves). Thus, the global picture remains very similar to
the Gaussian one, as illustrated in Figure 11.5b,c. Clearly, the most significant non-
linearities can be found at the finer scale (dark curves), where the sparsity effect is
prominent.
The example shown (wavelet analysis of a Lévy process in a Haar basis) was set up
so that the fine-level MAP estimator is a soft-threshold. As discussed next, the wavelet-
domain estimators are all part of the sym gamma family, which is the semigroup exten-
sion of the Laplace distribution. An interesting observation is that the thresholding
behavior fades with a coarsening of the scale. Again, this points to the fact that the
non-Gaussian effects (non-linearities) are the most significant at the finer levels of the
wavelet analysis where the signal-to-noise ratio is also the least favorable.
15 15
r = 1/4
Gamma MAP Meixner MAP
10 10
5 ∞ 5
0 0
–5 –5
–10 –10
–15 –15
–20 –10 0 10 20 –20 –10 0 10 20
(a) (b)
15 15
Gamma MMSE Meixner MMSE
10 10
5 5
0 0
–5 –5
–10 –10
–15 –15
–20 –10 0 10 20 –20 –10 0 10 20
(c) (d)
Figure 11.6 Comparison of MAP and MMSE estimators v(z) for a series of sym gamma and
Meixner-distributed random variables with r = 1/4, 1, 4, 64, +∞ (dark to light) corrupted by
white Gaussian noise of the same power as the signal (SNR=1). (a) Sym gamma MAP
estimators. (b) Meixner MAP estimators. (c) Sym gamma MMSE estimators. (d) Meixner
MMSE estimators.
Some further examples of sym gamma MAP and MMSE estimators over a range of
orders are shown in Figure 11.6 under constant signal-to-noise ratio to highlight the
differences in sparsity behavior. We observe that the MAP estimators have a hard-to-
soft-threshold behavior for r < 3/2, which is consistent with the discontinuity of the
potential at the origin. For larger values of r, the trend becomes more linear. By contrast,
the MMSE estimator is much closer to the LMMSE (thin black line) around the origin.
For larger signal values, both estimators result in a more or less progressive transition
between the two extreme lines of the cone (identity and LMMSE) that is controlled by
r – the smaller values of r correspond to the sparser scenarios with vMMSE being closer
to identity.
The Meixner family in Table 4.1 with order r > 0 and scale parameter s0 ∈ R+
provides the same type of extension for the hyperbolic secant distribution with essen-
tially the same functionality. Mathematically, it is closely linked to the gamma function
whose relevant properties are summarized in Appendix C. As shown in Table 10.1,
the Meixner potential has the same asymptotic behavior as the sym gamma potential
at infinity, with the advantage of being much smoother (infinitely differentiable) at the
origin. This implies that the curves of the gamma and Meixner estimators are globally
11.3 Study of wavelet-domain shrinkage estimators 311
quite similar. The main difference is that the Meixner MAP estimator is guaranteed to
be linear around the origin, irrespective of the value of r, and in better agreement with
the MMSE solution than its gamma counterpart.
Cauchy distribution
The prototypical example of a heavy-tail distribution is the symmetric Cauchy distribu-
tion with dispersion parameter s0 , which is given by
s0
pCauchy (x; s0 ) = 2 . (11.29)
π s0 + x2
It is a special case of a SαS distribution (with α = 1) as well as a symmetric Student
with r = 12 .
Since the Cauchy distribution is stable, we can invoke Proposition 9.8, which ensures
that the wavelet coefficients of a Cauchy process are Cauchy-distributed too. For illus-
tration purposes, we consider the analysis of a stable Lévy process (a.k.a. Lévy flight)
in an orthonormal Haar wavelet basis with ψ = D∗ φ, where φ is a triangular smoothing
kernel. The corresponding wavelet-domain Cauchy parameters √ may be determined from
(9.27) with γ = 1, d = 1, and α = 1, which yields si = s0 (2 2)i .
While the variance of the Cauchy distribution is unbounded, an analytical characteriz-
ation of the corresponding MAP estimator can be obtained by solving a cubic equation.
The MMSE solution is then described by a cumbersome formula that involves expo-
nentials and the error function erf. In particular, we can evaluate (11.24) to linearize its
behavior around the origin as
⎛ ⎛3 s2
⎞⎞
2 − 2i
⎜ ⎜ π e s ⎟⎟
vMMSE (zi ; si ) = zi ⎝1 − σ 2 ⎝ " # − s2i − 1⎠⎠ + O(z3i ). (11.30)
s
erfc √ i
2
The corresponding MAP and MMSE shrinkage functions with s0 = 14 and resolution
levels i = 0, . . . , 4 are shown in Figure 11.7. The difference between the two types of
estimator is striking around the origin and is much more dramatic at finer scales (i = 0
3 3
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
-4 -2 0 2 4 -4 -2 0 2 4
(a) (b)
(dark) and i = 1). As expected, all estimators converge to the identity map for large
input values, due to the slow (algebraic) decay of the Cauchy distribution. We observe
that the effect of processing (deviation from identity) becomes less and less significant
at coarser scales (light curves). This is consistent with the relative increase of the signal
contribution while the power of the noise remains constant across wavelet channels.
A powerful strategy for improving the performance of the basic wavelet-based denoisers
described in Section 11.3 is through the use of an overcomplete representation. Here,
we formalize the idea of cycle spinning by expanding the signal in a wavelet frame. In
essence, this is equivalent to considering a series of “shifted” orthogonal wavelet trans-
forms in parallel. The denoising task thereby reduces to finding a consensus solution.
We show that this can be done either through simple averaging or by constructing a
solution that is globally consistent by way of an iterative refinement procedure.
To demonstrate the concept and the virtues of an optimized design, we concentrate
on the model-based scenario of Section 10.4. The first important ingredient is the proper
choice of basis functions, which is discussed in Section 11.4.1. Then, in Section 11.4.2,
we switch to a redundant representation (tight wavelet frame) with a demonstration
of its benefits for noise reduction. In Section 11.4.3, we introduce the idea of consistent
cycle spinning, which results in an iterative variant of the basic denoising algorithm. The
impact of each of these refinements, including the use of the MMSE shrinkage functions
of Section 11.3, is evaluated experimentally in Section 11.4.4. The final outcome is
an optimized wavelet-based algorithm that is able to replicate the MMSE results of
Chapter 10.
1 1
e α1
1 1 2
(a) B-splines at fine level (c) Smoothing kernels
e α1
2
1 2
(b) B-splines at coarse level (d) Operator-like wavelets
Figure 11.8 Operator-like wavelets and exponential B-splines for the first-order operator
L = D − α1 Id with α1 = 0 (light) and α1 = −1 (dark). (a) Fine-level exponential B-splines
βα1 ,0 (t). (b) Coarse-level exponential B-splines βα1 ,1 (t). (c) Wavelet smoothing kernels
ϕint (t − 1). (d) Operator-like wavelets ψα1 ,1 (t) = L∗ ϕint (t − 1).
The fundamental ingredient for the implementation of the wavelet transform is that
the scaling function (exponential B-splines) and wavelets at resolution i satisfy the two-
scale relation
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
(i−1)
βα1 ,i (t − 2i k) 1 e2 α1 βα1 ,i−1 (t − 2i k)
⎣ ⎦∝⎣ (i−1)
⎦·⎣ ⎦,
ψα1 ,i (t − 2i k) −e2 α1 1 βα1 ,i−1 (t − 2i k − 2i−1 )
(11.31)
which involves two filters of length 2 (row vectors of the (2 × 2) transition matrix)
since the underlying B-splines and wavelets are non-overlapping for distinct values
of k. In particular, for i = 1 and k = 0, we get that βα1 ,1 (t) ∝ βα1 (t) + a1 βα1 (t − 1) and
ψα1 ,1 (t) ∝ −a1 βα1 (t) + βα1 (t − 1) with a1 = eα1 . These relations can be visualized in
Figures 11.8b and 11.8d, respectively. Also, for α1 = 0, we recover the Haar system
i
for which the underlying filters h and g (sum and difference with e2 α1 = 1 = a1 ) do
not depend upon the scale i (see (6.6) and (6.7) in Section 6.1). Finally, we note
that the proportionality factor in (11.31) is set by renormalizing the basis func-
tions on both sides such that their norm is unity, which results in an orthonormal
transform.
The corresponding fast wavelet-transform algorithm is derived by assuming that the
fine-scale expansion of the input signal is s(t) = k∈Z s[k]βα1 ,0 (t − k). To specify the
first iterationof the algorithm, we observe that the support of ψα1 ,1 (· − 2k) resp.,
βα1 ,1 (· − 2k) overlaps with the fine-scale B-splines at locations 2k and 2k + 1 only.
Due to the orthonormality of the underlying basis functions, this results in
11.4 Improved denoising by consistent cycle spinning 315
⎡ ⎤ ⎡ ⎤
s1 [k] s, βα1 ,1 (· − 2k)
⎣ ⎦=⎣ ⎦
v1 [k] s, ψα1 ,1 (· − 2k)
⎡ ⎤ ⎡ ⎤
1 1 eα1 s[2k]
=2 ⎣ ⎦·⎣ ⎦, (11.32)
1 + |e 1 |
α 2 −eα 1 1 s[2k + 1]
The final key observation is that the computation of the first level of wavelet coef-
ficients is analogous to the determination of the discrete increment process u[k] =
s[k] − a1 s[k − 1] (see Section 8.5.1) in the sense that v1 [k] ∝ u[2k + 1] is a subsampled
version of the latter.
where vm = WTm s.
where the equality on the right-hand side results from the application of Parseval’s iden-
tity for each individual basis. Next, we express the quadratic error between an arbitrary
vector z = (z1 , . . . , zM ) ∈ RMN and its approximation by Ax as
M
z − Ax2 = zm − WTm x2
m=1
M
= Wm zm − x2 . (by Parseval)
m=1
This error is minimized by setting its gradient with respect to x to zero; that is,
M
∂
z − Ax2 = − (Wm zm − x) = 0,
∂x
m=1
which yields
M
1
xLS = Wm zm = A† z.
M
m=1
11.4 Improved denoising by consistent cycle spinning 317
In practice, when the wavelet expansion is performed over I resolution levels, the
number of distinct shifted wavelet transforms is at most M = 2I . In direct analogy with
(11.13), the transposition of the wavelet-denoising problem to the context of wavelet
frames is then
⎧ ⎫
⎨ ⎬
z̃ = arg min 12
A† z −y22 + M τ
(z) (11.37)
z ⎩ ⎭
s
% &
= arg min 12 z − Ay22 + τ (z) (due to the tight-frame property)
z
= prox Ay; τ = (ṽ1 , . . . , ṽM ),
which is the average of the solutions (11.34) obtained with each individual wavelet
basis.
A remarkable property is that the cycle-spun version of wavelet denoising is guaran-
teed to improve upon the non-redundant version of the algorithm.
P R O P O S I T I O N 11.3
Let y = s+n be the samples of a signal s corrupted by zero-mean
i.i.d. noise n and s̃m the corresponding signal estimates given by the wavelet-based
denoising algorithm (11.35) with m = 1, . . . , M. Then, under the assumption that the
mean-square errors of the individual wavelet denoisers are equivalent, the averaged
signal estimate (11.38) satisfies
E{s̃ − s } ≤ E{s̃m − s }
2 2
for any m = 1, . . . , M.
Proof The residual noise in the orthonormal wavelet basis Wm is (ṽm − E{vm }), where
ṽm = WTm s̃m and E{vm } = WTm E{y} = WTm s, because of the assumption of zero-mean
noise. This allows us to express the total noise power over the M wavelet bases as
M
z̃ − As2 = ṽm − E{vm }2 .
m=1
318 Wavelet-domain methods
The favorable aspect of considering a redundant representation is that the inverse frame
operator A† is an orthogonal projector onto the signal space RN with the property that
A† w2 ≤ M w
1 2
for all w ∈ RMN . This follows from the Pythagorean relation w2 = AA† w2 +
(I − AA† )w2 (projection theorem) and the tight-frame property, which is equivalent
to AA† w2 = MA† w2 . By applying this result to w = z̃ − As, we obtain
M
1
A† (z̃ − As)2 = s̃ − s2 ≤ ṽm − E{vm }2 . (11.39)
M
m=1
Next, we take the statistical expectation of (11.39), which yields
1
M % &
E{s̃ − s } ≤ E ṽm − Wm s .
2 T 2
(11.40)
M
m=1
The final result then follows from Parseval’s relation (norm preservation of individual
wavelet transforms) and the weak stationarity hypothesis (MSE equivalence of shifted-
wavelet denoisers).
Note that this general result does not depend on the type of wavelet-domain proces-
sing – MAP vs. MMSE, or even scalar vs. vectorial – as long as the non-linear mapping
ṽm = f (vm ) is fixed and applied in a consistent fashion. The inequality in Proposition
11.3 also suggests that one can push the denoising performance further by optimizing
the MSE globally in the signal domain, which is not the same as minimizing the error
for each individual wavelet denoiser. The only downside of the redundant synthesis
formulation (11.37) is that the underlying cost function loses its statistical interpretation
(e.g., MAP criterion) because of the inherent coupling that results from considering
multiple series of wavelet coefficients. The fundamental limitation there is that it is
impossible to specify a proper innovation model in an overcomplete system.
equivalence by insisting that the wavelet-frame expansion be consistent with the signal.
This leads to the reformulation of (11.41) in synthesis form as
* +
1
z̃ = arg min z − Ay2 + τ (z) s.t. AA† z = z,
2
(11.42)
z 2
which is the consistent cycle-spinning version of denoising. Rather than attempting to
solve the constrained optimization problem (11.42) directly, we shall exploit the link
with conventional wavelet shrinkage. To that end, we introduce the augmented Lagrang-
ian penalty function
μ
L A (z, x, λ; μ) = 12 z − Ay22 + τ (z) + 2 z − Ax22 − λT (z − Ax) (11.43)
with penalty parameter μ ∈ R+ and Lagrangian multiplier vector λ ∈ RMN . Observe that
the minimization of (11.43) over (z, x, λ) is equivalent to solving (11.42). Indeed, the
consistency condition z = Ax asserted by (11.43) is equivalent to AA† z = z, while the
auxiliary variable x = A† z is the sought-after signal.
The standard strategy in the augmented-Lagrangian method of multipliers is to solve
the problem iteratively by first minimizing LA (z, x, λ; μ) with respect to (z, x) while
keeping μ fixed and updating λ according to the rule
where
1
z̃ = 1+μ (Ay + μAx + λ)
and C0 is a term that does not depend on z. Since the underlying cost function is separ-
able, the solution of the minimization of LA with respect to z is obtained by suitable
shrinkage of z̃, leading to
zk+1 = prox z̃k+1 ; 1+μ
τ
(11.44)
and involving the same kind of pointwise non-linearity as algorithm (11.34). The
converse task of optimizing LA over x with z = zk+1 fixed is a quadratic problem. The
required partial derivatives are obtained as
∂ LA (z, x, λ; μ)
= −μAT (z − Ax) − AT λ.
∂x
This leads to the closed-form solution
1 † k
xk+1 = A† zk+1 − A λ,
μ
320 Wavelet-domain methods
function is determined by the statistics of the signal and applied in a consensus fashion.
Its cost per iteration is O(N × M) operations, which is essentially that of the fast wavelet
transform. This makes the method very fast. Since every step is the outcome of an exact
minimization, the cost function decreases monotonically until the algorithm reaches a
fixed point. The convergence to a global optimum is guaranteed when the potential
function is convex.
One may also observe that the CCS denoising algorithm is similar to the MAP esti-
mation method of Section 10.2.4 since both rely on ADMM. Besides the fact that the
latter can handle an arbitrary system matrix H, the crucial difference is in the choice of
the auxiliary variable u = Ls (discrete innovation) vs. z = As (redundant wavelet trans-
form). While the two representations have a significant intersection, the tight wavelet
frame has the advantage of resulting in a better conditioned problem (because of the
norm-preservation property) and hence a faster convergence to the solution.
averaging. The natural idea that we shall test now is to replace the proximal opera-
tor prox by the optimal MMSE shrinkage function dictated by the theory of Section
11.3.2. This change essentially comes for free – the mere substitution in a lookup table
of prox (z, τ ) = vMAP (z; σ ) by vMMSE (z; σ ) as defined by (11.20) – but it has a tre-
mendous effect on performance, to the point that the algorithm reaches the best level
achievable. While there is not yet a proof that the proposed scheme converges to the
true MMSE solution, we shall document the behavior experimentally.
In order to compare the various wavelet-based denoising techniques, we have applied
them to the same series of Lévy processes as in Section 10.4.3: Brownian motion,
Laplace motion, compound-Poisson process with a standard Gaussian distribution and
λ = 0.6, and Lévy flight with Cauchy-distributed increments. Each realization of the
signal of length N = 100 is corrupted by AWGN of variance σ 2 . The signal is then
expanded in a Haar basis and denoised using the following algorithms:
• Soft-thresholding ((z) ∝ |z|) in the Haar wavelet basis with optimized τ (ortho-ST)
• Model-based shrinkage in the orthogonal wavelet basis (ortho-MAP vs. ortho-
MMSE)
• Model-based shrinkage in a tight frame with M = 2 (frame-MAP vs. frame-MMSE)
• Global MAP estimator implemented by consistent cycle spinning (CCS-MAP), as
described in Section 11.4.3
• Model-based consistent cycle spinning with MMSE shrinkage function (CCS-
MMSE)
To simplify the comparison, the depth of the transform is set to I = 1 and the lowpass
coefficients are kept untouched. The experimental conditions are exactly the same as in
Section 10.4.3, with each data point (SNR value) being the result of an average over
500 trials.
The model-based denoisers are derived from √ the knowledge
√ of the pdf of the wave-
let coefficients, which is given by pV1 (x) = 2pU (√ 2x), where pU is the pdf of the
increments of the Lévy process. The rescaling by 2 accounts for the fact that the
(redundant) Haar wavelet coefficients are a renormalized version of the increments. For
the direct methods, we set τ = σ 2 , which corresponds to a standard wavelet-domain
MAP estimator, as described by (11.13). For the iterative CCS-MAP solution, the iden-
tification of (11.41) with the standard
√ form (10.46) of the MAP estimator dictates the
2
choice τ = σ and (z) = U ( 2z). Similarly, the setting of the shrinkage function for
the MMSE version of CCS relies on the property that the noise variance in the wavelet
domain is σ 2 even though the components are no longer independent. By exploiting
the analogy with the orthogonal scenario, one then simply replaces the proximal step in
Algorithm 3, described in Equation (11.44), by
σ2
zk+1 = vMMSE z̃k+1 ; 1+μ ,
Figure 11.9 SNR improvement as a function of the level of noise for Brownian motion. The
wavelet-denoising methods by reverse order of performance are: standard soft-thresholding
(ortho-ST), optimal shrinkage in a wavelet basis (ortho-MAP/MMSE), shrinkage in a redundant
system (frame-MAP/MMSE), and optimal shrinkage with consistent cycle spinning
(CCS-MAP/MMSE).
Figure 11.10 SNR improvement as a function of the level of noise for a Lévy process with
Laplace-distributed increments. The wavelet-denoising methods by reverse order of
performance are: ortho-MAP (equivalent to soft-thresholding with fixed τ ), ortho-MMSE,
frame-MMSE, frame-MAP, CCS-MAP, and CCS-MMSE. The results of CCS-MMSE are
indistinguishable from the ones of the reference MMSE estimator obtained using message
passing (see Figure 10.12).
which are linear. The corresponding denoising results are shown in Figure 11.9. They
are consistent with our expectations: the model-based approach (ortho-MAP/MMSE)
results in a (slight) improvement over basic wavelet-domain soft-thresholding while the
performance gain brought by redundancy (frame-MAP/MMSE) is more substantial, in
accordance with Proposition 11.3. The optimal denoising (global MMSE=MAP solu-
tion) is achieved by running the CCS version of the algorithm, which produces a linear
solution that is equivalent to the Wiener filter (LMMSE estimator).
The same performance hierarchy can also be observed for the other types of signals
(see Figures 11.10–11.12), confirming the relevance of the proposed series of refine-
ments. For non-Gaussian processes, ortho-MMSE is systematically better than ortho-
MAP, not to mention soft-thresholding, in agreement with the theoretical predictions
of Section 11.3. Since the MAP estimator for the compound-Poisson process is use-
less (identical to zero), the corresponding ortho-MMSE thresholding can be compared
11.4 Improved denoising by consistent cycle spinning 323
Figure 11.11 SNR improvement as a function of the level of noise for a compound-Poisson
process (piecewise-constant signal). The wavelet-denoising methods by reverse order of
performance are: ortho-ST, ortho-MMSE, frame-MMSE, and CCS-MMSE. The results of
CCS-MMSE are indistinguishable from the ones of the reference MMSE estimator obtained
using message passing (see Figure 10.10).
Figure 11.12 SNR improvement as a function of the level of noise for a Lévy flight with
Cauchy-distributed increments. The wavelet-denoising methods by reverse order of performance
are: ortho-MAP, ortho-MMSE, frame-MMSE, frame-MAP, CCS-MAP, and CCS-MMSE. The
results of CCS-MMSE are indistinguishable from the ones of the reference MMSE estimator
obtained using message passing (see Figure 10.11).
against the optimized soft-thresholding (ortho-ST) where τ is tuned for maximum SNR
(see Figure 11.11). This is actually the scenario where this standard sparsity-promoting
scheme performs best, which is not too surprising since a piecewise-constant signal
is intrinsically sparse with a large proportion of its wavelet coefficients being zero.
Switching to a redundant system (frame) is beneficial in all instances. The only caveat
is that frame-MMSE is not necessarily the best design because the corresponding one-
step modification of wavelet coefficients typically destroys Parseval’s relation (lack of
consistency). In fact, one can observe a degradation, with frame-MMSE being worse
than frame-MAP in most cases. By contrast, the full power of the Bayesian formu-
lation is reinstated when the denoising is performed iteratively according to the CCS
strategy. Here, thanks to the consistency requirement, CCS-MMSE is always better
than CCS-MAP which actually corresponds to a wavelet-based implementation of the
true MAP estimator of the signal (see Section 10.4.3). In particular, CCS-MAP under
the Laplace hypothesis is equivalent to the standard total-variation denoiser. Finally,
324 Wavelet-domain methods
the most important finding is that, in all tested scenarios, the results of CCS-MMSE
are indistinguishable from those of the belief-propagation algorithm of Section 10.4.2
which implements the reference MMSE solution. This leads us to conjecture that the
CCS-MMSE estimator is optimal for the class of first-order processes. The practical
benefit is that CCS-MMSE is much faster than BP, which necessitates the computation
of two FFTs per data point.
While we are still missing a theoretical explanation of the ability of CCS-MMSE to
resolve non-Gaussian statistical interdependencies, we believe that the results presen-
ted are important conceptually, for they demonstrate the possibility of specifying itera-
tive signal-reconstruction algorithms that minimize the reconstruction error. Moreover,
it should be possible, by reverse engineering, to reformulate such schemes in terms
of the minimization of a pseudo-MAP cost functional that is tied to the underlying
signal model. Whether such ideas and design principles are transposable to higher-order
signals and/or more general types of inverse problems is an open question that calls for
further investigation.
involve accelerated versions of ISTA or FISTA that capitalize on the specificities of the
underlying system matrices.
Section 11.3
The signal-processing community’s response to the publication of Donoho and Johns-
tone’s work on the optimality of wavelet-domain soft-thresholding was a friendly com-
petition to improve denoising performance. The Bayesian reformulation of the basic
signal-denoising problem naturally led to the derivation of thresholding functions that
are optimal in the MMSE sense [SA96, Sil99, ASS98]. Moulin and Lui presented a
mathematical analysis of pointwise MAP estimators, establishing their shrinkage beha-
vior for heavy-tailed distributions [ML99]. In this statistical view of the problem, the
thresholding function is determined by the assumed prior distribution of the wavelet
coefficients, the most prominent choices being the generalized Gaussian distribution
[Mal89, CYV00, PP06] or a mixture of Gaussians with a peak at the origin [CKM97].
While infinite divisibility is not a property that has been emphasized in the image-
processing literature, researchers have considered a number of wavelet-domain models
that are compatible with the property and therefore part of the general framework inves-
tigated in Section 11.3. These include the Laplace distribution (see [HY00, Mar05] for
pointwise MMSE estimator), SαS laws (see [ATB03] and [ABT01, BF06] for point-
wise MAP and MMSE estimators, respectively), the Cauchy distribution [BAS07], as
well as the sym gamma family [FB05]. The latter choice (a.k.a. Bessel K-form) is
supported by a constructive model of images that is reminiscent of generalized Pois-
son processes [GS01]. There is also experimental evidence that this class of models
is able to fit the observed transform-domain histograms well over a variety of natural
images [MG01, SLG02].
The multivariate version of (11.21) can be found in [Ste81, Equation (3.3)]. This for-
mula is also central to the derivation of Stein’s unbiased risk estimator (SURE), which
provides a powerful data-driven scheme for adjusting the free parameters of a statistical
estimator under the AWGN hypothesis. SURE has been applied to the automatic adjust-
ment of the thresholding parameters of wavelet-based denoising algorithms such as the
SURE-shrink [DJ95] and SURELET [LBU07] approaches.
The possibility of defining bivariate shrinkage functions for exploiting inter-scale
wavelet dependencies is investigated in [SS02].
Section 11.4
The concept of redundant wavelet-based denoising was introduced by Coifman under
the name of cycle spinning [CD95]. The fact that this scheme always improves upon
non-redundant wavelet-based denoising (see Proposition 11.3) was pointed out by
Raphan and Simoncelli [RS08]. A frame is an overcomplete (and stable) generalization
of a basis; see for instance [Ald95, Chr03]. For the design and implementation of the
operator-like wavelets (including the first-order ones which are orthogonal), we refer
to [KU06].
The concept of consistent cycle spinning was developed by Kamilov et al. [KBU12].
The CCS Haar-denoising algorithm was then modified appropriately to provide the
MMSE estimator for Lévy processes [KKBU12].
12 Conclusion
In this appendix, we are concerned with integrals involving functions that are singu-
lar at a finite (or at least countable) number of isolated points. Without further loss of
generality, we consider the singularities to arise at the origin.
Suppose that we are given a function f that is locally integrable in any neighborhood
in Rd that excludes the origin, but not if the origin is included. Then, for any test function
ϕ ∈ D (Rd ) with 0 ∈ / support(ϕ), the integral
ϕ, f = ϕ(r)f(r) dr
Rd
converges in the sense of Lebesgue and is continuous in ϕ for sequences that exclude a
neighborhood of the origin. It may also converge for some other, but not all, ϕ ∈ D . In
general, if f grows no faster than some inverse power of |r| as r → 0, then ϕ, f will
converge for all ϕ whose value as well as all derivatives up to some order k do vanish
at 0. This is the situation that will be of interest to us here.
In some cases we may be able to continuously extend the bilinear form ϕ, f to all
test functions in D (Rd ) (or even in S (Rd )). In other words, it may be possible to find
a generalized function f˜ such that ϕ, f = ϕ, f˜ for all ϕ ∈ D for which the left-hand
side converges in the sense of Lebesgue. f˜ is then called a regularization of f.
Note that regularizations of f, when they exist, are not unique, as they can differ by a
generalized function that is supported at r = 0. This also implies that the difference of
any two regularizations of f can be written as a finite sum of the form
cn δ (n) (r),
|n|≤k
The first approach to regularization we shall describe is useful for regularizing para-
metric families of integrals. As an illustration, consider the family of functions xλ+ =
xλ 1[0,∞) in one scalar variable. For λ ∈ U = {λ ∈ C : Re(λ) > −1}, the integral
Fϕ (λ) = ϕ(x)xλ+ dx
R
is (complex) analytic in λ over its (initial) domain U, as can be seen by differentiation
under the integral sign. 1 Additionally, as we shall see shortly, Fϕ (λ) has a (necessarily
unique) analytic continuation to a larger domain Ũ ⊃ U in the complex plane. Denoting
the analytic continuation of Fϕ by F̃ϕ , we can use it to define a regularization of xλ+ for
λ ∈ Ũ\U by the identity
In theλ above example, one can show that the largest domain to which Fϕ (λ) =
R ϕ(x)x+ dx (integration in the sense of Lebesgue) can be extended analytically is
the set
Ũ = C\{−1, −2, −3, . . .}.
Over Ũ, the analytic continuation of Fϕ (λ) to −1 ≥ Re(λ) > −2, λ = −1, is found by
using the formula
1 d λ+1
xλ+ = x
λ + 1 dx +
to write
1 d −1 d
ϕ, xλ+ = ϕ, xλ+1
+ = ϕ, xλ+1
+ ,
λ+1 dx λ + 1 dx
where the rightmost member is well defined and where we have used duality to transfer
the derivative operator to the side of the test function. Similarly, we can find the analytic
continuation for −n ≥ Re(λ) > −(n + 1), λ = −n, by successive re-application of the
same approach, which leads to
(−1)n dn
ϕ, xλ+ = n ϕ, x+λ+n .
(λ + 1) · · · (λ + n) dx
1 We can differentiate under the integral sign with respect to λ due to the compact support of ϕ, whereby we
obtain the integral
d
Fϕ (λ) = ϕ(x)xλ+ log x dx,
dλ R
Within the band −n > Re(λ) > −(n + 1), we may also compute ϕ, xλ+ using the
formulas
⎛ ⎞
∞ ϕ (k) (0)xk ⎠ λ
ϕ, xλ+ = ⎝ϕ(x) − x dx (A.1)
0 k!
k≤− Re(λ)−1
∞
= ϕ(x)xk dx
1
⎛ ⎞
1 ϕ (k) (0)xk
+ ⎝ϕ(x) − ⎠ xλ dx
0 k!
k≤− Re(λ)−1
∞ ϕ (k) (0) λ+k
− x dx
k!
k≤− Re(λ)−1 1
∞
k
= ϕ(x)x dx
1
⎛ ⎞
1 ϕ (k) (0)xk ⎠ λ
+ ⎝ϕ(x) − x dx
0 k!
k≤− Re(λ)−1
ϕ (k) (0)
+ . (A.2)
k!(λ + k + 1)
k≤− Re(λ)−1
The effect of subtracting the first n terms of the Taylor expansion of ϕ(x) in (A.2) is to
create a zero of sufficiently high order at 0 to make the singularity of xλ+ integrable.
We use the above definition of xλ+ to define xλ− = (−x)λ+ for λ = −1, −2, −3, . . ., as
well as sign(x)|x|λ = xλ+ −xλ− and |x|λ = xλ+ +xλ− . Due to the cancellation of some poles in
λ, the generalized functions sign(x)|x|λ and |x|λ have an extended domain of definition:
they are defined for λ = −1, −3, −5, . . . and λ = −2, −4, −6, . . ., respectively.
The singular function rλ in d dimensions is regularized by switching to
(hyper)spherical coordinates and applying the definition of x+λ+d−1 . Due to symme-
tries, the definition obtained in this way is valid for all λ ∈ C with the exception of
the individual points −d, −(d + 2), −(d + 4), −(d + 6), . . .. As was the case in 1-D, we
find formulas based on removing terms from the Taylor expansion of ϕ for computing
ϕ, rλ in bands of the form −(d + 2m) < Re(λ) < −(d + 2m − 2), m = 1, 2, 3, . . .,
which results in
⎛ ⎞
∞ r n
ϕ, r λ
= ⎝Sϕ (r) − Sϕ (0)⎠ rλ+d−1 dr
(n)
0 n!
n≤− Re(λ)−d
⎛ ⎞
k
r (k) ⎠
= ⎝ ϕ(r) − ϕ (0) rλ dr.
R d k!
|k|≤− Re(λ)−d
A.2 Fourier transform of homogeneous distributions 331
where m is the mth iteration of the Laplacian operator. For m = 0, we can find the
result (A.3) directly by taking the limit λ → −d of ϕ, ρ λ . From there, the general
result is obtained by iterating the relation
(− )ρ λ = λρ λ−2 ,
The definition of the 1-D generalized functions of Section A.1 extends to Schwartz’
space, S (R), and to S (Rd ) in multiple dimensions. We can therefore consider these
families of generalized functions as members of the space S of tempered distributions.
In particular, this implies that the Fourier transform of any of these generalized functions
will belong to (the complex version of) S as well. We recall from Section 3.3.3 that,
in general, the (generalized) Fourier transform g of a tempered distribution g is defined
as the tempered distribution that makes the following identity hold true for all ϕ ∈ S :
g,
ϕ =
g, ϕ . (A.4)
The distributions we considered above are all homogeneous, in the sense that, using
g to denote any one of them, g(a·) for some a > 0 is equal to aλ g. It then follows
from the properties of the Fourier transform that
g is also homogeneous, albeit of order
−(d + λ). By invoking additional symmetry properties of these distributions, one then
finds that the Fourier transform of members of each of these families belongs to the same
family, with some normalization factor that can be computed; for instance, by plugging
a Gaussian function ϕ (which is its own Fourier transform) in (A.4). We summarize the
332 Singular integrals
(λ+1)
rλ+ , ϕ, r̃λ+
" # (jω)λ+1
−m−1 < Re(λ) < −m i (i)
= 0∞ rλ ϕ(r) − 0≤i≤m−1 r ϕ i! (0) dr
rn sign(r), N/A 2 n!
(jω)n+1
n = 0, 1, 2, . . .
+∞ ϕ(r)−ϕ(−r)
1/r 0 r dr −jπsign(ω)
⎛ ⎞ " #
− Re(λ)−d k
r 2λ+d πd/2 d+λ
rλ , r ∈ Rd , ⎝ϕ(r) − ϕ (k) (0)⎠ rλ dr " # 2 1
Rd k! − λ2 ωλ+d
−(d + 2m) < Re(λ) |k|=0
< −(d + 2m − 2)
Fourier transforms found in this way in Table A.1. The interested reader may refer to
Gelfand and Shilov [GS68] for the details of their calculations.
∞ ϕ(x)
the singular integral −∞ x dx, the principal value is defined as
∞ ϕ(x) ϕ(x) ∞ − ϕ(x)
p.v. dx = lim dx + dx
−∞ x →0 x −∞ x
∞ ϕ(x) − ϕ(−x)
= lim dx
→0 x
∞ ϕ(x) − ϕ(−x)
= dx
0 x
where the last integral converges in the sense of Lebesgue.
In essence,
∞ Cauchy’s
0 definition of principal value relies on the “infinite parts” of the
integrals 0 and −∞ cancelling one another out. To generalize this idea, consider the
integral
∞
ϕ(x)f(x) dx,
0
where the function f is assumed to be singular at 0. Let
∞
() = ϕ(x)f(x) dx
and suppose that, for some pre-chosen family of functions Hk () approaching infinity
at 0, we can find an n ∈ Z+ and coefficients ak , 1 ≤ k ≤ n, such that
n
lim () − an Hk () = A < ∞.
→0+
k=1
∞
A is then called the finite part (in French, partie finie) of the integral 0 ϕ(x)f(x) dx and
is denoted as [EK00]
∞
p.f. ϕ(x)f(x) dx.
0
In cases of interest to us, the family Hk () consists of inverse integer powers of
and logarithms. With this choice, the finite-part regularization of the singular integrals
considered in Section A.1 can be obtained. It is found to coincide with their regulariz-
ation by analytic continuation (note that all Hk are analytic in ). But, in addition, we
can use the finite part to regularize xλ+ and related functions in cases where the previous
method fails (namely, for λ = −1, −2, −3, . . .). Indeed, for λ = −n, n ∈ Z+ , we may
write
∞ n−1 1
ϕ(x) ϕ (k) (0)xk−n
dx = dx
xn k!
k=0
1 ϕ(x) −
n−1 ϕ (k) (0)xk ∞
k=0 k! ϕ(x)
+ n
dx + dx
x 1 xn
n−2
ϕ (n−1) (0) ϕ (k) (0) −n+k+1
−1
=− log + ·
(n − 1)! k! n−k−1
k=0
n−1 ϕ (k) (0)xk ∞
1 ϕ(x) − k=0 k! ϕ(x)
+ dx + dx.
xn 1 xn
334 Singular integrals
From there, by discarding the logarithm and inverse powers of and taking the limit
→ 0 of what remains, we find
∞ ∞
n−1 ϕ (k) (0)xk n−2
ϕ(x) ϕ(x) 1 ϕ(x) − k=0 k! ϕ (k) (0)
p.f. dx = dx + dx − ,
0 xn 1 xn 0 xn k!(n − k − 1)
k=0
where the two integrals of the right-hand side converge in the sense of Lebesgue.
Using similar calculations, for ϕ, xλ+ with λ = −1, −2, −3, . . ., we find the same
regularization as the one given by (A.2).
In general, a singular function f does not define a distribution in a unique way.
However, in many of the cases that are of interest to us there exists a particular reg-
ularization of f that is considered standard or canonical. For the parametric families
discussed so far, these essentially correspond to the regularization obtained by analytic
continuation. In Table A.1, we have summarized the formulas for canonical regulari-
zation as well as the Fourier transforms of the singular distributions that are of interest
to us.
Finally, we point out that the approaches presented in this appendix to regularize
singularities at the origin generalize in an obvious way to isolated singularities at any
other point and also to a finite (or even countable) number of isolated singularities.
As we noted earlier, the scaling property of the Fourier transform demands that the
Fourier transform of a homogeneous distribution of order λ be homogeneous of order
−(λ + d). Thus, for Re(λ) ≤ −d where the original distribution is singular at 0, its Fou-
rier transform is locally integrable everywhere and vice versa. Consequently, convo-
lutions with homogeneous singular kernels are often easier to evaluate in the Fourier
domain by employing the convolution-multiplication rule. Important examples of such
convolutions are the Hilbert and Riesz transforms.
The Hilbert transform of a test function ϕ ∈ S (R) can thus be defined either by the
convolution with the singular kernel h(x) = 1/(πx) as
+∞
Hϕ(x) = p.v. ϕ(y)h(x − y) dy,
−∞
ϕ
Hϕ(x) = F −1 {h},
with
h(ω) = −j sign(ω). These definitions extend beyond S (R) to ϕ ∈ Lp (R) for
1 < p < ∞ (the standard reference here is Stein and Weiss [SW71]).
A.4 Some convolution integrals with singular kernels 335
Similarly, the ith component of the Riesz transform of a test function ϕ ∈ S (Rd )
is defined in the spatial domain by the convolution with the kernel hi (r) = (d/2 +
1/2)ri /(πd/2+1/2 |r|d+1 ) as
N N
ξm ξ n f (ωm − ωn ) ≥ 0
m=1 n=1
N N N N
ξm ξ n
g(ωm − ωn ) = ξm ξ n e−j(ωm −ωn )x g(x) dx
m=1 n=1 m=1 n=1 R
N N
= ξm ξ n e−j(ωm −ωn )x g(x) dx
R m=1 n=1
N 2
= ξm e−jωm x g(x) dx ≥ 0,
R m=1
>0
≥0
where we have made use of the fact that g(x), the (inverse) Fourier transform of e−ω /2 ,
2
is positive. It is not hard to see that the argument above remains valid for any (multi-
dimensional) function f (ω) that is the Fourier transform of some non-negative kernel
g(x) ≥ 0. The more impressive result is that the converse implication is also true.
Proof
Let g(x) ≥ 0 be the (generalized) density associated
with μ such that μ(E) =
−jω,x
E g(x) dx for any Borel set E. We then write f (ω) = Rd e g(x) dx and perform
the same manipulation as for the Gaussian example above, which yields
N 2
N N
−jωm ,x
ξm ξ n f (ωm − ωn ) = ξm e g(x) dx > 0.
Rd m=1
m=1 n=1
≥0
≥0
338 Positive definiteness
−jωm ,x
The key observation is that the zero set of the sum of exponentials N m=1 ξm e
(which is an entire function) has measure zero. Since the above integral involves posi-
tive terms only, the only possibility for it to be vanishing is that g be identically zero
on the complement of this zero set, which contradicts the assumption on the existence
of E.
In particular, the latter constraint is verified whenever f (ω) = F {g}(ω), where g
is
a continuous, non-negative function with a bounded Lebesgue integral; i.e., 0 <
Rd g(x) dx < +∞. This kind of result is highly relevant to approximation and learning
theory: indeed, the choice of a strictly positive definite interpolation kernel (or radial
basis function) ensures that the solution of the generic scattered data interpolation prob-
lem is well defined and unique, no matter how the data centers are distributed [Mic86].
Here too, the prototypical example of a valid kernel is the Gaussian, which is (strictly)
positive definite.
There is also an extension of Bochner’s theorem for generalized functions that is due
to Laurent Schwartz. In a nutshell, the idea is to replace eachfinite sum N n=1 ξn f (ω −
ωn ) by an infinite one (integral) Rd ϕ(ω )f (ω − ω ) dω = Rd ϕ(ω − ω )f (ω ) dω =
f, ϕ(· − ω) , which amounts to considering appropriate linear functionals of f over
Schwartz’ class of test functions S (Rd ). In doing so, the double sum in Definition B.1
collapses into a scalar product between f and the autocorrelation function of the test
function ϕ ∈ S (Rd ), the latter being written as
ϕ =
f, f, ϕ = ϕ(x)μ(dx).
Rd
The term “tempered measure” refers to a generic type of mildly singular generalized
function that can be defined by the Lebesgue integral Rd ϕ(x)μ(dx) < ∞ for all ϕ ∈
S (Rd ). Such measures are allowed to exhibit polynomial growth at infinity subject to
the restriction that they remain finite on any compact set.
The fact that the above form implies positive definiteness can be verified by direct
substitution and application of Parseval’s relation, by which we obtain
1 2 1
f, (ϕ ∗ ϕ ∨ ) = d
f, |
ϕ| = |
ϕ (x)|2 μ(dx) ≥ 0,
(2π) (2π)d Rd
B.2 Conditionally positive–definite functions 339
where the summability property against S (Rd ) ensures that the integral is convergent
(since |
ϕ (x)|2 is rapidly decreasing).
The improvement over Theorem B.1 is that μ(Rd ) is no longer constrained to be
finite. While this extension is of no direct help for the specification of characteristic
functions, it happens to be quite useful for the definition of spline-like interpolation
kernels that result in well-posed data fitting/approximation problems. We also note that
the above definitions and results generalize to the infinite-dimensional setting (e.g., the
Minlos–Bochner theorem which involves measures over topological vector spaces).
This definition is also extendable to generalized functions using the line of thought
that leads to Definition B.2. To keep the presentation reasonably simple and to make the
link with the definition of the Lévy exponents in Section 4.2, we now focus on the 1-D
case (d = 1). Specifically, we consider the polynomial constraint N n=1 ξn ωn = 0, m ∈
m
{0, . . . , k − 1} and derive the generic form of conditionally positive definite generalized
functions of order k, including the continuous ones which are of greatest interest to us.
The distributional
counterpart of the kth-order constraint for d = 1 is the orthogonal-
ity condition R ϕ(ω)ωm dω = 0 for m ∈ {0, . . . , k − 1}. It is enforced by restricting
the analysis to the class of test functions whose moments up to order (k − 1) are van-
ishing. Without loss of generality, this is equivalent to considering some alternative test
function Dk ϕ = ϕ (k) where Dk is the kth derivative operator.
This extended definition allows for the derivation of the corresponding version
of Bochner’s theorem which provides an explicit characterization of the family of
340 Positive definiteness
Here, r(x) is a function in S (R) such that (r(x) − 1) has a zero of order (2k+1) at x = 0,
while the an are appropriate real-valued constants with the constraint that a2k ≥ 0.
Below, we provide a slightly adapted version of Gelfand and Vilenkin’s proof, which
is remarkably concise and quite illuminating [GV64, Theorem 1, pp. 178], at least if one
compares it with the standard derivation of the Lévy–Khintchine formula, which has a
much more technical flavor (see [Sat94]) and is ultimately less general.
5 6 5 6
Proof Since f, (−1)k D2k (ϕ ∗ ϕ ∨ ) = (−1)k D2k f, (ϕ ∗ ϕ ∨ ) , we interpret Defini-
tion B.4 as the property that (−1)k D2k f is positive definite. By the Schwartz–Bochner
theorem, this is equivalent to the existence of a tempered measure ν such that
> ? > ?
(−1)k D2k f,
ϕ = f, (−1)k D2k ϕ = f, x2k ϕ = ϕ(x)ν(dx).
R
φ (2k) (0)
f, φ = φ(x)μ(dx) + a2k , (B.2)
R\{0} (2k)!
which specifies f on the subset of test functions that have a 2kth-order zero at the origin.
To extend the representation to the whole space S (R), we associate with every ϕ ∈
S (R) the corrected function
2k−1
ϕ (n) (0) n
φc (x) = ϕ(x) − r(x) x , (B.3)
n!
n=0
B.2 Conditionally positive–definite functions 341
with r(x) as specified in the statement of the theorem. By construction, φc ∈ S (R) and
has the 2kth-order zero that is required for (B.2) to be applicable. By combining (B.2)
and (B.3), we find that
(2k) 2k−1
φc (0) ϕ (n) (0)
f, ϕ = φc (x)μ(dx) + a2k + f, r(x)xn .
R\{0} (2k)! n!
n=0
Next, we identify the constants an = f, r(x)xn and note that φc(2k) (0) = ϕ (2k) (0). The
final step is to substitute these together with the expression (B.3) of φc in the above
formula, which yields the desired result.
To prove the sufficiency of the representation, we apply (B.1) to evaluate the
functional
∨
ϕ (k) ∗
f, ( ϕ (k) ) =
f, x2k |ϕ(x)|2 = x2k |ϕ(x)|2 μ(dx) + a2k |ϕ(0)|2 ≥ 0,
R
where we have used the property that the derivatives of x2k |ϕ(x)|2 are all vanishing at
the origin, except the one of order 2k, which equals (2k)! |
ϕ (0)|2 for x = 0.
It is important to note that the choice of the function r is arbitrary as long as it fulfills
the boundary condition r(x) = 1 + O(|x|2k+1 ) as x → 0, so as to regularize the potential
kth-order singularity of μ at the origin, and that it decays sufficiently fast to temper
the Taylor-series correction in (B.3) at infinity. If we compare the effect of using two
different tempering functions r1 and r2 , the modification
is only in the value of the
constants an , with an,2 − an,1 = f, r2 (x) − r1 (x) xn . Another way of putting this is that
the corresponding distributions f1 and f2 specified by the leading integral in (B.1) will
only differ by a (2k − 1)th-order point distribution that is entirely localized at x = 0; that
2k−1 an,2 −an,1 (n)
is, f2 (x) −
f1 (x) = n=0 n! δ (x), owing to the property that a2k is common to
both scenarios, or, equivalently, that the difference of their inverse Fourier transforms
f1 and f2 is a polynomial of degree (2k − 1).
Thanks to Theorem B.4, it is also possible to derive an integral representation that is
the kth-order generalization of the Lévy–Khintchine formula. For a detailed treatment of
the multidimensional version of the problem, we refer to the works of Madych, Nelson,
and Sun [MN90a, Sun93].
e ←→
1 jωx
The result is obtained by plugging ϕ(x) = 2π ϕ (·) = δ(· − ω) into (B.1),
which is justifiable using a continuity argument. The key is that the corresponding
integral is bounded when μ satisfies the admissibility condition, which ensures the
continuity of f (ω) (by Lebesgue’s dominated-convergence theorem), and vice versa.
We now make the link with the Lévy–Khintchine theorem of statisticians (see Sec-
tion 4.2.1) which is equivalent to characterizing the functions that are conditionally
positive definite of order one. To that end, we rewrite the formula in Corollary B.5 for
k = 1 under the additional constraint that f1 (0) = 0 (which fixes the value of a0 ) as
1 a2 2 jωx
f1 (ω) = a0 + a1 jω − ω + e − r(x) − r(x)jωx μ(dx)
2π 2 R\{0}
b2 jωx
= b1 jω − ω2 + e − 1 − r(x)jωx v(x) dx,
2 R\{0}
where v(x) dx = 2π 1
μ(dx), bn = 2π 1
an , r(x) = 1 + O(|x|3 ) as x → 0, and
limx→±∞ r(x) = 0. Clearly, the new form is equivalent to the Lévy–Khintchine
formula (4.3) with the slight difference that the bias compensation is achieved by
using a bell-shaped, infinitely differentiable function r instead of the rectangular win-
dow 1|x|<1 (x).
Likewise, we are able to transcribe the generalized Fourier-transform-pair relation
(B.1) for the Lévy–Khintchine representation (4.3), which yields
fL−K , ϕ = fL−K ,
ϕ
" # b2
= ϕ(x) − ϕ(0) − x 1|x|<1 (x)ϕ (1) (0) v(x) dx + b1 ϕ (1) (0) + ϕ (2) (0).
R\{0} 2
(B.4)
The interest of (B.4) is that it uniquely specifies the generalized Fourier transform of a
Lévy exponent fL−K as a linear functional of ϕ.
We can also give a “time-domain” (or pointwise) interpretation of this result by
characterizing the generalized function
1
g(x) = F {fK−L }(x) = G{δ}(x),
2π
which also represents the impulse response of the infinitesimal semigroup generator G
investigated in Section 9.7. This is achieved by distinguishing between three cases:
The underlying principle is that the so-defined generalized function will result in the
same measurements as (B.4) when applied to the test function ϕ. In particular, the
values of ϕ (n) at the origin are sampled using the Dirac distribution and its derivatives in
accordance with the rule ϕ, δ (n) = (−1)n ϕ (n) (0).
The modified Bessel function of the second kind with order parameter α ∈ R admits the
Fourier-based representation [AS72]
e−jωx
Kα (ω) = dx.
R (1 + x2 )|α|
π 1 −x
It has the property that Kα (x) = K−α (x). A special case of interest is K 1 (x) = 2x 2 e .
" #α 2
which is compatible with the recursive definition of the factorial n! = n(n − 1)!. Another
useful result is Euler’s reflection formula,
π
(1 − z)(z) = .
sin(πz)
sin(πz) 1
sinc(z) = = , (C.2)
πz (1 − z) (1 + z)
which makes an intriguing connection with the sinus cardinalis function. There is a
similar link with Euler’s beta function,
1
B(z1 , z2 ) = tz1 −1 (1 − t)z2 −1 dt (C.3)
0
(z1 )(z2 )
=
(z1 + z2 )
where γ0 is the Euler–Mascheroni constant. The above allows us to derive the expansion
z 2
∞
Re(z)
− log |(z)|2 = 2γ0 Re(z) + log |z|2 + log 1 + −2 ,
n n
n=1
which is directly applicable to the likelihood function associated with the Meixner dis-
tribution. Also relevant to that context is the integral relation
r 2 r
jzx 1
( + jx) e dx = 2π(r)
R 2 2 cosh 2z
for r > 0 and z ∈ C, which can be interpreted as a Fourier transform by setting z = −jω.
Euler’s digamma function is defined as
d (z)
ψ(z) = log (z) = , (C.5)
dz (z)
dm+1
ψ (m) (z) = log (z) (C.6)
dzm+1
is called the polygamma function of order m.
346 Special functions
The SαS pdf of degree α ∈ (0, 2] and scale parameter s0 is best defined via its charac-
teristic function,
dω
e−|s0 ω| ejωx
α
p(x; α, s0 ) = .
R 2π
Alpha-stable distributions do not admit closed-form expressions, except for the special
cases α = 1 (Cauchy) and 2 (Gauss distribution). Moreover, their absolute moments
of order p, E{|X|p }, are unbounded for p > α, which is characteristic of heavy-tailed
distributions. We can relate the (symmetric) γ th-order moments of their characteristic
function to the gamma function by performing the change of variable t = (s0 ω)α , which
leads to
" #
−γ −1
∞ s−γ −1 γ −α+1 s0 γ α+1
|ω|γ e−|s0 ω| dω = 2 t α e−t dt = 2
α 0
. (C.7)
R 0 α α
By using the correspondence between Fourier-domain moments and time-domain deri-
vatives, we use this result to write the Taylor series of p(x; α, s0 ) around x = 0 as
∞ −2k−1
s0 2k + 1 |x|2k
p(x; α, s0 ) = (−1)k , (C.8)
πα α (2k)!
k=0
which involves even terms only (because of symmetry). The moment formula (C.7) also
yields a simple expression for the slope of the score at the origin, which is given by
" #
p (0) α3
X (0) = − X = " #.
pX (0) s2 1
0 α
Similar techniques are applicable to obtain the asymptotic form of p(x; α, s0 ) as x tends
to infinity [Ber52, TN95]. To characterize the tail behavior, it is sufficient to consider
the first term of the asymptotic expansion
1 " πα # 1
p(x; α, s0 ) ∼ (α + 1) sin sα0 α+1 as x → ±∞, (C.9)
π 2 |x|
which emphasizes the algebraic decay of order (α + 1) at infinity.
References
[BS93] C. Bouman and K. Sauer, A generalized Gaussian image model for edge-
preserving MAP estimation, IEEE Transactions on Image Processing 2 (1993),
no. 3, 296–310.
[BT09a] A. Beck and M. Teboulle, Fast gradient-based algorithms for constrained
total variation image denoising and deblurring problems, IEEE Transactions
on Image Processing 18 (2009), no. 11, 2419–2434.
[BT09b] A fast iterative shrinkage-thresholding algorithm for linear inverse problems,
SIAM Journal on Imaging Sciences 2 (2009), no. 1, 183–202.
[BTU01] T. Blu, P. Thévenaz, and M. Unser, MOMS: Maximal-order interpolation of
minimal support, IEEE Transactions on Image Processing 10 (2001), no. 7,
1069–1080.
[BU03] T. Blu and M. Unser, A complete family of scaling functions: The (α, τ )-
fractional splines, Proceedings of the IEEE International Conference on Acous-
tics, Speech, and Signal Processing (ICASSP’03) (Hong Kong SAR, People’s
Republic of China, April 6–10, 2003), vol. VI, IEEE, pp. 421–424.
[BU07] Self-similarity. Part II: Optimal estimation of fractal processes, IEEE Tran-
sactions on Signal Processing 55 (2007), no. 4, 1364–1378.
[BUF07] K. T. Block, M. Uecker, and J. Frahm, Undersampled radial MRI with multiple
coils. Iterative image reconstruction using a total variation constraint, Magnetic
Resonance in Medicine 57 (2007), no. 6, 1086–1098.
[BY02] B. Bru and M. Yor, Comments on the life and mathematical legacy of Wolfgang
Döblin, Finance and Stochastics 6 (2002), no. 1, 3–47.
[Car99] J. F. Cardoso and D. L. Donoho, Some experiments on independent component
analysis of non-Gaussian processes, Proceedings of the IEEE Signal Processing
Workshop on Higher-Order Statistics (SPW-HOS’99) (Caesarea, Istael, June
14–16, 1999), 1999, pp. 74–77.
[CBFAB97] P. Charbonnier, L. Blanc-Féraud, G. Aubert, and M. Barlaud, Deterministic
edge-preserving regularization in computed imaging, IEEE Transactions on
Image Processing 6 (1997), no. 2, 298–311.
[CD95] R. R. Coifman and D. L. Donoho, Translation-invariant de-noising, Wavelets
and statistics, (A. Antoniadis and G, Oppeheim, eds.), Lecture Notes in Statis-
tics, vol. 103, Springer, 1995, pp. 125–150.
[CDLL98] A. Chambolle, R. A. DeVore, N.-Y. Lee, and B. J. Lucier, Nonlinear wave-
let image processing: Variational problems, compression, and noise removal
through wavelet shrinkage, IEEE Transactions on Image Processing 7 (1998),
no. 33, 319–335.
[Chr03] O. Christensen, An Introduction to Frames and Riesz Bases, Birkhäuser, 2003.
[CKM97] H. A. Chipman, E. D. Kolaczyk, and R. E. McCulloch, Adaptive Bayesian
wavelet shrinkage, Journal of the American Statistical Association 92 (1997),
no. 440, 1413–1421.
[CLM+ 95] W. A. Carrington, R. M. Lynch, E. D. Moore, G. Isenberg, K. E. Fogarty, and
F. S. Fay, Superresolution three-dimensional images of fluorescence in cells
with minimal light exposure, Science 268 (1995), no. 5216, 1483–1487.
[CNB98] M. S. Crouse, R. D. Nowak, and R. G. Baraniuk, Wavelet-based statistical
signal processing using hidden Markov models, IEEE Transactions on Signal
Processing 46 (1998), no. 4, 886–902.
[Com94] P. Comon, Independent component analysis: A new concept, Signal Processing
36 (1994), no. 3, 287–314.
350 References
[CP11] P. L. Combettes and J.-C. Pesquet, Proximal splitting methods in signal proces-
sing, Fixed-Point Algorithms for Inverse Problems in Science and Engineering
(H. H. Bauschke, R. S. Burachik, P. L. Combettes, V. Elser, D. R. Luke, and
H. Wolkowicz, eds.), vol. 49, Springer New York, 2011, pp. 185–212.
[Cra40] H. Cramér, On the theory of stationary random processes, The Annals of
Mathematics 41 (1940), no. 1, 215–230.
[CSE00] C. Christopoulos, A. S. Skodras, and T. Ebrahimi, The JPEG2000 still image
coding system: An overview, IEEE Transactions on Consumer Electronics 16
(2000), no. 4, 1103–1127.
[CT04] R. Cont and P. Tankov, Financial Modelling with Jump Processes, Chapman &
Hall, 2004.
[CU10] K. N. Chaudhury and M. Unser, On the shiftability of dual-tree complex
wavelet transforms, IEEE Transactions on Signal Processing 58 (2010), no. 1,
221–232.
[CW91] C. K. Chui and J.-Z. Wang, A cardinal spline approach to wavelets, Proceedings
of the American Mathematical Society 113 (1991), no. 3, 785–793.
[CW08] E. J. Candès and M. B. Wakin, An introduction to compressive sampling, IEEE
Signal Processing Magazine 25 (2008), no. 2, 21–30.
[CYV00] S. G. Chang, B. Yu, and M. Vetterli, Spatially adaptive wavelet thresholding
with context modeling for image denoising, IEEE Transactions on Image Pro-
cessing 9 (2000), no. 9, 1522 –1531.
[Dau88] I. Daubechies, Orthogonal bases of compactly supported wavelets, Communi-
cations on Pure and Applied Mathematics 41 (1988), 909–996.
[Dau92] Ten Lectures on Wavelets, Society for Industrial and Applied Mathematics,
1992.
[dB78] C. de Boor, A Practical Guide to Splines, Springer, 1978.
[dB87] The polynomials in the linear span of integer translates of a compactly sup-
ported function, Constructive Approximation 3 (1987), 199–208.
[dBDR93] C. de Boor, R. A. DeVore, and A. Ron, On the construction of multivariate (pre)
wavelets, Constructive Approximation 9 (1993), 123–123.
[DBFZ+ 06] N. Dey, L. Blanc-Féraud, C. Zimmer, P. Roux, Z. Kam, J.-C. Olivo-Marin, and
J. Zerubia, Richardson-Lucy algorithm with total variation regularization for
3D confocal microscope deconvolution, Microscopy Research and Technique
69 (2006), no. 4, 260–266.
[dBH82] C. de Boor and K. Höllig, B-splines from parallelepipeds, Journal d’Analyse
Mathématique 42 (1982), no. 1, 99–115.
[dBHR93] C. de Boor, K. Höllig, and S. Riemenschneider, Box Splines, Springer, 1993.
[DDDM04] I. Daubechies, M. Defrise, and C. De Mol, An iterative thresholding algorithm
for linear inverse problems with a sparsity constraint, Communications on Pure
and Applied Mathematics 57 (2004), no. 11, 1413–1457.
[DJ94] D. L. Donoho and I. M. Johnstone, Ideal spatial adaptation via wavelet shrin-
kage, Biometrika 81 (1994), 425–455.
[DJ95] Adapting to unknown smoothness via wavelet shrinkage, Journal of the Ame-
rican Statistical Association 90 (1995), no. 432, 1200–1224.
[Don95] D. L. Donoho, De-noising by soft-thresholding, IEEE Transactions on Infor-
mation Theory 41 (1995), no. 3, 613–627.
[Don06] Compressed sensing, IEEE Transactions on Information Theory 52 (2006),
no. 4, 1289–1306.
References 351
[HP76] M. Hamidi and J. Pearl, Comparison of the cosine and Fourier transforms of
Markov-1 signals, IEEE Transactions on Acoustics, Speech and Signal Proces-
sing 24 (1976), no. 5, 428–429.
[HS04] T. Hida and S. Si, An Innovation Approach to Random Fields: Application of
White Noise Theory, World Scientific, 2004.
[HS08] Lectures on White Noise Functionals, World Scientific, 2008.
[HSS08] T. Hofmann, B. Schölkopf, and A. J. Smola, Kernel methods in machine learn-
ing, Annals of Statistics 36 (2008), no. 3, 1171–1220.
[HT02] D. W. Holdsworth and M. M. Thornton, Micro-CT in small animal and speci-
men imaging, Trends in Biotechnology 20 (2002), no. 8, S34–S39.
[Hun77] B. R. Hunt, Bayesian methods in nonlinear digital image restoration, IEEE
Transactions on Computers C-26 (1977), no. 3, 219–229.
[HW85] K. M. Hanson and G. W. Wecksung, Local basis-function approach to computed
tomography, Applied Optics 24 (1985), no. 23, 4028–4039.
[HY00] M. Hansen and B. Yu, Wavelet thresholding via MDL for natural images, IEEE
Transactions on Information Theory 46 (2000), no. 5, 1778 –1788.
[Itô54] K. Itô, Stationary random distributions, Kyoto Journal of Mathematics 28
(1954), no. 3, 209–223.
[Itô84] Foundations of Stochastic Differential Equations in Infinite-Dimensional
Spaces, CBMS-NSF Regional Conference Series in Applied Mathematics,
vol. 47, Society for Industrial and Applied Mathematics (SIAM), 1984.
[Jac01] N. Jacob, Pseudo Differential Operators & Markov Processes, Vol. 1: Fourier
Analysis and Semigroups, World Scientific, 2001.
[Jai79] A. K. Jain, A sinusoidal family of unitary transforms, IEEE Transactions on
Pattern Analysis and Machine Intelligence 1 (1979), no. 4, 356–365.
[Jai89] Fundamentals of Digital Image Processing, Prentice-Hall, 1989.
[JMR01] S. Jaffard, Y. Meyer, and R. D. Ryan, Wavelets: Tools for Science and Techno-
logy, SIAM, 2001.
[JN84] N. S. Jayant and P. Noll, Digital Coding of Waveforms: Principles and Applica-
tion to Speech and Video Coding, Prentice-Hall, 1984.
[Joh66] S. Johansen, An application of extreme point methods to the representation
of infinitely divisible distributions, Probability Theory and Related Fields 5
(1966), 304–316.
[Kai70] T. Kailath, The innovations approach to detection and estimation theory, Pro-
ceedings of the IEEE 58 (1970), no. 5, 680–695.
[Kal06] W. A. Kalender, X-ray computed tomography, Physics in Medicine and Biology
51 (2006), no. 13, R29.
[Kap62] W. Kaplan, Operational Methods for Linear Systems, Addison-Wesley, 1962.
[KBU12] U. Kamilov, E. Bostan, and M. Unser, Wavelet shrinkage with consistent cycle
spinning generalizes total variation denoising, IEEE Signal Processing Letters
19 (2012), no. 4, 187–190.
[KFL01] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, Factor graphs and the sum-
product algorithm, IEEE Transactions on Information Theory 47 (2001), no. 2,
498–519.
[Khi34] A. Khintchine, Korrelationstheorie der stationären stochastischen Prozesse,
Mathematische Annalen 109 (1934), no. 1, 604–615.
[Khi37a] A new derivation of one formula by P. Lévy, Bulletin of Moscow State Uni-
versity, I (1937), no. 1, 1–5.
354 References
[Sch99] H. H. Schaefer, Topological Vector Spaces, 2nd edn., Graduate Texts in Mathe-
matics, vol. 3, Springer, 1999.
[SDFR13] J.-L. Starck, D. L. Donoho, M. J. Fadili, and A. Rassat, Sparsity and the Baye-
sian perspective, Astronomy and Astrophysics 552 (2013), A133.
[SF71] G. Strang and G. Fix, A Fourier analysis of the finite element variational
method, Constructive Aspects of Functional Analysis, Edizioni Cremonese,
1971, pp. 793–840.
[Sha93] J. Shapiro, Embedded image coding using zerotrees of wavelet coefficients,
IEEE Transactions on Acoustics, Speech and Signal Processing 41 (1993),
no. 12, 3445–3462.
[Sil99] B. W. Silverman, Wavelets in statistics: Beyond the standard assumptions,
Philosophical Transactions: Mathematical, Physical and Engineering Sciences
357 (1999), no. 1760, 2459–2473.
[SLG02] A. Srivastava, X. Liu, and U. Grenander, Universal analytical forms for model-
ing image probabilities, IEEE Transactions on Pattern Analysis and Machine
Intelligence 24 (2002), no. 9, 1200–1214.
[SLSZ03] A. Srivastava, A. B. Lee, E. P. Simoncelli, and S.-C. Zhu, On advances in statis-
tical modeling of natural images, Journal of Mathematical Imaging and Vision
18 (2003), 17–33.
[SO01] E. P. Simoncelli and B. A. Olshausen, Natural image statistics and neural repre-
sentation, Annual Review of Neuroscience 24 (2001), 1193–1216.
[Sob36] S. Soboleff, Méthode nouvelle à résoudre le problème de Cauchy pour les équa-
tions linéaires hyperboliques normales, Recueil Mathématique (Matematiceskij
Sbornik) 1(43) (1936), no. 1, 39–72.
[SP08] E. Y. Sidky and X. Pan, Image reconstruction in circular cone-beam computed
tomography by constrained, total-variation minimization, Physics in Medicine
and Biology 53 (2008), no. 17, 4777.
[SPM02] J.-L. Starck, E. Pantin, and F. Murtagh, Deconvolution in astronomy: A review,
Publications of the Astronomical Society of the Pacific 114 (2002), no. 800,
1051–1069.
[SS02] L. Sendur and I. W. Selesnick, Bivariate shrinkage functions for wavelet-based
denoising exploiting interscale dependency, IEEE Transactions on Signal Pro-
cessing 50 (2002), no. 11, 2744–2756.
[ST94] G. Samorodnitsky and M. S. Taqqu, Stable Non-Gaussian Random Processes:
Stochastic Models with Infinite Variance, Chapman & Hall, 1994.
[Ste76] J. Stewart, Positive definite functions and generalizations: An historical survey,
Rocky Mountain Journal of Mathematics 6 (1976), no. 3, 409–434.
[Ste81] C. Stein, Estimation of the mean of a multivariate normal distribution, Annals
of Statistics 9 (1981), no. 6, 1135–1151.
[SU12] Q. Sun and M. Unser, Left inverses of fractional Laplacian and sparse sto-
chastic processes, Advances in Computational Mathematics 36 (2012), no. 3,
399–441.
[Sun93] X. Sun, Conditionally positive definite functions and their application to mul-
tivariate interpolations, Journal of Approximation Theory 74 (1993), no. 2,
159–180.
[Sun07] Q. Sun, Wiener’s lemma for infinite matrices, Transactions of the American
Mathematical Society 359 (2007), no. 7, 3099–3123.
360 References
[Uns93] On the optimality of ideal filters for pyramid and wavelet signal approxima-
tion, IEEE Transactions on Signal Processing 41 (1993), no. 12, 3591–3596.
[Uns99] Splines: A perfect fit for signal and image processing, IEEE Signal Proces-
sing Magazine 16 (1999), no. 6, 22–38.
[Uns00] Sampling: 50 years after Shannon, Proceedings of the IEEE 88 (2000), no. 4,
569–587.
[Uns05] Cardinal exponential splines, Part II: Think analog, act digital, IEEE Tran-
sactions on Signal Processing 53 (2005), no. 4, 1439–1449.
[UT11] M. Unser and P. D. Tafti, Stochastic models for sparse and piecewise-smooth
signals, IEEE Transactions on Signal Processing 59 (2011), no. 3, 989–1005.
[UTAK14] M. Unser, P. D. Tafti, A. Amini, and H. Kirshner, A unified formulation of
Gaussian vs. sparse stochastic processes, Part II: Discrete-domain theory, IEEE
Transactions on Information Theory 60 (2014), no. 5, 3036–3051.
[UTS14] M. Unser, P. D. Tafti, and Q. Sun, A unified formulation of Gaussian vs. sparse
stochastic processes, Part I: Continuous-domain theory, IEEE Transactions on
Information Theory 60 (2014), no. 3, 1361–1376.
[VAVU06] C. Vonesch, F. Aguet, J.-L. Vonesch, and M. Unser, The colored revolution of
bioimaging, IEEE Signal Processing Magazine 23 (2006), no. 3, 20–31.
[VBB+ 02] G. M. Viswanathan, F. Bartumeus, S. V. Buldyrev, J. Catalan, U. L. Fulco,
S. Havlin, M. G. E da Luz, M. L. Lyra, E. P. Raposo, and H. E. Stanley, Lévy
flight random searches in biological phenomena, Physica A: Statistical Mecha-
nics and Its Applications 314 (2002), no. 1–4, 208–213.
[VBU07] C. Vonesch, T. Blu, and M. Unser, Generalized Daubechies wavelet families,
IEEE Transactions on Signal Processing 55 (2007), no. 9, 4415–4429.
[VDVBU05] D. Van De Ville, T. Blu, and M. Unser, Isotropic polyharmonic B-splines: Sca-
ling functions and wavelets, IEEE Transactions on Image Processing 14 (2005),
no. 11, 1798–1813.
[VDVFHUB10] D. Van De Ville, B. Forster-Heinlein, M. Unser, and T. Blu, Analytical foot-
prints: Compact representation of elementary singularities in wavelet bases,
IEEE Transactions on Signal Processing 58 (2010), no. 12, 6105–6118.
[vKvVVvdV97] G. M. P. van Kempen, L. J. van Vliet, P. J. Verveer, and H. T. M. van der
Voort, A quantitative comparison of image restoration methods for confocal
microscopy, Journal of Microscopy 185 (1997), no. 3, 345–365.
[VMB02] M. Vetterli, P. Marziliano, and T. Blu, Sampling signals with finite rate
of innovation, IEEE Transactions on Signal Processing 50 (2002), no. 6,
1417–1428.
[VU08] C. Vonesch and M. Unser, A fast thresholded Landweber algorithm for wavelet-
regularized multidimensional deconvolution, IEEE Transactions on Image Pro-
cessing 17 (2008), no. 4, 539–549.
[VU09] A fast multilevel algorithm for wavelet-regularized image restoration, IEEE
Transactions on Image Processing 18 (2009), no. 3, 509–523.
[Wag09] P. Wagner, A new constructive proof of the Malgrange-Ehrenpreis theorem,
American Mathematical Monthly 116 (2009), no. 5, 457–462.
[Wah90] G. Wahba, Spline Models for Observational Data, Society for Industrial and
Applied Mathematics, 1990.
[Wen05] H. Wendland, Scattered Data Approximations, Cambridge University Press,
2005.
362 References
additive white Gaussian noise (AWGN), 254, 269, cardinal interpolation problem, 5
283, 301, 321 Cauchy’s principal value, 62, 333, 334
adjoint operator, 21, 32, 36, 37, 51, 52, 54, 91, causal, 5, 90, 95, 98, 105
94, 109 central-limit theorem, 245, 246, 312
alternating direction method of multipliers characteristic function, 45, 192, 206, 225, 240, 251,
(ADMM), 262, 263, 319 337
analysis window SαS, 234
arbitrary function in Lp , 80, 191, 195, 223 characteristic functional, 46, 47, 85, 154, 156, 158,
rectangular, 12, 58, 80 164, 188, 194, 225
analytic continuation, 329 domain extension, 195–197
augmented Lagrangian method, 261, 319 of Gaussian white noise, 75
autocorrelation function, 12, 51, 54, 152, 153, 159, of innovation process, 53, 73
161, 163, 167, 169 of sparse stochastic process, 153
compound-Poisson process, 10, 174, 185
B-spline factorization, 136–137 compressed sensing, 2, 255, 284, 288
B-splines, 21, 127 conditional positive definiteness, 62, 339
exponential, 125–126, 138–139, 150 of generalized functions, 340
Schoenberg’s correspondence, 69
fractional, 139–141, 204
continuity, 27, 32, 45, 46
generalized, 127–142, 160, 197–198
of functional, 85
minimum-support property, 21, 132, 134, 138
convex optimization, 2, 263
polyharmonic, 141
convolution, 39, 40–43, 89, 92–94, 158–159, 161,
polynomial, 8, 137
334
basis functions
semigroup, 239–242
Faber-Schauder, 13
correlation functional, 11, 50, 81, 153, 155, 188
Haar wavelets, 12
covariance matrix, 215, 216, 258
Bayes’ rule, 254 cumulants, 209, 213, 237–239
belief propagation, 279–282 cumulant-generating function, 238, 239
beta function, 345 cycle spinning, 299
BIBO stability, 93, 94 through averaging, 317–318
binomial expansion (generalized), 204
biomedical image reconstruction, 2, 24, 263, 276, Döblin, Wolfgang, 17
298 decay, 132, 235
deconvolution microscopy, 265–269, 298–299 algebraic, 132–133, 235
MRI, 269–272 compact support, 12, 125, 132, 134
X-ray CT, 272–276 exponential, 132, 235
biorthogonality, 145, 250, 253 supra-exponential, 235
Blu, Thierry, 204 decorrelation, 202, 217
boundary conditions, 9, 92, 102–103, 167 decoupling of sparse processes, 23, 191
bounded operator, 41 generalized increments, 20, 170–172, 197–205,
convolution on Lp (Rd ), 40–43 211, 251
Riesz–Thorin theorem, 41, 94 increments, 11, 191, 193
Brownian motion, 4, 11, 163, 173, 220 increments vs. wavelets, 191–194,
364 Index
wavelet analysis, 21, 205–206, fractional derivative, 104, 119, 176, 204
wavelets vs. KLT, 220 fractional integrator, 185
denoising, 1, 24, 277, 290 fractional Laplacian, 107, 108, 135, 176, 185
consistent cycle spinning, 318–320, frequency response, 5, 7
iterative MAP, 318–320 factorization, 95
MAP vs. MMSE, 283 rational, 39, 95, 162
MMSE (gold standard), 281 function, 25
wavelet-domain notion, 28
shrinkage/thresholding, 296 function spaces, 25–32
soft-threshold, 290 complete-normed, 28–29
differential entropy, 215 finite-energy (L2 ), 34
differential equation, 5, 89 generalized functions (S ), 34
stable, 95–97 Lebesgue (Lp ), 29
unstable, 19, 97–103
nuclear, 29–32
dilation matrix, 142, 206
rapidly decaying (R ), 29
Dirac impulse, 6, 9, 34, 35, 83, 121
smooth, rapidly decaying (S ), 30–31
discrete AR(1) process, 211, 220, 278
topological, 28
discrete convolution, 110
functional, 32
inverse, 110
discrete cosine transform (DCT), 1, 14, 211, 217,
220 gamma function, 344
discrete whitening filter, 202 properties, 344–345
discretization, 194–195 Gaussian hypothesis, 1, 54, 258–259
deconvolution, 268 Gaussian stationary process, 1, 161–162
MRI, 270 generalized compound process, 78, 185–187
X-ray CT, 273 generalized functions, 35–40
dual extension principle, 36 generalized increment process, 160–161, 195,
dual space 198–199
algebraic, 32 generalized increments, 199–205
continuous, 32, 35, 46 probability law of, 199–200
of topological vector space, 32 generalized random field, 47
duality product, 33–35 generalized stochastic process, 47–54, 84–87,
existence of, 86
estimators isotropic, 154
comparison of, 283–286, 303, 321 linear transform of, 51–52, 57, 84
LMMSE, 258 self-similar, 154–155
MAP, 255 stationary, 154–155
MMSE (or conditional mean), 278 Gibbs energy minimization, 287
pointwise MAP, 301 Green’s function, 7, 93, 95, 119–120
pointwise MMSE, 301–312 reproduction, 129–130
wavelet-domain MAP, 290, 292–293
expected value, 44
Hölder smoothness, 9, 135
Haar wavelets, 12, 114–116,
filtered white noise, 19, 53–54, 91, 159
analysis of Lévy process, 12–15, 193–194,
finite difference, 7, 192, 201
synthesis of Brownian motion, 15–16
finite rate of innovation, 2, 9, 78, 173
fluorescence microscopy, 265 Haar, Alfréd, 147
forward model, 249, 264, 270 Hadamard’s finite part, 106, 332–334, 343
Fourier central-slice theorem, 273 Hilbert transform, 42, 105, 334
Fourier multiplier, 40, 93, 119 Hurst exponent, 154, 156, 177, 178, 245
Lp characterization theorem, 41
Mikhlin’s theorem, 43 impulse response, 5, 38, 41, 158–160, 162
Fourier transform, 1, 34 first-order system, 90, 150
basic properties, 37 impulsive Poisson noise, 76–78,
of generalized functions, 37 increments, 11, 165, 178, 192
of homogeneous distributions, 331 independence at every point, 79, 164
of singular functions, 332 independent component analysis (ICA), 218–220
Index 365