0% found this document useful (0 votes)
26 views392 pages

An Introduction To Sparse Stochastic Processes (PDFDrive)

Uploaded by

Jeremy Oueret
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views392 pages

An Introduction To Sparse Stochastic Processes (PDFDrive)

Uploaded by

Jeremy Oueret
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 392

An Introduction to Sparse Stochastic Processes

Providing a novel approach to sparsity, this comprehensive book presents the theory of
stochastic processes that are ruled by linear stochastic differential equations and that
admit a parsimonious representation in a matched wavelet-like basis.
Two key themes are the statistical property of infinite divisibility, which leads to two
distinct types of behavior – Gaussian and sparse – and the structural link between linear
stochastic processes and spline functions, which is exploited to simplify the mathemati-
cal analysis. The core of the book is devoted to investigating sparse processes, including
a complete description of their transform-domain statistics. The final part develops prac-
tical signal-processing algorithms that are based on these models, with special emphasis
on biomedical image reconstruction.
This is an ideal reference for graduate students and researchers with an interest in
signal/image processing, compressed sensing, approximation theory, machine learning,
or statistics.

MICHAEL UNSER is Professor and Director of the Biomedical Imaging Group at the École
Polytechnique Fédérale de Lausanne (EPFL), Switzerland. He is a member of the Swiss
Academy of Engineering Sciences, a Fellow of EURASIP, and a Fellow of the IEEE.

POUYA D. TAFTI is a data scientist currently residing in Germany, and a former member
of the Biomedical Imaging Group at EPFL, where he conducted research on the theory
and applications of probabilistic models for data.
“Over the last twenty years, sparse representation of images and signals became a very
important topic in many applications, ranging from data compression, to biological
vision, to medical imaging. The book Sparse Stochastic Processes by Unser and Tafti is
the first work to systematically build a coherent framework for non-Gaussian processes
with sparse representations by wavelets. Traditional concepts such as Karhunen-Loéve
analysis of Gaussian processes are nicely complemented by the wavelet analysis of Levy
Processes which is constructed here. The framework presented here has a classical feel
while accommodating the innovative impulses driving research in sparsity. The book
is extremely systematic and at the same time clear and accessible, and can be recom-
mended both to engineers interested in foundations and to mathematicians interested in
applications.”
David Donoho, Stanford University

“This is a fascinating book that connects the classical theory of generalised functions
(distributions) to the modern sparsity-based view on signal processing, as well as sto-
chastic processes. Some of the early motivations given by I. Gelfand on the importance
of generalised functions came from physics and, indeed, signal processing and sam-
pling. However, this is probably the first book that successfully links the more abstract
theory with modern signal processing. A great strength of the monograph is that it consi-
ders both the continuous and the discrete model. It will be of interest to mathematicians
and engineers having appreciations of mathematical and stochastic views of signal pro-
cessing.”
Anders Hansen, University of Cambridge
An Introduction to Sparse
Stochastic Processes
MICHAEL UNSER and POUYA D. TAFTI
École Polytechnique Fédérale de Lausanne
University Printing House, Cambridge CB2 8BS, United Kingdom

Cambridge University Press is part of the University of Cambridge.


It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning and research at the highest international levels of excellence.

www.cambridge.org
Information on this title: www.cambridge.org/9781107058545
© Cambridge University Press 2014
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2014
Printed in the United Kingdom by Clays, St Ives plc
A catalog record for this publication is available from the British Library
Library of Congress Cataloging in Publication Data
Unser, Michael A., author.
An introduction to sparse stochastic processes / Michael Unser and Pouya Tafti, École polytechnique fédérale,
Lausanne.
pages cm
Includes bibliographical references and index.
ISBN 978-1-107-05854-5 (Hardback)
1. Stochastic differential equations. 2. Random fields. 3. Gaussian processes. I. Tafti, Pouya, author.
II. Title.
QA274.23.U57 2014
519.2 3–dc23 2014003923
ISBN 978-1-107-05854-5 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication,
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
To Gisela, Lucia, Klaus, and Murray
Contents

Preface page xiii


Notation xv

1 Introduction 1
1.1 Sparsity: Occam’s razor of modern signal processing? 1
1.2 Sparse stochastic models: the step beyond Gaussianity 2
1.3 From splines to stochastic processes, or when Schoenberg meets Lévy 5
1.3.1 Splines and Legos revisited 5
1.3.2 Higher-degree polynomial splines 8
1.3.3 Random splines, innovations, and Lévy processes 9
1.3.4 Wavelet analysis of Lévy processes and M-term approximations 12
1.3.5 Lévy’s wavelet-based synthesis of Brownian motion 15
1.4 Historical notes: Paul Lévy and his legacy 16

2 Roadmap to the book 19


2.1 On the implications of the innovation model 20
2.1.1 Linear combination of sampled values 20
2.1.2 Wavelet analysis 21
2.2 Organization of the book 22

3 Mathematical context and background 25


3.1 Some classes of function spaces 25
3.1.1 About the notation: mathematics vs. engineering 28
3.1.2 Normed spaces 28
3.1.3 Nuclear spaces 29
3.2 Dual spaces and adjoint operators 32
3.2.1 The dual of Lp spaces 33
3.2.2 The duals of D and S 33
3.2.3 Distinction between Hermitian and duality products 34
3.3 Generalized functions 35
3.3.1 Intuition and definition 35
3.3.2 Operations on generalized functions 36
3.3.3 The Fourier transform of generalized functions 37
3.3.4 The kernel theorem 38
viii Contents

3.3.5 Linear shift-invariant operators and convolutions 39


3.3.6 Convolution operators on Lp (Rd ) 40
3.4 Probability theory 43
3.4.1 Probability measures 43
3.4.2 Joint probabilities and independence 44
3.4.3 Characteristic functions in finite dimensions 45
3.4.4 Characteristic functionals in infinite dimensions 46
3.5 Generalized random processes and fields 47
3.5.1 Generalized random processes as collections of random variables 47
3.5.2 Generalized random processes as random generalized functions 49
3.5.3 Determination of statistics from the characteristic functional 49
3.5.4 Operations on generalized stochastic processes 51
3.5.5 Innovation processes 52
3.5.6 Example: filtered white Gaussian noise 53
3.6 Bibliographical pointers and historical notes 54

4 Continuous-domain innovation models 57


4.1 Introduction: from Gaussian to sparse probability distributions 58
4.2 Lévy exponents and infinitely divisible distributions 59
4.2.1 Canonical Lévy–Khintchine representation 60
4.2.2 Deciphering the Lévy–Khintchine formula 64
4.2.3 Gaussian vs. sparse categorization 68
4.2.4 Proofs of Theorems 4.1 and 4.2 69
4.3 Finite-dimensional innovation model 71
4.4 White Lévy noises or innovations 73
4.4.1 Specification of white noise in Schwartz’ space S  73
4.4.2 Impulsive Poisson noise 76
4.4.3 Properties of white noise 78
4.5 Generalized stochastic processes and linear models 84
4.5.1 Innovation models 84
4.5.2 Existence and characterization of the solution 84
4.6 Bibliographical notes 87

5 Operators and their inverses 89


5.1 Introductory example: first-order differential equation 90
5.2 Shift-invariant inverse operators 92
5.3 Stable differential systems in 1-D 95
5.3.1 First-order differential operators with stable inverses 96
5.3.2 Higher-order differential operators with stable inverses 96
5.4 Unstable Nth-order differential systems 97
5.4.1 First-order differential operators with unstable shift-invariant
inverses 97
5.4.2 Higher-order differential operators with unstable shift-invariant
inverses 101
5.4.3 Generalized boundary conditions 102
Contents ix

5.5 Fractional-order operators 104


5.5.1 Fractional derivatives in one dimension 104
5.5.2 Fractional Laplacians 107
5.5.3 Lp -stable inverses 108
5.6 Discrete convolution operators 109
5.7 Bibliographical notes 111

6 Splines and wavelets 113


6.1 From Legos to wavelets 113
6.2 Basic concepts and definitions 118
6.2.1 Spline-admissible operators 118
6.2.2 Splines and operators 120
6.2.3 Riesz bases 121
6.2.4 Admissible wavelets 124
6.3 First-order exponential B-splines and wavelets 124
6.3.1 B-spline construction 125
6.3.2 Interpolator in augmented-order spline space 126
6.3.3 Differential wavelets 126
6.4 Generalized B-spline basis 127
6.4.1 B-spline properties 128
6.4.2 B-spline factorization 136
6.4.3 Polynomial B-splines 137
6.4.4 Exponential B-splines 138
6.4.5 Fractional B-splines 139
6.4.6 Additional brands of univariate B-splines 141
6.4.7 Multidimensional B-splines 141
6.5 Generalized operator-like wavelets 142
6.5.1 Multiresolution analysis of L2 (Rd ) 142
6.5.2 Multiresolution B-splines and the two-scale relation 143
6.5.3 Construction of an operator-like wavelet basis 144
6.6 Bibliographical notes 147

7 Sparse stochastic processes 150


7.1 Introductory example: non-Gaussian AR(1) processes 150
7.2 General abstract characterization 152
7.3 Non-Gaussian stationary processes 158
7.3.1 Autocorrelation function and power spectrum 159
7.3.2 Generalized increment process 160
7.3.3 Generalized stationary Gaussian processes 161
7.3.4 CARMA processes 162
7.4 Lévy processes and their higher-order extensions 163
7.4.1 Lévy processes 163
7.4.2 Higher-order extensions of Lévy processes 166
7.4.3 Non-stationary Lévy correlations 167
7.4.4 Removal of long-range dependencies 169
x Contents

7.4.5 Examples of sparse processes 172


7.4.6 Mixed processes 175
7.5 Self-similar processes 176
7.5.1 Stable fractal processes 177
7.5.2 Fractional Brownian motion through the looking-glass 180
7.5.3 Scale-invariant Poisson processes 185
7.6 Bibliographical notes 187

8 Sparse representations 191


8.1 Decoupling of Lévy processes: finite differences vs. wavelets 191
8.2 Extended theoretical framework 194
8.2.1 Discretization mechanism: sampling vs. projections 194
8.2.2 Analysis of white noise with non-smooth functions 195
8.3 Generalized increments for the decoupling of sample values 197
8.3.1 First-order statistical characterization 199
8.3.2 Higher-order statistical dependencies 200
8.3.3 Generalized increments and stochastic difference equations 201
8.3.4 Discrete whitening filter 202
8.3.5 Robust localization 202
8.4 Wavelet analysis 205
8.4.1 Wavelet-domain statistics 206
8.4.2 Higher-order wavelet dependencies and cumulants 208
8.5 Optimal representation of Lévy and AR(1) processes 210
8.5.1 Generalized increments and first-order linear prediction 211
8.5.2 Vector-matrix formulation 212
8.5.3 Transform-domain statistics 212
8.5.4 Comparison of orthogonal transforms 216
8.6 Bibliographical notes 222

9 Infinite divisibility and transform-domain statistics 223


9.1 Composition of id laws, spectral mixing, and analysis of white noise 224
9.2 Class C and unimodality 230
9.3 Self-decomposable distributions 232
9.4 Stable distributions 234
9.5 Rate of decay 235
9.6 Lévy exponents and cumulants 237
9.7 Semigroup property 239
9.7.1 Gaussian case 241
9.7.2 SαS case 241
9.7.3 Compound-Poisson case 241
9.7.4 General iterated-convolution interpretation 241
Contents xi

9.8 Multiscale analysis 242


9.8.1 Scale evolution of the pdf 243
9.8.2 Scale evolution of the moments 244
9.8.3 Asymptotic convergence to a Gaussian/stable distribution 246
9.9 Notes and pointers to the literature 247

10 Recovery of sparse signals 248


10.1 Discretization of linear inverse problems 249
10.1.1 Shift-invariant reconstruction subspace 249
10.1.2 Finite-dimensional formulation 252
10.2 MAP estimation and regularization 255
10.2.1 Potential function 256
10.2.2 LMMSE/Gaussian solution 258
10.2.3 Proximal operators 259
10.2.4 MAP estimation 261
10.3 MAP reconstruction of biomedical images 263
10.3.1 Scale-invariant image model and common numerical setup 264
10.3.2 Deconvolution of fluorescence micrographs 265
10.3.3 Magnetic resonance imaging 269
10.3.4 X-ray tomography 272
10.3.5 Discussion 276
10.4 The quest for the minimum-error solution 277
10.4.1 MMSE estimators for first-order processes 278
10.4.2 Direct solution by belief propagation 279
10.4.3 MMSE vs. MAP denoising of Lévy processes 283
10.5 Bibliographical notes 286

11 Wavelet-domain methods 290


11.1 Discretization of inverse problems in a wavelet basis 291
11.1.1 Specification of wavelet-domain MAP estimator 292
11.1.2 Evolution of the potential function across scales 293
11.2 Wavelet-based methods for solving linear inverse problems 294
11.2.1 Preliminaries 295
11.2.2 Iterative shrinkage/thresholding algorithm 296
11.2.3 Fast iterative shrinkage/thresholding algorithm 297
11.2.4 Discussion of wavelet-based image reconstruction 298
11.3 Study of wavelet-domain shrinkage estimators 300
11.3.1 Pointwise MAP estimators for AWGN 301
11.3.2 Pointwise MMSE estimators for AWGN 301
11.3.3 Comparison of shrinkage functions: MAP vs. MMSE 303
11.3.4 Conclusion on simple wavelet-domain shrinkage estimators 312
11.4 Improved denoising by consistent cycle spinning 313
11.4.1 First-order wavelets: design and implementation 313
11.4.2 From wavelet bases to tight wavelet frames 315
xii Contents

11.4.3 Iterative MAP denoising 318


11.4.4 Iterative MMSE denoising 320
11.5 Bibliographical notes 324

12 Conclusion 326

Appendix A Singular integrals 328


A.1 Regularization of singular integrals by analytic continuation 329
A.2 Fourier transform of homogeneous distributions 331
A.3 Hadamard’s finite part 332
A.4 Some convolution integrals with singular kernels 334

Appendix B Positive definiteness 336


B.1 Positive definiteness and Bochner’s theorem 336
B.2 Conditionally positive–definite functions 339
B.3 Lévy–Khintchine formula from the point of view of generalized functions 342

Appendix C Special functions 344


C.1 Modified Bessel functions 344
C.2 Gamma function 344
C.3 Symmetric-alpha-stable distributions 346

References 347
Index 363
Contents

Preface page xiii


Notation xv

1 Introduction 1
1.1 Sparsity: Occam’s razor of modern signal processing? 1
1.2 Sparse stochastic models: the step beyond Gaussianity 2
1.3 From splines to stochastic processes, or when Schoenberg meets Lévy 5
1.3.1 Splines and Legos revisited 5
1.3.2 Higher-degree polynomial splines 8
1.3.3 Random splines, innovations, and Lévy processes 9
1.3.4 Wavelet analysis of Lévy processes and M-term approximations 12
1.3.5 Lévy’s wavelet-based synthesis of Brownian motion 15
1.4 Historical notes: Paul Lévy and his legacy 16

2 Roadmap to the book 19


2.1 On the implications of the innovation model 20
2.1.1 Linear combination of sampled values 20
2.1.2 Wavelet analysis 21
2.2 Organization of the book 22

3 Mathematical context and background 25


3.1 Some classes of function spaces 25
3.1.1 About the notation: mathematics vs. engineering 28
3.1.2 Normed spaces 28
3.1.3 Nuclear spaces 29
3.2 Dual spaces and adjoint operators 32
3.2.1 The dual of Lp spaces 33
3.2.2 The duals of D and S 33
3.2.3 Distinction between Hermitian and duality products 34
3.3 Generalized functions 35
3.3.1 Intuition and definition 35
3.3.2 Operations on generalized functions 36
3.3.3 The Fourier transform of generalized functions 37
3.3.4 The kernel theorem 38
viii Contents

3.3.5 Linear shift-invariant operators and convolutions 39


3.3.6 Convolution operators on Lp (Rd ) 40
3.4 Probability theory 43
3.4.1 Probability measures 43
3.4.2 Joint probabilities and independence 44
3.4.3 Characteristic functions in finite dimensions 45
3.4.4 Characteristic functionals in infinite dimensions 46
3.5 Generalized random processes and fields 47
3.5.1 Generalized random processes as collections of random variables 47
3.5.2 Generalized random processes as random generalized functions 49
3.5.3 Determination of statistics from the characteristic functional 49
3.5.4 Operations on generalized stochastic processes 51
3.5.5 Innovation processes 52
3.5.6 Example: filtered white Gaussian noise 53
3.6 Bibliographical pointers and historical notes 54

4 Continuous-domain innovation models 57


4.1 Introduction: from Gaussian to sparse probability distributions 58
4.2 Lévy exponents and infinitely divisible distributions 59
4.2.1 Canonical Lévy–Khintchine representation 60
4.2.2 Deciphering the Lévy–Khintchine formula 64
4.2.3 Gaussian vs. sparse categorization 68
4.2.4 Proofs of Theorems 4.1 and 4.2 69
4.3 Finite-dimensional innovation model 71
4.4 White Lévy noises or innovations 73
4.4.1 Specification of white noise in Schwartz’ space S  73
4.4.2 Impulsive Poisson noise 76
4.4.3 Properties of white noise 78
4.5 Generalized stochastic processes and linear models 84
4.5.1 Innovation models 84
4.5.2 Existence and characterization of the solution 84
4.6 Bibliographical notes 87

5 Operators and their inverses 89


5.1 Introductory example: first-order differential equation 90
5.2 Shift-invariant inverse operators 92
5.3 Stable differential systems in 1-D 95
5.3.1 First-order differential operators with stable inverses 96
5.3.2 Higher-order differential operators with stable inverses 96
5.4 Unstable Nth-order differential systems 97
5.4.1 First-order differential operators with unstable shift-invariant
inverses 97
5.4.2 Higher-order differential operators with unstable shift-invariant
inverses 101
5.4.3 Generalized boundary conditions 102
Contents ix

5.5 Fractional-order operators 104


5.5.1 Fractional derivatives in one dimension 104
5.5.2 Fractional Laplacians 107
5.5.3 Lp -stable inverses 108
5.6 Discrete convolution operators 109
5.7 Bibliographical notes 111

6 Splines and wavelets 113


6.1 From Legos to wavelets 113
6.2 Basic concepts and definitions 118
6.2.1 Spline-admissible operators 118
6.2.2 Splines and operators 120
6.2.3 Riesz bases 121
6.2.4 Admissible wavelets 124
6.3 First-order exponential B-splines and wavelets 124
6.3.1 B-spline construction 125
6.3.2 Interpolator in augmented-order spline space 126
6.3.3 Differential wavelets 126
6.4 Generalized B-spline basis 127
6.4.1 B-spline properties 128
6.4.2 B-spline factorization 136
6.4.3 Polynomial B-splines 137
6.4.4 Exponential B-splines 138
6.4.5 Fractional B-splines 139
6.4.6 Additional brands of univariate B-splines 141
6.4.7 Multidimensional B-splines 141
6.5 Generalized operator-like wavelets 142
6.5.1 Multiresolution analysis of L2 (Rd ) 142
6.5.2 Multiresolution B-splines and the two-scale relation 143
6.5.3 Construction of an operator-like wavelet basis 144
6.6 Bibliographical notes 147

7 Sparse stochastic processes 150


7.1 Introductory example: non-Gaussian AR(1) processes 150
7.2 General abstract characterization 152
7.3 Non-Gaussian stationary processes 158
7.3.1 Autocorrelation function and power spectrum 159
7.3.2 Generalized increment process 160
7.3.3 Generalized stationary Gaussian processes 161
7.3.4 CARMA processes 162
7.4 Lévy processes and their higher-order extensions 163
7.4.1 Lévy processes 163
7.4.2 Higher-order extensions of Lévy processes 166
7.4.3 Non-stationary Lévy correlations 167
7.4.4 Removal of long-range dependencies 169
x Contents

7.4.5 Examples of sparse processes 172


7.4.6 Mixed processes 175
7.5 Self-similar processes 176
7.5.1 Stable fractal processes 177
7.5.2 Fractional Brownian motion through the looking-glass 180
7.5.3 Scale-invariant Poisson processes 185
7.6 Bibliographical notes 187

8 Sparse representations 191


8.1 Decoupling of Lévy processes: finite differences vs. wavelets 191
8.2 Extended theoretical framework 194
8.2.1 Discretization mechanism: sampling vs. projections 194
8.2.2 Analysis of white noise with non-smooth functions 195
8.3 Generalized increments for the decoupling of sample values 197
8.3.1 First-order statistical characterization 199
8.3.2 Higher-order statistical dependencies 200
8.3.3 Generalized increments and stochastic difference equations 201
8.3.4 Discrete whitening filter 202
8.3.5 Robust localization 202
8.4 Wavelet analysis 205
8.4.1 Wavelet-domain statistics 206
8.4.2 Higher-order wavelet dependencies and cumulants 208
8.5 Optimal representation of Lévy and AR(1) processes 210
8.5.1 Generalized increments and first-order linear prediction 211
8.5.2 Vector-matrix formulation 212
8.5.3 Transform-domain statistics 212
8.5.4 Comparison of orthogonal transforms 216
8.6 Bibliographical notes 222

9 Infinite divisibility and transform-domain statistics 223


9.1 Composition of id laws, spectral mixing, and analysis of white noise 224
9.2 Class C and unimodality 230
9.3 Self-decomposable distributions 232
9.4 Stable distributions 234
9.5 Rate of decay 235
9.6 Lévy exponents and cumulants 237
9.7 Semigroup property 239
9.7.1 Gaussian case 241
9.7.2 SαS case 241
9.7.3 Compound-Poisson case 241
9.7.4 General iterated-convolution interpretation 241
Contents xi

9.8 Multiscale analysis 242


9.8.1 Scale evolution of the pdf 243
9.8.2 Scale evolution of the moments 244
9.8.3 Asymptotic convergence to a Gaussian/stable distribution 246
9.9 Notes and pointers to the literature 247

10 Recovery of sparse signals 248


10.1 Discretization of linear inverse problems 249
10.1.1 Shift-invariant reconstruction subspace 249
10.1.2 Finite-dimensional formulation 252
10.2 MAP estimation and regularization 255
10.2.1 Potential function 256
10.2.2 LMMSE/Gaussian solution 258
10.2.3 Proximal operators 259
10.2.4 MAP estimation 261
10.3 MAP reconstruction of biomedical images 263
10.3.1 Scale-invariant image model and common numerical setup 264
10.3.2 Deconvolution of fluorescence micrographs 265
10.3.3 Magnetic resonance imaging 269
10.3.4 X-ray tomography 272
10.3.5 Discussion 276
10.4 The quest for the minimum-error solution 277
10.4.1 MMSE estimators for first-order processes 278
10.4.2 Direct solution by belief propagation 279
10.4.3 MMSE vs. MAP denoising of Lévy processes 283
10.5 Bibliographical notes 286

11 Wavelet-domain methods 290


11.1 Discretization of inverse problems in a wavelet basis 291
11.1.1 Specification of wavelet-domain MAP estimator 292
11.1.2 Evolution of the potential function across scales 293
11.2 Wavelet-based methods for solving linear inverse problems 294
11.2.1 Preliminaries 295
11.2.2 Iterative shrinkage/thresholding algorithm 296
11.2.3 Fast iterative shrinkage/thresholding algorithm 297
11.2.4 Discussion of wavelet-based image reconstruction 298
11.3 Study of wavelet-domain shrinkage estimators 300
11.3.1 Pointwise MAP estimators for AWGN 301
11.3.2 Pointwise MMSE estimators for AWGN 301
11.3.3 Comparison of shrinkage functions: MAP vs. MMSE 303
11.3.4 Conclusion on simple wavelet-domain shrinkage estimators 312
11.4 Improved denoising by consistent cycle spinning 313
11.4.1 First-order wavelets: design and implementation 313
11.4.2 From wavelet bases to tight wavelet frames 315
xii Contents

11.4.3 Iterative MAP denoising 318


11.4.4 Iterative MMSE denoising 320
11.5 Bibliographical notes 324

12 Conclusion 326

Appendix A Singular integrals 328


A.1 Regularization of singular integrals by analytic continuation 329
A.2 Fourier transform of homogeneous distributions 331
A.3 Hadamard’s finite part 332
A.4 Some convolution integrals with singular kernels 334

Appendix B Positive definiteness 336


B.1 Positive definiteness and Bochner’s theorem 336
B.2 Conditionally positive–definite functions 339
B.3 Lévy–Khintchine formula from the point of view of generalized functions 342

Appendix C Special functions 344


C.1 Modified Bessel functions 344
C.2 Gamma function 344
C.3 Symmetric-alpha-stable distributions 346

References 347
Index 363
Preface

In the years since 2000, there has been a significant shift in paradigm in signal proces-
sing, statistics, and applied mathematics that revolves around the concept of sparsity
and the search for “sparse” representations of signals. Early signs of this (r)evolution
go back to the discovery of wavelets, which have now superseded classical Fourier
techniques in a number of applications. The other manifestation of this trend is the
emergence of data-processing schemes that minimize an 1 norm as opposed to the
squared 2 norm associated with the traditional linear methods. A highly popular
research topic that capitalizes on those ideas is compressed sensing. It is the quest for
a statistical framework that would support this change of paradigm that led us to the
writing of this book.
The cornerstone of our formulation is the classical innovation model, which is equi-
valent to the specification of stochastic processes as solutions of linear stochastic diffe-
rential equations (SDE). The non-standard twist here is that we allow for non-Gaussian
driving terms (white Lévy noise) which, as we shall see, has a dramatic effect on the
type of signal being generated. A fundamental property, hinted in the title of the book,
is that the non-Gaussian solutions of such SDEs admit a sparse representation in an
adapted wavelet-like basis. While a sizable part of the present material is an outgrowth
of our own research, it is founded on the work of Lévy (1930) and Gelfand (arguably,
the second most famous Soviet mathematician after Kolmogorov), who derived general
functional tools and results that are hardly known by practitioners but, as we argue in
the book, are extremely relevant to the issue of sparsity. The other important source
of inspiration is spline theory and the observation that splines and stochastic processes
are ruled by the same differential equations. This is the reason why we opted for the
innovation approach which facilitates the transposition of analytical techniques from
one field to the other. While the formulation requires advanced mathematics that are
carefully explained in the book, the underlying model has a strong engineering appeal
since it constitutes the natural extension of the traditional filtered-white-noise interpre-
tation of a Gaussian stationary process.
The book assumes that the reader has a good understanding of linear systems
(ordinary differential equations, convolution), Hilbert spaces, generalized functions
(i.e., inner products, Dirac impulses, linear operators), the Fourier transform, basic
statistical signal processing, and (multivariate) statistics (probability density and cha-
racteristic functions). By contrast, there is no requirement for prior knowledge of
splines, stochastic differential equations, or advanced functional analysis (function
xiv Preface

spaces, Bochner’s theorem, operator theory, singular integrals) since these topics are
treated in a self-contained fashion.
Several people have had a crucial role in the genesis of this book. The idea of defi-
ning sparse stochastic processes originated during the preparation of a talk for Martin
Vetterli’s 50th birthday (which coincided with the anniversary of the launching of Sput-
nik) in an attempt to build a bridge between his signals with a finite rate of innovation
and splines. We thank him for his long-time friendship and for convincing us to under-
take this writing project. We are grateful to our former collaborator, Thierry Blu, for
his precious help in the elucidation of the functional link between splines and stochastic
processes. We are extremely thankful to Arash Amini, Julien Fageot, Pedram Pad, Qiyu
Sun, and John-Paul Ward for many helpful discussions and their contributions to mathe-
matical results. We are indebted to Emrah Bostan, Ulugbek Kamilov, Hagai Kirshner,
Masih Nilchian, and Cédric Vonesch for turning the theory into practice and for running
the signal- and image-processing experiments described in Chapters 10 and 11. We are
most grateful to Philippe Thévenaz for his intelligent editorial advice and his spotting of
multiple errors and inconsistencies, while we take full responsibility for the remaining
ones. We also thank Phil Meyler, Sarah Marsh and Gaja Poggiogalli from Cambridge
University Press, as well as John King for his careful copy-editing.
The authors also acknowledge very helpful and stimulating discussions with Ben
Adcock, Emmanuel Candès, Volkan Cevher, Robert Dalang, Mike Davies, Christine
De Mol, David Donoho, Pier-Luigi Dragotti, Michael Elad, Yonina Eldar, Jalal Fadili,
Mario Figueiredo, Vivek Goyal, Rémy Gribonval, Anders Hansen, Nick Kingsbury,
Gitta Kutyniok, Stamatis Lefkimmiatis, Gabriel Peyré, Robert Novak, Jean-Luc Stark,
and Dimitri Van De Ville, as well as a number of other researchers involved in the field.
The European Research Commission (ERC) and the Swiss National Science Founda-
tion provided partial support throughout the writing of the book.
Notation

Abbreviations
ADMM Alternating-direction method of multipliers
AL Augmented Lagrangian
AR Autoregressive
ARMA Autoregressive moving average
AWGN Additive white Gaussian noise
BIBO Bounded input, bounded output
CAR Continuous-time autoregressive
CARMA Continuous-time autoregressive moving average
CCS Consistent cycle spinning
DCT Discrete cosine transform
fBm Fractional Brownian motion
FBP Filtered backprojection
FFT Fast Fourier transform
FIR Finite impulse response
FISTA Fast iterative shrinkage/thresholding algorithm
ICA Independent-component analysis
id Infinitely divisible
i.i.d. Independent identically distributed
IIR Infinite impulse response
ISTA Iterative shrinkage/thresholding algorithm
JPEG Joint Photographic Experts Group
KLT Karhunen–Loève transform
LMMSE Linear minimum-mean-square error
LPC Linear predictive coding
LSI Linear shift-invariant
MAP Maximum a posteriori
MMSE Minimum-mean-square error
MRI Magnetice resonance imaging
PCA Principal-component analysis
pdf Probability density function
PSF Point-spread function
ROI Region of interest
xvi Notation

SαS Symmetric-alpha-stable
SDE Stochastic differential equation
SNR Signal-to-noise ratio
WSS Wide-sense stationary

Sets
N, Z+ Non-negative integers, including 0
Z Integers
R Real numbers
R+ Non-negative real numbers
C Complex numbers
Rd d-dimensional Euclidean space
Zd d-dimensional integers

Various notation
j Imaginary unit such that j2 = −1
x Ceiling: smallest integer at least as large as x
x Floor: largest integer not exceeding x
(x1 : xn ) n-tuple (x1 , x2 , . . . , xn )
f Norm of the function f (see Section 3.1.2)
fLp Lp -norm of the function f (in the sense of Lebesgue)
ap p -norm of the sequence a
ϕ, s Scalar (or duality) product
f, g L2 L2 inner product
f∨ Reversed signal: f∨ (r) = f(−r)
(f ∗ g)(r) Continuous-domain convolution
(a ∗ b)[n] Discrete-domain convolution 

ϕ (ω) Fourier transform of ϕ: Rd ϕ(r)e−jω,r dr

f = F {f} Fourier transform of f (classical or generalized)
f = F −1 {f} Inverse Fourier transform of f
F {f}(ω) = F {f}(−ω) Conjugate Fourier transform of f

Signals, functions, and kernels


f, f(·), or f(r) Continuous-domain signal: function Rd → R
ϕ Generic test function in S (Rd )
ψL = L∗ φ Operator-like wavelet with smoothing kernel φ
s, ϕ, s Generalized function S (Rd ) → R 
μh Measure associated with h: ϕ, h = Rd ϕ(r)μh (dr)
δ Dirac impulse: ϕ, δ = ϕ(0)
δ(· − r0 ) Shifted Dirac impulse
βL Generalized B-spline associated with the operator L
ϕint Spline interpolation kernel
Notation xvii

β+n = βDn+1 Causal polynomial B-spline of degree n


xn+ = max(0, x)n One-sided power function
βα First-order exponential B-spline with pole α ∈ C
β(α1 :αN ) Nth-order exponential B-spline: βα1 ∗ · · · ∗ βαN
a, a[·], or a[n] Discrete-domain signal: sequence Zd → R
δ[n] Discrete Kronecker impulse

Spaces
X, Y Generic vector spaces (normed or nuclear)

L2 (Rd ) Finite-energy functions
 R d |f(r)| 2 dr < ∞

Lp (Rd ) Functions such that Rd |f(r)| p dr < ∞


   α p
Lp,α (Rd ) Functions such that Rd f(r) 1 + |r|  dr < ∞
D (Rd ) Smooth and compactly supported test functions
D  (Rd ) Distributions or generalized functions over Rd
S (Rd ) Smooth and rapidly decreasing test functions
S  (Rd ) Tempered distributions (generalized functions)
R (Rd ) Bounded functions with rapid decay

2 (Zd ) Finite-energy sequences k∈Zd |a[k]|2 < ∞

p (Zd ) Sequences such that k∈Zd |a[k]|p < ∞

Operators
Id Identity
D = dtd Derivative
Dd Finite difference (discrete derivative)
DN Nth-order derivative
∂n Partial derivative of order n = (n1 , . . . , nd )
L Whitening operator (LSI)

L(ω) Frequency response of L (Fourier multiplier)
ρL Green’s function of L
L∗ Adjoint of L such that ϕ1 , Lϕ2 = L∗ ϕ1 , ϕ2
L−1 Right inverse of L such that LL−1 = Id
h(r1 , r2 ) Generalized impulse response of L−1
L−1∗ Left inverse of L∗ such that (L−1∗ )L∗ = Id
Ld Discrete counterpart of L
NL Null space of L
Pα First-order differential operator: D − αId, α ∈ C
P(α1 :αN ) Differential operator of order N: Pα1 ◦ · · · ◦ PαN
α First-order weighted difference
(α1 :αN ) Nth-order weighted differences: α1 ◦ · · · ◦ αN
γ
∂τ Fractional derivative of order γ ∈ R+ and phase τ
γ
(− ) 2 Fractional Laplacian of order γ ∈ R+
γ∗ γ
Ip Lp -stable left inverse of (− ) 2
xviii Notation

Probability
X, Y Generic scalar random variables
PX Probability measure on R of X
pX (x) Probability density function (univariate)
X (x) Potential function: − log pX (x)
proxX (x, λ) Proximal operator
pid (x) Infinitely divisible probability law
E{·} Expected value operator
mn nth-order moment: E{Xn }
κn nth-order cumulant

pX (ω) Characteristic function of X: E{ejωX }
f(ω) Lévy exponent: log pid (ω)
v(a) Lévy density
p(X1 :XN ) (x) Multivariate probability density function

p(X1 :XN ) (ω) Multivariate characteristic function
mn Moment with multi-index n = (n1 , . . . , nN )
κn Cumulant with multi-index n
H(X1 :XN ) Differential entropy
I(X1 , . . . , XN ) Mutual information
D(pq) Kullback–Leibler divergence

Generalized stochastic processes


w Continuous-domain white noise (innovation)
ϕ, w Generic scalar observation of innovation process
fϕ (ω) Modified Lévy exponent: log pϕ,w (ω)
vϕ (a) Modified Lévy density
s Generalized stochastic process: L−1 w
u Generalized increment process: Ld s = βL ∗ w
W 1-D Lévy process with DW = w
BH Fractional Brownian motion with Hurst index H
Ps (ϕ) Characteristic functional: E{ejϕ,s }
Bs (ϕ1 , ϕ2 ) Correlation functional: E{ϕ1 , s ϕ2 , s }
Rs (r1 , r2 ) Autocorrelation function: E{s(r1 )s(r2 )}
1 Introduction

1.1 Sparsity: Occam’s razor of modern signal processing?

The hypotheses of Gaussianity and stationarity play a central role in the standard statis-
tical formulation of signal processing. They fully justify the use of the Fourier transform
as the optimal signal representation and naturally lead to the derivation of optimal linear
filtering algorithms for a large variety of statistical estimation tasks. This classical view
of signal processing is elegant and reassuring, but it is not at the forefront of research
anymore.
Starting with the discovery of the wavelet transform in the late 1980s [Dau88,Mal89],
researchers in signal processing have progressively moved away from the Fourier
transform and have uncovered powerful alternatives. Consequently, they have ceased
modeling signals as Gaussian stationary processes and have adopted a more determi-
nistic, approximation-theoretic point of view. The key developments that are presently
reshaping the field, and which are central to the theory presented in this book, are
summarized below.
• Novel transforms and dictionaries for the representation of signals. New redundant
and non-redundant representations of signals (wavelets, local cosine, curvelets) have
emerged since the mid 1990s and have led to better algorithms for data compression,
data processing, and feature extraction. The most prominent example is the wavelet
based JPEG-2000 standard for image compression [CSE00], which outperforms the
widely-used JPEG method based on the DCT (discrete cosine transform). Another
illustration is wavelet-domain image denoising, which provides a good alternative to
more traditional linear filtering [Don95]. The various dictionaries of basis functions
that have been proposed so far are tailored to specific types of signals; there does not
appear to be one that fits all.
• Sparsity as a new paradigm for signal processing. At the origin of this new trend
is the key observation that many naturally occurring signals and images – in par-
ticular, the ones that are piecewise-smooth – can be accurately reconstructed from
a “sparse” wavelet expansion that involves many fewer terms than the original
number of samples [Mal98]. The concept of sparsity has been systematized and
extended to other transforms, including redundant representations (a.k.a. frames); it
is at the heart of recent developments in signal processing. Sparse signals are easy
to compress and to denoise by simple pointwise processing (e.g., shrinkage) in the
transformed domain. Sparsity provides an equally powerful framework for dealing
2 Introduction

with more difficult, ill-posed signal-reconstruction problems [CW08, BDE09].


Promoting sparse solutions in linear models is also of interest in statistics: a
popular regression shrinkage estimator is LASSO, which imposes an upper bound on
the 1 -norm of the model coefficients [Tib96].
• New sampling strategies with fewer measurements. The theory of compressed sensing
deals with the problem of the reconstruction of a signal from a minimal, but suitably
chosen, set of measurements [Don06, CW08, BDE09]. The strategy there is as fol-
lows: among the multitude of solutions that are consistent with the measurements,
one should favor the “sparsest” one. In practice, one replaces the underlying 0 -norm
minimization problem, which is NP hard, by a convex 1 -norm minimization which is
computationally much more tractable. Remarkably, researchers have shown that this
simplification does yield the correct solution under suitable conditions (e.g., restricted
isometry) [CW08]. Similarly, it has been demonstrated that signals with a finite rate of
innovation (the prototypical example being a stream of Dirac impulses with unknown
locations and amplitudes) can be recovered from a set of uniform measurements at twice
the “innovation rate” [VMB02], rather than twice the bandwidth, as would otherwise
be dictated by Shannon’s classical sampling theorem.
• Superiority of non-linear signal-reconstruction algorithms. There is increasing
empirical evidence that non-linear variational methods (non-quadratic or spar-
sity-driven regularization) outperform the classical (linear) algorithms (direct or
iterative) that are being used routinely for solving bioimaging reconstruction pro-
blems [CBFAB97, FN03]. So far, this has been demonstrated for the problem of
image deconvolution and for the reconstruction of non-Cartesian MRI [LDP07].
The considerable research effort in this area has also resulted in the development of
novel algorithms (ISTA, FISTA) for solving convex optimization problems that were
previously considered out of numerical reach [FN03, DDDM04, BT09b].

1.2 Sparse stochastic models: the step beyond Gaussianity

While the recent developments listed above are truly remarkable and have resulted in
significant algorithmic advances, the overall picture and understanding is still far from
being complete. One limiting factor is that the current formulations of compressed sen-
sing and sparse-signal recovery are fundamentally deterministic. By drawing on the
analogy with the classical linear theory of signal processing, where there is an equiva-
lence between quadratic energy-minimization techniques and minimum-mean-square-
error (MMSE) estimation under the Gaussian hypothesis, there are good chances that
further progress is achievable by adopting a complementary statistical-modeling point
of view. 1 The crucial ingredient that is required to guide such an investigation is a
sparse counterpart to the classical family of Gaussian stationary processes (GSP). This

1 It is instructive to recall the fundamental role of statistical modeling in the development of traditional
signal processing. The standard tools of the trade are the Fourier transform, Shannon-type sampling, linear
filtering, and quadratic energy-minimization techniques. These methods are widely used in practice: they
are powerful, easy to deploy, and mathematically convenient. The important conceptual point is that they
1.2 Sparse stochastic models: beyond Gaussianity 3

book focuses on the formulation of such a statistical framework, which may be aptly
qualified as the next step after Gaussianity under the functional constraint of linearity.
In light of the elements presented in the introduction, the basic requirements for a
comprehensive theory of sparse stochastic processes are as follows:
• Backward compatibility. There is a large body of literature and methods based on the
modeling of signals as realizations of GSP. We would like the corresponding iden-
tification, linear filtering, and reconstruction algorithms to remain applicable, even
though they obviously become suboptimal when the Gaussian hypothesis is violated.
This calls for an extended formulation that provides the same control of the correla-
tion structure of the signals (second-order moments, Fourier spectrum) as the classical
theory does.
• Continuous-domain formulation. The proper interpretation of qualifying terms
such as “piecewise-smooth,” “translation-invariant,” “scale-invariant,” “rotation-
invariant” calls for continuous-domain models of signals that are compatible with
the conventional (finite-dimensional) notion of sparsity. Likewise, if we intend to
optimize or possibly redesign the signal-acquisition system as in generalized sam-
pling and compressed sensing, the very least is to have a model that characterizes the
information content prior to sampling.
• Predictive power. Among other things, the theory should be able to explain why
wavelet representations can outperform the older Fourier-related types of decompo-
sitions, including the KLT, which is optimal from the classical perspective of variance
concentration.
• Ease of use. To have practical usefulness, the framework should allow for the de-
rivation of the (joint) probability distributions of the signal in any transformed
domain. This calls for a linear formulation with the caveat that it needs to accommo-
date non-Gaussian distributions. In that respect, the best thing beyond Gaussianity is

are justifiable based on the theory of Gaussian stationary processes (GSP). Specifically, one can invoke the
following optimality results:

• The Fourier transform as well as several of its real-valued variants (e.g., DCT) are asymptotically equi-
valent to the Karhunen–Loève transform (KLT) for the whole class of GSP. This supports the use of
sinusoidal transforms for data compression, data processing, and feature extraction. The underlying
notion of optimality here is energy compaction, which implies decorrelation. Note that the decorrela-
tion is equivalent to independence in the Gaussian case only.
• Optimal filters. Given a series of linear measurements of a signal corrupted by noise, one can readily
specify its optimal reconstruction (LMMSE estimator) under the general Gaussian hypothesis. The cor-
responding algorithm (Wiener filter) is linear and entirely determined by the covariance structure of the
signal and noise. There is also a direct connection with variational reconstruction techniques since the
Wiener solution can also be formulated as a quadratic energy-minimization problem (Gaussian MAP
estimator) (see Section 10.2.2).
• Optimal sampling/interpolation strategies. While this part of the story is less known, one can also
invoke estimation-theoretic arguments to justify a Shannon-type, constant-rate sampling, which ensures
a minimum loss of information for a large class of predominantly lowpass GSP [PM62, Uns93]. This
is not totally surprising since the basis functions of the KLT are inherently bandlimited. One can also
derive minimum-mean-square-error interpolators for GSP in general. The optimal signal-reconstruction
algorithm takes the form of a hybrid Wiener filter whose input is discrete (signal samples) and whose
output is a continuously defined signal that can be represented in terms of generalized B-spline basis
functions [UB05b].
4 Introduction

infinite divisibility, which is a general property of random variables that is preserved


under arbitrary linear combinations.
• Stochastic justification and refinement of current reconstruction algorithms. A
convincing argument for adopting a new theory is that it must be compatible with
the state of the art, while it also ought to suggest new directions of research. In the
present context, it is important to be able to establish the connection with determinis-
tic recovery techniques such as 1 -norm minimization.
The good news is that the foundations for such a theory exist and can be traced
back to the pioneering work of Paul Lévy, who defined a broad family of “additive”
stochastic processes, now called Lévy processes. Brownian motion (a.k.a. the Wiener
process) is the only Gaussian member of this family, and, as we shall demonstrate,
the only representative that does not exhibit any degree of sparsity. The theory that is
developed in this book constitutes the full linear, multidimensional extension of those
ideas where the essence of Paul Lévy’s construction is embodied in the definition of
Lévy innovations (or white Lévy noise), which can be interpreted as the derivative of
a Lévy process in the sense of distributions (a.k.a. generalized functions). The Lévy
innovations are then linearly transformed to generate a whole variety of processes whose
spectral characteristics are controlled by a linear mixing operator, while their sparsity is
governed by the innovations. The latter can also be viewed as the driving term of some
corresponding linear stochastic differential equation (SDE).
Another way of describing the extent of this generalization is to consider the re-
presentation of a general continuous-domain Gaussian process by a stochastic Wiener
integral,

s(t) = h(t, τ ) dW(τ ), (1.1)


R
where h(t, τ ) is the kernel – that is, the infinite-dimensional analog of the matrix re-
presentation of a transformation in Rn – of a general, L2 -stable linear operator. W is a
random measure which is such that
t
W(t) = dW(τ )
0
is the Wiener process, where the latter equation constitutes a special case of (1.1) with
h(t, τ ) = 1{t>τ ≥0} . Here, 1 denotes the indicator function of the set . If h(t, τ ) =
h(t−τ ) is a convolution kernel, then (1.1) defines the whole class of Gaussian stationary
processes.
The essence of the present formulation is to replace the Wiener measure by a more
general non-Gaussian, multidimensional Lévy measure. The catch, however, is that we
shall not work with measures but rather with generalized functions and generalized sto-
chastic processes. These are easier to manipulate in the Fourier domain and better suited
for specifying general linear transformations. In other words, we shall rewrite (1.1) as

s(t) = h(t, τ )w(τ ) dτ , (1.2)


R
where the entity w (the continuous-domain innovation) needs to be given a proper math-
ematical interpretation. The main advantage of working with innovations is that they
1.3 From splines to stochastic processes 5

provide a very direct link with the theory of linear systems, which allows for the use of
standard engineering notions such as the impulse and frequency responses of a system.

1.3 From splines to stochastic processes, or when Schoenberg meets Lévy

We shall start our journey by making an interesting connection between splines, which
are deterministic objects with some inherent sparsity, and Lévy processes with a spe-
cial focus on the compound-Poisson process, which constitutes the archetype of a
sparse stochastic process. The key observation is that both categories of signals –
namely, deterministic and random – are ruled by the same differential equation.
They can be generated via the proper integration of an “innovation” signal that carries
all the necessary information. The fun is that the underlying differential system is only
marginally stable, which requires the design of a special antiderivative operator. We
then use the close relationship between splines and wavelets to gain insight into the
ability of wavelets to provide sparse representations of such signals. Specifically, we
shall see that most non-Gaussian Lévy processes admit a better M-term representa-
tion in the Haar wavelet basis than in the classical Karhunen–Loève transform (KLT),
which is usually believed to be optimal for data compression. The explanation for this
counter-intuitive result is that we are breaking some of the assumptions that are implicit
in the proof of optimality of the KLT.

1.3.1 Splines and Legos revisited


Splines constitute a general framework for converting series of data points (or samples)
into continuously defined signals or functions. By extension, they also provide a power-
ful mechanism for translating tasks that are specified in the continuous domain into
efficient numerical algorithms (discretization).
The cardinal setting corresponds to the configuration where the sampling grid is on
the integers. Given a sequence of sample values f [k], k ∈ Z, the basic cardinal inter-
polation problem is to construct a continuously defined signal f(t), t ∈ R that satisfies
the interpolation condition f(t)|t=k = f [k], for all k ∈ Z. Since the general problem is
obviously ill-posed, the solution is constrained to live in a suitable reconstruction sub-
space (e.g., a particular space of cardinal splines) whose degrees of freedom are in
one-to-one correspondence with the data points. The most basic concretization of those
ideas is the construction of the piecewise-constant interpolant
f1 (t) = f [k]β+0 (t − k), (1.3)
k∈Z
which involves rectangular basis functions (informally described as “Legos”) that are
shifted replicates of the causal 2 B-spline of degree zero:

⎨ 1, for 0 ≤ t < 1
β+0 (t) = (1.4)
⎩ 0, otherwise.

2 A function f (t) is said to be causal if f (t) = 0, for all t < 0.


+ +
6 Introduction

0 2 4 6 8 10

(a)

0 2 4 6 8 10

(b)

0 2 4 6 8 10

(c)

Figure 1.1 Examples of spline signals. (a) Cardinal spline interpolant of degree zero
(piecewise-constant). (b) Cardinal spline interpolant of degree one (piecewise-linear).
(c) Non-uniform D-spline or compound-Poisson process, depending on the interpretation
(deterministic vs. stochastic).

Observe that the basis functions {β+0 (t − k)}k∈Z are non-overlapping and orthonormal,
and that their linear span defines the space of cardinal polynomial splines of degree zero.
Moreover, since β+0 (t) takes the value one at the origin and vanishes at all other inte-
gers, the expansion coefficients in (1.3) coincide with the original samples of the signal.
Equation (1.3) is nothing but a mathematical representation of the sample-and-hold me-
thod of interpolation which yields the type of “Lego-like” signal shown in Figure 1.1a.
A defining property of piecewise-constant signals is that they exhibit “sparse” first-
order derivatives that are zero almost everywhere, except at the points of transition
where differentiation is only meaningful in the sense of distributions. In the case of the
cardinal spline specified by (1.3), we have that
Df1 (t) = a1 [k]δ(t − k), (1.5)
k∈Z

where the weights of the integer-shifted Dirac impulses δ(· − k) are given by the cor-
responding jump size of the function: a1 [k] = f [k] − f [k − 1]. The main point is that
the application of the operator D = dtd uncovers the spline discontinuities (a.k.a. knots)
which are located on the integer grid: its effect is that of a mathematical A-to-D conver-
sion since the right-hand side of (1.5) corresponds to the continuous-domain represen-
tation of a discrete signal commonly used in the theory of linear systems. In the no-
menclature of splines, we say that f1 (t) is a cardinal D-spline, 3 which is a special case
3 Other brands of splines are defined in the same fashion by replacing the derivative D by some other dif-
ferential operator generically denoted by L.
1.3 From splines to stochastic processes 7

0
b+ (t) = + (t) − + (t − 1)
1 1

1 2 3 4 5 1 2 3 4 5
(a) (b)

Figure 1.2 Causal polynomial B-splines. (a) Construction of the B-spline of degree zero starting
from the causal Green’s function of D. (b) B-splines of degree n = 0, . . . , 4 (light to dark), which
become more bell-shaped (and beautiful) as n increases.

of a general non-uniform D-spline where the knots can be located arbitrarily (see Fi-
gure 1.1c).
The next fundamental observation is that the expansion coefficients in (1.5) are
obtained via a finite-difference scheme which is the discrete counterpart of differentia-
tion. To get some further insight, we define the finite-difference operator

Dd f(t) = f(t) − f(t − 1).

The latter turns out to be a smoothed version of the derivative

Dd f(t) = (β+0 ∗ Df )(t),

where the smoothing kernel is precisely the B-spline generator for the expansion (1.3).
An equivalent manifestation of this property can be found in the relation

β+0 (t) = Dd D−1 δ(t) = Dd 1+ (t), (1.6)

where the unit step 1+ (t) = 1[0,+∞) (t) (a.k.a. the Heaviside function) is the causal
Green’s function 4 of the derivative operator. This formula is illustrated in Figure 1.2a.
Its Fourier-domain counterpart is

+0 (ω) = 1 − e−jω


β β+0 (t)e−jωt dt = , (1.7)
R jω
which is recognized as being the ratio of the frequency responses of the operators Dd
and D, respectively.
Thus, the basic Lego component, β+0 , is much more than a mere building block: it
is also a kernel that characterizes the approximation that is made when replacing a
continuous-domain derivative by its discrete counterpart. This idea (and its generaliza-
tion for other operators) will prove to be one of the key ingredients in our formulation
of sparse stochastic processes.
4 We say that ρ(t) is the causal Green’s function of the shift-invariant operator L if ρ is causal and satisfies
Lρ = δ. This can also be written as L−1 δ = ρ, meaning that ρ is the causal impulse response of the
shift-invariant inverse operator L−1 .
8 Introduction

1.3.2 Higher-degree polynomial splines


A slightly more sophisticated model is to select a piecewise-linear reconstruction which
admits the similar B-spline expansion

f2 (t) = f [k + 1]β+1 (t − k), (1.8)


k∈Z

where


⎪ t, for 0 ≤ t < 1
  ⎨
β+1 (t) = β+0 ∗ β+0 (t) = 2 − t, for 1 ≤ t < 2 (1.9)



0, otherwise

is the causal B-spline of degree one, a triangular function centered at t = 1. Note that
the use of a causal generator is compensated by the unit shifting of the coefficients in
(1.8), which is equivalent to recentering the basis functions on the sampling locations.
The main advantage of f2 in (1.8) over f1 in (1.3) is that the underlying function is now
continuous, as illustrated in Figure 1b.
In an analogous manner, one can construct higher-degree spline interpolants that are
piecewise polynomials of degree n by considering B-spline atoms of degree n obtained
from the (n + 1)-fold convolution of β+0 (t) (see Figure 1.2b). The generic version of such
a higher-order spline model is

fn+1 (t) = c[k]β+n (t − k), (1.10)


k∈Z

with
 
β+n (t) = β+0 ∗ β+0 ∗ · · · ∗ β+0 (t).
  
n+1

The catch, though, is that, for n > 1, the expansion coefficients c[k] in (1.10) are not
identical to the sample values f [k] anymore. Yet, they are in a one-to-one correspon-
dence with them and can be determined efficiently by solving a linear system of equa-
tions that has a convenient band-diagonal Toeplitz structure [Uns99].
The higher-order counterparts of relations (1.7) and (1.6) are
 n+1
1 − e −jω
+ (ω) =
β n

and
−(n+1)
β+n (t) = Dn+1
d D δ(t)
Dn+1 n
d (t)+
= (1.11)
n!
n+1  
n + 1 (t − k)n+
k
= (−1)
k n!
k=0
1.3 From splines to stochastic processes 9

with (t)+ = max(0, t). The latter explicit time-domain formula follows from the fact
that the impulse response of the (n + 1)-fold integrator (or, equivalently, the causal
tn
Green’s function of Dn+1 ) is the one-sided power function D−(n+1) δ(t) = n!+ . This elegant
formula is due to Schoenberg, the father of splines [Sch46]. He also proved that the
polynomial B-spline of degree n is the shortest cardinal Dn+1 -spline and that its integer
translates form a Riesz basis of such polynomial splines. In particular, he showed that
the B-spline representation (1.10) is unique and stable, in the sense that
 
fn 2L2 = |fn (t)|2 dt ≤ c22 = c[k]2 .
R k∈Z

Note that the inequality above becomes an equality for n = 0 since the squared
L2 -norm of the corresponding piecewise-constant function is easily converted into a
sum. This also follows from Parseval’s identity because the B-spline basis {β+0 (·−k)}k∈Z
is orthonormal.
One last feature is that polynomial splines of degree n are inherently smooth, in the
sense that they are n-times differentiable everywhere with bounded derivatives – that is,
Hölder continuous of order n. In the cardinal setting, this follows from the property that
−(n+1)
Dn β+n (t) = Dn Dn+1
d D δ(t)

= Dnd Dd D−1 δ(t) = Dnd β+0 (t),

which indicates that the nth-order derivative of a B-spline of degree n is piecewise-


constant and bounded.

1.3.3 Random splines, innovations, and Lévy processes


To make the link with Lévy processes, we now express the random counterpart of
(1.5) as

Ds(t) = An δ(t − tn ) = w(t), (1.12)


n

where the locations tn of the Dirac impulses are uniformly distributed over the real
line (Poisson distribution with rate parameter λ) and the weights An are independent
identically distributed (i.i.d.) with amplitude distribution pA (a). For simplicity, we are
also assuming that pA is symmetric with finite variance σA2 = R a2 pA (a)da. We shall
refer to w as the innovation of the signal s since it contains all the parameters that are
necessary for its description. Clearly, s is a signal with a finite rate of innovation, a term
that was coined by Vetterli et al. [VMB02].
The idea now is to reconstruct s from its innovation w by integrating (1.12). This
requires the specification of some boundary condition to fix the integration constant.
Since the constraint in the definition of Lévy processes is s(0) = 0 (with probability
one), we first need to find a suitable antiderivative operator, which we shall denote by
10 Introduction

D−1
0 . In the event when the input function is Lebesgue integrable, the relevant operator
is readily specified as
⎧ t


t 0 ⎨ ϕ(τ ) dτ , for t ≥ 0
D−1 ϕ(t) = ϕ(τ ) dτ − ϕ(τ ) dτ = 0
0
0
−∞ −∞ ⎪

⎩ − ϕ(τ ) dτ , for t < 0.
t

It is the corrected version (subtraction of the proper signal-dependent constant) of the


conventional shift-invariant integrator D−1 for which the integral runs from −∞ to t.
The Fourier counterpart of this definition is

ejωt − 1 dω
D−1
0 ϕ(t) = 
ϕ (ω) ,
R jω 2π
which can be extended, by duality, to a much larger class of generalized functions (see
Chapter 5). This is feasible because the latter expression is a regularized version of an
integral that would otherwise be singular, since the division by jω is tempered by a prop-
er correction in the numerator: ejωt − 1 = jωt + O(ω2 ). It is important to note that D−1
0 is
scale-invariant (in the sense that it commutes with scaling), but not shift-invariant, un-
like D−1 . Our reason for selecting D−10 over D
−1 is actually more fundamental than just

imposing the “right” boundary conditions. It is guided by stability considerations: D−1 0


is a valid right inverse of D in the sense that DD−1 0 = Id over a large class of generali-
zed functions, while the use of the shift-invariant inverse D−1 is much more constrained.
Other than that, both operators share most of their global properties. In particular, since
the finite-difference operator has the convenient property of annihilating the constants
that are in the null space of D, we see that

β+0 (t) = Dd D−1 −1


0 δ(t) = Dd D δ(t). (1.13)

Having the proper inverse operator at our disposal, we can apply it to formally solve
the stochastic differential equation (1.12). This yields the explicit representation of the
sparse stochastic process:

s(t) = D−1
0 w(t) = An D−1
0 {δ(· − tn )}(t)
n
 
= An 1+ (t − tn ) − 1+ (−tn ) , (1.14)
n

where the second term 1+ (−tn ) in the last parenthesis ensures that s(0) = 0. Clearly,
the signal defined by (1.14) is piecewise-constant (random spline of degree zero) and
its construction is compatible with the classical definition of a compound-Poisson pro-
cess, which is a special type of Lévy process. A representative example is shown in
Figure 1.1c.
It can be shown that the innovation w specified by (1.12), made of random impulses,
is a special type of continuous-domain white noise with the property that

E{w(t)w(t )} = Rw (t − t ) = σw2 δ(t − t ), (1.15)


1.3 From splines to stochastic processes 11

White noise (innovation)

0 Brownian motion 0

Gaussian
0.0 0.2 0.4 0.6 0.8 1.0
Integrator

Impulsive w(t) t s(t) Compound-Poisson


dt 0 0

0 0.0 0.2 0.4 0.6 0.8 1.0

Sa S (Cauchy)
0 0
0.0 0.2 0.4 0.6 0.8 1.0

Figure 1.3 Synthesis of different brands of Lévy processes by integration of a corresponding


continuous-domain white noise. The alpha-stable excitation in the bottom example is such that
the increments of the Lévy process have a symmetric Cauchy distribution.

where σw2 = λσA2 is the product of the Poisson rate parameter λ and the variance σA2 of
the amplitude distribution. More generally, we can determine the correlation functional
of the innovation, which is given by

E{ϕ1 , w ϕ2 , w } = σw2 ϕ1 , ϕ2 (1.16)



for any real-valued functions ϕ1 , ϕ2 ∈ L2 (R) and ϕ1 , ϕ2 = R ϕ1 (t)ϕ2 (t) dt.
This suggests that we can apply the same operator-based synthesis to other types of
continuous-domain white noise, as illustrated in Figure 1.3. In doing so, we are able to
generate the whole family of Lévy processes. In the case where w is a white Gaussian
noise, the resulting signal is a Brownian motion which has the property of being conti-
nuous almost everywhere. A more extreme case arises when w is an alpha-stable noise
which yields a stable Lévy process whose sample path has a few really large jumps and
is rougher than a Brownian motion.
In the classical literature on stochastic processes, Lévy processes are usually defined
in terms of their increments, which are i.i.d. and infinitely divisible random variables
(see Chapter 7). Here, we shall consider the so-called increment process u(t) = s(t) −
s(t − 1), which has a number of remarkable properties. The key observation is that u,
in its continuous-domain version, is the convolution of a white noise (innovation) with
the B-spline kernel β+0 . Indeed, the relation (1.13) leads to

u(t) = Dd s(t) = Dd D−1


0 w(t) = (β+ ∗ w)(t).
0
(1.17)

This implies, among other things, that u is stationary, while the original Lévy process
s is not (since D−1
0 is not shift-invariant). It also suggests that the samples of the incre-
ment process u are independent if they are taken at a distance of 1 or more apart, the
limit corresponding to the support of the rectangular convolution kernel β+0 . When the
12 Introduction

autocorrelation function Rw (τ ) of the driving noise is well defined and given by (1.15),
we can easily determine the autocorrelation of u as
 
Ru (τ ) = E{u(t + τ )u(t)} = β+0 ∗ (β+0 )∨ ∗ Rw (τ ) = σw2 β+1 (τ − 1), (1.18)

where (β+0 )∨ (t) = β+0 (−t). It is proportional to the autocorrelation of a rectangle, which
is a triangular function (centered B-spline of degree one).
Of special interest to us are the samples of u on the integer grid, which are characte-
rized for k ∈ Z as

u[k] = s(k) − s(k − 1) = 1(k−1,k] , w = β+0 (k − ·), w .

The relation on the right-hand side can be used to show that the u[k] are i.i.d. because
w is white, stationary, and the supports of the analysis functions β+0 (k − t) are non-
overlapping. We shall refer to {u[k]}k∈Z as the discrete innovation of s. Its determi-
nation involves the sampling of s at the integers and a discrete differentiation (finite
differences), in direct analogy with the generation of the continuous-domain innovation
w(t) = Ds(t).
The discrete innovation sequence u[·] will play a fundamental role in signal proces-
sing because it constitutes a convenient tool for extracting the statistics and charac-
terizing the samples of a stochastic process. It is probably the best practical way of
presenting the information because
• we never have access to the full signal s(t), which is a continuously defined entity,
and
• we cannot implement the whitening operator (derivative) exactly, not to
mention that the continuous-domain innovation w(t) does not admit an interpretation
as an ordinary function of t. For instance, Brownian motion is not differentiable
anywhere in the classical sense.
This points to the fact that the continuous-domain innovation model is a theoretical
construct. Its primary purpose is to facilitate the determination of the joint probability
distributions of any series of linear measurements of a wide class of sparse stochastic
processes, including the discrete version of the innovation which has the property of
being maximally decoupled.

1.3.4 Wavelet analysis of Lévy processes and M-term approximations


Our purpose so far has been to link splines and Lévy processes to the derivative op-
erator D. We shall now exploit this connection in the context of wavelet analysis. To
that end, we consider the Haar basis {ψi,k }i∈Z,k∈Z , which is generated by the Haar
wavelet


⎨ 1, for 0 ≤ t < 2
1

ψHaar (t) = −1, for 1
≤t<1 (1.19)


2

0, otherwise.
1.3 From splines to stochastic processes 13

Figure 1.4 Dual pair of multiresolution bases where the first kind of functions (wavelets) are the
derivatives of the second (hierarchical basis functions). (a) (Unnormalized) Haar wavelet basis.
(b) Faber–Schauder basis (a.k.a. Franklin system).

The basis functions, which are orthonormal, are given by


 
−i/2 t − 2i k
ψi,k (t) = 2 ψHaar , (1.20)
2i

where i and k are the scale (dilation of ψHaar by 2i ) and location (translation of ψi,0
by 2i k) indices, respectively. A closely related system is the Faber–Schauder basis
{φi,k (·)}i∈Z,k∈Z , which is made up of B-splines of degree one in a wavelet-like confi-
guration (see Figure 1.4).
Specifically, the hierarchical triangle basis functions are given by
 
t − 2i k
φi,k (t) = β+1 . (1.21)
2i−1

While these functions are orthogonal within any given scale (because they are non-
overlapping), they fail to be so across scales. Yet, they form a Schauder basis, which is
a somewhat weaker property than being a Riesz basis of L2 (R).
The fundamental observation for our purpose is that the Haar system can be obtained
by differentiating the Faber–Schauder one, up to some amplitude factor. Specifically,
we have the relations

ψi,k = 2i/2−1 Dφi,k (1.22)

D−1
0 ψi,k = 2
i/2−1
φi,k . (1.23)

Let us now apply (1.22) to the formal determination of the wavelet coefficients of
the Lévy process s = D−1 0 w. The crucial manipulation, which will be justified rigo-
rously within the framework of generalized stochastic processes (see Chapter 3), is
s, Dφi,k = D∗ s, φi,k = −w, φi,k , where we have used the adjoint relation D∗ = −D
14 Introduction

and the right-inverse property of D−1


0 . This allows us to express the wavelet coeffi-
cients as
Yi,k = s, ψi,k = −2i/2−1 w, φi,k ,
which, up to some scaling factors, amounts to a Faber–Schauder analysis of the innova-
tion w = Ds. Since the triangle functions φi,k are non-overlapping within a given scale
and the innovation is independent at every point, we immediately deduce that the cor-
responding wavelet coefficients are also independent. However, the decoupling is not
perfect across scales due to the parent-to-child overlap of the triangle functions. The
residual correlation can be determined from the correlation functional (1.16) of the
noise, according to
  
E{Yi,k Yi ,k } = 2(i+i )/2−2 E w, φi,k w, φi ,k ∝ φi,k , φi ,k .
Since the triangle functions are non-negative, the residual correlation is zero if and only
if φi,k and φi ,k are non-overlapping, in which case the wavelet coefficients are inde-
pendent as well. We can also predict that the wavelet transform of a compound-Poisson
process will be sparse (i.e., with many vanishing coefficients) because the random Dirac
impulses of the innovation will intersect only a few Faber–Schauder functions, an effect
that becomes more and more pronounced as the scale gets finer. The level of sparsity
can therefore be expected to be directly dependent upon λ (the density of impulses per
unit length).
To quantify this behavior, we applied Haar wavelets to the compression of sampled
realizations of Lévy processes and compared the results with those of the “optimal” text-
book solution for transform coding. In the case of a Lévy process with finite variance,
the KLT can be determined
 analytically
 from the knowledge of the covariance function
E{s(t)s(t )} = C |t| + |t | − |t − t | , where C is an appropriate constant. The KLT is
also known to converge to the discrete cosine transform (DCT) as the size of the signal
increases. The present compression task is to reconstruct a series of 4096-point signals
from their M largest transform coefficients, which is the minimum-error selection rule
dictated by Parseval’s relation. Figure 1.5 displays the graph of the relative quadratic
M-term approximation errors for the three types of Lévy processes shown in Figure 1.3.
We also considered the identity transform as baseline, and the DCT as well, whose
results were found to be indistinguishable from those of the KLT. We observe that the
KLT performs best in the Gaussian scenario, as expected. It is also slightly better than
wavelets at large compression ratios for the compound-Poisson process (piecewise-
constant signal with Gaussian-distributed jumps). In the latter case, however, the
situation changes dramatically as M increases since one is able to reconstruct the signal
perfectly from a fraction of the wavelet coefficients, by reason of the sparse behavior
explained above. The advantage of wavelets over the KLT/DCT is striking for the Lévy
flight (symmetric-alpha-stable, or SαS, distribution with α = 1). While these findings
are surprising at first, they do not contradict the classical theory which tells us that
the KLT has the minimum basis-restriction error for the given class of processes. The
twist here is that the selection of the M largest transform coefficients amounts to some
adaptive reordering of the basis functions, which is not accounted for in the derivation
1.3 From splines to stochastic processes 15

Figure 1.5 Haar wavelets vs. KLT: M-term approximation errors for different brands of Lévy
processes. (a) Gaussian (Brownian motion). (b) Compound-Poisson with Gaussian jump
distribution and e−λ = 0.9. (c) Alpha-stable (symmetric Cauchy). The results are averages
over 1000 realizations.

of the KLT. The other point is that the KLT solution is not defined for the third type of
SαS process, whose theoretical covariances are unbounded – this does not prevent us
from applying the Gaussian solution/DCT to a finite-length realization whose 2 -norm
is finite (almost surely). This simple experiment with various stochastic models corro-
borates the results obtained with image compression where the superiority of wavelets
over the DCT (e.g., JPEG2000 vs. JPEG) is well established.

1.3.5 Lévy’s wavelet-based synthesis of Brownian motion


We close this introductory chapter by making the connection with a multiresolution
scheme that Paul Lévy developed in the 1930s to characterize the properties of Brown-
ian motion. To do so, we adopt a point of view that is the dual of the one in Section
16 Introduction

1.3.4: it essentially amounts to interchanging the analysis and synthesis functions. As


first step, we expand the innovation w in the orthonormal Haar basis and obtain

w= Zi,k ψi,k with Zi,k = w, ψi,k .


i∈Z k∈Z

This is acceptable 5 under the finite-variance hypothesis on w. Since the Haar basis is
orthogonal, the coefficients Zi,k in the above expansion are fully decorrelated, but not
necessarily independent, unless the white noise is Gaussian or the corresponding basis
functions do not overlap. We then construct the Lévy process s = D−10 w by integrating
the wavelet expansion of the innovation, which yields

s(t) = Zi,k D−1


0 ψi,k (t)
i∈Z k∈Z

= 2i/2−1 Zi,k φi,k (t). (1.24)


i∈Z k∈Z

The representation (1.24) is of special interest when the noise is Gaussian, in which
case the coefficients Zi,k are i.i.d. and follow a standardized Gaussian distribution. The
formula then maps into Lévy’s recursive mid-point method of synthesizing Brownian
motion, which Yves Meyer singles out as the first use of wavelets to be found in the
literature (see [JMR01, pp. 20–24]). The Faber–Schauder expansion (1.24) stands out as
a localized, practical alternative to Wiener’s original construction of Brownian motion,
which involves a sum of harmonic cosines (KLT-type expansion).

1.4 Historical notes: Paul Lévy and his legacy

Paul Lévy is a highly original thinker who ended up being one of the most influen-
tial figures of modern probability theory [Tay75, Loè73]. Among his many contribu-
tions to the field are the introduction of the characteristic function as an analytical
tool, the characterization of the limit of sums of independent random variables with
unbounded variance (stable distributions), and the investigation of infinitely divisible
laws which ultimately led to the specification of the complete family of additive – or
Lévy – processes. In this latter respect, Michel Loève singles out his 1934 report on
the integration/summation of independent random components [Lév34] as one of the
most important probability papers ever published. There, Lévy is bold enough to make
the transition from a discrete to a continuous indexing in a running sum. This results
in the construction of a random function that is a generalization of Brownian motion
and one of the earliest instances of a non-Gaussian stochastic process. If one leaves the

5 The convergence in the sense of distributions is ensured since the wavelet coefficients of a rapidly decaying
test function ϕ are rapidly decaying as well.
1.4 Historical notes: Paul Lévy and his legacy 17

mathematical technicalities aside, this is very much in the spirit of (1.2), except for the
presence of the more general weighting kernel h.
During his tenure as professor at the prestigious École Polytechnique in Paris from
1920 to 1959, Paul Lévy only supervised four Ph.D. students. 6 Every one of them tur-
ned out to be a brilliant scientist whose work is intertwined with the material presented
in this book.
The first, Wolfgang Döblin (Ph.D. 1938 at age 23; co-advised by Maurice Fréchet),
was a German jew who acquired French citizenship in 1936. Döblin was an extraor-
dinarily gifted mathematician. His short career ended tragically on the front of World
War II when he took his own life to avoid being captured by the German troops entering
France. Yet, during the year he served as a French soldier, he was able to make funda-
mental contributions to the theory of Markov processes and stochastic integration that
predate the work of Itô; these were discovered in 2000 in a sealed envelope deposited
at the French Academy of Sciences – see [BY02] as well as [Pet05] for a romanced
account of Döblin’s life.
Lévy’s second student, Michel Loève (Ph.D. 1941; co-advised by Maurice Fréchet),
is a prominent name in modern probability theory. He was a famous professor at Berke-
ley who is best known for the development of the spectral representation of second-
order stationary processes (the Karhunen–Loève transform).
The third student was Benoit B. Mandelbrot (Ph.D. 1952), who is universally reco-
gnized as the inventor of fractals. In his early work, Mandelbrot introduced the use of
non-Gaussian random walks – that is, Lévy processes – into financial statistics. In parti-
cular, he showed that the rate of change of prices in markets was much better character-
ized by alpha-stable distributions (which are heavy-tailed) than by Gaussians [Man63].
Interestingly, it is also statistics, albeit Gaussian statistics, that led him to the discovery
of fractals when he characterized the class of self-similar processes known as fraction-
al Brownian motion (fBm). While fBm corresponds to a fractional-order integration of
white Gaussian noise, the construction is somewhat technical for it involves the res-
olution of a singular integral. 7 The relevance to the present study is that an important
subclass of sparse processes is made up by the non-Gaussian cousins of fBms and their
multidimensional extension (see Section 7.5).
Lévy’s fourth and last student, Georges Matheron (Ph.D. 1958), was the founding
father of the field of geostatistics. Being interested in the prediction of ore concen-
tration, he developed a statistical method for the interpolation of random fields from
non-uniform samples [Mat63]. His method, called kriging, uses the prior knowledge
that the field is a Brownian motion and determines the interpolant that minimizes the
mean-square estimation error. Interestingly, the solution, which is specified by a space-
dependent regression equation, happens to be a spline function whose type is determined

6 Source: Mathematics Genealogy Project at http://genealogy.math.ndsu.nodak.edu/.


7 Retrospectively, we cannot help observing the striking parallel between the stochastic integral that
defines fBm (see Equation (7.48)) and the Lévy–Khintchine representation of alpha-stable laws (an area
in which Mandelbrot was obviously an expert), which involves the same kind of singularity (see the Lévy
density v(a) ∝ 1/a1+α in the last line of Table 4.1).
18 Introduction

by the correlation structure (variogram) of the underlying field. There is also an intimate
link between kriging and data-approximation methods based on radial basis functions
and/or reproducing-kernel Hilbert spaces [Mye92] – in particular, thin-plate splines that
are associated with the Laplace operator [Wah90]. While the Gaussian hypothesis is
implicit to Matheron’s work, it is arguably the earliest link established between splines
and stochastic processes.
2 Roadmap to the book

The writing of this book was motivated by our desire to formalize and extend the ideas
presented in Section 1.3 to a class of differential operators much broader than the de-
rivative D. Concretely, this translates into the investigation of the family of stochastic
processes specified by the general innovation model that is summarized in Figure 2.1.
The corresponding generator of random signals (upper part of the diagram) has two
fundamental components: (1) a continuous-domain noise excitation w, which may be
thought of as being composed of a continuum of independent identically distributed
(i.i.d.) random atoms (innovations), and (2) a deterministic mixing procedure (formally
described by L−1 ) which couples the random contributions and imposes the correlation
structure of the output. The concise description of the model is Ls = w, where L is the
whitening operator. The term “innovation” refers to the fact that w represents the un-
predictable part of the process. When the inverse operator L−1 is linear shift-invariant
(LSI), the signal generator reduces to a simple convolutional system which is charac-
terized by its impulse response (or, equivalently, its frequency response). Innovation
modeling has a long tradition in statistical communication theory and signal proces-
sing; it is the basis for the interpretation of a Gaussian stationary process as a filtered
version of a white Gaussian noise [Kai70, Pap91].
In the present context, the underlying objects are continuously defined. The inno-
vation model then results from defining a stochastic process (or random field when
the index variable r is a vector in Rd ) as the solution of a stochastic differential
equation (SDE) driven by a particular brand of noise. The non-standard aspect here
is that we are considering the innovation model in its greatest generality, allowing
for non-Gaussian inputs and differential systems that are not necessarily stable. We
shall argue that these extensions are essential for making this type of modeling com-
patible with the latest developments in signal processing pertaining to the use of
wavelets and sparsity-promoting reconstruction algorithms. Specifically, we shall
see that it is possible to generate a wide variety of sparse processes by replacing
the traditional Gaussian input by some more general brand of (Lévy) noise, within
the limits of mathematical admissibility. We shall also demonstrate that such pro-
cesses admit a sparse representation in a wavelet basis under the assumption that L is
scale-invariant. The difficulty there is that scale-invariant SDEs are inherently unstable
(due to the presence of poles at the origin); yet, we shall see that they can still result in
a proper specification of fractal-type processes, albeit not within the usual framework
of stationary processes. The non-trivial aspect of these generalizations is that they
20 Roadmap to the book

L−1 Generalized
(appropriate boundary conditions)
stochastic process

w(r) s(r)
Whitening operator

Figure 2.1 Innovation model of a generalized stochastic process. The process is generated by
application of the (linear) inverse operator L−1 to a continuous-domain white-noise process w.
The generation mechanism is general in the sense that it applies to the complete family of Lévy
noises, including Gaussian noise as the most basic (non-sparse) excitation. The output process s
is stationary if and only if L−1 is shift-invariant.

necessitate the resolution of instabilities – in the form of singular integrals. This is


required not only at the system level, to allow for non-stationary processes, but also at
the stochastic level because the most interesting sparsity patterns are associated with
unbounded Lévy measures.
Before proceeding with the statistical characterization of sparse stochastic processes,
we shall highlight the central role of the operator L and make a connection with spline
theory and the construction of signal-adapted wavelet bases.

2.1 On the implications of the innovation model

To motivate our approach, we start with an informal discussion, leaving the technicali-
ties aside. The stochastic process s in Figure 2.1 is constructed by applying the (integral)
operator L−1 to some continuous-domain white noise w. In most cases of interest, L−1
has an infinitely supported impulse response which introduces long-range dependen-
cies. If we are aiming at a concise statistical characterization of s, it is essential that we
somehow invert this integration process, the ideal being to apply the operator L which
would give back the innovation signal w that is fully decoupled. Unfortunately, this is
not feasible in practice because we do not have access to the signal s(r) over the entire
domain r ∈ Rd , but only to its sampled values on a lattice or, more generally, to a series
of coefficients in some appropriate basis. Our analysis options are essentially two-fold,
as described in Sections 2.1.1 and 2.1.2.

2.1.1 Linear combination of sampled values


Given the sampled values s(k), k ∈ Zd , the best we can aim at is to implement a dis-
crete version of the operator L, which is denoted by Ld . In effect, Ld will act on the
sampled version of the signal as a digital filter. The corresponding continuous-domain
description of its impulse response is
Ld δ(r) = dL [k]δ(r − k)
k∈Zd
2.1 On the implications of the innovation model 21

with some appropriate weights dL . To fix ideas, Ld may correspond to the numerical


version of the operator provided by the finite-difference method of approximating deri-
vatives.
The interest is now to characterize the (approximate) decoupling effect of this dis-
crete version of the whitening operator. This is quite feasible when the continuous-
domain composition of the operators Ld and L−1 is shift-invariant with impulse
response βL (r) which is assumed to be absolutely integrable (BIBO stability). In
that case, one readily finds that
u(r) = Ld s(r) = (βL ∗ w)(r), (2.1)

where
βL (r) = Ld L−1 δ(r). (2.2)

This suggests that the decoupling effect will be the strongest when the convolution ker-
nel βL is the most localized (minimum support) and closest to an impulse. 1 We call βL
the generalized B-spline associated with the operator L. For a given operator L, the chal-
lenge will be to design the most localized kernel βL , which is the way of approaching
the discretization problem that best matches our statistical objectives. The good news is
that this is a standard problem in spline theory, meaning that we can take advantage of
the large body of techniques available in this area, even though they have hardly been
applied to the stochastic setting so far.

2.1.2 Wavelet analysis


The second option is to analyze the signal s using wavelet-like functions {ψi (·−rk )}. For
that purpose, we assume that we have at our disposal some real-valued “L-compatible”
generalized wavelets which, at a given resolution level i, are such that
ψi (r) = L∗ φi (r). (2.3)

Here, L∗ is the adjoint operator of L and φi is some smoothing kernel with good
localization properties. The interpretation is that the wavelet transform provides some
kind of multiresolution version of the operator L with the effective width of the kernels
φi increasing in direct proportion to the scale; typically, φi (r) ∝ φ0 (r/2i ). Then, the
wavelet analysis of the stochastic process s reduces to
s, ψi (· − r0 ) = s, L∗ φi (· − r0 )
= Ls, φi (· − r0 )
= w, φi (· − r0 ) = (φi∨ ∗ w)(r0 ), (2.4)

where φi∨ (r) = φi (−r) is the reversed version of φi . The remarkable aspect is that
the effect is essentially the same as in (2.1), so that it makes good sense to develop a
common framework to analyze white noise.

1 One may be tempted to pretend that β is a Dirac impulse, which amounts to neglecting all discretization
L
effects. Unfortunately, this is incorrect and most likely to result in false statistical conclusions. In fact, we
shall see that the localization deteriorates as the order of the operator increases, inducing higher (Markov)
orders of dependencies.
22 Roadmap to the book

This is all nice in principle as long as one can construct “L-compatible” wavelet
bases. For instance, if L is a pure nth-order derivative operator – or by extension, a
scale-invariant differential operator – then the above reasoning is directly applicable to
conventional wavelet bases. Indeed, these are known to behave like multiscale versions
of derivatives due to their vanishing-moment property [Mey90, Dau92, Mal09]. In prior
work, we have linked this behavior, as well as a number of other fundamental wavelet
properties, to the polynomial B-spline convolutional factor that is necessarily present
in every wavelet that generates a multiresolution basis of L2 (R) [UB03]. What is not
so widely known is that the spline connection extends to a much broader variety of
operators – not necessarily scale-invariant – and that it also provides a general recipe
for constructing wavelet-like basis functions that are matched to some given operator L.
This has been demonstrated in one dimension (1-D) for the entire family of ordinary dif-
ferential operators [KU06]. The only significant difference with the conventional theory
of wavelets is that the smoothing kernels φi are not necessarily rescaled versions of each
other.
Note that the “L-compatible” property is relatively robust. For instance, if L = L L0 ,
then an “L-compatible” wavelet is also L -compatible with φi = L0 φi . The design chal-
lenge in the context of stochastic modeling is thus to come up with a suitable wavelet
basis such that φi in (2.3) is most localized – possibly, of compact support.

2.2 Organization of the book

The reasoning of Section 2.1 is appealing because of its conceptual simplicity and
generality. Yet, the precise formulation of the theory requires some special care because
the underlying stochastic objects are infinite-dimensional and possibly highly singular.
For instance, we are faced with a major difficulty at the outset because the continuous-
domain input of our model (the innovation w) does not admit a conventional interpre-
tation as a function of the domain variable r. This entity can only be probed indirectly
by forming scalar products with test functions in accordance with Laurent Schwartz’
theory of distributions, so that the use of advanced mathematics is unavoidable.
For the benefit of readers who would be unfamiliar with concepts used in this book,
we provide the relevant mathematical background in Chapter 3, which also serves the
purpose of introducing the notation. The first part is devoted to the definition of the
relevant function spaces, with special emphasis on generalized functions (a.k.a. tem-
pered distributions), which play a central role in our formulation. The second part
reviews the classical, finite-dimensional tools of probability theory and shows how
some concepts (e.g., characteristic function, Bochner’s theorem) are extendable to the
infinite-dimensional setting within the framework of Gelfand’s theory of generalized
stochastic processes [GV64].
Chapter 4 is devoted to the mathematical specification of the innovation model. Since
the theory gravitates around the notion of Lévy exponents , we start with a systema-
tic investigation of such functions, denoted by f(ω), which are fundamental to the
(classical) study of infinitely divisible probability laws. In particular, we discuss their
2.2 Organization of the book 23

canonical representation given by the Lévy–Khintchine formula. In Section 4.4, we


make use of the powerful Minlos–Bochner theorem to transfer those representations
to the infinite-dimensional setting. The fundamental result of our theory is that every
admissible continuous-domain innovation for the model in Figure 2.1 belongs to the
so-called family of white Lévy noises. This implies that an innovation process is com-
pletely characterized by its Lévy exponent f(ω). We conclude the chapter with the pre-
sentation of mathematical criteria for the existence of solutions of Lévy-driven SDEs
(stochastic differential equations) and provide the functional tools for the complete sta-
tistical characterization of these processes. Interestingly, the classical Gaussian pro-
cesses are covered by the formulation (by setting f(ω) = − 12 ω2 ), but they turn out
to be the only non-sparse members of the family.
Besides the random excitation w, the second fundamental component of the innova-
tion model in Figure 2.1 is the inverse L−1 of the whitening operator L. It must fulfill
some continuity/boundedness condition in order to yield a proper solution of the under-
lying SDE. The construction of such inverses (shaping filters) is the topic of Chapter 5,
which presents a systematic catalog of the solutions that are currently available, inclu-
ding recent constructs for scale-invariant/unstable SDEs.
In Chapter 6, we review the tools that are available from the theory of splines in
relation to the specification of the analysis kernels in Equations (2.1) and (2.3). The tech-
niques are quite generic and applicable to any operator L that admits a proper inverse
L−1 . Moreover, by writing a generalized B-spline as βL = Ld L−1 δ, one can appreciate
that the construction of a B-spline for some operator L implicitly provides the solution
of two innovation-related problems at once: (1) the formal inversion of the operator L
(for solving the SDE) and (2) the proper discretization of L through a finite-difference
scheme. The leading thread in our formulation is that these two tasks should not be dis-
sociated – this is achieved formally via the identification of βL , which actually results
in simplified and streamlined mathematics. Remarkably, these generalized B-splines are
also the key for constructing wavelet-like basis functions that are “L-compatible.”
In Chapter 7, we apply our framework to the functional specification of a variety of
generalized stochastic processes, including the classical family of Gaussian stationary
processes and their sparse counterparts. We also characterize non-stationary processes
that are solutions of unstable SDEs. In particular, we describe higher-order extensions
of Lévy processes, as well as a whole variety of fractal-type processes.
In Chapter 8, we rely on our functional characterization to obtain a maximally decou-
pled representation of sparse stochastic processes by application of the discretized
version of the whitening operator or by suitable wavelet expansion. Based on the
characteristic form of these processes, we are able to deduce the transform-domain
statistics and to precisely assess residual dependencies. These ideas are illustrated with
examples of sparse processes for which operator-like wavelets outperform the classical
KLT (or DCT) and result in an independent component analysis.
An implicit property of the innovation model is that the statistical distribution of the
inner product between a sparse stochastic process and a particular basis function (e.g.,
wavelet) is uniquely characterized by a “modified” Lévy exponent. Our main point in
Chapter 9 is to use this result to show that the sparsity of the input noise is transferred to
24 Roadmap to the book

the transformed domain. Apart from a shaping effect that can be quantified, the resulting
probability density function remains within the same family of infinitely divisible laws.
In the final part of the book, we illustrate the use of these stochastic models (and
the corresponding analytical tools) with the formulation of algorithms for the recov-
ery of signals and images from incomplete, noisy measurements. In Chapter 10, we
develop a general framework for the discretization of linear inverse problems in a
B-spline basis, which is analogous to the finite-element method for solving PDEs.
The central element is the “projection” of the continuous-domain stochastic model
onto the (finite-dimensional) reconstruction space in order to specify the prior sta-
tistical distribution of the signal. This naturally yields the maximum a posteriori
solution to the signal-reconstruction problem. The framework is illustrated with the
derivation of practical algorithms for magnetic resonance imaging, deconvolution
microscopy, and tomographic reconstruction. Remarkably, the resulting MAP es-
timators are compatible with the non-quadratic regularization schemes (e.g., 1 -
minimization, LASSO, and/or non-convex p relaxation) that are currently in favor
in imaging. To get a handle on the quality of the reconstruction, we then rely on the
innovation model to investigate the extent to which one is able to “optimally” denoise
sparse signals. In particular, we demonstrate the feasibility of MMSE reconstruction
when the signal belongs to the class of Lévy processes, which provides us with a gold
standard against which to compare other algorithms.
In Chapter 11, we present alternative wavelet-based reconstruction methods that are
typically faster than the fixed-scale techniques of Chapter 10. These methods capitalize
on the orthonormality of the wavelet basis, which provides direct control of the norm
of the signal. We show that the underlying optimization task is amenable to iterative
thresholding algorithms (ISTA or FISTA) which are simple to deploy and well suited
for large-scale problems. We also investigate the effect of cycle spinning, which is a
fundamental ingredient for making wavelets competitive in terms of image quality. Our
closing topic is the use of statistical modeling for the improvement of standard wavelet-
based denoising – in particular, the optimization of the wavelet-domain thresholding
functions and the search of a consensus solution across multiple wavelet expansions in
order to minimize the global estimation error.
3 Mathematical context
and background

In this chapter we summarize some of the mathematical preliminaries for the remai-
ning chapters. These concern the function spaces used in the book, duality, generalized
functions, probability theory, and generalized random processes. Each of these topics is
discussed in a separate section.
For the most part, the theory of function spaces and generalized functions can be
seen as an infinite-dimensional generalization of linear algebra (function spaces gener-
alize Rn , and continuous linear operators generalize matrices). Similarly, the theory of
generalized random processes involves the generalization of the idea of a finite random
vector in Rn to an element of an infinite-dimensional space of generalized functions.
To give a taste of what is to come, we briefly compare finite- and infinite-dimensional
theories in Tables 3.1 and 3.2. The idea, in a nutshell, is to replace vectors by (genera-
lized) functions. Formally, this extension amounts to replacing some finite sums (in the
finite-dimensional formulation) by integrals. Yet, in order for this to be mathematically
sound, one needs to properly define the underlying objects as elements of some infinite-
dimensional vector space, to specify the underlying notion(s) of convergence (which is
not an issue in Rn ) while ensuring that some basic continuity conditions are met.
Fundamental to our formulation is the material on infinite-dimensional probability
theory from Section 3.4.4 to the end of the chapter. The mastery of those notions requires
a good understanding of function spaces and generalized functions, which are covered
in the first part of the chapter. The impatient reader who is not directly concerned with
the full mathematical details may skip what follows the tables at first reading and consult
the relevant sections later as needed.

3.1 Some classes of function spaces

By the term “function” we shall mean elements of various function spaces. At a mini-
mum, a function space is a set X along with some criteria for determining, first, whether
or not a given “function” ϕ = ϕ(r) belongs to X (in mathematical notation, ϕ ∈ X )
and, second, given ϕ, ϕ0 ∈ X , whether or not ϕ and ϕ0 describe the same object in X
(in mathematical notation, ϕ = ϕ0 ). Most often, in addition to these, the space X has
additional structure (see below).
In this book we shall deal largely with two types of function spaces: complete normed
spaces such as Lebesgue Lp spaces, and nuclear spaces such as the Schwartz space S
26 Mathematical context and background

Table 3.1 Comparison of notions of linear algebra with those of functional analysis and the theory of
distributions (generalized functions). See Sections 3.1–3.3 for an explanation.

Finite-dimensional theory (linear algebra) Infinite-dimensional theory (functional


analysis)

Euclidean space RN , complexification CN Function spaces such as the Lebesgue space


Lp (Rd ) and the space of tempered
distributions S  (Rd ), among others
Vector x = (x1 , . . . , xN ) in RN or CN Function f (r) in S  (Rd ), Lp (Rd ), etc.

bilinear scalar product ϕ, g = ϕ(r)g(r) dr

x, y = N n=1 xn yn ϕ ∈ S (Rd ) (test function), g ∈ S  (Rd )
(generalized function), or
ϕ ∈ Lp (Rd ), g ∈ Lq (Rd ) with 1p + 1q = 1, for
instance
Equality: x = y ⇐⇒ xn = yn Various notions of equality (depends on the
space), such as
⇐⇒ u, x = u, y , ∀u ∈ RN weak equality of distributions:
f = g ∈ S  (Rd ) ⇐⇒ ϕ, f = ϕ, g for
all ϕ ∈ S (Rd ),
⇐⇒ x − y2 = 0 almost-everywhere equality:
d
f = g ∈ Lp (R ) ⇐⇒ p dr = 0
R d |f (r) − g(r)|
Linear operators RN → RM Continuous linear operators
S (Rd ) → S  (Rd )
N 
y = Ax ⇒ ym = n=1 amn xn g = Aϕ ⇒ g(r) = Rd a(r, s)ϕ(s) ds for
some a ∈ S  (Rd × Rd ) (Schwartz’
kernel theorem)
Transpose Adjoint
x, Ay = AT x, y ϕ, Ag = A∗ ϕ, g

Table 3.2 Comparison of notions of finite-dimensional statistical calculus with the theory of generalized
stochastic processes. See Section 3.4 for an explanation.

Finite-dimensional Infinite-dimensional

Random variable X in RN Generalized stochastic process s in S 


Probability measure PX on RN Probability measure Ps on S 
 
PX (E) = Prob(X ∈ E) = E pX (x) dx (pX is Ps (E) = Prob(s ∈ E) = E Ps (dg)
a generalized [i.e., hybrid] pdf)
for suitable subsets E ⊂ RN for suitable subsets E ⊂ S 
Characteristic function Characteristic functional
 
X (ω) = E{ejω,X } =
P jω,x p (x) dx, s (ϕ) = E{ejϕ,s } =
P jϕ,g P (dg),
RN e X Se s
ω ∈ RN ϕ∈S
3.1 Some classes of function spaces 27

and the space D of compactly supported test functions, as well as their duals S  and D  ,
which are spaces of generalized functions. These two categories of spaces (complete-
normed and nuclear) cannot overlap, except in finite dimensions. Since the function
spaces that are of interest to us are infinite-dimensional (they do not have a finite vector-
space basis), the two categories are mutually exclusive.
The structure of each of the aforementioned spaces has two aspects. First, as a vector
space over the real numbers or its complexification, the space has an algebraic structure.
Second, with regard to the notions of convergence and taking of limits, the space has
a topological structure. The algebraic structure lends meaning to the idea of a linear
operator on the space, while the topological structure gives rise to the concept of a
continuous operator or map, as we shall see shortly.
All the spaces considered here have a similar algebraic structure. They are either vec-
tor spaces over R, meaning that, for any ϕ, ϕ0 in the space and any a ∈ R, the operations
of addition ϕ → ϕ + ϕ0 and multiplication by scalars ϕ → aϕ are defined and map the
space (denoted henceforth by X ) into itself. Or, we may take the complexification of a
real vector space X , composed of elements of the form ϕ = ϕr + jϕi , with ϕr , ϕi ∈ X
and j denoting the imaginary unit. The complexification is then a vector space over C. In
the remainder of the book, we shall denote a real vector space and its complexification
by the same symbol. The distinction, when important, will be clear from the context.
For the spaces with which we are concerned in this book, the topological structure
is completely specified by providing a criterion for the convergence of sequences. 1 By
this we mean that, for any given sequence (ϕi ) in X and any ϕ ∈ X , we are equipped
with the knowledge of whether or not ϕ is the limit of (ϕi ). A topological space is a set
X with topological structure. For normed spaces, the said criterion is given in terms of
a norm, while in nuclear spaces it is given in terms of a family of seminorms, as we shall
discuss below. But before that, let us first define linear and continuous operators.
An operator is a mapping from one vector space to another; that is, a rule that asso-
ciates an output function A{ϕ} ∈ Y (also written as Aϕ) with each input ϕ ∈ X .

DEFINITION 3.1 (Linear operator) An operator A : X → Y where X and Y are


vector spaces is called linear if, for any ϕ, ϕ0 ∈ X and a, b ∈ R (or C),

A{aϕ + bϕ0 } = aA{ϕ} + bA{ϕ0 }. (3.1)

D E F I N I T I O N 3.2 (Continuous operator) Let X , Y be topological spaces. An operator


A : X → Y is called sequentially continuous (with respect to the topologies of X
 Y ) if, for any convergent sequence (ϕi ) in X with limit ϕ ∈ X , the sequence
and
A{ϕi } converges to A{ϕ} in Y , that is,

lim A{ϕi } = A{lim ϕi }.


i i

The above definition of continuity coincides with the stricter topological definition for
spaces we are interested in.
1 This is in contrast with those topological spaces where one needs to consider generalizations of the notion
of a sequence involving partially ordered sets (the so-called nets or filters). Spaces in which a knowledge
of sequences suffices are called sequential.
28 Mathematical context and background

We shall assume that the topological structure of our vector spaces is such that the
operations of addition and multiplication by scalars in R (or C) are continuous. With
this compatibility condition, our object is called a topological vector space.
Having defined the two types of structure (algebraic and topological) and their rela-
tion with operators in abstract terms, let us now show concretely how the topological
structure is defined for some important classes of spaces.

3.1.1 About the notation: mathematics vs. engineering


So far, we have considered a function in abstract terms as an element of a vector space:
ϕ ∈ X . The more conventional view is that of a map ϕ : Rd → R (or C) that associates
a value ϕ(r) with each point r = (r1 , . . . , rd ) ∈ Rd . Following the standard convention
in engineering, we shall therefore also use the notation ϕ(r) [instead of ϕ(·) or ϕ] to
represent the function using r as our generic d-dimensional index variable, the norm of

which is denoted by |r|2 = di=1 |ri |2 . This is to be contrasted with the point values (or
samples) of ϕ which will be denoted using subscripted index variables; i.e., ϕ(rk ) stands
for the value of ϕ at r = rk . Likewise, ϕ(r − r0 ) = ϕ(· − r0 ) refers to the function ϕ
shifted by r0 .
A word of caution is in order here. While the engineering notation has the advantage
of being explicit, it can also be felt as being abusive because the point values of ϕ are
not necessarily well defined, especially when the function presents discontinuities, not
to mention the case of generalized functions that do not have a pointwise interpretation.
ϕ(r) should therefore be treated as an alternative notation for ϕ that reminds us of the
domain of the function, and not be interpreted literally.

3.1.2 Normed spaces


A norm on X is a map X → R, usually denoted as ϕ → ϕ (with indices used if
needed to distinguish between different norms), which fulfills the following properties
for all a ∈ R (or C) and ϕ, ϕ0 ∈ X .
• ϕ ≥ 0 (non-negativity)
• aϕ = |a| ϕ (positive homogeneity)
• ϕ + ϕ0  ≤ ϕ + ϕ0  (triangular inequality)
• ϕ = 0 implies ϕ = 0 (separation of points).
By relaxing the last requirement we obtain a seminorm.
A normed space is a vector space X equipped with a norm.
A sequence (ϕi ) in a normed space X is said to converge to ϕ (in the topology of X ),
in symbols
lim ϕi = ϕ,
i

if and only if
lim ϕ − ϕi  = 0.
i
3.1 Some classes of function spaces 29

Let (ϕi ) be a sequence in X such that for any  > 0 there exists an N ∈ N with
ϕi − ϕj  <  for all i, j ≥ N.
Such a sequence is called a Cauchy sequence. A normed space X is complete if it does
not have any holes, in the sense that, for every Cauchy sequence in X , there exists a
ϕ ∈ X such that limi ϕi = ϕ (in other words if every Cauchy sequence has a limit in
X ). A normed space that is not complete can be completed by introducing new points
corresponding to the limits of equivalent Cauchy sequences. For example, the real line
is the completion of the set of rational numbers with respect to the absolute-value norm.

Examples
Important examples of complete normed spaces are the Lebesgue spaces. The Lebesgue
spaces Lp (Rd ), 1 ≤ p ≤ ∞, are composed of functions whose Lp (Rd ) norm, denoted as
 · p , is finite, where
 1
d |ϕ(r)| dr for 1 ≤ p < ∞
p p
ϕ(r)Lp = R
ess supr∈Rd |ϕ(r)| for p = ∞
and where two functions that are equal almost everywhere are considered to be
equivalent.
We may also define weighted Lp spaces by replacing the shift-invariant Lebesgue
measure (dr) by a weighted measure w(r)dr in the above definitions. In that case, w(r)
is assumed to be a measurable function that is (strictly) positive almost everywhere. In
particular, for w(r) = 1 + |r|α with α ≥ 0 (or, equivalently, w(r) = (1 + |r|)α ), we denote
the associated norms as  · p,α , and the corresponding normed spaces as Lp,α (Rd ).
The latter spaces are useful when characterizing the decay of functions at infinity. For
example, L∞,α (Rd ) is the space of functions that are bounded by a constant multiple of
1
1+|r|α almost everywhere (a.e.).
Some remarkable inclusion properties of Lp,α (Rd ), 1 ≤ p ≤ ∞, α > 0, are
• α > α0 implies Lp,α (Rd ) ⊂ Lp,α0 (Rd ).
• L∞, d + (Rd ) ⊂ Lp (Rd ) for any  > 0.
p

Finally, we define the space of rapidly decaying functions, R (Rd ), as the intersection
of all L∞,α (Rd ) spaces, α > 0, or, equivalently, as the intersection of all L∞,α (Rd ) with
α ∈ N. In other words, R (Rd ) contains all bounded functions that essentially decay
faster than 1/|r|α at infinity for all α ∈ R+ . A sequence (fi ) converges in (the topology
of) R (Rd ) if and only if it converges in all L∞,α (Rd ) spaces.
The causal exponential ρα (r) = 1[0,∞) (r)eαr with Re(α) < 0 that is central to linear
systems theory is a prototypical example of a function included in R (R).

3.1.3 Nuclear spaces


Defining nuclear spaces is neither easy nor particularly intuitive. Fortunately, for our
purpose in this book, knowing the definition is not necessary. We shall simply assert
30 Mathematical context and background

that certain function spaces are nuclear, in order to use certain results that are true for
nuclear spaces (specifically, the Minlos–Bochner theorem; see below). For the sake of
completeness, a general definition of nuclear spaces is given at the end of this section,
but this definition may safely be skipped without compromising the presentation.
Specifically, it will be sufficient for us to know that the spaces D (Rd ) and S (Rd ),
which we shall shortly define, are nuclear, as are the Cartesian products and powers of
nuclear spaces, and their closed subspaces.
To define these spaces, we need to identify their members, as well as the criterion of
convergence for sequences in the space.

The space D (Rd )


The space of compactly supported smooth test functions is denoted by D (Rd ). It consists
of infinitely differentiable functions with compact support in Rd . To define its topology,
we provide the following criterion for convergence in D (Rd ):
A sequence (ϕi ) of functions in D (Rd ) is said to converge (in the topology of
D (Rd )) if

(1) there exists a compact (here, meaning closed and bounded) subset K of Rd such that
all ϕi are supported inside K; and
(2) the sequence (ϕi ) converges in all of the seminorms

ϕK,n = sup |∂ n ϕ(r)| for all n ∈ Nd .


r∈K

Here, n = (n1 , . . . , nd ) ∈ Nd is what is called a multi-index, and ∂ n is shorthand for


n
the partial derivative ∂rn11 · · · ∂rdd . We take advantage of the present opportunity also
 
to introduce two other notations: |n| for di=1 |ni | and rn for the product di=1 rni i .
The space D (Rd ) is nuclear (for a proof, see for instance [GV64]).

The Schwartz space S (Rd )


The Schwartz space or the space of so-called smooth and rapidly decaying test func-
tions, denoted as S (Rd ), consists of infinitely differentiable functions ϕ on Rd for which
all of the seminorms defined below are finite:

ϕm,n = sup |rm ∂ n ϕ(r)| for all m, n ∈ Nd .


r∈Rd

In other words, S (Rd ) is populated by functions that, together with all of their deriva-
tives, decay faster than the inverse of any polynomial at infinity.
The topology of S (Rd ) is defined by positing that a sequence (ϕi ) converges in
S (Rd ) if and only if it converges in all of the above seminorms.
The Schwartz space has the remarkable property that its complexification is invari-
ant under the Fourier transform. In other words, the Fourier transform, defined by the
integral

ϕ (ω) = F {ϕ}(ω) =
ϕ(r) →  e−jr,ω ϕ(r) dr
Rd
3.1 Some classes of function spaces 31

and inverted by

ϕ (ω) → ϕ(r) = F −1 {
 ϕ }(r) = ejr,ω 
ϕ (ω) ,
Rd (2π)d
is a continuous map from S (Rd ) into itself. Our convention here is to use ω ∈ Rd as
the generic Fourier-domain index variable.
In addition, both S (Rd ) and D (Rd ) are closed and continuous under differentiation
of any order and multiplication by polynomials. Lastly, they are included in R (Rd ) and
hence in all the Lebesgue spaces, Lp (Rd ), which do not require any smoothness.

General definition of nuclear spaces


Defining a nuclear space requires us to define nuclear operators. These are operators
that can be approximated by operators of finite rank in a certain sense (an operator
between vector spaces is of finite rank if its range is finite-dimensional).
We first recall the notation p (N), 1 ≤ p < ∞, for the space of p-summable
sequences; that is, sequences c = (ci )i∈N for which

|ci |p
i∈N

is finite. We also denote by ∞ (N) the space of all bounded sequences.


In a complete normed space Y , let (ψi )i∈N be a sequence with bounded norm (i.e.,
ψi  ≤ M for some M ∈ R and all i ∈ N). We then denote by Mψ the linear operator
1 (N) → Y which maps a sequence c = (ci )i∈N in 1 to the weighted sum

ci ψi
i∈N

in Y (the sum converges in norm by the triangular inequality).


An operator A : X → Y , where X , Y are complete normed spaces, is called nuclear
if there exists a continuous linear operator
 

A : X → ∞ : ϕ → ai (ϕ) ,

an operator
 : ∞ → 1 : (ci ) → (λi ci )

where i |λi | < ∞, and a bounded sequence ψ = (ψi ) in Y , such that we can write

A = Mψ  
A.

This is equivalent to the following decomposition of A into a sum of rank 1 operators:

A : ϕ → λi ai (ϕ)ψi .
i∈N

The continuous linear operator X → Y : ϕ → λi ai (ϕ)ψi is of rank 1 because it


 maps
X into the 1-D subspace of Y spanned by ψi ; compare (ψi ) with a basis and ai (ϕ)
with the coefficients of Aϕ in this basis.
32 Mathematical context and background

More generally, given an arbitrary topological vector space X and a complete


normed space Y , the operator A : X → Y is said to be nuclear if there exists a com-
plete normed space X1 , a nuclear operator A1 : X1 → Y , and a continuous operator
B : X → X1 , such that
A = A1 B.
Finally, X is a nuclear space if any continuous linear map X → Y , where Y is a
complete normed space, is nuclear.

3.2 Dual spaces and adjoint operators

Given a space X , a functional on X is a map f that takes X to the scalar field R (or C).
In other words, f takes a function ϕ ∈ X as argument and returns the number f (ϕ).
When X is a vector space, we may consider linear functionals on it, where linearity
has the same meaning as in Definition 3.1. When f is a linear functional, it is customary
to use the notation ϕ, f in place of f (ϕ).
The set of all linear functionals on X , denoted as X ∗ , can be given the structure of a
vector space in the obvious way by the identity

ϕ, af + bf0 = aϕ, f + bϕ, f0 ,

where ϕ ∈ X , f, f0 ∈ X ∗ , and a, b ∈ R (or C) are arbitrary. The resulting vector space


X ∗ is called the algebraic dual of X .
The map from X × X ∗ to R (or C) that takes the pair (ϕ, f) to their so-called scalar
product ϕ, f is then bilinear in the sense that it is linear in each of the arguments ϕ
and f. Note that the reasoning about linear functionals works both ways, so that we can
also switch the order of the pairing. This translates into the formal commutativity rule
f, ϕ = ϕ, f with a dual interpretation of the two sides of the equality.
Given vector spaces X , Y with algebraic duals X ∗ , Y ∗ and a linear operator A :
X → Y , the adjoint or transpose of A, denoted as A∗ , is the linear operator Y ∗ → X ∗
defined by
A∗ f = f ◦ A

for any linear functional f : Y → R (or C) in Y ∗ , where ◦ denotes composition. The


motivation behind the above definition is to have the identity

Aϕ, f = ϕ, A∗ f (3.2)

hold for all ϕ ∈ X and f ∈ Y ∗ .


If X is a topological vector space, it is of interest to consider the subspace of X ∗
composed of those linear functionals on X that are continuous with respect to the
topology of X . This subspace is denoted as X  and called the topological or conti-
nuous dual of X . Note that, unlike X ∗ , the continuous dual generally depends on the
topology of X . In other words, the same vector space X with different topologies will
generally have different continuous duals.
3.2 Dual spaces and adjoint operators 33

As a general rule, in this book we shall adopt some standard topologies and only work
with the corresponding continuous dual space, which we shall simply call the dual. Also,
henceforth, we shall assume the scalar product ·, · to be restricted to X × X  . There,
the space X may vary but is necessarily paired with its continuous dual.
Following the restrictions of the previous paragraph, we sometimes say that the ad-
joint of A : X → Y exists, meaning that the algebraic adjoint A∗ : Y ∗ → X ∗ , when
restricted to Y  , maps into X  , so that we can write

Aϕ, f = ϕ, A∗ f ,

where the scalar products on the two sides are now restricted to Y × Y  and X × X  ,
respectively.
One can define different topologies on X  by providing various criteria for conver-
gence. The only one we shall need to deal with is the weak-∗ topology, which indicates
(for a sequential space X ) that (fi ) converges to f in X  if and only if

limϕ, fi = ϕ, f for all ϕ ∈ X .


i

This is precisely the topology of pointwise convergence for all “points” ϕ ∈ X .


We shall now mention some examples.

3.2.1 The dual of Lp spaces


The dual of the Lebesgue space Lp (Rd ), 1 ≤ p < ∞, can be identified with the space
Lp (Rd ) with 1 < p ≤ ∞ satisfying 1/p + 1/p = 1, by defining

ϕ, f = ϕ(r)f (r) dr (3.3)


Rd

for ϕ ∈ Lp (Rd ) and f ∈ Lp (Rd ). In particular, L2 (Rd ), which is the only Hilbert space
of the family, is its own dual.
To see that linear functionals described by the above formula with f ∈ Lp are conti-
nuous on Lp , we can rely on Hölder’s inequality, which states that

|ϕ, f | ≤ |ϕ(r)f (r)| dr ≤ ϕLp f Lp (3.4)


Rd

for 1 ≤ p, p ≤ ∞ and 1/p + 1/p = 1. The special case of (3.4) for p = 2 yields the
Cauchy–Schwarz inequality.

3.2.2 The duals of D and S


In this section, we give the mathematical definition of the duals of the nuclear spaces D
and S . A physical interpretation of these definitions is postponed until Section 3.3.
The dual of D (Rd ), denoted as D  (Rd ), is the so-called space of distributions over Rd
(although we shall use the term “distribution” more generally to mean any generalized
34 Mathematical context and background

function in the sense of Section 3.3). Ordinary locally integrable functions 2 (in parti-
cular, all Lp functions and all continuous functions) can be identified with elements of
D  (Rd ) by using (3.3). By this we mean that any locally integrable function f defines a
continuous linear functional on D (Rd ) where, for ϕ ∈ D (Rd ), ϕ, f is given by (3.3).
However, not all elements of D  (Rd ) can be characterized in this way. For instance,
the Dirac functional δ (a.k.a. Dirac impulse), which maps ϕ ∈ D (Rd ) to the value
ϕ, δ = ϕ(0), belongs in D  (Rd ) but cannot be written as an integral à la (3.3). Even
in this and similar cases, we may sometimes write Rd ϕ(r)f (r) dr, keeping in mind
that the integral is no longer a true (i.e., Lebesgue) integral, but simply an alternative
notation for ϕ, f .
In similar fashion, the dual of S (Rd ), denoted as S  (Rd ), is defined and called the
space of tempered (or Schwartz) distributions. Since D ⊂ S and any sequence that
converges in the topology of D also converges in S , it follows that S  (Rd ) is (can
be identified with) a smaller space (i.e., a subspace) of D  (Rd ). In particular, not every
locally integrable function belongs in S  . For example, locally integrable functions
of exponential growth have no place in S  as their scalar product with Schwartz test
functions via (3.3) is not in general finite (much less continuous). Once again, S  (Rd )
contains objects that are not functions on Rd in the true sense of the word. For example,
δ also belongs in S  (Rd ).

3.2.3 Distinction between Hermitian and duality products



We use the notation  f, g L2 = Rd f (r)g(r) dr to represent the usual (Hermitian-
symmetric) L2 inner product. The latter is defined for f, g ∈ L2 (Rd ) (the Hilbert space
of complex finite-energy functions); it is equivalent to Schwartz’ duality product only
when the second argument is real-valued (due to the presence of complex conjugation).
The corresponding Hermitian adjoint of an operator A is denoted by AH ; it is defined as
AH f, g L2 =  f, Ag L2 =  f, Ag which implies that AH = A∗ . The distinction between
these types of adjoints is only relevant when considering signal expansions or analyses
in terms of complex basis functions.
The classical Fourier transform is defined as


f (ω) = F {f}(ω) = f (r)e−jr,ω dr
Rd

for any f ∈ L1 (Rd ). This definition admits a unique extension, F : L2 (Rd ) → L2 (Rd ),
which is an isometry map (Plancherel’s theorem). The fact that the Fourier transform
preserves the L2 norm of a function (up to a normalization factor) is a direct consequence
of Parseval’s relation,
1 
 f, g L2 = f,
g L2 ,
(2π)d

g = 
whose duality product equivalent is  f, f, g .
2 A function on Rd is called locally integrable if its integral over any closed bounded set is finite.
3.3 Generalized functions 35

3.3 Generalized functions

3.3.1 Intuition and definition


We begin with some considerations regarding the modeling of physical phenomena. Let
us suppose that the object of our study is some physical quantity f that varies in relation
to some parameter r ∈ Rd representing space and/or time. We assume that our way of
obtaining information about f is by making measurements that are localized in space–
time using sensors (ϕ, ψ, . . .). We shall denote the measurement of f procured by ϕ as
ϕ, f . 3 Let us suppose that our sensors form a vector space, in the sense that for any
two sensors ϕ, ψ and any two scalars a, b ∈ R (or C), there is a real or virtual sensor
aϕ + bψ such that

aϕ + bψ, f = aϕ, f + bψ, f .

In addition, we may reasonably suppose that the phenomenon under observation has
some form of continuity, meaning that

limϕi , f = ϕ, f ,
i

where (ϕi ) is a sequence of sensors that tend to ϕ in a certain sense. We denote the set
of all sensors by X . In the light of the above notions of linear combinations and limits
defined in X , mathematically the space of sensors then has the structure of a topological
vector space.
Given the above properties and the definitions of the previous sections, we conclude
that f represents an element of the continuous dual X  of X . Given that our sensors, as
previously noted, are assumed to be localized in Rd , we may model them as compactly
supported or rapidly decaying functions on Rd , denoted by the same symbols (ϕ, ψ, . . .)
and, in the case where f also corresponds to a function on Rd , relate the observation
ϕ, f to the functional form of ϕ and f by the identity

ϕ, f = ϕ(r)f (r) dr.


Rd

We exclude from consideration those functions f for which the above integral is unde-
fined or infinite for some ϕ ∈ X .
However, we are not limited to taking f to be a true function of r ∈ Rd . By requiring
our sensor or test functions to be smooth, we can permit f to become singular; that is,
to depend on the value of ϕ and/or of its derivatives at isolated points/curves inside Rd .
An example of a singular generalized function f, which we have already noted, is the
Dirac distribution (or impulse) δ that measures the value of ϕ at the single point r = 0
(i.e., ϕ, δ = ϕ(0)).
Mathematically, we define generalized functions as members of the continuous dual
X  of a nuclear space X of functions, such as D (Rd ) or S (Rd ).

3 The connection with previous sections should already be apparent from this choice of notation.
36 Mathematical context and background

Implicit to the manipulation of generalized functions is the notion of weak equality


(or equality in the sense of distributions). Concretely, this means that one should inter-
pret the statement f = g with f, g ∈ X  as

ϕ, f = ϕ, g for all ϕ ∈ X.

3.3.2 Operations on generalized functions


Following (3.2), any continuous linear operator D → D or S → S can be transposed
to define a continuous linear operator D  → D  or S  → S  . In particular, since D (Rd )
and S (Rd ) are closed under differentiation, we can define derivatives of distributions.
First, note that, formally,

∂ n ϕ, f = ϕ, ∂ n∗ f .

Now, using integration by parts in (3.3), for ϕ, f in D (Rd ) or S (Rd ) we see that ∂ n∗ =
(−1)|n| ∂ n . In other words, we can write

ϕ, ∂ n f = (−1)|n| ∂ n ϕ, f . (3.5)

The idea is then to use (3.5) as the defining formula in order to extend the action of the
derivative operator ∂ n for any f ∈ D  (Rd ) or S  (Rd ).
Formulas for scaling, shifting (translation), rotation, and other geometric transforma-
tions of distributions are obtained in a similar manner. For instance, the translation by
r0 of a generalized function f is defined via the identity

ϕ, f (· − r0 ) = ϕ(· + r0 ), f .

More generally, we give the following definition.

(Dual extension principle) Given operators U, U∗ : S (Rd ) →


D E F I N I T I O N 3.3
an adjoint pair on S (Rd ) × S (Rd ), we extend their action to
S (Rd ) that form
S (R ) → S (R ) by defining Uf and U∗ f so as to have
 d  d

ϕ, Uf = U∗ ϕ, f
ϕ, U∗ f = Uϕ, f

for all f. A similar definition gives the extension of adjoint pairs D (Rd ) → D (Rd ) to
operators D  (Rd ) → D  (Rd ).

Examples of operators S (Rd ) → S (Rd ) that can be extended in the above fashion
include derivatives, rotations, scaling, translation, time-reversal, and multiplication by
smooth functions of slow growth in the space–time domain. The other fundamental
operation is the Fourier transform, which is treated in Section 3.3.3.
3.3 Generalized functions 37

3.3.3 The Fourier transform of generalized functions


We have already noted that the Fourier transform F is a reversible operator that maps
the (complexified) space S (Rd ) into itself. The additional relevant property is that F
is self-adjoint: ϕ, F ψ = F ϕ, ψ , for all ϕ, ψ ∈ S (Rd ). This helps us specify the
generalized Fourier transform of distributions in accordance with the general extension
principle in Definition 3.3.

DEFINITION 3.4 The generalized Fourier transform of a distribution f ∈ S  (Rd ) is


the distributionf = F {f} ∈ S  (Rd ) that satisfies

ϕ,
f = 
ϕ, f

for all ϕ ∈ S , where 


ϕ = F {ϕ} is the classical Fourier transform of ϕ given by the
integral


ϕ (ω) = e−jr,ω ϕ(r) dr.
Rd
For example, since we have

ϕ(r) dr = ϕ, 1 = 
ϕ (0) = 
ϕ, δ ,
Rd
we conclude that the (generalized) Fourier transform of δ is the constant function 1.
The fundamental property of the generalized Fourier transform is that it maps S  (Rd )
into itself and that it is invertible with F −1 = (2π
1
)d
F where F {f} = F { f ∨ }. This
quasi-self-reversibility – also expressed by the first row of Table 3.3 – implies that
any operation on generalized functions that is admissible in the space–time domain
has its counterpart in the Fourier domain, and vice versa. For instance, multiplication
with a smooth function in the Fourier domain corresponds to a convolution in the signal

Table 3.3 Basic properties of the (generalized) Fourier transform.

Temporal or spatial domain Fourier domain


f(r) = F {f}(r) (2π)d f (−ω)
f ∨ (r) = f (−r) 
f (−ω) = 
f ∨ (ω)
f (r) 
f (−ω)
f (AT r) 1  −1
| det A| f(A ω)
f (r − r0 ) e−jr0 ,ω 
f(ω)
ejr,ω0 f (r) 
f(ω − ω0 )
∂ n f (r) (jω)n
f(ω)
rn f (r) j|n| ∂ n
f(ω)
(g ∗ f)(r) g(ω)
 f(ω)
g(r)f (r) g ∗
(2π)−d ( f)(ω)
38 Mathematical context and background

domain. Consequently, the familiar functional identities concerning the classical Fourier
transform, such as the formulas for change of variables and differentiation, among
others, also hold true for this generalization. These are summarized in Table 3.3.
In addition, the reader can find in Appendix A a table of Fourier transforms of some
important singular generalized functions in one and several variables.

3.3.4 The kernel theorem


The kernel theorem provides a characterization of continuous operators X → X  (with
respect to the nuclear topology on X and the weak-∗ topology on X  ). We shall state a
version of the theorem for X = S (Rd ), which is the one we shall use. The version for
D (Rd ) is obtained by replacing the symbol S with D everywhere in the statement of
the theorem.

T H E O R E M 3.1 (Schwartz’ kernel theorem: first form) Every continuous linear op-
erator A : S (Rd ) → S  (Rd ) can be written in the form

ϕ(r) → A{ϕ}(r) = ϕ(s)a(r, s) ds, (3.6)


Rd

where a(·, ·) is a generalized function in S  (Rd × Rd ).

We can interpret the above formula as some sort of continuous-domain matrix-vector


product, where r, s play the role of the row and column indices, respectively (see the
list of analogies in Table 3.1). This characterization of continuous linear operators as
infinite-dimensional matrix-vector products partly justifies our earlier statement that
nuclear spaces “resemble” finite-dimensional spaces in fundamental ways.
The kernel a ∈ S  (Rd × Rd ) associated with the linear operator A can be identified as

a(·, s0 ) = A{δ(· − s0 )}, (3.7)

which corresponds to making the formal substitution ϕ = δ(· − s0 ) in (3.6). One can
therefore view a(·, s0 ) as the generalized impulse response of A.
An equivalent statement of Theorem 3.1 is as follows.

T H E O R E M 3.2 (Schwartz kernel theorem: second form) Every continuous bilinear


form l : S (Rd ) × S (Rd ) → R (or C) can be written as

l(ϕ1 , ϕ2 ) = ϕ1 (r)ϕ2 (s)a(r, s) ds dr, (3.8)


Rd ×Rd

where the kernel a is some generalized function in S  (Rd × Rd ).

One may argue that the signal-domain notation that is used in both (3.6) and (3.8) is
somewhat abusive since A{ϕ} and a do not necessarily have an interpretation as classical
functions (see statement on the notation in Section 3.1.1). The purists therefore prefer
to denote (3.8) as
l(ϕ1 , ϕ2 ) = ϕ1 ⊗ ϕ2 , a (3.9)

with (ϕ1 ⊗ ϕ2 )(r, s) = ϕ1 (r)ϕ2 (s) for all ϕ1 , ϕ2 ∈ S (Rd ).


3.3 Generalized functions 39

The connection among the representations (3.6)–(3.9) is clarified by relating the


continuous bilinear form l to the underlying continuous linear operator A : S (Rd ) →
S  (Rd ) by means of the identity

l(ϕ1 , ϕ) = ϕ1 , Aϕ ,
where Aϕ ∈ S  (Rd ) is the generalized function specified by (3.6) or, equivalently, by
the inner “integral” (duality product) with respect to s in (3.8).

3.3.5 Linear shift-invariant operators and convolutions


Let Sr0 denote the shift operator ϕ → ϕ(· − r0 ). We call an operator U shift-invariant
if USr0 = Sr0 U for all r0 ∈ Rd .
As a corollary of the kernel theorem, we have the following characterization of linear
shift-invariant (LSI) operators S → S  (and a similar characterization for D → D  ).
C O R O L L A RY 3.3 Every continuous linear shift-invariant operator S (Rd ) →
S  (Rd ) can be written as a convolution

ϕ(r) → (ϕ ∗ h)(r) = ϕ(s)h(r − s) ds


Rd

with some generalized function h ∈ S  (Rd ).


The idea there is that the kernel (or generalized impulse response) in (3.6) is a func-
tion of the relative displacement only: a(r, s) = h(r − s) (shift-invariance property).
Moreover, in this case we have the convolution-multiplication formula
ϕ
F {h ∗ ϕ} = h. (3.10)
Note that the convolution of a test function and a distribution is in general a distri-
bution. The latter is smooth (and therefore equivalent to an ordinary function), but not
necessarily rapidly decaying. However, ϕ ∗ h will once again belong continuously to S
if 
h, the Fourier transform of h, is a smooth (infinitely differentiable) function with at
most polynomial growth at infinity because the smoothness of  h translates into h having
rapid decay in the spatio-temporal domain, and vice versa. In particular, we note that the
condition is met when h ∈ R (Rd ) (since rn h(r) ∈ L1 (Rd ) for any n ∈ Nd ). A classical
situation in dimension d = 1 where the decay is guaranteed to be exponential is when
the Fourier transform of h is a rational transfer function of the form
M
 (jω − zm )
h(ω) = C0 m=1 N
n=1 (jω − pn )

with no purely imaginary pole (i.e., with Re(pn )  = 0, 1 ≤ n ≤ N). 4


Since any sequence that converges in some Lp space, with 1 ≤ p ≤ ∞, also converges
in S  , the kernel theorem implies that any continuous linear operator S (Rd ) → Lp (Rd )
can be written in the form specified by (3.6).
In defining the convolution of two distributions, some caution should be exerted. To
be consistent with the previous definitions, we can view convolutions as continuous
4 For M or N = 0, we shall take the corresponding product to be equal to 1.
40 Mathematical context and background

LSI operators. The convolution of two distributions will then correspond to the com-
position of two LSI operators. To fix ideas, let us take two distributions f and h, with
corresponding operators Af and Ah . We then wish to identify f ∗ h with the composi-
tion Af Ah . However, note that, by the kernel theorem, Af and Ah are initially defined
S → S  . Since the codomain of Ah (the space S  ) does not match the domain of Af
(the space S ), this composition is a priori undefined.
There are two principal situations where we can get around the above limitation. The
first is when the range of Ah is limited to S ⊂ S  (i.e., Ah maps S to itself instead
of the much larger S  ). This is the case for the distributions with a smooth Fourier
transform that we discussed previously.
The second situation where we may define the convolution of f and h is when the
range of Ah can be restricted to some space X (i.e., Ag : S → X ) and, furthermore,
Af has a continuous extension to X ; that is, we can extend it as Af : X → S  .
An important example of the second situation is when the distributions in question
belong to the spaces Lp (Rd ) and Lq (Rd ) with 1 ≤ p, q ≤ ∞ and 1/p + 1/q ≤ 1. In this
case, their convolution is well defined and can be identified with a function in Lr (Rd ),
1 ≤ r ≤ ∞, with
1 1 1
1+ = + .
r p q
Moreover, for f ∈ Lp (Rd ) and h ∈ Lq (Rd ), we have

f ∗ hLr ≤ f Lp hLq .

This result is Young’s inequality for convolutions. An important special case of this
identity, most useful in derivations, is obtained for q = 1 and p = r:

h ∗ f Lp ≤ hL1 f Lp . (3.11)

The latter formula indicates that Lp (Rd ) spaces are “stable” under convolution with
elements of L1 (Rd ) (stable filters).

3.3.6 Convolution operators on Lp (Rd )


While the condition h ∈ L1 (Rd ) in (3.11) is very useful in practice and plays a central
role in the classical theory of linear systems, it does not cover the entire range of boun-
ded convolution operators on Lp (Rd ). Here we shall be more precise and characterize
the complete class of such operators for the cases p = 1, 2, +∞. In harmonic analysis,
these operators are commonly referred to as Lp Fourier multipliers, using (3.10) as the
starting point for their definition.

DEFINITION 3.5 (Fourier multiplier) An operator T : Lp (Rd ) → Lp (Rd ) is called


an Lp Fourier multiplier if it is continuous on Lp (Rd ) and can be represented as Tf =
F −1 {
fH}. The function H : Rd → C is the frequency response of the underlying filter.

The first observation is that the definition guarantees linearity and shift-invariance.
Moreover, since S (Rd ) ⊂ Lp (Rd ) ⊂ S  (Rd ), the multiplier operator can be written as
3.3 Generalized functions 41

a convolution Tf = h ∗ f (see Corollary 3.3) where h ∈ S  (Rd ) is the impulse response


of the operator T: h = F −1 {H} = Tδ. Conversely, we also have that H =  h = F {h}.
Since we are dealing with a linear operator on a normed vector space, we can rely on
the equivalence between continuity (in accordance with Definition 3.2) and the boun-
dedness of the operator.

D E F I N I T I O N 3.6 (Operator norm) The norm of the linear operator T : Lp (Rd ) →


Lp (Rd ) is given by

Tf Lp
TLp = sup .
f∈Lp (Rd )\{0} f Lp

The operator is said to be bounded if its norm is finite.

In practice, it is often sufficient to work out bounds for the extreme cases (e.g.,
p = 1, +∞) and to then invoke the Riesz–Thorin interpolation theorem to extend the
results to the p values in between.

THEOREM 3.4 (Riesz–Thorin) Let T be a linear operator that is bounded on Lp1 (Rd )
as well as on Lp2 (Rd ), with 1 ≤ p1 ≤ p2 . Then, T is also bounded for any p ∈
[p1 , p2 ] in the sense that there exist constants Cp = TLp with min(Cp1 , Cp2 ) ≤ Cp ≤
max(Cp1 , Cp2 ) such that

Tf Lp ≤ Cp f Lp

for all f ∈ Lp (Rd ).

The next theorem summarizes the main results that are available on the characteriza-
tion of convolution operators on Lp (Rd ).

T H E O R E M 3.5 (Characterization of Lp Fourier multipliers) Let T be a Fourier-


multiplier operator with frequency response H : Rd → C and (generalized) impulse
response h = F −1 {H} = T{δ}. Then, the following statements apply:
(1) The operator T is an L1 Fourier multiplier if and only if there
 exists a finite complex-
valued Borel measure denoted by μh such that H(ω) = Rd e−jω,r μh (dr).
(2) The operator T is an L∞ Fourier multiplier if and only if H is the Fourier transform
of a finite complex-valued Borel measure, as stated in (1).
(3) The operator T is an L2 Fourier multiplier if and only if H =  h ∈ L∞ (Rd ).
The corresponding operator norms are

TL1 = TL∞ = μh TV


1
TL2 = HL∞ ,
(2π)d/2
where μh TV is the total variation (TV) of the underlying measure. Finally, T is an Lp
Fourier multiplier for the whole range 1 ≤ p ≤ +∞ if the condition on H in (1) or (2)
42 Mathematical context and background

is met with
TLp ≤ μh TV = sup ϕ, h . (3.12)
ϕL∞ ≤1

We note that the above theorem is an extension of (3.11) since being a finite
Borel measure is less restrictive a condition than h ∈ L1 (Rd ). To see this, we invoke
Lebesgue’s decomposition theorem, stating that a finite measure μh admits a unique
decomposition as
μh = μac + μsing ,
where μac is an absolutely continuous measure and μsing a singular measure whose mass
is concentrated on a set whose Lebesgue measure is zero. If μsing = 0, then there exists
a unique function h ∈ L1 (Rd ) – the Radon–Nikodym derivative of μh with respect to
the Lebesgue measure – such that

ϕ(r)μh (dr) = ϕ(r)h(r) dr = ϕ, h .


Rd Rd
We then recall Hölder’s inequality (3.4) with (p, p ) = (∞, 1),
|ϕ, h | ≤ ϕL∞ hL1 ,
to see that the total-variation norm defined by (3.12) reduces to the L1 norm: μh TV =
hL1 . Under those circumstances, there is an equivalence between (3.11) and (3.12).
More generally, when μsing = 0 we can make the same kind of association between
μh and a generalized function h which is no longer in L1 (Rd ). The typical case is when

μsing is a discrete measure which results in a generalized function hsing = k hk δ(·−rk )
that is a sum of Dirac impulses. The total variation of μh is then given by μh TV =

hac L1 + k |hk |.
Statement (3) in Theorem 3.5 is a consequence of Parseval’s identity. It is consistent
with the intuition that a “stable” filter should have a bounded frequency response, as
a minimal requirement. The class of convolution kernels that satisfy this condition
are sometimes called pseudo-measures. These are more general entities than measures
because the Fourier transform of a finite measure is necessarily uniformly continuous
in addition to being bounded.
The last result in Theorem 3.5 is obtained by interpolation between Statements (1) and
(2) using the Riesz–Thorin theorem. The extent to which the TV condition μh TV < ∞
can be relaxed for p = 1, 2, ∞ is not yet settled and considered to be a difficult mathema-
tical problem. A borderline example of a one-dimensional (1-D) convolution operator
that is bounded for 1 < p < ∞ (see Theorem 3.6 below), but fails to meet the neces-
sary and sufficient TV condition for p = 1, ∞, is the Hilbert transform. Its frequency
response is HHilbert (ω) = −j sign(ω), which is bounded since |HHilbert (ω)| = 1 (all-pass
filter), but which is not uniformly continuous because of the jump at ω = 0. Its impulse
1
response is the generalized function h(r) = πr , which is not included in L1 (R) for two
reasons: the singularity at r = 0 and the lack of sufficient decay at infinity.
The case of the Hilbert transform is covered by Mikhlin’s multiplier theorem, which
provides a sufficient condition on the frequency response of a filter for Lp -stability.
3.4 Probability theory 43

THEOREM 3.6 (Mikhlin) A Fourier-multiplier operator is bounded in Lp (Rd ) for


1 < p < ∞ if its frequency response H : Rd → C satisfies the differential estimate
 n n 
ω ∂ H(ω) ≤ Cn for all |n| ≤ (d/2) + 1.

Mikhlin’s condition, which can absorb some degree of discontinuity at the origin, is
easy to check in practice. It is stronger than the minimal boundedness requirement for
p = 2.

3.4 Probability theory

3.4.1 Probability measures


Probability measures are mathematical constructs that permit us to assign numbers (pro-
babilities) between 0 (almost impossible) to 1 (almost sure) to events. An event is mo-
deled by a subset A of the universal set X of all outcomes of a certain experiment X,
which is assumed to be known. The symbol PX (A) then gives the probability that some
element of A occurs as the outcome of experiment X. Note that, in general, we may
assign probabilities only to some subsets of X . We shall denote the collection of all
subsets of X for which PX is defined as SX .
The probability measure PX then corresponds to a function SX → [0, 1]. The triple
(X , SX , PX ) is called a probability space.
Frequently, the collection SX contains open and closed sets, as well as their countable
unions and intersections, collectively known as Borel sets. In this case we call PX a
Borel probability measure.
An important application of the notion of probability is in computing the “average”
value of some (real- or complex-valued) quantity f that depends on the outcome in X .
This quantity, the computation of which we shall discuss shortly, is called the expected
value of f, and is denoted as E{f (X)}.
An important context for probabilistic computations is when the outcome of X can be
encoded as a finite-dimensional numerical sequence, which implies that we can identify
X with Rn (or a subset thereof). In this case, within the proper mathematical setting,
we can find a (generalized) function pX , called the probability distribution 5 or density
function (pdf) of X, such that

PX (A) = pX (x) dx
A

for suitable subsets A of Rn . 6

5 Probability distributions should not be confused with the distributions in the sense of Schwartz (i.e., ge-
neralized functions) that were introduced in Section 3.3. It is important to distinguish the two usages, in
part because, as we describe here, in finite dimensions a connection can be made between probability
distributions and positive generalized functions.
6 In classical probability theory, a pdf is defined as the Radon–Nikodym derivative of a probability measure
with respect to some other measure, typically the Lebesgue measure (as we shall assume). This requires
the probability measure to be absolutely continuous with respect to the latter measure. The definition
44 Mathematical context and background

More generally, the expected value of f : X → C is here given by

E{f (X)} = f (x)pX (x) dx. (3.13)


Rn
We say “more generally” because PX (A) can be seen as the expected value of the
indicator function 1A (X). Since the integral of complex-valued f can be written as the
sum of its real and imaginary parts, without loss of generality we shall consider only
real-valued functions where convenient.
When the outcome of the experiment is a vector with infinitely many coordinates (for
instance a function R → R), it is typically not possible to characterize probabilities with
probability distributions. It is nevertheless still possible to define probability measures
on subsets of X , and also to define the integral (average value) of many a function
f : X → R. In effect, a definition of the integral of f with respect to probability
measure PX is obtained using a limit of “simple” functions (finite weighted sums of
indicator functions) that approximate f. For this general definition of the integral we use
the notation

E{f (X)} = f (x) PX (dx),


X

which we may also use, in addition to (3.13), in the case of a finite-dimensional X .


In general, given a function f : X → Y that defines a new outcome y ∈ Y for
every outcome x ∈ X of experiment X, one can see the result of applying f to the
outcome of X as a new experiment Y. The probability of an event B ⊂ Y is the same
as the combined probability of all outcomes of X that generate an outcome in B. Thus,
mathematically,
PY (B) = PX (f−1 (B)) = PX ◦ f−1 (B),

where the inverse image f−1 (B) is defined as


f−1 (B) = {x ∈ X : f (x) ∈ B}.
PY = PX (f−1 ·) is called the push-forward of PX through f.

3.4.2 Joint probabilities and independence


When two experiments X and Y with probabilities PX and PY are considered simul-
taneously, one can imagine a joint probability space (X,Y , SX,Y , PX,Y ) that supports
both X and Y, in the sense that there exist functions f : X,Y → X and g : X,Y → Y
such that
PX (A) = PX,Y (f−1 (A)) and PY (B) = PX,Y (g−1 (B))
for all A ∈ SX and B ∈ SY .

of the generalized pdf given here is more permissive, and also includes measures that are singular with
respect to the Lebesgue measure (for instance the Dirac measure of a point, for which the generalized pdf
is a Dirac distribution). This generalization relies on identifying measures on the Euclidean space with
positive linear functionals.
3.4 Probability theory 45

The functions f, g above are assumed to be fixed, and the joint event that A occurs for
X and B for Y is given by
f−1 (A) ∩ g−1 (B).

If the outcome of X has no bearing on the outcome of Y and vice versa, then X and Y
are said to be independent. In terms of probabilities, this translates into the probability
factorization rule

PX,Y (f−1 (A) ∩ g−1 (B)) = PX (A) · PY (B) = PX,Y (f−1 (A)) · PX,Y (g−1 (B)).

The above ideas can be extended to any finite collection of experiments X1 , . . . , XM


(and even to infinite ones, with appropriate precautions and adaptations).

3.4.3 Characteristic functions in finite dimensions


In finite dimensions, given a probability measure PX on X = Rn , for any vector ω ∈
Rn we can compute the expected value (integral) of the bounded function x → ejω,x .
This permits us to define a complex-valued function on Rn by the formula


pX (ω) = E{ejω,x } = ejω,x pX (x) dx = F {pX }(ω), (3.14)
Rn

which corresponds to a slightly different definition of the Fourier transform of the


(generalized) probability distribution pX . The convention in probability theory is to de-
fine the forward Fourier transform with a positive sign for jω, x , which is the opposite
of the convention used in analysis.
One can prove that  pX , as defined above, is always continuous at 0 with pX (0) = 1,
and that it is positive definite (see Definition B.1 in Appendix B).
Remarkably, the converse of the above fact is also true. We record the latter result,
which is due to Bochner, together with the former observation, as Theorem 3.7.

THEOREM 3.7 (Bochner) Let  pX : Rn → C be a function that is positive definite,


fulfills 
pX (0) = 1, and is continuous at 0. Then, there exists a unique Borel probability
measure PX on Rn , such that


pX (ω) = ejω,x PX (dx) = E{ejω,x }.
Rn

Conversely, the function specified by (3.14) with pX (r) ≥ 0 and Rn pX (r) dr = 1 is
pX (ω)| ≤ 
positive definite, uniformly continuous, and such that | pX (0) = 1.

The interesting twist (which is due to Lévy) is that the positive definiteness of 
pX and
its continuity at 0 implies continuity everywhere (as well as boundedness).
Since, by the above theorem,  pX uniquely identifies PX , it is called the characteristic
function of probability measure  PX (recall that the probability measure PX is related
to the density pX by PX (E) = E pX (x) dx for sets E in the σ -algebra over Rn ).
The next theorem characterizes weak convergence of measures on Rn in terms of
their characteristic functions.
46 Mathematical context and background

T H E O R E M 3.8 (Lévy’s continuity theorem) Let (PXi ) be a sequence of probability


measures on Rn with respective sequence of characteristic functions (
pXi ). If there exists
a function  pX such that
lim 
pXi (ω) = 
pX (ω)
i

pointwise on and if, in addition, 


Rn , pX is continuous at 0, then 
pX is the characteristic
function of a probability measure PX on Rn . Moreover, PXi converges weakly to PX ,
in symbols
w
PXi −
→ PX ,
meaning, for any continuous function f : Rn → R,
lim EXi {f} = EX {f}.
i
w
The reciprocal of the above theorem is also true; namely, if PXi −
→ PX , then

pXi (ω) → 
pX (ω) pointwise.

3.4.4 Characteristic functionals in infinite dimensions


Given a probability measure PX on the continuous dual X  of some test function space
X , one can define an analog of the finite-dimensional characteristic function, dubbed
X , by means of the identity
the characteristic functional of PX and denoted as P
X (ϕ) = E{ejϕ,X }.
P (3.15)
Comparing the above definition with (3.14), one notes that Rn , as the domain of the
characteristic function 
pX , is now replaced by the space X of test functions.
As was the case in finite dimensions, the characteristic functional fulfills two impor-
tant conditions:
• Positive definiteness: P X is positive definite, in the sense that, for any N (test) func-
tions ϕ1 , . . . , ϕN , for any N, the N × N matrix with entries pij = P X (ϕi − ϕj ) is
non-negative definite.
• Normalization: P X (0) = 1.

In view of the finite-dimensional result (Bochner’s theorem), it is natural to ask if a


condition in terms of continuity can be given also in the infinite-dimensional case, so
that any functional PX fulfilling this continuity condition, in addition to the above two,
uniquely identifies a probability measure on X  . In the case where X is a nuclear space
(and, in particular, for X = S (Rd ) or D (Rd ); see Section 3.1.3), such a condition is
given by the Minlos–Bochner theorem.
THEOREM 3.9 (Minlos–Bochner) Let X be a nuclear space and let P X : X → C
X (0) = 1,
be a functional that is positive definite in the sense discussed above, fulfills P
and is continuous X → C. Then, there exists a unique probability measure PX on X 
(the continuous dual of X ), such that

X (ϕ) =
P ejϕ,x PX (dx) = E{ejϕ,X }.
X
3.5 Generalized random processes and fields 47

Conversely, the characteristic functional associated with some probability measure PX


X (0) = 1.
on X  is positive definite, continuous over X , and such that P
The practical implication of this result is that one can rely on characteristic function-
als to indirectly specify infinite-dimensional measures (most importantly,
probabilities of stochastic processes) – which are difficult to pin down otherwise.
Operationally, the characteristic functional P X (ϕ) is nothing but a mathematical rule
X (ϕ) = e 2 2 ) that returns a value in C for any given function ϕ ∈ S . The
(e.g., P − 1
ϕ 2

truly powerful aspect is that this rule condenses all the information about the statistical
distribution of some underlying infinite-dimensional random object X. When working
with characteristic functionals, we shall see that computing probabilities and deriving
various properties of the said processes are all reduced to analytical derivations.

3.5 Generalized random processes and fields

In this section, we present an introduction to the theory of generalized random pro-


cesses, which is concerned with defining probabilities on function spaces, that is,
infinite-dimensional vector spaces with some notion of limit and convergence. We have
made the point before that the theory of generalized functions is a natural extension of
finite-dimensional linear algebra. The same kind of parallel can be drawn between the
theory of generalized stochastic processes and conventional probability calculus (which
deals with finite-dimensional random vector variables). Therefore, before getting into
more detailed explanations, it is instructive to have a look back at Table 3.2, which
provides a side-by-side summary of the primary probabilistic concepts that have been
introduced so far. The reader is then referred to Table 3.4, which presents a comparison
of finite- and infinite-dimensional “innovation models.” To give the basic idea, in finite
dimensions, an “innovation” is a vector in Rn of independent identically distributed
(i.i.d.) random variables. An “innovation model” is obtained by transforming such a
vector by means of a linear operator (a matrix), which embodies the structure of depen-
dencies of the model. In infinite dimensions, the notion of an i.i.d. vector is replaced
by that of a random process with independent values at every point (which we shall
call an “innovation process”). The transformation is achieved by applying a continuous
linear operator which constitutes the generalization of a matrix. The characterization
of such models is made possible by their characteristic functionals, which, as we saw
in Section 3.4.4, are the infinite-dimensional equivalents of characteristic functions of
random variables.

3.5.1 Generalized random processes as collections of random variables


A generalized stochastic process 7 is essentially a randomization of the idea of a gener-
alized function (Section 3.3), in much the same way as an ordinary stochastic process is
a randomization of the concept of a function.

7 We shall use the terms random/stochastic “process” and “field” almost interchangeably. The distinction, in
general, lies in the fact that, for a random process, the parameter is typically interpreted as time, while, for
a field, the parameter is typically multidimensional and interpreted as a spatial or spatio-temporal location.
48 Mathematical context and background

Table 3.4 Comparison of innovation models in finite- and infinite-dimensional settings. See
Sections 4.3–4.5 for a detailed explanation.

Finite-dimensional Infinite-dimensional
Standard Gaussian i.i.d. vector Standard Gaussian white noise w
W = (W1 , . . . , WN ) w (ϕ) = e− 12 ϕ22 , ϕ ∈ S
P
1
pW (ω) = e− 2 |ω| , ω ∈ RN
2

Multivariate Gaussian vector X Gaussian generalized process s
X = AW s = Aw (for continuous A : S  → S  )

pX (ω) =
1 T 2
e− 2 |A ω| Ps (ϕ) = e− 12 A∗ ϕ2
General i.i.d. vector W = (W1 , . . . , WN ) General white noise w with Lévy exponent f
  
with exponent f w (ϕ) = e Rd f ϕ(r) dr
P
N

pW (ω) = e n=1 f(ωn )

Linear transformation of general i.i.d. Linear transformation of general white noise


random vector W (innovation model) s (innovation model)
X = AW s = Aw
 pW (AT ω)
pX (ω) =  Ps (ϕ) = P
w (A∗ ϕ)

At a minimum, the definition of a generalized stochastic process s should permit us


to associate probabilistic models with observations made using test functions. In other
words, with any test function ϕ in some suitable test-function space X is associated a
random variable s(ϕ), also denoted as ϕ, s . This is to be contrasted with an observation
s(t) at time t, which would be modeled by a random variable in the case of an ordinary
stochastic process. We shall denote the probability measure of the random variable ϕ, s
as Ps,ϕ . Similarly, to any finite collection of observations ϕn , s , 1 ≤ n ≤ N, N ∈ N,
corresponds a joint probability measure Ps,ϕ1 :ϕN on RN (we shall only consider real-
valued processes here, and therefore assume the observations to be real-valued).
Moreover, finite families of observations ϕn , s , 1 ≤ n ≤ N, and ψm , s , 1 ≤ m ≤ M,
need to be consistent or compatible to ensure that all computations of the probability of
an event involving finite observations yield the same value for the probability. In model-
ing physical phenomena, it is also reasonable to assume some weak form of continuity
in the probability of ϕ, s as a function of ϕ.
Mathematically, these requirements are fulfilled by the kind of probabilistic model
induced by a cylinder-set probability measure. In other words, a cylinder-set probability
measure provides a consistent probabilistic description for all finite sets of observations
of some phenomenon s using test functions ϕ ∈ X . Furthermore, a cylinder-set probabi-
lity measure can always be specified via its characteristic functional P s (ϕ) = E{ejϕ,s },
which makes it amenable to analytic computations.
The only conceptual limitation of such a probability model is that, at least a priori, it
does not permit us to associate the sample paths of the process with (generalized) func-
tions. Put differently, in this framework, we are not allowed to interpret s as a random
3.5 Generalized random processes and fields 49

entity belonging to the dual X  of X , since we have not yet defined a proper probability
measure on X  . 8 Doing so involves some additional steps.

3.5.2 Generalized random processes as random generalized functions


Fortunately, the above existence and interpretation problem is fully resolved by taking
X to be a nuclear space, thanks to the Minlos–Bochner theorem (Theorem 3.9). This
allows for the extension of the underlying cylinder-set probability measure to a proper
(by which here we mean countably additive) probability measure on X  (the topological
dual of X ).
In this case, the joint probabilities Ps,ϕ1 :ϕN , ϕ1 , . . . , ϕN ∈ X , N ∈ N, corresponding
to the random variables ϕn , s for all possible choices of test functions, collectively
define a probability measure Ps on the infinite-dimensional dual space X  . This means
that we can view s as an element drawn randomly from X  according to the probability
law Ps .
In particular, if we take X to be either S (Rd ) or D (Rd ), then our generalized random
process/field will have realizations that are distributions in S  (Rd ) or D  (Rd ), respec-
tively. We can then also think of ϕ, s as the measurement of this random object s by
means of some sensor (test function) ϕ in S or D .
Since we shall rely on this fact throughout the book, we reiterate once more that a
complete probabilistic characterization of s as a probability measure on the space X 
(dual to some nuclear space X ) is provided by its characteristic functional. The truly
powerful aspect of the Minlos–Bochner theorem is that the implication goes both ways:
any continuous, positive definite functional P s : X → C with proper normalization
identifies a unique probability measure Ps on X  . Therefore, to define a generalized
random process s with realizations in X  , it suffices to produce a functional P s : X →
C with the noted properties.

3.5.3 Determination of statistics from the characteristic functional


The characteristic functional of the generalized random process s contains complete
information about its probabilistic properties, and can be used to compute all probabili-
ties, and to derive or verify the probabilistic properties related to s.
Most importantly, it can yield the Nth-order joint probability density of any set of
linear observations of s by suitable N-dimensional inverse Fourier transformation. This
follows from a straightforward manipulation in the domain of the (joint) characteristic
function and is recorded for further reference.
PROPOSITION 3.10 Let y = (Y1 , . . . , YN ) with Yn = ϕn , s , where ϕ1 , . . . , ϕN ∈
X , be a set of linear measurements of the generalized stochastic process s with
8 In fact, X  may very well be too small to support such a description (while the algebraic dual, X ∗ , can
support the measure – by Kolmogorov’s extension theorem – but is too large for many practical purposes).
An important example is that of white Gaussian noise, which one may conceive of as associating a Gaussian
random variable with variance ϕ22 to any test function ϕ ∈ L2 . However, the “energy” of white Gaussian
noise is clearly infinite. Therefore it cannot be modeled as a randomly chosen function in (L2 ) = L2 .
50 Mathematical context and background

characteristic functional P s (ϕ) = E{ejϕ,s } that is continuous over the function space
X . Then,
 N 
 s,ϕ :ϕ (ω) = P
p(Y :Y ) (ω) = P s ωn ϕn
1 N 1 N
n=1

and the joint pdf of y is given by


 N

−1 s dω
p(Y1 :YN ) (y) = F {
p(Y1 :YN ) }(y) = P ωn ϕn e−jy,ω ,
RN (2π)N
n=1

where the observation functions ϕn ∈ X are fixed and ω = (ω1 , . . . , ωN ) plays the role
of the N-dimensional Fourier variable.

Proof The continuity assumption over the function space X (which need not be
nuclear) ensures that the manipulation is legitimate. Starting from the definition of the
characteristic function of y = (Y1 , . . . , YN ), we have
  

p(Y1 :YN ) (ω) = E exp jω, y
 
 N 
= E exp j ωn ϕn , s
n=1
 N

 
= E exp j ωn ϕn , s (by linearity of duality product)
n=1
 N

s
=P ωn ϕn . s (ϕ))
(by definition of P
n=1

The density p(Y1 :YN ) is then obtained by inverse (conjugate) Fourier transformation.

Similarly, the formalism allows one to retrieve all first- and second-order moments of
the generalized stochastic process s. To that end, one considers the mean and correlation
functionals defined and computed as

d  
Ms (ϕ) = E{ϕ, s } = (−j) Ps,ϕ (ω)ω=0

d  
= (−j) P 
s (ωϕ) ω=0

∂2 
Bs (ϕ1 , ϕ2 ) = E{ϕ1 , s ϕ2 , s } = (−j)2 Ps,ϕ ,ϕ (ω1 , ω2 )
1 2 ω1 ,ω2 =0
∂ω1 ∂ω2
∂ 2 
= (−j)2 Ps (ω1 ϕ1 + ω2 ϕ2 ) .
∂ω1 ∂ω2 ω1 ,ω2 =0

When the space of test functions is nuclear (X = S (Rd ) or D (Rd )) and the above
quantities are well defined, we can find generalized functions ms (the generalized mean)
3.5 Generalized random processes and fields 51

and Rs (the generalized autocorrelation function) such that

Ms (ϕ) = ϕ(r)ms (r) dr (3.16)


Rd

Bs (ϕ1 , ϕ2 ) = ϕ1 (r)ϕ2 (s)Rs (r, s) dr. (3.17)


Rd

The first identity is simply a consequence of Ms being a continuous linear functional


on X , while the second is an application of Schwartz’ kernel theorem (Theorem 3.2).

3.5.4 Operations on generalized stochastic processes


In constructing stochastic models, it is of interest to separate the essential random-
ness of the models (the “innovation”) from their deterministic structure. Our way of
approaching this objective is by encoding the random part in a characteristic functional
Pw and the deterministic structure of dependencies in an operator U (or, equivalently,
in its adjoint U∗ ). In the following paragraphs, we first review the mathematics of this
construction, before we come back to, and clarify, the said interpretation. The concepts
presented here in an abstract form are illustrated and made intuitive in the remainder of
the book.
Given a continuous linear operator U : X → Y with continuous adjoint U∗ : Y  →
X  , where X , Y need not be nuclear, and a functional
w : Y → C
P

that satisfies the three conditions of Theorem 3.9 (continuity, positive definiteness, and
normalization), we obtain a new functional
s : X → C
P
w and U as per
fulfilling the same properties by composing P

P w (Uϕ)
s (ϕ) = P for all ϕ ∈ X . (3.18)

Writing
s (ωϕ) = E{ejωϕ,s } = 
P pϕ,s (ω)

and
w (ωUϕ) = E{ejωUϕ,w } = 
P pUϕ,w (ω)

for generalized processes s and w, we deduce that the random variables ϕ, s and
Uϕ, w have the same characteristic functions and therefore follow

ϕ, s = Uϕ, w in probability law.

The manipulation that led to Proposition 3.10 shows that a similar relation exists, more
generally, for any finite collection of observations ϕn , s and Uϕn , w , 1 ≤ n ≤ N,
N ∈ N.
52 Mathematical context and background

Therefore, symbolically at least, by the definition of the adjoint U∗ : Y  → X  of U,


we may write
ϕ, s = ϕ, U∗ w .

This seems to indicate that, in a sense, the random model s, which we have defined
using (3.18), can be interpreted as the application of U∗ to the original random model w.
However, things are complicated by the fact that, unless X and Y are nuclear spaces,
we may not be able to interpret w and s as random elements of Y  and X  , respectively.
Therefore the application of U∗ : Y  → X  to s should be understood to be merely a
formal construction.
On the other hand, by requiring X to be nuclear and Y to be either nuclear or com-
pletely normed, we see immediately that P s : X → C fulfills the requirements of
the Minlos–Bochner theorem, and thereby defines a generalized random process with
realizations in X  .
The previous discussion suggests the following approach to defining generalized ran-
dom processes: take a continuous, positive definite functional P w : Y → C on some
(nuclear or completely normed) space Y . Then, for any continuous operator U defined
from a nuclear space X into Y , the composition
s = P
P w (U·)

is the characteristic functional of a generalized random process s with realizations in X  .


In subsequent chapters, we shall mostly focus on the situation where U = L−1∗ and
U∗ = L−1 for some given (whitening) operator L that admits a continuous inverse in the
suitable topology, the typical choice of spaces being X = S (Rd ) and Y = Lp (Rd ). The
underlying hypothesis is that one is able to invert the linear operator U and to recover w
from s, which is formally written as w = Ls; that is,

ϕ, w = ϕ, Ls , for all ϕ ∈ Y .

The above ideas are summarized in Figure 3.1.

3.5.5 Innovation processes


In a certain sense, the most fundamental class of generalized random processes we can
use to play the role of w in the construction of Section 3.5.4 are those with independent
values at every point in Rd [GV64, Chapter 4, pp. 273–288]. The reason is that we can
then isolate the spatio-temporal dependency of the probabilistic model in the mixing
operator (U∗ in Figure 3.1), and attribute randomness to independent contributions (in-
novations) at geometrically distinct points in the domain. We call such a construction
an innovation model.
Let us attempt to make the notion of independence at every point more precise in the
context of generalized stochastic processes, where the objects of study are, more accura-
tely, not pointwise observations, but rather observations made through scalar products
with test functions. To qualify a generalized process s as having independent values
3.5 Generalized random processes and fields 53

Figure 3.1 Definition of linear transformation of generalized stochastic processes using


characteristic functionals. In this book, we shall focus on innovation models where w is a white
noise process. The operator L = U−1∗ (if it exists) is called the whitening operator of s, since
Ls = w.

at every point, we therefore require that the random variables ϕ1 , w and ϕ2 , w be
independent whenever the test functions ϕ1 and ϕ2 have disjoint supports.
Since the joint characteristic function of independent random variables factorizes (is
separable), we can formulate the above property in terms of the characteristic functional
Pw of w as
Pw (ϕ1 + ϕ2 ) = Pw (ϕ1 )P
w (ϕ2 ).

An important class of characteristic functionals fulfilling this requirement are those that
can be written in the form

w (ϕ) = e
P Rd
f (ϕ(r)) dr
. (3.19)

To have P w (0) = 1 (normalization), we require that f (0) = 0. The requirement of


positive definiteness narrows down the class of admissible functions f much further,
practically to those identified by the Lévy–Khintchine formula. This will be the subject
of the greater part of our next chapter.

3.5.6 Example: filtered white Gaussian noise


In the above framework, we can define white Gaussian noise or innovation on Rd as a
random element of the space of Schwartz’ generalized functions, S  (Rd ), whose cha-
racteristic functional is given by
w (ϕ) = e− 12 ϕ22 .
P

Note that this functional is a special instance of (3.19) with f (ω) = − 12 ω2 . The Gaussian
appellation is justified by observing that, for any N test functions ϕ1 , . . . , ϕN , the random
54 Mathematical context and background

variables ϕ1 , w , . . . , ϕN , w are jointly Gaussian. Indeed, we can apply Proposition
3.10 to obtain the joint characteristic function
⎛  N 2 ⎞
 
Pϕ :ϕ (ω) = exp ⎝−  1
 ω n ϕn
 ⎠
 .
1 N
2 
n=1 2

By taking the inverse Fourier transform of the above expression, we find that the random
variables ϕn , w , n = 1, . . . , N, have a multivariate Gaussian distribution with mean 0
and covariance matrix with entries
Cmn = ϕm , ϕn .
The independence of ϕ1 , w and ϕ2 , w is obvious whenever ϕ1 and ϕ2 have disjoint
support. This justifies calling the process white. 9 In this special case, even mere ortho-
gonality of ϕ1 and ϕ2 is enough for independence, since for ϕ1 ⊥ ϕ2 we have Cmn = 0.
From (3.16) and (3.17), we also find that w has 0 mean and “correlation function”
Rw (r, s) = δ(r − s), which should also be familiar. In fact, this last expression is some-
times used to formally “define” white Gaussian noise.
A filtered white Gaussian noise is obtained by applying a continuous convolution
(i.e., LSI) operator U∗ : S  → S  to the Gaussian innovation, in the sense described
in Section 3.5.4.
Let us denote the convolution kernel of the operator U : S → S (the adjoint of U∗ )
by h. 10 The convolution kernel of U∗ : S  → S  is then h∨ . Following Section 3.5.4,
we find the following characteristic functional for the filtered process U∗ w = h∨ ∗ w:
U∗ w (ϕ) = e− 12 h∗ϕ22 .
P
In turn, it yields the following mean and correlation functions
mU∗ w (r) = 0
 
RU∗ w (r, s) = h ∗ h∨ (r − s),
as expected.

3.6 Bibliographical pointers and historical notes

Sections 3.1 and 3.2


Recommended references on functional analysis, topological vector spaces, and duality
are the books by Schaefer [Sch99] and Rudin [Rud73].
Much of the theory of nuclear spaces was developed by Grothendieck [Gro55] in
his thesis work under the direction of Schwartz. For detailed information, we refer to
Pietsch [Pie72].
9 Our notion of whiteness in this book goes further than having a white spectrum. By whiteness we mean
that the process is stationary and has truly independent (not merely uncorrelated) values over disjoint sets.
10 Recall that, for the convolution to map back into S , h needs to have a smooth Fourier transform, which
implies rapid decay in the temporal or spatial domain. This is the case, in particular, for any rational
transfer function that lacks purely imaginary poles.
3.6 Bibliographical pointers and historical notes 55

Section 3.3
For a comprehensive treatment of generalized functions, we recommend the books of
Gelfand and Shilov [GS64] and Schwartz [Sch66] (the former being more accessible
while maintaining rigor). The results on Fourier multipliers are covered by Hörmander
[Hör80] and Mikhlin et al. [MP86].
A historical precursor to the theory of generalized functions is the “operational
method” of Heaviside, appearing in his collected works in the last decade of the nine-
teenth century [Hea71]. The introduction of the Lebesgue integral was a major step
that gave a precise meaning to the concept of the almost-everywhere equivalence of
functions. Dirac introduced his eponymous distribution as a convenient notation in the
1920s. Sobolev [Sob36] developed a theory of generalized functions in order to define
weak solutions of partial differential equations. But it was Laurent Schwartz [Sch66]
who put forth the formal and comprehensive theory of generalized functions (distri-
butions) as we use it today (first edition published in 1950). His work was further
developed and exposed by the Russian school of Gelfand et al.

Section 3.4
Kolmogorov is the founding father of the modern axiomatic theory of probability which
is based on measure theory. We still recommend his original book [Kol56] as the main
reference for the material presented here. Newer and more advanced results can be
found in the encyclopedic works of Bogachev [Bog07] and Fremlin [Fre00, Fre01,
Fre02, Fre03, Fre08] on measure theory.
Paul Lévy defined the characteristic function in the early 1920s and is responsible for
turning the Fourier–Stieltjes apparatus into one of the most useful tools of probability
theory [Lév25, Tay75]. The foundation of the finite-dimensional Fourier approach is
Bochner’s theorem, which appeared in 1932 [Boc32].
Interestingly, it was Kolmogorov himself who introduced the characteristic functional
in 1935 as an equivalent (infinite-dimensional) Fourier-based description of a measure
on a Banach space [Kol35]. This tool then lay dormant for many years. The theoretical
breakthrough came when Minlos proved the equivalence between this functional and
the characterization of probability measures on duals of nuclear spaces (Theorem 3.9) –
as hypothesized by Gelfand [Min63, Kol59]. This powerful framework now constitutes
the infinite-dimensional counterpart of the traditional Fourier approach to probability
theory.
What is lesser known is that Laurent Schwartz, who also happened to be Paul
Lévy’s son-in-law, revisited the theory of probability measures on infinite-dimensional
topological vector spaces, including developments from the French school, in the final
years of his career [Sch73b, Sch81b]. These later works are highly abstract, as one may
expect from their author. This makes for an interesting contrast with Paul Lévy, who
had a limited interest in axioms and whose research was primarily guided by an extra-
ordinary intuition.

Section 3.5
The concept of generalized stochastic processes, including the characterization of
continuously defined white noises, was introduced by Gelfand in 1955 [Gel55]. Itô
56 Mathematical context and background

contributed to the topic by formulating the correlation theory of such processes


[Itô54]; see also [Itô84]. The basic reference for the material presented here is
[GV64, Chapter 3].
The first applications of the characteristic functional to the study of stochastic
processes have been traced back to 1947; they are due to Le Cam [LC47] and Boch-
ner [Boc47], who both appear to have (re)discovered the tool independently. Le Cam
was concerned with the practical problem of modeling the relation between rainfall
and riverflow, while Bochner was aiming at a fundamental characterization of stochas-
tic processes. Another early promoter is Bartlett, who, in collaboration with Kendall,
determined the characteristic functional of several Poisson-type processes that are
relevant to biology and physics [BK51]. The framework was consolidated by Gel-
fand [Gel55] and Minlos in the 1960s. They provided the extension to generalized
functions and also addressed the fundamental issue of the uniqueness and consistency
of this infinite-dimensional description.
4 Continuous-domain innovation
models

The stochastic processes that we wish to characterize are those generated by linear trans-
formation of non-Gaussian white noise. If we were operating in the discrete domain and
restricting ourselves to a finite number of dimensions, we would be able to use any
sequence of i.i.d. random variables wn as system input and rely on conventional multi-
variate statistics to characterize the output. This strongly suggests that the specification
of the mixing matrix (L−1 ) and the probability density function (pdf) of the innovation
is sufficient to obtain a complete description of a linear stochastic process, at least in the
discrete setting.
But our goal is more ambitious since we place ourselves in the context of conti-
nuously defined processes. The situation is then not quite as straightforward because: (1)
we are dealing with infinite-dimensional objects, (2) it is much harder to properly define
the notion of continuous-domain white noise, and (3) there are theoretical restrictions
on the class of admissible innovations. While this calls for an advanced mathematical
machinery, the payoff is that the continuous-domain formalism lends itself better to
analytical computations, by virtue of the powerful tools of functional and harmonic
analysis. Another benefit is that the non-Gaussian members of the family are necessarily
sparse as a consequence of the theory which rests upon the powerful characterization
and existence theorems by Lévy–Khintchine, Minlos, Bochner, and Gelfand–Vilenkin.
As in the subsequent chapters, we start by providing some intuition in the first sec-
tion and then proceed with a more formal characterization. Section 4.2 is devoted to
an in-depth investigation of Lévy exponents, which are intimately tied to the family of
infinitely divisible distributions in the classical (scalar) theory of probability. What is
non-standard here and fundamental to our argument is the link that is made between
infinite divisibility and sparsity in Section 4.2.3. In Section 4.3, we apply those results
to the Fourier-domain characterization of a multivariate linear model driven by an infi-
nitely divisible noise vector, which primarily serves as preparation for the subsequent
infinite-dimensional generalization. In Section 4.4, we extend the formulation to the
continuous domain, which results in the proper specification of white Lévy noise w (or
non-Gaussian innovations) as a generalized stochastic process (in the sense of Gelfand
and Vilenkin) with independent “values” at every point. The fundamental result is that a
given brand of noise (or innovations) is uniquely specified by its Lévy exponent f (ω) via
its characteristic functional Pw (ϕ). Finally, in Section 4.5, we characterize the statisti-
cal effect of the mixing operator L−1 (general linear model) and provide mathematical
58 Continuous-domain innovation models

conditions on f and L that ensure that the resulting process s = L−1 w is well defined
mathematically.

4.1 Introduction: from Gaussian to sparse probability distributions

Intuitively, a continuous-domain white-noise process is formed by the juxtaposition of


a continuum of i.i.d. random contributions. Since these atoms of randomness are infi-
nitesimal, the realizations (a.k.a. sample paths) of such processes are highly singular
(discontinuous), meaning that they do not admit a classical interpretation as (random)
functions of the index variable r ∈ Rd . Consequently, the random variables associated
with the sample values w(r0 ) are undefined. The only concrete way of observing such
noises is by probing them through some localized analysis window ϕ(· − r0 ) centered
around some location r0 . This produces some scalar quantity X = ϕ(· − r0 ), w which is
a conventional random variable with some pdf pX(ϕ) . Note that pX(ϕ) is independent of
the position r0 , which reflects the fact that w is stationary. In order to get some sense of
the variety of achievable random patterns, we propose to convert the continuous-domain
process w into some corresponding i.i.d. sequence Xk (discrete white noise) by selecting
a sequence of non-overlapping rectangular windows:

Xk = rect(· − k), w .

The concept is illustrated in Figure 4.1. The main point that will be made clearer in
what follows is that there is a one-to-one correspondence between the pdf of Xk – the
so-called canonical pdf pid (x) = pX(rect) (x) – and the complete functional description
of w via its characteristic functional, which we shall investigate in Section 4.4. What is
more remarkable (and distinct from the discrete setting) is that this canonical pdf cannot
be arbitrary: the theory dictates that it must be part of the family of infinitely divisible
(id) laws (see Section 4.2).
The prime example of an id pdf is the Gaussian distribution illustrated in Figure 4.1a.
As already mentioned, it is the only non-sparse member of the family. All other distri-
butions exhibit either a mass density at the origin (like the compound-Poisson example
in Figure 4.1c with Prob(x = 0) = e−λ = 0.75 and Gaussian amplitude distribution) or
a slower rate of decay at infinity (heavy-tail behavior). The Laplace probability law of
Figure 4.1b results in the mildest possible form of sparsity – indeed, it can be proven
that there is a gap between the Gaussian and the other members of the family in the
sense that there is no id distribution with p(x) = e−O(|x| ) with 0 <  < 1. In other
1+

words, a non-Gaussian pid (x) is constrained to decay like e−λ|x| or slower – typically,
like O(1/|x|r ) with r > 1 (inverse polynomial/algebraic decay). The sparsest example
in Figure 4.1 is provided by the Cauchy distribution pCauchy (x) = π x12 +1 , which is part
( )
of the symmetric-alpha-stable (SαS) family (here, α = 1). The SαS distributions with
α ∈ (0, 2) are notorious for their heavy-tail behavior and the fact that their moments
E{|x|p } are unbounded for p > α.
4.2 Lévy exponents and infinitely divisible distributions 59

5
(a) Gaussian
0.4

0.3

0.2 2000

0.1

–4 –2 2 4 5

5
(b) Laplace
0.5

0.4

0.3
2000
0.2

0.1

–4 –2 2 4 5

5
(c) Compound-Poisson
0.7
0.6
0.5
0.4 2000
0.3
0.2
0.1

–4 –2 0 2 4 5

5
(d) Cauchy (stable)
0.30
0.25
0.20
2000
0.15
0.10
0.05

–4 –2 2 4
5

Figure 4.1 Examples of canonical, infinitely divisible probability density functions and
corresponding observations of a continuous-domain white-noise process through an array of
non-overlapping rectangular integration windows. (a) Gaussian distribution (not sparse). (b)
Laplace distribution (moderately sparse). (c) Compound-Poisson distribution (finite rate of
innovation). (d) Cauchy distribution (ultra-sparse = heavy-tailed with unbounded variance).

4.2 Lévy exponents and infinitely divisible distributions

The investigation of sparse stochastic processes requires a solid understanding of the


classical notions of Lévy exponents and infinite divisibility, which constitute the pillars
of our formulation. This section provides a self-contained presentation of the required
mathematical background. It also brings out the link with sparsity.
D E F I N I T I O N 4.1 (Lévy exponent) A continuous, complex-valued function f : R → C
such that f (0) = 0 is a valid Lévy exponent if and only if it is conditionally positive
definite of order one, so that
N N
f (ωm − ωn )ξm ξ n ≥ 0
m=1 n=1
N
under the condition m=1 ξm = 0 for every possible choice of ω1 , . . . , ωN ∈ R, ξ1 , . . . ,
ξN ∈ C and N ∈ Z+ .
60 Continuous-domain innovation models

The importance of Lévy exponents in mathematical statistics is that they are tightly
linked with the property of infinite divisibility.

DEFINITION 4.2 (Infinite divisibility) A random variable X with generic pdf pid is
infinitely divisible (id) if and only if, for any N ∈ Z+ , there exist i.i.d. random variables
X1 , . . . , XN such that X has the same distribution as X1 + · · · + XN .

The foundation of the theory of such random variables is that their characteristic func-
tions are in one-to-one correspondence with Lévy exponents. While the better-known
formulation of this equivalence is provided by the Lévy–Khintchine theorem (Theo-
rem 4.2), we prefer first to express it in functional terms, building upon the work of
three giants in harmonic analysis: Lévy, Bochner, and Schoenberg.

T H E O R E M 4.1 (Lévy–Schoenberg) Let  pid (ω) = E{ejωX } = R ejωx pid (x) dx be the
characteristic function of an infinitely divisible (id) random variable X. Then,

f (ω) = log
pid (ω)

is a Lévy exponent in the sense of Definition 4.1. Conversely, if f is a valid Lévy exponent,
then the inverse Fourier integral

pid (x) = e f (ω) e−jωx
R 2π
yields the pdf of an id random variable.

The proof is given in the supplementary material in Section 4.2.4. As for the lat-
 implication, we observe that the condition f (0) = 0 ⇔ 
ter pid (0) = 1 implies that
R pid (x) dx = 1, while the positive definiteness ensures that pid (x) ≥ 0 so that it is a
valid pdf.

4.2.1 Canonical Lévy–Khintchine representation


The second, more explicit statement of the announced equivalence with id distribu-
tions capitalizes on the property that Lévy exponents admit a canonical representation
in terms of a Lévy measure μv or some equivalent density v, which is the notational
choice 1 that we are favoring here.

D E F I N I T I O N 4.3 (Lévy measure/density) A (positive) measure μv on R\{0} is called


a Lévy measure if it satisfies the admissibility condition

min(a2 , 1)μv (da) = min(a2 , 1)v(a) da < ∞. (4.1)


R\{0} R\{0}

The corresponding density function v : R → R+ , which is such that μv (da) = v(a) da,
is called the Lévy density.
1 In most mathematical texts, the Lévy–Khintchine decomposition is formulated in terms of a Lévy measure
μv rather than a density. Even though Lévy measures need not always have a density in the sense of the
Radon–Nikodym derivative with respect to the Lebesgue measure (i.e., as an ordinary function), following
Bourbaki we may still identify them with positive linear functionals, which we represent notationally as
integrals against a “generalized” density: μv (E) = E v(a) da for any set in the Borel algebra on R\{0}.
4.2 Lévy exponents and infinitely divisible distributions 61

We observe that, as in the case of a pdf, the density v is not necessarily an ordinary
function, for it may include isolated Dirac impulses (discrete part of the measure) as
well as a singular component.
THEOREM 4.2 (Lévy–Khintchine) A probability distribution pid is infinitely divisible
(id) if and only if its characteristic function can be written as
 

pid (ω) = pid (x)ejωx dx = exp f (ω) , (4.2)
R
with
b2 ω2  jaω 
f (ω) = jb1 ω − + e − 1 − jaω1|a|<1 (a) v(a) da, (4.3)
2 R\{0}

where b1 ∈ R and b2 ∈ R+ are some arbitrary constants, and where v is an admissible
Lévy density; 1|a|<1 (a) is the indicator function of the set  = {a ∈ R : |a| < 1} (i.e.,
1 (a) = 1 if a ∈  and 1 (a) = 0 otherwise).
The admissibility condition (4.1) guarantees that the right-hand integral in (4.3) is
well defined; this follows from the bounds |ejaω − 1 − jaω| < a2 ω2 and |ejaω −
1| < min(|aω|, 2). An important aspect of the theory is that this allows for (non-
integrable) Lévy densities with a singular behavior around the origin; for instance,
v(a) = O(1/|a|2+ ) with  ∈ [0, 1) as a → 0.
The connection with Theorem 4.1 is that the Lévy–Khintchine expansion (4.3) pro-
vides a complete characterization of the conditionally positive definite functions of or-
der one, as specified in Definition 4.1. This theme is further developed in Appendix B,
which contains the proof of the above statement and also makes interesting links with
theoretical results that are fundamental to machine learning and approximation theory.
In Section 4.4, we shall indicate how id distributions (or, equivalently, Lévy expo-
nents) can be used to specify an extended family of continuous-domain white-noise
processes. In that context, we shall typically require that pid has a well-defined first-
order absolute moment and/or that it is symmetric with respect to the origin, which
leads to the following simplifications of the canonical representation.
C O R O L L A RY 4.3 Let pid be an infinitely divisible pdf whose characteristic
 function
is given by  f (ω)
pid (ω) = e . Then, depending on the properties of pid or, equivalently,
on the Lévy density v , the Lévy exponent f admits the following Lévy–Khintchine-type
representations:
 
(1) pid id symmetric i.e., pid (x) = pid (−x) if and only if
b2 ω2  
f (ω) = − + cos(aω) − 1 v(a) da (4.4)
2 R\{0}

with v(a) =v(−a).


(2) pid id with R |x| pid (x) dx < ∞ if and only if
b2 ω2  jaω 
f (ω) = jb1 ω − + e − 1 − jaω v(a) da (4.5)
2 R\{0}

with |a|>1 |a| v(a) da < ∞.
62 Continuous-domain innovation models


(3) pid id with R\{0} |a|v(a) da < ∞ if and only if

b2 ω2  jaω 
f (ω) = jb1 ω − + e − 1 v(a) da (4.6)
2 R\{0}

where b1 ∈ R, b2 ∈ R+ , b1 = b1 − R\{0} a v(a) da and v(a) ≥ 0 is an admissible
Lévy density.

These are obtained by direct manipulation of (4.3) with b1 = b1 + |a|≥1 a v(a) da.
Equation (4.4) is valid in all generality, provided that we interpret the integral as a Cau-
chy principal-value limit (see Appendix A) to handle potential (symmetric) singularities
around the origin. The Lévy–Khintchine formulas (4.4) and (4.5) are fundamental be-
cause they give an explicit, constructive characterization of the noise functionals that
are central to our formulation. From now on, we rely on Corollary 4.3 to specify admis-
sible Lévy exponents: the parameters (b1 , b2 , v) will be referred to as the Lévy triplet of
f (ω).
For completeness, we also mention the less classical (but equivalent) representation
of the Lévy exponent as
b2 ω2
f (ω) = jb1 ω − v(ω) −
+ v(0), (4.7)
2
where v(ω) = F {v}(ω) is the (conjugate) Fourier transform of v in the sense of general-
ized functions. The idea here is to rely on the powerful theory of generalized functions
to seamlessly absorb the (potential) singularity 2 of v at a = 0. The interested reader can
refer to Appendix A for complementary explanations.
Below is a summary of known criteria for identifying admissible Lévy exponents,
some being more operational than others [GV64, pp. 275–282]. These are all conse-
quences of Bochner’s theorem, which provides a Fourier-domain equivalence between
continuous, positive definite functions and probability density functions (or positive
Borel measures). See Appendix B for an overview and discussion of the functional
notion of positive definiteness and corresponding mathematical tools.

PROPOSITION 4.4 The following statements on f are equivalent:


(1) f (ω) is a continuous, conditionally positive definite function of order one.
−1
(2) pid (x) = F {e f (ω) }(x) is an infinitely divisible distribution.
(3) f admits a Lévy–Khintchine representation as in Theorem 4.2.
−1
(4) Let pXτ (x) = F {eτ f (ω) } for τ ≥ 0. Then, {pXτ }τ ∈R+ is a family of valid pdfs; that
is, pXτ (x) ≥ 0 and R pXτ (x) dx = 1 for all τ ≥ 0.
pXτ (ω) = eτ f (ω) is a continuous, positive definite function of ω ∈ R with 
(5)  pXτ (0) = 1
for any τ ∈ [0, ∞).

Interestingly, it was Schoenberg (the father of splines) who first established the equi-
valence between Statements (1) and (5) (see proof of the direct part in Section 4.2.4).
2 This corresponds to interpreting v as the generalized function associated with the finite part (p.f., i.e.,

Hadamard’s “partie finie”) of the classical Lévy measure: ϕ, v = p.f. R ϕ(a)μv (da).
4.2 Lévy exponents and infinitely divisible distributions 63

The equivalence between (4) and (5) follows from Bochner’s theorem (Theorem 3.7).
The fact that (2) implies (4) is a side product of the proof in Section 4.2.4, while
the converse implication is a direct consequence of (3). Indeed, if f (ω) has a Lévy–
Khintchine representation, then the same is true for τ f (ω), which also implies that the
whole family of pdfs {pXτ }τ ∈R+ is infinitely divisible. The latter is uniquely specified by
f and therefore
 in one-to-one correspondence with the canonical
 id
τ distribution
 pid (x) =
pXτ (x)τ =1 . Another important observation is that
pXτ (ω) = ef (ω) =  pid (ω))τ , so that
pXτ in Statement (4) may be interpreted as the τ -fold (possibly, fractional) convolution
of pid .
In our work, we sometimes need to limit ourselves to some particular subset of Lévy
exponents.
DEFINITION 4.4 (p-admissibility) A Lévy exponent f with derivative f  is called
p-admissible if it satisfies the inequality
|f (ω)| + |ω| |f  (ω)| ≤ C |ω|p (4.8)
for some constant C > 0 and 0 < p ≤ 2.
PROPOSITION 4.5 The generic Lévy exponents
   
(1) f1 (ω) = R\{0} ejaω − 1 v1 (a) da with R |a|v1 (a) da < ∞
   
(2) f2 (ω) = R\{0} cos(aω) − 1 v2 (a) da with R a2 v2 (a) da < ∞
   
(3) f3 (ω) = R\{0} ejaω − 1 − jaω v3 (a) da with R a2 v3 (a) da < ∞
are p-admissible with p1 = 1, p2 = 2, and p3 = 2, respectively.
jaω
Proof The first result follows
 from the
 bounds |ejaω − 1| ≤ |a| · |ω| and | dedω | < |a|.
 
The second is based on cos(aω) − 1 ≤ |aω| and | sin(aω)| ≤ |aω|. Specifically,
2

|f2 (ω)| ≤ |aω|2 v2 (a) da = |ω|2 a2 v2 (a) da


R R
 
 
 
|ω||f 2 (ω)| = |ω|  a sin(aω) v2 (a) da
R\{0}

≤ |ω| |a| |aω| v2 (a) da = |ω|2 a2 v2 (a) da.


R R

As for the third exponent, we also use the inequality |ejaω − 1 − jaω| ≤ |aω|2 , which
yields

|f3 (ω)| ≤ |aω|2 v3 (a) da = |ω|2 a2 v3 (a) da


R\{0} R

 
 
|ω| |f 3 (ω)| = |ω|  ja(ejaω − 1) v3 (a) da
R\{0}

≤ |ω| |a| |aω| v3 (a) da = |ω|2 a2 v3 (a) da,


R R
which completes the proof.
64 Continuous-domain innovation models

Since the p-admissibility property is preserved through summation, this covers a large
portion of the Lévy exponents specified in Corollary 4.3.

Examples
The power law fα (ω) = −|ω|α with 0 < α ≤ 2 is Lévy α-admissible; it generates the
SαS id distributions [Fel71]. Note that fα (ω) fails to be conditionally positive definite
for α > 2, meaning that the inverse Fourier transform of e−|ω| exhibits negative values
α

and is therefore not a valid pdf. The upper acceptable limit is α = 2 and corresponds
to the Gaussian law. More generally, a Lévy exponent that is symmetric and twice-
differentiable at the origin (which is equivalent to the variance of the corresponding id
distribution being finite) is p-admissible with p = 2; this follows as a direct consequence
of Corollary 4.3 and Proposition 4.5.
Another fundamental instance, which generates the complete family of id
compound-Poisson distributions, is
 jaω 
fPoisson (ω) = λ e − 1 pA (a) da,
R\{0}

where λ > 0 is the Poisson rate and pA (a) ≥ 0 the amplitude pdf with R pA (a) da = 1.
In general, fPoisson (ω) is p-admissible with p = 1 provided that E{|A|} = R |a|pA (a) da <
∞ (see Proposition 4.5). If, in addition, pA is symmetric with a bounded variance, then
the Poisson range of admissibility extends to p ∈ [1, 2]. Further examples of symmetric
id distributions are documented in Table 4.1. Their Lévy exponent is simply obtained
by taking f (ω) = log pX (ω).
The relevance of id distributions for signal processing is that any linear combination
of independent id random variables is id as well. Indeed, let X1 and X2 be two inde-
pendent id random variables with Lévy exponents f1 and f2 , respectively; then, it is not
difficult to show that a1 X1 + a2 X2 , where a1 and a2 are arbitrary constants, is id with
Lévy exponent f (ω) = f1 (a1 ω) + f2 (a2 ω).

4.2.2 Deciphering the Lévy–Khintchine formula


From a harmonic-analysis perspective, the Lévy–Khintchine representation is close-
ly related to Bochner’s theorem, which states that a positive definite function g can
always be expressed as the Fourier transform ofa positive finite Borel measure; i.e.,
g(ω) = R ejωa v(a) da with v(a) ≥ 0 and g(0) = R v(a) da < ∞. Here, the additional
requirement is that f (0) = 0 (conditional positive definiteness of order one), which is
enforced by proper subtraction of a linear correction in jaω within the integral, the latter
being partly compensated by the addition of the component jωb1 . The side benefit of
this
 regularization is that it enlarges the set of admissible densities to those satisfying
min(a 2 , 1)v(a) da < ∞, which allows for a singular behavior around the origin.
R
As for the linear and quadratic terms outside the integral, they map into the singular
point distribution b1 δ  (a) + b2 δ  (a) (weighted derivatives of Dirac impulses) that is
concentrated at the origin a = 0 and excluded from the (classical) Lebesgue integral.
For the complete details, we refer the reader to Appendix B. The proposed treatment
4.2 Lévy exponents and infinitely divisible distributions 65

Table 4.1 Primary families of symmetric, infinitely divisible probability laws. The special functions
Kα (x ), (z), and B(x , y ) are defined in Appendix C.

relies on Gelfand and Vilenkin’s distributional characterization of conditionally


positive definiteness of order n in Theorem B.4. Despite the greater generality of
the result, we find its proof more enlightening and of a less technical nature than
the traditional derivation of the Lévy–Khintchine formula, which is summarized in
Section 4.2.4, for completeness.
From a statistical perspective, the exponent f specified by the Lévy–Khintchine for-
mula is the logarithm of the characteristic function of an id random variable. This means
that breaking f into additive subparts is in fact equivalent
N
to factorizing the pdf into
convolutional factors. Specifically, let  pX (ω) = e n=1 fn (ω) be the characteristic func-
tion of a (compound) id distribution where the fn are valid Lévy exponents. Then,

pX (ω) = N
 n=1 
pXn (ω) with 
pXn (ω) = efn (ω) = E{ejωXn }, which translates into the convo-
lution relation
 
pX (x) = pX1 ∗ pX2 ∗ · · · ∗ pXN (x).

The statistical interpretation is that X = X1 + · · · + XN where the Xn are independent


with id pdf pXn . The infinitely divisible property simply translates into the fact that, for
a given f (ω) and any N > 0, X can always be broken down into N i.i.d. components
with Lévy exponent f (ω)/N. Indeed, it is easy to see from the Lévy representation
66 Continuous-domain innovation models

that the admissibility of f (ω) implies that τ f (ω) is a valid Lévy exponent as well for
any τ ≥ 0.
To further our understanding of id distributions, it is instructive to characterize the
atoms of the Lévy–Khintchine representation. Focusing on the simplest form (4.6),
we identify three types of elementary constituents, with the third type being motivated
by the decomposition
 of a (continuous) Lévy density into a weighted “sum” of Dirac

impulses: v(a) = R v(τ )δ(a − τ ) dτ ≈ n λn δ(a − τn ) with λn = v(τn )(τn − τn−1 ):
(1) Linear term f1 (ω) = jb1 ω. This corresponds to the (degenerate) pdf of a constant
X1 = b1 with pX1 (x) = δ(x − b1 ).
(2) Quadratic term f2 (ω) = −b22 ω . As already mentioned, this leads to the centered
2

Gaussian with variance b2 given by


 
−1 −b2 ω2 /2 1 x2
pX2 (x) = F {e }(x) = √ exp − .
2πb2 2b2
(3) Exponential (or Poisson)term f3 (ω) = λ(e jτ ω − 1), which is associated with the

elementary Lévy triplet 0, 0, λδ(a − τ ) . Based on the Taylor-series expansion
 (λz)m
pX3 (ω) = eλ(z−1) = e−λ +∞
 m=0 m! with z = e
jωτ , we readily obtain the pdf by

(generalized) inverse Fourier transformation:



e−λ λm
pX3 (x) = δ(x − mτ ).
m!
m=0

This formula coincides with the continuous-domain representation of a Poisson dis-


tribution 3 with Poisson parameter λ and gain factor τ ; that is, Prob(X3 = τ m) =
e−λ λm
m! .

 More generally, when v(a) = λp(a) where p(a) ≥ 0 is some arbitrary pdf with
p(a) da = 1, we can make a compound-Poisson 4 interpretation with
R
 
fPoisson (ω) = λ (ejaω − 1)p(a) da = λ 
p(ω) − 1 ,
R

where p(ω) = R ejaω p(a) dais the characteristic function of p = pA . Using the fact
that 
p(ω) is bounded, we apply the same type of Taylor-series argument and express the
characteristic function as
∞  m
−λ p(ω)
λ
efPoisson (ω)
=e = pY (ω).
m!
m=0

3 The standard form of the discrete Poisson probability model is Prob(N = n) = e−λ λn with n ∈ N. It provides
n!
the probability of a given number of independent events (n) occurring in a fixed space–time interval when
the average rate of occurrence is λ. The Poisson parameter is equal to the expected value of N, but also to
its variance: λ = E{N} = Var{N}.
4 The compound-Poisson probability model has two components. The first is a random variable N that follows
a Poisson distribution with parameter λ, and the second a series of i.i.d. random variables A1 , A2 , A3 , . . .
N
with pdf pA which are drawn at each trial of N. Then, Y = n=1 An is a compound-Poisson random
variable with Poisson parameter λ and amplitude pdf pA . Its mean and variance are given by E{Y} = λE{A}
and Var{Y} = λVar(A), respectively.
4.2 Lévy exponents and infinitely divisible distributions 67

Finally, by using the property that 


p(ω)m is the characteristic function of the m-fold
convolution of p, we get the general formula for the compound-Poisson pdf with Pois-
son parameter λ and amplitude distribution p:
 
−λ λ λ2 λ3
pY (x) = e δ(x) + p(x) + (p ∗ p)(x) + (p ∗ p ∗ p)(x) + · · · . (4.9)
1! 2! 3!

Thus, in essence, the Lévy–Khintchine formula is a description of the Fourier trans-


form of a distribution that is the convolution of three components: an impulse δ(· − b1 )
(shifting), a Gaussian of variance b2 (smoothing), and a compound-Poisson distribution
(spreading). The effect of the first term is a simple recentering of the pdf around b1 .
The third compound-Poisson component is itself obtained via a suitable composition of
m-fold convolutions of some primary pdf p. It is much more concentrated at the origin
than the Gaussian, because of the presence of the Dirac distribution with weight e−λ ,
but also heavier-tailed because of the spreading effect of the m-fold convolution.
The additional linear correction terms in (4.3) and (4.5) allow for a wider variety
of distributions that have the common property of being limits of compound-Poisson
distributions.

PROPOSITION 4.6 Every id distribution is the weak limit of a sequence of Poisson


distributions.

Proof Let  pid be the characteristic function of some id distribution pid and consider an
arbitrary sequence τn ↓ 0. Then,
" #
pXn (ω) = exp τn−1 (pid (ω)τn − 1)

is the characteristic function of a compound-Poisson distribution with parameter λ =


−1
τn−1 and amplitude distribution p(x) = F { pid (ω)τn }(x). Moreover, we have that
" #  
pXn (ω) = exp τn−1 (eτn logpid (ω) − 1) = exp log
 pid (ω) + O(τn )

for every ω as n → ∞. Hence,  pXn (ω) → exp (log pid (ω)) = 


pid (ω) so that pXn
converges weakly to pid by Lévy’s continuity theorem (Theorem 3.8).

The id pdfs for which v ∈ / L1 (R) are generally smoother than the compound-Poisson
ones for they do not display a singularity (Dirac impulse) at the origin, unlike (4.9). Yet,
depending on the degree of concentration (or singularity) of v around the origin, they
will typically exhibit a peaky behavior around the mean. While this class of distributions
is responsible for the additional level of complication in the Lévy–Khintchine formula –
as compared to the simpler Poisson version (4.6) – we argue that it is highly relevant
for applications because of the many possibilities that it offers. Somewhat surprisingly,
there are many families of id distributions with singular Lévy density that are more
tractable mathematically than their compound-Poisson cousins found in Table 4.1, at
least when considering their pdf.
68 Continuous-domain innovation models

4.2.3 Gaussian vs. sparse categorization


The family of id distributions allows for a range of behaviors that varies between the
purely Gaussian and sparse extremes. In the context of Lévy processes, these are often
referred to as the diffusive and jump modes. To make our point, we consider two distinct
scenarios.

Finite-variance case 
We first assume that the second
 moment m2 = R a2 v(a) da of the Lévy density is finite,
which also implies that |a|>1 |a|v(a) da < ∞ because of the admissibility condition.
Hence, the corresponding Lévy–Khintchine representation is (4.5). An interesting non-
Poisson example of infinitely divisible probability laws that
 falls into this category (with
−|a| 
non-integrable v) is the Laplace density with Lévy triplet 0, 0, v(a) = e|a| and pid (x) =
1 −|x|
2e (see Figure 4.1b). This model is particularly relevant in the context of sparse
signal recovery because it provides a tight connection between Lévy processes and total-
variation regularization [UT11, Section VI].
Now, if the Lévy density is Lebesgue integrable, we can pull the linear correction
 of the Lévy–Khintchine integral and represent f using (4.6) with v(a) = λpA (a) and
out
R pA (a) da = 1. This implies that we can decompose X into the sum of two independent
Gaussian and compound-Poisson random variables. The variances of the Gaussian and
Poisson components are σ 2 = b2 and λE{A2 }, respectively. The Poisson component is
sparse because its probability density function exhibits the mass distribution e−λ δ(x) at
the origin shown in Figure 4.1c, meaning that the chances, for a continuous amplitude
distribution, of getting zero are overwhelmingly higher than any other value, especially
for smaller values of λ > 0. It is therefore justifiable to use 0 ≤ e−λ < 1 as our Poisson
sparsity index.

Infinite-variance case
We now turn our attention to the case where the second moment of the Lévy density
is unbounded, which we like to label as “super-sparse.” To justify this terminology, we
invoke the Ramachandran–Wolfe theorem [Ram69, Wol71], which states that the pth
moment
 E{|x|p } with p ∈ R+ of an infinitely divisible distribution is finite if and only
if
 |a|>1 |a|p v(a) da < ∞ (see Section 9.5). For p ≥ 2, the latter is equivalent to
R |a| v(a) da < ∞ because of the Lévy admissibility condition. It follows that the cases
p

that are not covered by the previous scenario (including the Gaussian + Poisson model) ne-
cessarily give rise to distributions whose moments of order p are unbounded for p ≥ 2. The
prototypical representatives of such heavy-tail distributions are the alpha-stable ones (see
Figure 4.1d) or, by extension, the broad family of infinitely divisible probability laws
that are in their domain of attraction. It has been shown that these distributions precisely
fulfill the requirement for p compressibility [AUM11], which is a stronger form of spar-
sity than the presence of a mass probability density at the origin.
4.2 Lévy exponents and infinitely divisible distributions 69

4.2.4 Proofs of Theorems 4.1 and 4.2


For completeness, we end this section on Lévy exponents with the proofs of the two
key theorems in the theory of infinitely divisible distributions. The Lévy–Schoen-
berg theorem is central to our formulation because it makes the link between the
id property and the fundamental notion of positive definiteness. In the case of the
Lévy–Khintchine theorem, we have opted for a sketch of proof which is adapted
from the literature. The main intent there was to provide additional insights into the
nature of the singularities of the Lévy density and their effect on the form of the
exponent.

Proof of Theorem
 4.1 (Lévy–Schoenberg)
Let 
pid (ω) = jωx
R e pid (x) dx be the characteristic function of an id random variable.
1/n
Then, by definition, pid (ω) is a valid characteristic function for any n ∈ Z+ .
Since the convolution
 m/n of two pdfs is a pdf, we can also take integer powers, which
results in  pid (ω) being a characteristic function. For any irrational number
τ > 0, we can specify a sequence
τ of rational numbers m/n that converges to τ so
m/n
that pid (ω)  →  pid (ω) with the limit function being continuous. This implies
τ
pXτ (ω) = 
that  pid (ω) is a characteristic function
 for any τ ≥ 0 by Lévy’s continuity
s
theorem (Theorem 3.8). Moreover,  pXτ (x) =  pXτ/s (ω) must be non-zero for any  finite
τ , owing to the fact that lims→∞  pXτ/s (ω) = 1. In particular,  pid (ω) =  
 τ =1  = 0
pXτ (ω)
so that we can write it as  f (ω)
pid (ω) = e , where f (ω) is continuous
 jωx with Re f (ω) ≤ 0
τ
and f (0) = 0. Hence,  pXτ (ω) =  pid (ω) = e τ f (ω) = R e pXτ (x) dx, where pXτ (x)
is a valid pdf for any τ ∈ [0, ∞), which is Statement (4) in Proposition 4.4. By
Bochner’s theorem (Theorem 3.7), this is equivalent to eτ f (ω) being positive defi-
nite for any τ ≥ 0 with f (ω) continuous and f (0) = 0. The first part of Theo-
rem 4.1 then follows as a corollary of the next fundamental result, which is due to
Schoenberg.

T H E O R E M 4.7 (Schoenberg correspondence) The function f (ω) is conditionally posi-


tive definite of order one if and only if eτ f (ω) is positive definite for any τ > 0.

Proof We only give the easier part (if statement) and refer to [Sch38, Joh66] for the
complete details. The property that eτ f (ω) is positive definite for every τ > 0 is ex-
pressed as
N N
ξm ξ n eτ f (ωm −ωn ) ≥ 0,
m=1 n=1

for every possible choice of ω1 , . . . , ωN ∈ R, ξ1 , . . . , ξN ∈ C, and N ∈ Z+ . In the more



restricted setup of Definition 4.1 where N n=1 ξn = 0, this can also be restated as
N N
1  
ξm ξ n eτ f (ωm −ωn ) − 1 ≥ 0.
τ
m=1 n=1
70 Continuous-domain innovation models

The next step is to take the limit


N N N N
eτ f (ωm −ωn ) − 1
lim ξm ξ n = ξm ξ n f (ωm − ωn ) ≥ 0,
τ →0 τ
m=1 n=1 m=1 n=1

which implies that f (ω) is conditionally positive definite of order one.


This also makes the second part of Theorem 4.1 easy because  pid (ω) = ef (ω) can be
factorized into a product of N identical positive definite subparts with Lévy exponent
1
N f (ω).

Sketch of proof of Theorem 4.2 (Lévy–Khintchine)


We start from the equivalence of the id property and Statement (4) in Proposition 4.4
established above. This result is restated as
eτ f (ω) − 1 pXτ (x)
= (ejωx − 1) dx,
τ R τ
the limit of which as τ → 0 exists and is equal to f (ω). Next, we define the measure
2 p (x) 2
Kτ (dx) = x2x+1 Xττ dx, which is bounded for all τ > 0 because x2x+1 ≤ 1 and pXτ is a
valid pdf. We then express the Lévy exponent as
 
eτ f (ω) − 1 pX (x)
f (ω) = lim = lim (ejωx − 1) τ dx
τ →0 τ τ →0 R τ
  2
jxω x +1
= lim ejωx − 1 − 2
Kτ (dx) + jω lim a(τ ),
τ →0 R 1+x x2 τ →0

where
x pXτ (x)
a(τ ) = dx.
R 1 + x2 τ
The technical part of the work, which is quite tedious and is not included here, is to
show that the above integrals are bounded and that the two limits are well defined in the
sense that a(τ ) → a0 and Kτ → K (weakly) as τ ↓ 0 with K being a finite measure.
This ultimately yields Khintchine’s canonical representation,
  2
jxω x +1
f (ω) = jωa0 + ejωx − 1 − 2
K(dx),
R 1+x x2
where a0 ∈ R and K is some bounded Borel measure. A potential advantage of Khin-
tchine’s representation is that the corresponding measure K is not singular. The connec-
tion with the standard Lévy–Khintchine formula is b2 = K(0+ ) − K(0− ) and v(x) dx =
x2 +1
x2
K(dx) for x  = 0. It is also possible to work out a relation between a and b1 , which
depends upon the type of linear compensation in the canonical representation.
The above manipulation shows that the coefficients of the linear and quadratic terms
of the Lévy–Khintchine formula (4.3) are primarily due to the non-integrable part of
p (x) 2
g(x) = limτ ↓0 Xττ = x x+1 2 k(x) where k(x) dx = K(dx).
By convention, the classical Lévy density v is assumed to be zero at the origin so that
it differs from g by a point distribution that is concentrated at the origin. By invoking
4.3 Finite-dimensional innovation model 71

a basic theorem in distribution theory stating that a distribution entirely localized at


the origin can always be expressed as a linear combination of the Dirac impulse and
its derivatives, we can write that g(x) − v(x) = b0 δ(x) + b1 δ  (x) + b2 δ  (x), where the
higher-order derivatives of δ are excluded because of the admissibility condition.
For the indirect part of the proof, we start from the integral of the Lévy–Khintchine
formula and consider the sequence of distributions whose exponent is
 
fn (ω) = ejaω − 1 − jaω1|a|<1 (a) v(a) da
|a|>1/n
 jaω 
= −j ω av(a) da + e − 1 v(a) da.
1/n<|a|<1 |a|>1/n
  
an

Since the leading constant an is finite and |a|>1/n v(a) da < ∞ for any fixed n (due
to the admissibility condition on v), this corresponds to the exponent of a shifted
compound-Poisson distribution whose characteristic function is efn (ω) . This allows us to
 b2
pn (ω) = e−b1 jω− 2 ω +fn (ω) is a valid characteristic function for any n ∈ Z+ .
2
deduce that 
Finally, we have the convergence of the sequence  pn (ω) → ef (ω) as n → ∞ where
f (ω) is given by (4.3). Since f (ω) – and therefore ef (ω) – is continuous around ω = 0,
we infer that ef (ω) is a valid pdf (by Lévy’s continuity theorem). The continuity of
f is established by bounding the Lévy–Khintchine integral and invoking Lebesgue’s
dominated convergence theorem. The id part is obvious.

4.3 Finite-dimensional innovation model

To set the stage for the infinite-dimensional extension to come in Section 4.4, it is ins-
tructive to investigate the structure of a purely discrete innovation model whose input is
the random vector u = (U1 , . . . , UN ) of i.i.d. infinitely divisible random variables. The
generic Nth-order pdf of the discrete innovation variable u is

$
N
p(U1 :UN ) (u1 , . . . , uN ) = pid (un ), (4.10)
n=1

−1
where pid (x) = F {ef (ω) }(x) and f is the Lévy exponent of the underlying id distri-
bution. Since p(U1 :UN ) (u1 , . . . , uN ) is separable due to the independence assumption, we
can write its characteristic function as the product of individual id factors:

$
N

p(U1 :UN ) (ω) = E(U1 :UN ) {ejω,u } = ef (ωn )
n=1
 N

= exp f (ωn ) , (4.11)
n=1
72 Continuous-domain innovation models

where ω = (ω1 , . . . , ωN ) is the frequency variable. The N-dimensional output signal


x = (X1 , . . . , XN ) is then specified as the solution of the matrix-vector innovation
equation
u = Lx,
where the N × N whitening matrix L is assumed to be invertible. This implies that
x = Au is a linear transformation of the excitation noise with A = L−1 . Its Nth-order
characteristic function is obtained by simple (linear) change of variable:
T ω,u

p(X1 :XN ) (ω) = E(U1 :UN ) {ejω,Au } = E(U1 :UN ) {ejA }
= T
p(U1 :UN ) (A ω)
 N 
 T 
= exp f [A ω]n . (4.12)
n=1

Based on this equation, we can determine any marginal distribution by setting the
appropriate frequency variables to zero. For instance, we find that the first-order pdf
of Xn , the nth component of x, is given by
−1
% &
pXn (x) = F efn (ω) (x),

where
N
 
fn (ω) = f anm ω
m=1

with weighting coefficients anm = [A]nm = [L−1 ]nm . The key observation here is that
fn is an admissible Lévy exponent, which implies that the underlying distribution is
infinitely divisible (by Theorem 4.1), with the same being true for all the marginals
and, by extension, the distribution of any linear measurement(s) of x. While this pro-
vides a general mechanism for characterizing the probability law(s) of the discrete
signal x within the classical framework of multivariate statistics, it is a priori difficult
to perform the required computations (matrix inverse and inverse Fourier transforms)
analytically, except in the Gaussian case, where the exponent is quadratic. Indeed, in
1
p(X1 :XN ) (ω) = e− 2 A ω2 , which is the Fourier
T 2
this latter situation, (4.12) simplifies to 
transform of a multivariate Gaussian distribution with zero mean and covariance matrix
E{xxT } = AAT .
As we shall see in the next two sections, these results are transposable to the infinite-
dimensional setting (see Table 3.4). While this may look like an unnecessary complica-
tion at first sight, the payoff is a theory that lends itself better to an analytical treatment
using the powerful tools of harmonic analysis. The essence of the generalization is to
replace the frequency variable ω by a generic test function ϕ ∈ S (Rd ), the sums in
(4.11) and (4.12) by Lebesgue integrals, and the matrix inverses by Green’s functions
which can often be specified explicitly. To make an analogy, it is conceptually and prac-
tically easier to formulate a comprehensive (deterministic) theory of linear systems by
using Fourier analysis and convolution operators than by relying on linear algebra, with
the same applying here. At the end of the exercise, it is still possible to come back to
4.4 White Lévy noises or innovations 73

a finite-dimensional signal representation by projecting the continuous-domain model


onto a suitable set of basis functions, as will be shown in Chapter 10.

4.4 White Lévy noises or innovations

Having gained a solid understanding of Lévy exponents, we can now move to the speci-
fication of a corresponding family of continuous-domain white-noise processes to drive
the innovation model in Figure 2.1. To that end, we rely on Gelfand’s theory of gen-
eralized stochastic processes [GV64], which was briefly summarized in Section 3.5.
This powerful formalism allows for the complete and remarkably concise description
of a generalized stochastic process by its characteristic functional. While the latter is
not widely used in the standard formulation of stochastic processes, it lends itself quite
naturally to the specification of generalized white-noise processes in terms of Lévy ex-
ponents, in direct analogy with what we have done before for id distributions.
D E F I N I T I O N 4.5 (White Lévy noise or innovation) A generalized stochastic process
w in D  (Rd ) is called a white Lévy noise (or innovation) if its characteristic functional is
given by
 
 

Pw (ϕ) = E{e jϕ,w
} = exp f ϕ(r) dr , (4.13)
Rd

where f is a valid Lévy exponent and ϕ is a generic test function in D (Rd ) (the space of
infinitely differentiable functions of compact support).
Equation (4.13) is very similar to (4.2) and its multivariate extension (4.11). The key
differences are that the frequency variable is now replaced by the generic test func-
tion ϕ ∈ D (Rd ) (which is a more general infinite-dimensional entity) and that the sum
inside the exponential in (4.11) is replaced by an integral over the domain of ϕ. The
fundamental point is that Pw (ϕ) is a continuous, positive definite functional on D (Rd )
with the key property that P w (ϕ1 + ϕ2 ) = Pw (ϕ1 )Pw (ϕ2 ) whenever ϕ1 and ϕ2 have
non-overlapping support (i.e., ϕ1 (r)ϕ2 (r) = 0). The first part of the statement ensures
that these generalized processes are well defined (by the Minlos–Bochner theorem),
while the separability property implies that they take independent values at all points,
which partially justifies the “white noise” nomenclature. Remarkably, Gelfand and Vi-
lenkin have shown that there is also a converse implication [GV64, Theorem 6, p. 283]:
Equation (4.13) specifies a continuous, positive definite functional on D (Rd ) (and hence
defines an admissible white-noise process in D  (Rd )) if and only if f is a Lévy exponent.
This ensures that the Lévy family constitutes the broadest possible class of acceptable
white-noise inputs for our innovation model.

4.4.1 Specification of white noise in Schwartz’ space S 


In the present work, which relies heavily on convolution operators and Fourier analysis,
we find it more convenient to define generalized stochastic processes with respect to
test functions in the nuclear space S (Rd ), rather than the smaller space D (Rd ) used
74 Continuous-domain innovation models

by Gelfand and Vilenkin. This requires a minimal restriction on the class of admissible
Lévy densities in reference to Definition 4.3 to compensate for the lack of compact
support of the functions in S (Rd ).

T H E O R E M 4.8 (Lévy–Schwartz admissibility) A white Lévy noise specified by (4.13)


with ϕ ∈ S (Rd ) is a generalized stochastic process in S  (Rd ) provided
 that f (·) is
characterized by the Lévy–Khintchine formula (4.3) with Lévy triplet b1 , b2 , v , where
the Lévy density v(a) ≥ 0 satisfies

min(a2 , |a| )v(a) da < ∞ for some  > 0. (4.14)


R

Proof By the Minlos–Bochner theorem (Theorem 3.9), it suffices to show that P w (ϕ)
is a continuous, positive definite functional over S (R ) with P
d w (0) = 1, where the
latter follows trivially from f (0) = 0. The positive definiteness is a direct consequence
of the exponential nature of the characteristic
 functional and the conditional positive
definiteness of the Lévy exponent see Section 4.4.3 and the paragraph following Equa-
tion (4.20) .
The only delicate part is to prove continuity in the topology of S (Rd ). To that end,
we consider a series of functions ϕn that converge to ϕ in S (Rd ). First, we recall that
S (Rd ) ⊂ Lp (Rd ) for all 0 < p ≤ +∞. Moreover, the convergence in S (Rd ) implies
the convergence in all the Lp spaces. Indeed, if we select k > 0 such that kp > 1 and
0 > 0, we have, for n sufficiently large,
0
|ϕn (r) − ϕ(r)| ≤ ,
(r + 1)k
2

which implies that


p p dr
ϕn − ϕLp ≤ 0 .
(r2 + 1)kp
Since the right-hand-side integral is convergent and independent of n, we conclude that
limn→∞ ϕn − ϕLp = 0.
Since continuity is preserved through the composition of continuous maps and the
exponential function is continuous, we only need to establish the continuityof f (ϕ) =
w (ϕ). The continuity of the Gaussian part is obvious since ϕn dr → ϕ dr and
log P
ϕn L2 → ϕL2 . We therefore concentrate on the functional
" #
G(ϕ) = ejaϕ(r) − 1 − jaϕ(r)1|a|≤1 (a) v(a) da dr
Rd R\{0}

that corresponds to the non-Gaussian part of the Lévy–Khintchine representation of f.


Next, we write
 
 jaϕn (r) 
|G(ϕn ) − G(ϕ)| ≤ e − ejaϕ(r)  v(a) da dr
Rd |a|>1
  
 jaϕn (r)
+ e − ejaϕ(r) − ja ϕn (r) − ϕ(r)  v(a) da dr
Rd 0<|a|≤1
= (1) + (2).
4.4 White Lévy noises or innovations 75

To bound the first integral, we use the inequality


     
 jx jy   jy j(x−y)  |x − y| 
e − e  = e (e − 1) ≤ min (2, |x − y|) ≤ 2
2
under the (non-restrictive) condition that  ≤ 1. This yields

(1) ≤ 21− |a| |ϕn (r) − ϕ(r)| v(a) da dr


Rd |a|>1
 
=2 1−
|a| v(a) da ϕn − ϕL .

|a|>1
As for the second integral, we use the bound
     
 jx   
e − ejy − j(x − y) = ejy ej(x−y) − 1 − j(x − y) + j(x − y)(ejy − 1)
   
   
≤ ej(x−y) − 1 − j(x − y) + (x − y)(ejy − 1)
≤ (x − y)2 + |x − y| · |y|.
Therefore,
 2
(2) ≤ a2 ϕn (r) − ϕ(r) v(a) da dr
Rd 0<|a|≤1

+ a2 |ϕn (r) − ϕ(r)| · |ϕ(r)|v(a) da dr


Rd 0<|a|≤1
 " #
≤ a2 v(a) da ϕn − ϕ2L2 + ϕn − ϕL2 ϕL2 .
0<|a|≤1

Since ϕn − ϕLp → 0 for all p > 0 as ϕn converges to ϕ in S (Rd ), we conclude that
w (ϕ).
limn→∞ |G(ϕn ) − G(ϕ)| = 0, which proves the continuity of P
Note that (4.14), which will be referred to as Lévy–Schwartz admissibility, is a very
slight restriction on the classical condition ( = 0) for id laws (see (4.1) in Defini-
tion 4.3). The fact that  can be chosen arbitrarily small reflects the property that the
functions in S have a faster-than-algebraic decay. Another equivalent formulation of
Lévy–Schwartz admissibility is
E{|ϕ, w | } < ∞ for some  > 0,

(4.15)
which follows from (9.10) and Proposition 9.10 in Chapter 9. Along the same lines, it
can be shown that the finiteness of the th-order moment in (4.15) for any non-trivial
ϕ0 implies that the same holds true for all ϕ ∈ S (Rd ).
From now on, we implicitly assume that the Lévy–Schwartz admissibility condition
is met.
To exemplify the procedure, we select a quadratic exponent which is trivially admis-
sible (since v(a) = 0). This results in
 
 b2
PwGauss (ϕ) = exp − ϕL2 , 2
2
which is the functional that completely characterizes the white Gaussian noise of the
classical theory of stationary processes.
76 Continuous-domain innovation models

4.4.2 Impulsive Poisson noise


We have already alluded to the fact that the continuous-domain white-noise processes
w are highly singular and generally too rough to admit an interpretation as conventional
functions of the index variable r ∈ Rd . The realizations (or sample paths) are genera-
lized functions that can only be probed indirectly through their scalar products ϕ, w
with test functions or observation windows, as illustrated in Section 4.1. While the use
of such an indirect approach is unavoidable in the mathematical formulation, it is pos-
sible to provide an explicit pointwise description of a noise realization in the special case
where the Lévy exponent f is associated with a compound-Poisson distribution [UT11].
The corresponding impulsive Poisson noise model is
w(r) = Ak δ(r − rk ), (4.16)
k∈Z

where rk are random point locations in Rd and where the Ak are i.i.d. random variables
with pdf pA . The random events are indexed by k (using some arbitrary ordering); they
are mutually independent and follow a spatial Poisson distribution. Specifically, let 
be any finite-measure subset of Rd ; then the probability of observing N() = n events
in  is
e−λVol() (λVol())n
Prob (N() = n) = ,
n!
where Vol() is the measure (or spatial volume) of . This is to say that the Poisson
parameter λ represents the average number of random impulses per unit hyper-volume.
The link with the formal specification of Lévy noise in Definition 4.5 is as follows.
THEOREM 4.9 The characteristic functional of the impulsive Poisson noise specified
by (4.16) is
 
 
w
P Poisson (ϕ) = E{e } = exp
jϕ,w
fPoisson ϕ(r) dr (4.17)
Rd
with

fPoisson (ω) = λ (ejaω − 1)pA (a) da = λ(


pA (ω) − 1), (4.18)
R
where λ is the Poisson density parameter, pA the amplitude pdf of the Dirac impulses,
and 
pA the corresponding characteristic function.
Proof We select an arbitrary test function ϕ ∈ D (Rd ) of compact support, with its sup-
port included in the centered cube ϕ = [−cϕ , cϕ ]d . We denote by Nw,ϕ the number of
Poisson points of w in ϕ ; by definition, it is a Poisson random variable with parameter
λVol(ϕ ). The restriction of w to ϕ corresponds to the random sum
Nw,ϕ
An δ(r − rn ),
n=1
  
where we have used an appropriate relabeling of the variables (Ak , rk )rk ∈ ϕ in
Nw,ϕ 
(4.16). Correspondingly, we have ϕ, w = n=1 An ϕ(rn ).
4.4 White Lévy noises or innovations 77

By the order-statistics property of Poisson processes, the rn are independent and all
equivalent in distribution to a random variable r that is uniform on ϕ .
Using the law of total expectation, we expand the characteristic functional of w,
Pw (ϕ) = E ejϕ,w , as
% %  &&
w (ϕ) = E E ejϕ,w Nw,ϕ
P
⎧ ⎧  ⎫⎫
⎨ ⎨N$  ⎬⎬
w,ϕ
  
=E E ejAn ϕ(rn ) Nw,ϕ
⎩ ⎩  ⎭⎭
n=1
⎧ ⎫
⎨N$w,ϕ %
 
&⎬
=E E ejA ϕ(r ) (by independence)
⎩ ⎭
n=1
⎧ ⎫
⎨N$w,ϕ % % 
 
&& ⎬
=E E E ejAϕ(r ) A (total expectation)
⎩ ⎭
n=1
⎧ ⎧ ⎫⎫
⎨N$w,ϕ ⎨ e jAϕ(r ) dr ⎬⎬

=E E (as r is uniform in ϕ )
⎩ ⎩ Vol(ϕ ) ⎭⎭
n=1
⎧   ⎫
⎨N$w,ϕ ejaϕ(r) dr p (a) da ⎬
A
R ϕ
=E . (4.19)
⎩ Vol(ϕ ) ⎭
n=1

The last expression has the inner expectation expanded in terms of the pdf pA of the
random variable A. Defining the auxiliary functional

M(ϕ) = ejaϕ(r) pA (a) da dr,


φ R

we rewrite (4.19) as
⎧ ⎫  
⎨N$ 
M(ϕ) ⎬
w,ϕ
M(ϕ) Nw,ϕ
E =E .
⎩ Vol(ϕ ) ⎭ Vol(ϕ )
n=1

Next, we use the fact that Nw,ϕ is a Poisson random variable to compute the above
expectation directly:
      n
M(ϕ) Nw,ϕ M(ϕ) n e−λVol(ϕ ) λVol(ϕ )
E =
Vol(ϕ ) Vol(ϕ ) n!
n≥0
(λM(ϕ))n
= e−λVol(ϕ )
n!
n≥0
−λVol(ϕ ) λM(ϕ)
=e e (Taylor)
  
= exp λ M(ϕ) − Vol(ϕ ) .
78 Continuous-domain innovation models

 
We now replace M(ϕ) by its integral equivalent, noting also that Vol(ϕ ) = ϕ R 1 ×
pA (a) da dr, whereupon we obtain the expression
 
Pw (ϕ) = exp λ (ejaϕ(r) − 1)pA (a) da dr .
ϕ R

As (ejaϕ(r) − 1) vanishes outside the support of ϕ (and, therefore, outside ϕ ), we may


enlarge the domain of the inner integral to all of Rd , which yields (4.17). Finally, we
use the fact that the derived Poisson functional is part of the Lévy family and invoke
Theorem 4.8 to extend the domain of P w from D (Rd ) to S (Rd ).

The interest of this result is twofold. First, it gives a concrete meaning to the
compound-Poisson scenario in Figure 4.1c, allowing for a description in terms of
conventional point processes. In the same vein, we can propose a physical analogy for
the elementary Poisson term f3 (ω) = λ(eja0 ω −1) in Section 4.2.2 with pA (a) = δ(a−a0 ):
the counting of photons impinging on the detectors of a CCD camera with the photon
density being constant over Rd and the integration time proportional to λ. The cor-
responding process is usually termed “photon noise” in optical imaging. Second, the
explicit noise model (4.16) suggests a practical mechanism for generating generalized
Poisson processes as a weighted sum of shifted Green’s functions of L, each Dirac
impulse being replaced by the response of the inverse operator in the innovation model
in Figure 2.1.
Note that the above description of generalized compound-Poisson processes is com-
patible with the usual definition of finite-rate-of-innovation signals. Yet, this is by far
not the whole story since the impulsive Poisson noise is the only member of the Lévy
family whose “innovation rate,” as measured by λ, is finite.

4.4.3 Properties of white noise


To emphasize the parallel with the scalar formulation in Section 4.2, we start by intro-
ducing the functional counterpart of Definition 4.1.
D E F I N I T I O N 4.6 (Generalized Lévy exponent) A continuous complex-valued func-
tional F on the nuclear space S (Rd ) such that F(0) = 0, F(ϕ) = F(−ϕ) is called a
generalized Lévy exponent if it is conditionally positive definite of order one; i.e.,
N N
F(ϕm − ϕn )ξm ξ n ≥ 0
m=1 n=1

under the condition N n=1 ξn = 0 for every possible choice ϕ1 , . . . , ϕN ∈ S (R ),
d

ξ1 , . . . , ξN ∈ C, and N ∈ Z .
+

This definition is motivated by the infinite-dimensional counterpart of Schoenberg’s


correspondence theorem (Theorem 4.7) [PR70].
T H E O R E M 4.10 (Prakasa Rao) Let F be a complex-valued functional on the nuclear
space S (Rd ) such that F(0) = 0, F(ϕ) = F(−ϕ). Then, the following conditions are
equivalent.
4.4 White Lévy noises or innovations 79

(1) The functional F is conditionally positive definite of order one.


(2) For every choice ϕ1 , . . . , ϕN ∈ S (Rd ), ξ1 , . . . , ξN ∈ C, τ > 0 and N ∈ Z+ ,
N N
 
exp τ F(ϕm − ϕn ) ξm ξ n ≥ 0.
m=1 n=1

(3) For every choice ϕ1 , . . . , ϕN ∈ S (Rd ), ξ1 , . . . , ξN ∈ C, and N ∈ Z+ ,


N N
 
F(ϕm − ϕn ) − F(ϕm ) − F(−ϕn ) ξm ξ n ≥ 0.
m=1 n=1
In the white-Lévy-noise scenario of Definition 4.5, we have that
 
F(ϕ) = f ϕ(r) dr. (4.20)
Rd
It then comes as no surprise that the generalized Lévy exponent F(ϕ) inherits the rele-
vant properties of f, including conditional positive definiteness, with
N N N N
 
F(ϕm − ϕn )ξm ξ n = f ϕm (r) − ϕn (r) ξm ξ n dr ≥ 0
m=1 n=1 Rd m=1 n=1
  
≥0

subject to the constraint N n=1 ξn = 0.
The simple additive nature of the mapping (4.20) between generalized Lévy expo-
nents and the classical ones translates into the following white-noise properties which
are central to our formulation.

(1) Independent atoms and stationarity


PROPOSITION 4.11 A white Lévy noise is stationary and independent at every point.
 
Proof The stationarity property is expressed by P w (ϕ) = Pw ϕ(· − r0 ) for all
r0 ∈ Rd . It is established by simple change of variable in the defining integral. To
investigate the independence at every point, we determine the joint characteristic
function of the random variables X1 = ϕ1 , w and X2 = ϕ2 , w , which is given by

pX1 ,X2 (ω1 , ω2 ) = exp (F(ω1 ϕ1 + ω2 ϕ2 )), where F is defined by (4.20). When ϕ1 and
ϕ2 have non-overlapping support, we use the fact that f (0) = 0 and decompose the
exponent as
F(ω1 ϕ1 + ω2 ϕ2 ) = F(ω1 ϕ1 ) + F(ω2 ϕ2 ),
which implies that

pX1 ,X2 (ω1 , ω2 ) = exp (F(ω1 ϕ1 )) × exp (F(ω2 ϕ2 ))
pX1 (ω1 ) × 
= pX2 (ω2 ),
where pX (ω) = exp (F(ωϕ)), which proves that X1 and X2 are independent. The inde-
pendence at every point follows from the fact that one can consider contracting, Dirac-
like sequences of functions ϕ1 and ϕ2 that are non-overlapping and whose support gets
arbitrarily small.
80 Continuous-domain innovation models

(2) Infinite divisibility


PROPOSITION 4.12 A white Lévy noise is uniquely specified by a canonical id dis-
 f (ω)−jωx dω
tribution pid (x) = R e 2π , where f is the defining Lévy exponent in (4.20). The
latter corresponds to the pdf of the observation X = rect(· − r0 ), w through a rectan-
gular window at some arbitrary location r0 .
The first part is just a restatement of the functional equivalence between Lévy noises
and id distributions on the one hand, and Lévy exponents on the other. As for the second
part, we recall that the characteristic function of the variable X = ϕ, w is given by
 
E{ejωX } = E{ejωϕ,w } = E{ejωϕ,w } = Pw (ωϕ) = exp F(ωϕ) .

By choosing ϕ(r) = rect(r − r0 ) with


 rect(r)
 = 1 for r ∈ (− 12 , 12 ]d and zero otherwise,
  resolve the integral Rd f ωrect(r) dr = f (ω); this implies that 
we formally pid (ω) =
exp f (ω) , which is the desired result.
Along the same line, we can show that the use of an arbitrary, non-rectangular ana-
lysis window does not fundamentally change the situation, in the sense that the pdf
remains infinitely divisible.
P R O P O S I T I O N 4.13 The observation X = ϕ, w of a white Lévy noise with Lévy
exponent f through an arbitrary observation window ϕ – not necessarily in S (Rd ) –
yields an infinitely divisible
 random pX (ω) =
 variable X whose characteristic function is
efϕ (ω) , where fϕ (ω) = Rd f ωϕ(r) dr. The validity of fϕ requires some (mild) technical
condition on f when ϕ ∈ Lp (Rd ) is not rapidly decaying.
This property is investigated in depth in Chapter 9 and exploited for deriving
transform-domain statistics. The precise statement of this id result for ϕ ∈ Lp (Rd )
is given in Theorem 9.1.
Another manifestation of the id property is that a continuous-domain Lévy noise can
always be broken down into an arbitrary number of i.i.d components.
P R O P O S I T I O N 4.14 A white Lévy noise w is infinitely divisible in the sense that it can
be decomposed as w = w1 + · · · + wN for any N ∈ Z+ , where the wn are i.i.d. white-noise
processes.
This simply follows from the property that the characteristic functional of the sum of
two independent processes is the product of their individual characteristic functionals.
Specifically, we can write
" #N
w (ϕ) = P
P w (ϕ) ,
N

   
where P w (ϕ) = exp
N Rd f ϕ(r) /N dr is the characteristic functional of a Lévy noise.
The justification is that f (ω)/N is a valid Lévy exponent for any N ≥ 1. In the impulsive
Poisson case, this simply translates into the Poisson density parameter λ being divided
by N.
Interestingly, there is also a converse to the statement in Proposition 4.14 [PR70]:
a generalized process s in S  (Rd ) is infinitely divisible
  if and only if its character-
istic functional can be written as P s (ϕ) = exp F(ϕ) , where F(ϕ) is a continuous,
4.4 White Lévy noises or innovations 81

conditionally positive definite functional over S (Rd ) (or generalized Lévy exponent)
as specified in Definition 4.6 [PR70, main theorem]. While this general characterization
is nice conceptually, it is hardly discriminative since the underlying notion of infinite
divisibility applies to all concrete families of generalized stochastic processes that are
known to us. In particular, it does not require the “whiteness” property that is funda-
mental for defining proper innovations.

(3) Flat power spectrum


Strictly speaking, the properties of stationarity and independence at every point are not
sufficient for specifying “white” noise. There is also some implicit idea of enforcing a
flat Fourier spectrum. A simple example that satisfies the two first properties but fails to
meet the latter is the weak derivative of a Lévy noise whose generalized power spectrum
(when defined) is not flat but proportional to |ω|2 .
The notion of power spectrum is based on second-order moments and only makes
sense when the stochastic process is stationary with a well-defined autocorrelation. In
Gelfand’s theory, the second-order dependencies are captured by the correlation func-
tional Bw (ϕ1 , ϕ2 ) = E{ϕ1 , w · ϕ2 , w }, where it is assumed that the generalized noise
process w is real-valued and that its second-order moments are well defined. The latter
second-order requirement is equivalent to imposing that the Lévy exponent f should be
twice-differentiable at the origin or, equivalently, that the canonical id distribution pid
of the process has a finite second-order moment.

P R O P O S I T I O N 4.15 Let w be a (second-order) white Lévy noise with Lévy exponent


f such that f  (0) = 0 (zero-mean assumption) and σw2 = −f  (0) < +∞ (finite-variance
assumption). Then,

Bw (ϕ1 , ϕ2 ) = σw2 ϕ1 , ϕ2 . (4.21)

Formally, this corresponds to the statement that the autocorrelation of a second-


order 5 Lévy noise is a Dirac impulse; i.e.,
 
Rw (r) = E{w(r0 )w(r0 − r)} = Bw δ(· − r0 ), δ(· − r0 + r)
= σw2 δ(r),

where r0 ∈ Rd can be arbitrary, as a consequence of the stationarity property. This also


means that the generalized power spectrum of a second-order white Lévy noise is flat:

w (ω) = F {Rw (r)}(ω) = σw2 .

We recall that the term “white” is used in reference to white light, whose electromag-
netic spectrum is distributed over the visible band in a way that stimulates all color
receptors of the eye equally. This is in opposition to “colored” noise, whose spectral
content is not equally distributed.
5 In the statistical literature, a second-order process usually designates a stochastic process whose second-
order moments are all well defined. In the case of generalized processes, the property refers to the existence
of the correlation functional.
82 Continuous-domain innovation models

Proof We have that Bw (ϕ1 , ϕ2 ) = E{X1 X2 }, where X1 = ϕ1 , w and X2 = ϕ2 , w are
real-valued
 random variables with joint characteristic function pX1 ,X2 (ω) = exp F(ω1 ϕ1 +
ω2 ϕ2 ) with ω = (ω1 , ω2 ). We then invoke the moment-generating property of the Fou-
rier transform, which translates into

2 ∂ X1 ,X2 (ω) 
2p
Bw (ϕ1 , ϕ2 ) = E{X1 X2 } = (−j) .
∂ω1 ∂ω2 ω1 =0,ω2 =0

By applying the chain rule twice, we obtain


 
∂ 2
pX1 ,X2 (ω) fX1 ,X2 (ω) ∂fX1 ,X2 (ω) ∂fX1 ,X2 (ω) ∂ 2 fX1 ,X2 (ω)
=e + ,
∂ω1 ∂ω2 ∂ω1 ∂ω2 ∂ω1 ∂ω2
where
 
fX1 ,X2 (ω) = log
pX1 ,X2 (ω1 , ω2 ) = f ω1 ϕ1 (r) + ω2 ϕ2 (r) dr
Rd

is the cumulant generating function of pX1 ,X2 . The required first derivative with respect
to ωi , i = 1, 2 is given by
∂fX1 ,X2 (ω)  
= f  ω1 ϕ1 (r) + ω2 ϕ2 (r) ϕi (r) dr,
∂ωi Rd

which, when evaluated at the origin, simplifies to


∂fX1 ,X2 (0)
= f  (0) ϕi (r) dr = −j E{Xi }. (4.22)
∂ωi Rd

Similarly, we get

∂ 2 fX1 ,X2 (0)


= f  (0) ϕ1 (r)ϕ2 (r) dr.
∂ω1 ∂ω2 Rd

By combining those results and using the property that fX1 ,X2 (0) = 0, we conclude that
 2
E{X1 X2 } = −f  (0)ϕ1 , ϕ2 − f  (0) ϕ1 , 1 ϕ2 , 1 ,

which is equivalent to (4.21) under the hypothesis that f  (0) = 0. It is also clear from
(4.22) that this latter condition is equivalent to the zero-mean property of the noise;
that is, E{ϕ, w } = 0 for all ϕ ∈ S (Rd ). Finally, we note that (4.21) is compatible
with the more general cumulant formula (9.22) if we set n = (1, 1), n = 2, and κ2 =
(−j)2 f  (0).

Since f (0) = 0 by definition, another way of writing the hypotheses in Proposi-


σ2
tion 4.15 is f (ω) = − 2w ω2 + O(|ω|3 ), which expresses an asymptotic equivalence with
the symmetric Gaussian scenario (purely quadratic Lévy exponent). This second-order
assumption ensures that the noise has zero mean and a finite variance σw2 so that its cor-
relation functional (4.21) is well defined. In what follows, it is made implicit whenever
we are talking of correlations or power spectra.
4.4 White Lévy noises or innovations 83

(4) Stochastic counterpart of the Dirac impulse


From an engineering perspective, white noise is often viewed as the stochastic analog
of the Dirac distribution δ, whose spectrum is flat in the literal sense (i.e., F {δ} = 1).
The fundamental difference, of course, is that the generalized function δ is a determi-
nistic entity. The simplest way of introducing randomness is by considering a shifted
and weighted impulse Aδ(· − r0 ) whose location r0 is uniformly distributed over some
compact subset of Rd and whose amplitude A is a random variable with pdf pA . A richer
form of excitation is obtained through the summation of such i.i.d. elementary contri-
butions, which results in the construction of impulsive Poisson noise, as specified by
(4.16). Theorem 4.9 ensures that this explicit way of representing noise is legitimate in
the case where the Lévy density v = λpA is integrable and the Gaussian part absent.
We shall now see that this constructive approach can be pushed to the limit for the
non-Poisson brands of innovations, including the Gaussian ones.

P R O P O S I T I O N 4.16 A white Lévy noise is the limit of a sequence of impulsive


Poisson-noise processes in the sense of the weak convergence of the underlying infinite-
dimensional measures.

Proof The technical part of the proof uses an infinite-dimensional generalization of


Lévy’s continuity theorem and will be reported elsewhere. The key idea is to consider
the following sequence of Lévy exponents:
 2 
 1  f (ω)
fn (ω) = n e n f (ω) − 1 = f (ω) + O ,
n
1
which are of the compound-Poisson type with λn = n and  pAn (ω) = e n f (ω) and which
converge to f (ω) as n goes to infinity. This suggests forming the corresponding sequence
of characteristic functionals
     

1
f ϕ(r)
Pwn (ϕ) = exp n e n − 1 dr ,
Rd
   
which are expected to converge to P w (ϕ) = exp
Rd f ϕ(r) dr as n → +∞. The main
point for the argument is that these are all of the impulsive Poisson type for n fixed (see
Theorem 4.9). The crux of the proof is to control the convergence by specifying some
appropriate bound and to verify that some basic equicontinuity conditions are met.

The result is interesting because it gives us some insight into the nature of continuous
domain noise. The limit process involves random Dirac impulses that get denser as
n increases. When the variance of the noise σw2 = −f  (0) is finite, the increase of the
average number of impulses per unit volume λn = O(n) is compensated by a decrease of
the variance of the amplitude distribution in inverse proportion: Var{An } = σw2 /n. While
any of the generalized noise processes in the sequence is as rough as a Dirac impulse,
this picture suggests that the degree of singularity of the limit object in the non-Poisson
scenario can be potentially reduced due to the accumulation of impulses and the fact that
the variance of their amplitude distribution converges to zero. The particular example
84 Continuous-domain innovation models

that we have in mind is Gaussian white noise, which can be obtained as a limit of
compound-Poisson processes with contracting Gaussian amplitude distributions.

4.5 Generalized stochastic processes and linear models

As already mentioned, the class of generalized stochastic processes that are of interest
to us are those defined through the generic innovation model Ls = w (linear stochastic
differential equation) where the differential operator L is shift-invariant and where the
driving term w is a continuous-domain white Lévy noise. Having made sense of the
latter, we can now proceed with the specification of the class of admissible whitening
operators L. The key requirement here is that the model be invertible (in the sense
of generalized functions) which, by duality, translates into some continuity constraint
on the adjoint operator L−1∗ . For the time being, we shall limit ourselves to making
some general statements about L and its inverse that ensure existence, while deferring
to Chapter 5 for concrete examples of admissible operators.

4.5.1 Innovation models


The interpretation of the above continuous-domain linear model in the sense of
generalized functions is

∀ϕ ∈ S (Rd ), ϕ, Ls = ϕ, w . (4.23)

The generalized stochastic process s = L−1 w is generated by solving this equation,


which amounts to a linear transformation of the Lévy innovation w. Formally, this trans-
lates into

∀ϕ ∈ S (Rd ), ϕ, s = ϕ, L−1 w = L−1∗ ϕ, w , (4.24)

where L−1 is an appropriate right inverse of L. The above manipulation obviously only
makes sense if the action of the adjoint operator L−1∗ is well defined over Schwartz’
class S (Rd ) of test functions – ideally, a continuous mapping from S (Rd ) into itself
or, possibly, Lp (Rd ) (or some variant) if one imposes suitable restrictions on the Lévy
exponent f to maintain continuity.
We like to refer to (4.23) as the analysis statement of the model, and to (4.24) – or
its shorthand, s = L−1 w – as the synthesis description. Of course, this will only work
properly if we have an exact equivalence, meaning that there is a proper and unique
definition of L−1 . The latter will need to be made explicit on a case-by-case basis with
the possible help of boundary conditions.

4.5.2 Existence and characterization of the solution


We shall now see that, under suitable conditions on L−1∗ (see Theorem 4.17 below),
one can completely specify such processes via their characteristic functional and ensure
4.5 Generalized stochastic processes and linear models 85

their existence as solutions of (4.23). We recall that the characteristic functional of a


generalized stochastic process s is defined as
s (ϕ) = E{ejϕ,s }
P

= ejϕ,s Ps (ds),
S  (Rd )

where the latter expression involves an abstract infinite-dimensional integral over the
space of tempered distributions and provides the connection with the defining measure
Ps on S  (Rd ). Ps is a functional S (Rd ) → C that associates the complex number
Ps (ϕ) with each test function ϕ ∈ S (Rd ) and which is endowed with three fundamen-
s (0) = 1). It
tal properties: positive definiteness, continuity, and normalization (i.e., P
can also be specified using the more concrete formula

s (ϕ) =
P ejy dPϕ,s (y), (4.25)
R

which involves a classical Stieltjes integral with respect to the probability law PY=ϕ,s =
Prob(Y < y), where Y = ϕ, s is a conventional scalar random variable, once ϕ is fixed.
For completeness, we also recall the meaning of the underlying terminology in the
context of a generic (normed or nuclear) space X of test functions.

DEFINITION 4.7 (positive definite functional) A complex-valued functional G :


X → C defined over the function space X is said to be positive definite if
N N
G(ϕm − ϕn )ξm ξ n ≥ 0
m=1 n=1

for every possible choice of ϕ1 , . . . , ϕN ∈ X , ξ1 , . . . , ξN ∈ C, and N ∈ N+ .

D E F I N I T I O N 4.8 (Continuous functional) A functional G : X → R (or C) is said


to be continuous (with respect to the topology of the function space X ) if, for any
convergent sequence (ϕi ) in X with limit ϕ ∈ X , the sequence G(ϕi ) converges to
G(ϕ); that is,
lim G(ϕi ) = G(lim ϕi ).
i i

An essential element of our formulation is that Schwartz’ space of test functions


S (Rd ) is nuclear (see Section 3.1.3), as required by the Minlos–Bochner theorem
(Theorem 3.9). The latter expresses the one-to-one correspondence (in the form of an
infinite-dimensional Fourier pair) between the characteristic functional P s : S (Rd ) →

C and the measure Ps on S (R ) that uniquely characterizes the generalized process s.
d

The truly powerful aspect of the theorem is that it suffices to check that Ps satisfies the
d
three defining conditions – positive definiteness, continuity over S (R ), and normali-
zation – to prove that it is a valid characteristic functional, which then automatically
ensures the existence of the process since the corresponding measure over S  (Rd ) is
well defined.
86 Continuous-domain innovation models

The formulation of our generative model (4.24) in that context is


s (ϕ) = P
P  −1 (ϕ) = P
w (L−1∗ ϕ), (4.26)
L w

w (ϕ) is the characteristic functional of the innovation process w.


where P

THEOREM 4.17 (Generalized innovation model) Let U = L−1∗ be a linear operator


that satisfies the two conditions
(1) Left-inverse property: UL∗ ϕ = ϕ for all ϕ ∈ S (Rd ) where L∗ is the adjoint of
some given (whitening) operator L;
(2) Stability: U is a continuous linear map from S (Rd ) into itself, or, by extension,
S (Rd ) → R (Rd ).

Then, s (ϕ) =
process s that is characterized by E{ejϕ,s } = P
the generalized
 −1∗  stochastic
  d
exp Rd f L ϕ(r) dr is well defined in S (R ) and satisfies the innovation model
Ls = w, where w is a Lévy innovation with exponent f.
When f is p-admissible (see Definition 4.4) with p ≥ 1, the second condition can be
replaced by the weaker requirement that U is a continuous linear map from S (Rd ) into
Lp (Rd ).

Proof First, we prove that s is a bona fide generalized stochastic process in S  (Rd )
by showing that Ps (ϕ) is a continuous, positive definite functional on S (Rd ) such that

Ps (0) = 1 (by the Minlos–Bochner theorem).    
The Lévy noise functional P w (ϕ) = exp d
Rd f ϕ(r) dr is continuous over S (R )
by construction (see Theorem 4.8). This, together with the assumption that L−1∗ is
a continuous operator on S (Rd ), implies that the composed functional P s (ϕ) =

Pw (L ϕ) is continuous on S (R ). The reasoning is also applicable when L−1∗ is a
−1∗ d

continuous operator S (Rd ) → Lp (Rd ) and P w (ϕ) is continuous over Lp (Rd ) – see the
triangular diagram in Figure 3.1 with X = S (Rd ) and Y = Lp (Rd ). This latter scenario
is covered by Theorem 8.2, which establishes the positive definiteness and continuity
of Pw over Lp (Rd ) when f is p-admissible (see Section 8.2). The case where L−1∗ is
a continuous operator S (Rd ) → R (Rd ) is handled in the same fashion by invoking
Proposition 8.1.
Next, for any given set of functions ϕ1 , . . . , ϕN ∈ S (Rd ) and coefficients ξ1 , . . . ,
ξN ∈ C, we have
N N
s (ϕm − ϕn )ξm ξ n
P
m=1 n=1
N N
 
= w L−1∗ (ϕm − ϕn ) ξm ξ n
P
m=1 n=1
N N
= w (L−1∗ ϕm − L−1∗ ϕn )ξm ξ n
P (by linearity)
m=1 n=1
≥0, w )
(by the positivity of P
4.6 Bibliographical notes 87

which shows that P s is positive definite on S (Rd ). Finally, P s (0) = P w (L−1∗ 0) =


Pw (0) = 1, which completes the first part of the proof.
The above result and stability conditions ensure that the action of the inverse operator
L−1 is well defined over the relevant subset of tempered distributions, which justifies
the formal manipulation made in (4.24). Now, if L−1∗ is a proper left inverse of L∗ , we
have that
−1∗ ∗ ∗ −1
  L ϕ, w = L ϕ, L  w = ϕ, Ls ,
ϕ, w = L
Id s

which proves that the generalized process s = L−1 w satisfies (4.23).


The next chapters are devoted to the investigation of specific instances of this model
and to making sure that the conditions for existence in Theorem 4.17 are met. We shall
also discuss the conceptual connection with splines and wavelets. This connection is
fundamental to our purpose. In Chapter 9, we shall then use the generalized innovation
model to show that the primary statistical features of the input noise are essentially
transferred to the signal as well as to the transform domain. The main point is that
the marginal distributions are all part of the same infinitely divisible family as long as
the composition of the mixing procedure (L−1 ) and the signal analysis remains linear.
On the other hand, the amount of coupling and level of interdependence will strongly
depend on the nature of the transformation.

4.6 Bibliographical notes

Sections 4.2 and 4.3


Infinitely divisible distributions were introduced by de Finetti in 1929, and their pri-
mary properties established by Kolmogorov, Lévy, Khintchine, and Feller in the 1930s
[BDR02, SVH03, MR06]. They constitute a classical topic in probability theory in tight
connection with the central-limit theorem [GK68, Fel71]. The general expression (4.3)
for the exponent of the characteristic function of an infinitely divisible random variable
was given by Lévy [Lév34]. Shortly after, Khintchine provided a purely analytical deri-
vation [Khi37b,Khi37a]. The sketch of proof in Section 4.2.4 is adapted from [Khi37a],
whose translation is given in [MR06].
We have chosen to name Theorem 4.1 after Lévy and Schoenberg because it essen-
tially results from the combination of two fundamental theorems in harmonic analysis
named after these authors [Lév34, Sch38]. While their groundwork dates back to the
1930s, it took until the late 1960s to reformulate the Lévy–Khintchine characterization
in terms of the (conditional) positive definiteness of the exponent [Joh66].
The p-admissibility condition was introduced in [UTS14] in order to simplify the
derivation of bounds and continuity properties related to Lévy exponents. The argument
concerning the compatibility of infinite divisibility with the notion of sparsity is also
adapted from this paper.
For additional information on id laws, we refer the reader to [SVH03, Sat94, CT04];
these works also contain the ground material for the specification of the symmetric id
88 Continuous-domain innovation models

distributions in Table 4.1. Further distributional properties relating to decay and the
effect of repeated convolution are exposed in Chapter 9.

Section 4.4
The specification of white Lévy noise by means of its characteristic functional (see
Definition 4.5) is based on a series of theorems by Gelfand and Vilenkin [GV64].
Interestingly, the generic form (4.13) is not only sufficient for defining a (stationary)
innovation, as proven by these authors, but also necessary if one adds the observability
constraint that Xid = rect, w is a well-defined random variable [AU14]. The restric-
tion of the family to the space of tempered distributions was investigated by Fageot
et al. [Fag14]. Theorem 4.9 is adapted from [UT11].
The abstract characterization of infinite divisibility and the full generalization of the
Lévy–Khintchine formula for measures over topological vector spaces is covered in the
works of Fernique and Prakasa Rao [Fer67, PR70].

Section 4.5
The innovation or filtered-white-noise model has a long tradition in communication and
statistical signal processing in relation to time-series analysis [BS50, WM57, Kai70].
The classical assumption is that the excitation noise (innovation) is Gaussian and that
the shaping filter is causal with a causal inverse. The innovation, as defined by Wiener
and Masani, then corresponds to the unpredictable part of the signal; that is, the differ-
ence between the value of the signal at given time t and the optimal linear forecast of that
value based on the information available prior to t. Thanks to the Gaussian hypothesis,
one can then formulate a coherent correlation theory of such processes, using standard
Fourier and Hilbert-space techniques, in which continuous-domain white noise only
intervenes as a formal entity; that is, a Gaussian process whose power spectrum is a
constant. This purely spectral description of white noise is consistent with the Wiener–
Khintchine theorems, 6 which explains its popularity among engineers [Pap91, Yag86].
The non-Gaussian extension of the innovation model that is presented in this chapter
is conceptually similar, but relies on the more elaborate definition of continuous-domain
white noise and the functional tools that were developed by Gelfand to formulate his
theory of generalized stochastic processes [Gel55, GV64]. A slightly more restrictive
version of the model with Gaussian and/or impulsive Poisson excitation was presented
in [UT11]. The original statement of Theorem 4.17 for d = 1 can be found in [UTS14].
While the level of generality of this result is sufficient for our purpose, we must warn
the reader that the framework cannot directly handle non-linear transformations because
the underlying objects are generalized functions, which are intrinsically linear. For com-
pleteness, we mention the existence of an extended theory of white noise, due to Hida,
which is aimed at overcoming this limitation [HKPS93,HS04,HS08]. This theory gives
a meaning to certain classes of non-linear white-noise functionals – in analogy with Itô’s
calculus – but it is mathematically quite involved and beyond the scope of this book.

6 The Wiener–Khintchine theorem states that the autocorrelation function of a second-order stationary pro-
cess is the inverse Fourier transform of its power spectrum.
5 Operators and their inverses

In this chapter we review three classes of linear shift-invariant (LSI) operators: convo-
lution operators with stable LSI inverses, operators that are linked with ordinary differ-
ential equations, and fractional operators.
The first class, considered in Section 5.2, is composed of the broad family of mul-
tidimensional operators whose inverses are stable convolution operators – or filters.
Convolution operators play a central role in signal processing. They are easy to charac-
terize mathematically via their impulse response. The corresponding generative model
for stochastic processes amounts to LSI filtering of a white noise, which automatically
yields stationary processes.
Our second class is the 1-D family of ordinary differential operators with constant
coefficients, which is relevant to a wide range of modeling applications. In the “stable”
scenario, reviewed in Section 5.3, these operators admit stable LSI inverses on S  and
are therefore included in the previous category. On the other hand, when the differential
operators have one or more zeros on the imaginary axis (the marginally stable/unstable
case), they find a non-trivial null space in S  , which consists of (exponential) poly-
nomials. This implies that they are no longer unconditionally invertible on S  , and
that we can at best identify left- or right-side inverses, which should additionally fulfill
appropriate “boundedness” requirements in order to be usable in the definition of sto-
chastic processes. However, as we shall see in Section 5.4, obtaining an inverse with
the required boundedness properties is feasible but requires giving up shift-invariance.
As a consequence, stochastic processes defined by these operators are generally non-
stationary.
The third class of LSI operators, investigated in Section 5.5, consists of fractional
derivatives and/or Laplacians in one and several dimensions. Our focus is on the family
of linear operators on S that are simultaneously homogeneous (scale-invariant up to
a scalar coefficient) and invariant under shifts and rotations. These operators are inti-
mately linked to self-similar processes and fractals [BU07, TVDVU09]. Once again,
finding a stable inverse operator to be used in the definition of self-similar processes
poses a mathematical challenge since the underlying system is inherently unstable. The
difficulty is evidenced by the fact that statistical self-similarity is generally not compa-
tible with stationarity, which means that a non-shift-invariant inverse operator needs to
be constructed. Here again, a solution may be found by extending the approach used for
the previous class of operators. From our first example in Section 5.1, we shall actually
90 Operators and their inverses

see that one is able to reconcile the classical theory of stationary processes with that of
self-similar ones by viewing the latter as a limit case of the former.
Before we begin our discussion of operators, let us formalize some notions of
invariance.

D E F I N I T I O N 5.1 (Translation-invariance) An operator T is shift- (or translation-)


invariant if and only if, for any function ϕ in its domain and any r0 ∈ Rd ,

T{ϕ(· − r0 )}(r) = T{ϕ}(r − r0 ).

D E F I N I T I O N 5.2 (Scale-invariance) An operator T is scale-invariant (homogeneous)


of order γ if and only if, for any function ϕ in its domain,

T{ϕ}(r/a) = |a|γ T{ϕ(·/a)}(r),

where a ∈ R+ is the dilation factor.

An alternative version of the scale-invariance condition is

T{ϕ(a·)}(r) = |a|γ T{ϕ}(ar), (5.1)

where a now represents a contraction factor.

D E F I N I T I O N 5.3 (Rotation-invariance) An operator T is scalarly rotation-invariant


if and only if, for any function ϕ in its domain,

T{ϕ}(RT r) = T{ϕ(RT ·)}(r),

where R is any orthogonal matrix in Rd×d (by using orthogonal matrices in the defini-
tion, we take into account both proper and improper rotations, with respective determi-
nants 1 and −1).

5.1 Introductory example: first-order differential equation

To fix ideas, let us consider the generic first-order differential operator L = D − αId


with α ∈ C, where D = dr d
and Id are the derivative and identity operators, respectively.
Clearly, L is LSI, but generally not scale-invariant unless α = 0. The corresponding
linear system with (deterministic or stochastic) output s and input w is defined by the
d
differential equation dr s(r) − αs(r) = w(r). Under the classical stability assumption
Re(α) < 0 (pole in the left half of the complex plane), its impulse response is given by
* +
1
ρα (r) = F −1 (r) = 1+ (r)eαr (5.2)
jω − α
and is rapidly decaying. This provides us with an explicit characterization of the inverse
operator, which reduces to a simple convolution with a decreasing causal exponential:

(D − αId)−1 ϕ = ρα ∗ ϕ.
5.1 Introductory example: first-order differential equation 91

Figure 5.1 Comparison of antiderivative operators. (a) Input signal. (b) Result of shift-invariant
integrator and its adjoint. (c) Result of scale-invariant integrator I0 and its Lp -stable adjoint I∗0 ;
the former yields a signal that vanishes at the origin, while the latter enforces the decay of the
output as t → −∞ at the cost of a jump discontinuity at the origin.

Likewise, it is easy to see that the corresponding adjoint inverse operator L−1∗ is speci-
fied by
(D − αId)−1∗ ϕ = ρα∨ ∗ ϕ,
where ρα∨ (r) = ρα (−r) is the time-reversed version of ρα . Thanks to its rapid decay, ρα∨
defines a continuous linear translation-invariant map from S (R) into itself.
This allows us to express the solution of the first-order SDE as a filtered version of
the input noise sα = ρα ∗ w. It follows that sα is a stationary
  process that  is completely
specified by its characteristic form E{ejϕ,sα } = exp R f (ρα∨ ∗ ϕ)(r) dr , where f is
the Lévy exponent of the innovation w (see Section 4.5).
Let us now focus our attention on the limit case α = 0, which yields an operator L = D
that is scale-invariant. Here too, it is possible to specify the LSI inverse (integrator)
r
D−1 ϕ(r) = ϕ(τ ) dτ = (1+ ∗ ϕ)(r),
−∞

whose output is well defined pointwise when ϕ ∈ S (R). The less favorable aspect is
that the classical LSI integrator does not fulfill the usual stability requirement due to the
92 Operators and their inverses

non-integrability of its impulse response 1+ ∈ / L1 (R). This implies that D−1∗ ϕ = 1∨+ ∗ϕ
is generally not in Lp (R). Thus, we are no longer fulfilling the admissibility condition in
−1∗
 4.17. The source of the problem is the lack of decay of D ϕ(r) as r → −∞
Theorem
when R ϕ(τ ) dτ =  ϕ (0)  = 0 (see Figure 5.1b). Fortunately, this can be compensated
by defining the modified antiderivative operator

I∗0 ϕ(r) = ϕ (0) = (1∨
ϕ(τ ) dτ − 1+ (−r) + ∗ ϕ)(r) − ϕ(τ ) dτ 1∨
+ (r)
r R
−1∗
=D ϕ(r) − (D−1∗ ϕ)(−∞)1∨
+ (r), (5.3)

which happens to be the only left inverse of D∗ = −D that is both scale-invariant and
Lp -stable for any p > 0. The adjoint of I∗0 specifies the adjusted 1 integrator
r
I0 ϕ(r) = ϕ(τ ) dτ = D−1 ϕ(r) − (D−1 ϕ)(0), (5.4)
0

which is the correct scale-invariant inverse of D for our formulation, to be applied to


elements of S  (R). It follows that the solution of the corresponding (unstable) SDE
can be expressed as s = I0 w, which is a well-defined stochastic process as long as the
input noise is Lévy p-admissible with p ≥ 1 (by Theorem 4.17). We note that the price
to pay for the stabilization of the solution is to give up on shift-invariance. Indeed, the
adjusted integrator is such that it imposes the boundary condition s(0) = (I0 w)(0) = 0
(see Figure 5.1c), which is incompatible with stationarity, but a necessary condition for
self-similarity.
 The so-constructed processes are fully specified by their characteristic
form exp( R f (I0 ϕ(r))dt), where f is a Lévy exponent. Based on this representation, we
can show that these are equivalent to the Lévy processes that are usually defined for
r ≥ 0 only [Sat94]. While this connection with the classical theory of Lévy processes is
already remarkable, it turns out that the underlying principle is quite general and appli-
cable to a much broader family of operators, provided that we can ensure Lp -stability.

5.2 Shift-invariant inverse operators

The association of LSI operators with convolution integrals will be familiar to most
readers. In effect, we saw in Section 3.3.5 that, as a consequence of Schwartz’ ker-
nel theorem, every continuous LSI operator L : S (Rd ) → S  (Rd ) corresponds to a
convolution
L : ϕ → L{δ} ∗ ϕ

with a kernel (impulse response) L{δ} ∈ S  (Rd ). Moreover, by the convolution-product


rule, we may also characterize L by a multiplication in the Fourier domain:

L : ϕ → F −1 {
L(ω)
ϕ (ω)}.

1 Any two valid right inverses can only differ by a component (constant) that is in the null space of the
operator. The scale-invariant solution is the one that forces the output to vanish at the origin.
5.2 Shift-invariant inverse operators 93

We call 
L the Fourier multiplier or symbol associated with the operator L.
From the Fourier-domain characterization of L, we see that, if L is smooth and does
d d
not grow too fast, then L maps S (R ) back into S (R ). This is in particular true if L{δ}
(the impulse response) is an ordinary locally integrable function with rapid decay. It is
also true if L is a 1-D linear differential operator with constant coefficients, in which
case L{δ} is a finite sum of derivatives of the Dirac distribution and the corresponding
Fourier multiplier L(ω) is a polynomial in jω.
For operators with smooth Fourier multipliers that are nowhere zero in Rd and not
decaying (or decaying slowly) at ∞, we can define the inverse L−1 : S (Rd ) → S (Rd )
of L by
* +
−1 −1  ϕ (ω)
L : ϕ → F .

L(ω)

This inverse operator is also linear and shift-invariant, and has the convolution kernel
* +
−1 1
ρL = F , (5.5)

L(ω)

which is in effect the Green’s function of the operator L. Thus, we may write

L−1 : ϕ → ρL ∗ ϕ.

For the cases in which  L(ω) vanishes at some points, its inverse 1/L(ω) is not in
general a locally integrable function, but even in the singular case we may still be able
to regularize the singularities at the zeros of 
L(ω) and obtain a singular “generalized
function” whose inverse Fourier transform, per (5.5), once again yields a convolution
kernel ρL that is a Green’s function of L. The difference is that in this case, for an
arbitrary ϕ ∈ S (Rd ), ρL ∗ ϕ may no longer belong to S (Rd ).
As in our introductory example, the simplest scenario occurs when the inverse opera-
tor L−1 is shift-invariant with an impulse response ρL that has sufficient decay for the
system to be BIBO-stable (bounded input, bounded output).

P R O P O S I T I O N 5.1 Let L−1 ϕ(r) = (ρL ∗ ϕ)(r) = Rd ρL (r )ϕ(r − r ) dr with ρL ∈
L1 (Rd ) (or, more generally, where ρL is a complex-valued Borel measure of bounded
variation). Then, L−1 and its adjoint specified by L−1∗ ϕ(r) = (ρL∨ ∗ϕ)(r) = Rd ρL (−r )
ϕ(r − r ) dr are both Lp -stable in the sense that

L−1 ϕLp ≤ ρL L1 ϕLp


L−1∗ ϕLp ≤ ρL L1 ϕLp

for all p ≥ 1. In particular, this ensures that L−1∗ continuously maps S (Rd ) →
Lp (Rd ).

The result follows from Theorem 3.5. For the sake of completeness, we shall establish
the bound based on the two extreme cases p = 1 and p = +∞.
94 Operators and their inverses

Proof To obtain the L1 bound, we manipulate the norm of the convolution integral as
 
  
ρL ∗ ϕL1 =  ρ (r)ϕ(r 
− r) dr  dr
 L 
Rd Rd

≤ |ρL (r)ϕ(s)| dr ds (change of variable s = r − r)


Rd Rd

= |ρL (r)| dr |ϕ(s)| ds = ρL L1 ϕL1 ,


Rd Rd

where the exchange of integrals is justified by Fubini’s theorem. The corresponding L∞


bound follows from the simple pointwise estimate
 
|(ρL ∗ ϕ)(r)| ≤ ϕL∞ ρL (r ) dr = ρL L ϕL
1 ∞
Rd

for any r ∈ Rd . The final step is to invoke the Riesz–Thorin


 theorem (Theorem 3.4)
which yields Young’s inequality for Lp functions see (3.11) , and hence proves the
desired result for 1 ≤ p ≤ ∞. The inequality also applies to the adjoint operator since
the latter amounts to a convolution with the reversed impulse response ρL∨ (r) = ρ(−r) ∈
L1 (Rd ). The final statement simply follows from the fact that the convergence of a
sequence of functions in the (strong) topology of S (Rd ) implies convergence in all Lp
norms for p > 0.

Note that the L1 condition in Proposition 5.1 is the standard hypothesis that is made in
the theory of linear systems to ensure the BIBO stability of an analog filter. It is slightly
stronger than the total variation (TV) condition in Theorem 3.5, which is necessary and
sufficient for both BIBO (p = ∞) and L1 stabilities.
If, in addition, ρL (r) decays faster than any polynomial (e.g., is compactly supported
or decays exponentially), then we can actually ensure S -continuity so that there is no
restriction on the class of corresponding stochastic processes.

Let L−1 ϕ(r) = (ρL ∗ ϕ)(r) with |ρL (r)| ≤ 1+r


P R O P O S I T I O N 5.2
Cn
n for all n ∈ Z
+

and r ∈ Rd . Then, −1
L and L −1∗ are S -continuous in the sense that ϕ ∈ S ⇒
L−1 ϕ, L−1∗ ϕ ∈ S with both operators being bounded in an appropriate sequence of
seminorms.

The key here is that the convolution with ρL preserves the rapid decay of the test
function ϕ. The degree of smoothness of the output is not an issue because, for non-
constant functions, the convolution operation commutes with differentiation.
The good news is that the entire class of stable 1-D differential systems with rational
transfer functions and poles in the left half of the complex plane falls into the category
of Proposition 5.2. The application of such operators provides us with a convenient
mechanism for solving ordinary differential equations, as detailed in Section 5.3.
The S -continuity property is important for our formulation. It also holds for all
shift-invariant differential operators whose impulse response is a point distribution; e.g.,
(D − Id){δ} = δ  − δ. It is preserved under convolution, which justifies the factorization
of operators into simpler constituents.
5.3 Stable differential systems in 1-D 95

5.3 Stable differential systems in 1-D

The generic form of a linear shift-invariant differential equation in 1-D with (determi-
nistic or random) output s and driving term w is

N M
an Dn s = bm Dm w, (5.6)
n=0 m=0

where the an and bm are arbitrary complex coefficients with the normalization constraint
aN = 1. Equation (5.6) thus covers the general 1-D case of Ls = w, where L is a shift-
invariant operator with the rational transfer function

 (jω)N + aN−1 (jω)N−1 + · · · + a1 (jω) + a0 pN (jω)


L(ω) = = . (5.7)
bM (jω)M + · · · + b1 (jω) + b0 qM (jω)

The poles of the system, which are the roots of the characteristic polynomial pN (ζ ) =
ζ N + aN−1 ζ N−1 + · · · + a0 with Laplace variable ζ ∈ C, are denoted by {αn }N n=1 . In
the standard causal-stable scenario where Re(αn ) < 0 for n = 1, . . . , N, the solution is
obtained as

s(r) = L−1 w(r) = (ρL ∗ w)(r),

where ρL is the causal Green’s function of L specified by (5.5).


In practice, the determination of ρL is based on the factorization of the transfer func-
tion of the system as

1 $N
1
= qM (jω) (5.8)

L(ω) jω − αn
n=1
M
(jω − γm )
= bM m=1
N
, (5.9)
n=1 (jω − αn )

which is then broken into simple constituents, either by serial composition of first-order
factors or by decomposition into simple partial fractions. We are providing the fully
factorized form (5.9) of the transfer function to recall the property that a stable Nth-
order system is completely characterized by its poles {αn }N n=1 and zeros {γm }m=1 , up to
M

the proportionality factor bM .


Since the S -continuity property is preserved through the composition of convolution
operators, we shall rely on (5.9) to factorize L−1 into elementary operators. To that
end, we shall study the effect of simple constituents (first-order differential operators
with stable inverses) before considering their composition into higher-order operators.
We shall also treat the leading polynomial factor qM (jω) in (5.8) separately because it
corresponds to a convolution operator whose impulse response is the point distribution
M (m)
 m=0 bm δ . The latter is S -continuous,
 irrespective of the choice of coefficients bm
or, equivalently, the zeros γm in (5.9) .
96 Operators and their inverses

5.3.1 First-order differential operators with stable inverses


The first-order differential operator

Pα = D − αId

corresponds to the convolution kernel (impulse response) (δ  − αδ) and Fourier multi-
plier (jω−α). For Re(α)  = 0 (the stable case in signal-processing parlance), the Fourier
multiplier, (jω − α)−1 , is non-singular, with the well-defined inverse Fourier transform
* + 
1 eαr 1[0,∞) (r) if Re(α) < 0
ρα (r) = F −1 (r) =
jω − α −e 1(−∞,0] (r) if Re(α) > 0.
αr

Clearly, in either the causal (Re(α) < 0) or the anti-causal (Re(α) > 0) case, these
functions decay rapidly at infinity, and convolutions with them map S (R) back into
itself. Moreover, a convolution with ρα inverts Pα : S (R) → S (R) from both the left
and the right, for which we can write

P−1
α ϕ = ρα ∗ ϕ (5.10)

for Re(α)  = 0.
Since Pα and P−1α , Re(α)  = 0, are both S -continuous (continuous from S (R) into
S (R)), their action can be transferred to the space S  (R) of Schwartz distributions by
identifying the adjoint operator P∗α with (−P−α ), in keeping with the identity

Pα ϕ, φ = −ϕ, P−α φ



on S (R). We recall that, in this context, ϕ, φ denotes the bilinear form R ϕ(r)φ(r) dr,
not the Hermitian product.

5.3.2 Higher-order differential operators with stable inverses


As noted earlier, the preservation of continuity by composition facilitates the study of
differential systems by permitting us to decompose higher-order operators into first-
order factors. Specifically, let us consider the equivalent factorized representation of the
Nth-order differential equation (5.6) given by

Pα1 · · · PαN {s}(r) = qM (D){w}(r), (5.11)


M
where {αn }N n=1 are the poles of the system and where qM (D) =
m
m=0 bm D is the
Mth-order differential operator acting on the right-hand side of (5.6). Under the assump-
tion that Re(αn )  = 0 (causal or anti-causal stability), we can invert the operators acting
on the left-hand side of (5.11), which allows us to express the solution of the differential
equation as

s(r) = P−1 −1
α · · · Pα1 qM (D){w}(r). (5.12)
 N  
L−1
5.4 Unstable Nth-order differential systems 97

This translates into the following formulas for the corresponding inverse operator L−1 :

L−1 = P−1 −1
αN · · · Pα1 qM (D)
= bM P−1 −1
αN · · · Pα1 Pγ1 · · · PγN ,

which are consistent with (5.8) and (5.9), respectively. These operator-based mani-
pulations are legitimate since all the elementary constituents in (5.11) and (5.12) are
S -continuous convolution operators. We also recall that all Lp -stable and, a fortiori
S -continuous, convolution operators satisfy the properties of commutativity, associati-
vity, and distributivity, so that the ordering of the factors in (5.11) and (5.12) is imma-
terial. Interestingly, this divide-and-conquer approach to the problem of inverting a
differential operator is also extendable to the unstable scenarios (with Re(αn ) = 0 for
some values of n), the main difference being that the ordering of operators becomes
important (partial loss of commutativity).

5.4 Unstable Nth-order differential systems

Classically, a differential system is categorized as being unstable when some of its poles
are in the right complex half-plane, including the imaginary axis. Mathematically, there
is no fundamental reason for excluding the cases Re(αn ) > 0 because one can simply
switch to an anti-causal configuration which preserves the exponential decay of the
response, as we did in defining these inverses in Section 5.3.1.
The only tricky situation occurs for purely imaginary poles of the form αm = jω0 with
ω0 ∈ R, to which we now turn our attention.

5.4.1 First-order differential operators with unstable shift-invariant inverses


Once again, we begin with first-order operators. Unlike the case of Re(α)  = 0, for
purely imaginary α = jω0 , Pjω0 is not a surjective operator from S (R) to S (R) (mean-
ing it maps S (R) into a proper subset of itself, not into the entire space), because it
introduces a frequency-domain zero at ω0 . This implies that we cannot find an operator
U : S (R) → S (R) that is a right inverse of Pjω0 . In other words, we cannot fulfill
Pjω0 Uϕ = ϕ for all ϕ ∈ S (R) subject to the constraint that Uϕ ∈ S (R). 
If we now consider Pjω0 as an operator on S  (R) as the adjoint of (−P−jω0 ) , then
this operator is not one-to-one. More precisely, it has a non-trivial null space consisting
of multiples of the complex sinusoid ejω0 r . Consequently, on S  (R), Pjω0 does not have
a left inverse U : S  (R) → S  (R) with U Pjω0 f = f for all f ∈ S  (R).
The main conclusion of concern to us is that Pjω0 : S (R) → S (R) and its adjoint
P∗jω0 : S  (R) → S  (R) are not invertible in the usual sense of the word (i.e., from both
sides). However, we are able to properly invert Pjω0 on its image (range), as we discuss
now.
98 Operators and their inverses

Let S jω0 denote the image of S (R) under Pjω0 . This is the same as the subspace of
S (R) consisting of functions ϕ for which 2
+∞
e−jω0 r ϕ(r) dr = 0.
−∞

In particular, for jω0 = 0, we obtain S 0 , the space of Schwartz test functions with
vanishing zeroth-order moment. We may then view Pjω0 as an operator S (R) → S jω0 ,
and this operator now has an inverse P−1
jω0 from S jω0 → S (R) defined by

P−1
jω0 ϕ(r) = (ρjω0 ∗ ϕ)(r), (5.13)

where
* +
1 1
ρjω0 (r) = F −1 (r) = sign(r)ejω0 r . (5.14)
jω − jω0 2
Specifically, this LSI operator satisfies the right- and left-inverse relations

Pjω0 P−1
jω0 ϕ = ϕ for all ϕ ∈ S jω0
P−1
jω0 Pjω0 ϕ = ϕ for all ϕ ∈ S (R).

In order to be able to use such inverse operators for defining stochastic processes, we
need to extend P−1
jω0 to all of S (R).
−1
Note that, unlike the case of P−1
α with Re(α)  = 0, here the extension of Pjω0 to
an operator S (R) → S  (R) is in general not unique. For instance, P−1
jω0 may also be
specified as

P−1
jω0 ,+ ϕ(r) = (ρjω0 ,+ ∗ ϕ)(r), (5.15)

with causal impulse response

ρjω0 ,+ (r) = ejω0 r 1+ (r), (5.16)

which defines the same operator on S jω0 but not on S (R). In fact, we could as well
have taken any impulse response of the form ρjω0 (r) + p0 (r), where p0 (r) = c0 ejω0 r is an
oscillatory component that is in the null space of Pjω0 . By contrast, the Lp -continuous
inverses that we define below remain the same for all of these extensions. To convey the
idea, we shall first consider the extension based on the causal operator P−1∗
jω0 ,+ defined by
(5.15). Its adjoint is denoted by P−1∗
jω0 ,+ and amounts to an (anti-causal) convolution with
∨ .
ρjω 0 ,+
To solve the stochastic differential equation Pjω0 s = w, we need to find a left inverse
of the adjoint P∗jω0 acting on the space of test functions, which maps S (R) into the
required Lp space (Theorem 4.17). The problem with the “shift-invariant” 3 extensions
of the inverse defined above is that their image inside S  (R) is not contained in arbitrary
Lp spaces. For this reason, we now introduce a different, “corrected” extension of the
2 To see this, note that (D − jω Id)ϕ(r) = ejω0 r D{e−jω0 r ϕ(r)}.
0
3 These operators are shift-invariant because they are defined by means of convolutions.
5.4 Unstable Nth-order differential systems 99

inverse of P∗jω0 that maps S (R) to R (R) – therefore, a fortiori, also into all Lp spaces.
This corrected left inverse, which we shall denote as I∗ω0 , is constructed as
 
I∗ω0 ϕ(r) = P−1∗
jω0 ,+ ϕ(r) − lim P−1∗
jω0 ,+ ϕ(y) ∨
ρjω0 ,+
(r)
y→−∞
∨ ∨
= (ρjω 0 ,+
∗ ϕ)(r) − 
ϕ (−ω0 )ρjω0 ,+
(r), (5.17)

in direct analogy with (5.3). As in our introductory example, the role of the second term
is to remove the tail of P−1
jω0 ,+ ϕ(r). This ensures that the output decays fast enough to
belong to R (R). It is not difficult to show that the convolutional definition of I∗ω0 given
by (5.17) does not depend on the specific choice of the impulse response within the
class of admissible LSI inverses of Pjω0 on S jω . Then, we may simplify the notation by
writing

I∗ω0 ϕ(r) = (ρjω



0
∗ ϕ)(r) −  ∨
ϕ (−ω0 )ρjω0
(r), (5.18)

for any Green’s function ρjω0 of the operator Pjω . While the left-inverse operator I∗ω0
fixes the decay, it fails to be a right inverse of P∗jω0 unless ϕ ∈ S jω0 or, equivalently,

ϕ (−ω0 ) = 0.
The corresponding right inverse of Pjω0 is provided by the adjoint of I∗ω0 . It is identi-
fied via the scalar product manipulation

ϕ, I∗ω0 φ = ϕ, P−1∗  ∨


jω0 φ − φ(−ω0 )ϕ, ρjω0 (by linearity)
= P−1
jω0 ϕ, φ − e
jω0 r
, φ (P−1
jω0 ϕ)(0) (using (5.13))
= P−1
jω0 ϕ, φ − e jω0 r
(P−1
jω0 ϕ)(0), φ .

Since the above is equal to Iω0 ϕ, φ by definition, we find that

Iω0 ϕ(r) = P−1


jω0 ϕ(r) − e
jω0 r
(P−1
jω0 ϕ)(0)
= (ρjω0 ∗ ϕ)(r) − ejω0 r (ρjω0 ∗ ϕ)(0), (5.19)

where ρjω0 is defined by (5.14). The specificity of Iω0 is to impose the boundary condi-
tion s(0) = 0 on the output s = Iω0 ϕ, irrespective of the input function ϕ. This is achieved
by the addition of a component that is in the null space of Pjω . This also explains why
we may replace ρjω0 in (5.19) by any other Green’s function of Pjω , including the causal
one given by (5.16).
In particular, for jω0 = 0 (that is, for Pjω0 = D), we have
 r
r 0 ϕ(t) dt r ≥ 0
I0 ϕ(r) = ϕ(t) dt − ϕ(t) dt = 00
−∞ −∞ − r ϕ(t) dt r < 0,

while the adjoint is given by


 ∞
∞ ∞ ϕ(t) dt r≥0
I∗0 ϕ(r) = ϕ(t) dt − 1(−∞,0] (r) ϕ(t) dt = rr
r −∞ − −∞ ϕ(t) dt r < 0.

These are equivalent to the solution described in Section 5.1 (see Figure 5.1).
100 Operators and their inverses

The Fourier-domain counterparts of (5.19) and (5.18) are


ejωr − ejω0 r dω
Iω0 ϕ(r) = 
ϕ (ω) (5.20)
R jω − jω0 2π

ϕ (ω) − 
ϕ (−ω0 ) jωr dω
I∗ω0 ϕ(r) = e , (5.21)
R −jω − jω0 2π
respectively. One can observe that the form of the numerator in both integrals is such
that it tempers the singularity of the denominator at ω = ω0 . The relevance of these
corrected inverse operators for the construction of stochastic processes is due to the
following theorem.
THEOREM 5.3 The operator I∗ω0 defined by (5.18) is continuous from S (R) to R (R)
and extends continuously to a linear operator R (R) → R (R). It is the dual of the
operator Iω0 defined by (5.19) in the sense that Iω0 φ, ϕ = φ, I∗ω0 ϕ . Moreover,

Iω0 ϕ(0) =0 (zero boundary condition)


Iω0 (D − jω0 Id)∗ ϕ

=ϕ (left-inverse property)
(D − jω0 Id)Iω0 φ = φ (right-inverse property)
for all ϕ, φ ∈ S (R).
The first part of the theorem follows from Proposition 5.4 below, which indicates that
I∗ω0 preserves rapid decay. The statements in the second part have already been dis-
cussed. The left-inverse property, for instance, follows from the fact that (D −
jω0 Id)∗ ϕ ∈ S −jω0 , which is the subspace of S (R) for which all the inverses of
P∗jω0 = −P−jω0 are equivalent. The right-inverse property of Iω0 is easily verified by
applying Pjω0 to (5.19).
To qualify the rate of decay of functions, we rely on the L∞,α norm, defined as
 
ϕL∞,α = ess sup ϕ(r)(1 + |r|)α  . (5.22)
r∈R

Hence, the inclusion ϕ ∈ L∞,α (R) is equivalent to


ϕL∞,α
|ϕ(r)| ≤ a.e.,
(1 + |r|)α
which is to say that ϕ has an algebraic decay of order α. We also recall that R (R) is
the space of rapidly decaying functions, which is the intersection of all L∞,α (R) spaces
with α ≥ 0. The relevant embedding relations are S (R) ⊂ R (R) ⊂ L∞,α (R) ⊂ Lp (R)
for any α > 1 and p ≥ 1. Moreover, since S (R) has the strictest topology in the chain,
a sequence that converges in S (R) is also convergent in R (R), L∞,α (R), or Lp (R).
PROPOSITION 5.4 Let I∗ω0 be the linear operator defined by (5.17). Then, for all
ϕ ∈ L∞,α (R) with α > 1, there exists a constant C such that
I∗ω0 ϕL∞,α−1 ≤ C ϕL∞,α .
Hence, I∗ω0 is a continuous operator from L∞,α (R) into L∞,α−1 (R) and, by restriction
of its domain, from R (R) → R (R) or S (R) → R (R).
5.4 Unstable Nth-order differential systems 101

Proof For r < 0, we rewrite (5.17) as


I∗ω0 ϕ(r) = P−1∗
jω0 ϕ(r) − e
−jω0 r

ϕ (−ω0 )
+∞ ∞
= e−jω0 (r−τ ) ϕ(τ ) dτ − e−jω0 r ejω0 τ ϕ(τ ) dτ
r −∞
r
= −e−jω0 r ejω0 τ ϕ(τ ) dτ .
−∞
This implies that
 
 r 
|I∗ω0 ϕ(r)| =  e jω0 τ
ϕ(τ ) dτ 
−∞
r
≤ |ϕ(τ )| dτ
−∞
r ϕL∞,α ϕL∞,α
≤ dτ ≤ C
−∞ (1 + |τ |)α (1 + |r|)α−1
∞
for all r < 0. For r > 0, I∗ω0 ϕ(r) = r e−jω0 (r−τ ) ϕ(τ ) dτ so that the above upper bound
remains valid.
While I∗ω0 is continuous over R (R), it is not shift-invariant. Moreover, it will gener-
ally spoil the global smoothness of the functions in S (R) to which it is applied due
to the discontinuity at the origin that is introduced by the correction. By contrast, its
adjoint Iω0 preserves the smoothness of the input but fails to return functions that are
rapidly decaying at infinity. This lack of shift-invariance and the slow growth of the
output at infinity is the price to pay for being able to solve unstable differential systems.

5.4.2 Higher-order differential operators with unstable shift-invariant inverses


Given that the operators I∗ω0 , ω0 ∈ R, defined in Section 5.4.1 are continuous R (R) →
R (R), they may be composed to obtain higher-order continuous operators R (R) →
R (R) that serve as left inverses of the corresponding higher-order differential operators
in the sense of Section 5.4.1. More precisely, given (ω1 , . . . , ωK ) ∈ RK , we define the
composite integration operator
I(ω1 :ωK ) = Iω1 ◦ · · · ◦ IωK , (5.23)
whose adjoint is given by
 ∗
I∗(ω1 :ωK ) = Iω1 ◦ · · · ◦ IωK = I∗ωK ◦ · · · ◦ I∗ω1 . (5.24)
 
I∗(ω1 :ωK ) , which maps R (R) and therefore S (R) ⊂ R (R) continuously into R (R), is
then a left inverse of
P∗(jωK :jω1 ) = P∗jω1 ◦ · · · ◦ P∗jωK ,
with Pα = D − αId and P∗α = −P−α . Conversely, I(ωK :ω1 ) is a right inverse of
P(jω1 :jωK ) = Pjω1 ◦ · · · ◦ PjωK .
102 Operators and their inverses

Putting everything together, with the definitions of Section 5.3.2, we arrive at the
following corollary of Theorem 5.3:

C O R O L L A RY 5.5 For α = (α1 , . . . , αM ) ∈ CM with Re(αn )  = 0 and (ω1 , . . . , ωK ) ∈


RK , the (M+K)th-order operator L−1∗ = P−1∗ ∗
(αM :α1 ) I(ωK :ω1 ) maps S (R) continuously into
R (R) ⊆ Lp (R) for any p > 0. It is a left inverse of L∗ = P∗(jω :jω ) P∗(α :αM ) in the sense
1 K 1
that
P−1∗ ∗ ∗ ∗
(αM :α1 ) I(ωK :ω1 ) P(jω1 :jωK ) P(α1 :αM ) ϕ = ϕ

for all ϕ ∈ S (R).

We shall now use this result to solve the differential equation (5.11) in the non-stable
scenario. To that end, we order the poles in such a way that the unstable ones come last
with αN−K+m = jωm , 1 ≤ m ≤ K, where K is the number of purely imaginary poles. We
thus specify the right-inverse operator

L−1 = I(ωK :ω1 ) P−1


(αN−K :α1 ) qM (D),

which we then apply to w to obtain the solution s = L−1 w. In effect , by applying I(ωK :ω1 )
last, we are also enforcing the K linear boundary conditions


⎪ s(0) = 0



⎨ PjωK {s}(0) = 0
.. (5.25)



⎪ .


Pjω2 · · · PjωK {s}(0) = 0.

To show that s = L−1 w is a consistent solution, we proceed by duality and write

ϕ, P(α1 :αN−K ) P(jω1 :jωK ) s = ϕ, P(α1 :αN−K ) P(jω1 :jωK ) L−1 w
= P∗(jω1 :jωK ) P∗(α1 :αN−K ) ϕ, I(ωK :ω1 ) P−1
(αN−K :α1 ) qM (D)w
= P−1∗

∗ ∗ ∗
:α ) I(ω :ω ) P(jω :jω ) P(α1 :αN−K ) ϕ, qM (D)w
 N−K 1 K 1  1 K 
Id
= ϕ, qM (D)w ,

where we have made use of Corollary 5.5. This proves that s satisfies the differential
equation (5.11) with driving term w, subject to the boundary conditions (5.25).

5.4.3 Generalized boundary conditions


In the resolution method presented so far, the inverse operator Iω0 was designed to
impose zero boundary conditions at the origin. In more generality, one may consider
inverse operators Iω0 ,ϕ0 that incorporate conditions of the form

ϕ0 , Iω0 ,ϕ0 w = 0 (5.26)


5.4 Unstable Nth-order differential systems 103

on the solution s = Iω0 ,ϕ0 w. This leads to the definition of the right-inverse operator

ρjω0 ∗ ϕ, ϕ0
Iω0 ,ϕ0 ϕ(r) =(ρjω0 ∗ ϕ)(r) − ejω0 r , (5.27)

ϕ0 (−ω0 )

where ρjω0 is a Green’s function of Pjω0 and ϕ0 is some given rapidly decaying func-
tion such that ϕ0 (−ω0 )  = 0. In particular, if we set ω0 = 0 and ϕ0 = δ, we reco-
ver the scale-invariant integrator I0 = I0,δ that was used in our introductory example
(Section 5.1) to provide the connection with the classical theory of Lévy processes. The
Fourier-domain counterpart of (5.27) is

 
ϕ0 (−ω) 
ejωr − ejω0 r 
ϕ0 (−ω0 ) dω
Iω0 ,ϕ0 ϕ(r) = 
ϕ (ω) . (5.28)
R j(ω − ω0 ) 2π

The above operator is well defined pointwise for any ϕ ∈ L1 (R). Moreover, it is a right
inverse of (D − jω0 Id) on S (R) because the regularization in the numerator amounts
to a sinusoidal correction that is in the null space of the operator. The adjoint of Iω0 ,ϕ0
is specified by the Fourier-domain integral

 
ϕ (−ω0 ) 

ϕ (ω) − ϕ0 (−ω0 ) 
 ϕ0 (ω) jωr dω
I∗ω0 ,ϕ0 ϕ(r) = e , (5.29)
R −j(ω + ω0 ) 2π

which is non-singular too, thanks to the regularization in the numerator. The beneficial
effect of this adjustment is that I∗ω0 ,ϕ0 is R -continuous and Lp -stable, unlike its more
conventional shift-invariant counterpart P−1∗ jω0 . The time-domain counterpart of (5.29) is

ϕ (−ω0 ) 
 
I∗ω0 ,ϕ0 ϕ(r) =(ρjω

∗ ϕ)(r) − ∨
ϕ0 ∗ ρjω (r), (5.30)
0 
ϕ0 (−ω0 ) 0

where ρjω0 is a Green’s function of Pjω0 . This relation is very similar to (5.18), with
the notable difference that the second term is convolved by ϕ0 . This suggests that we
can restore the smoothness of the output by picking a kernel ϕ0 with a sufficient degree
of differentiability. In fact, by considering a sequence of such kernels in S (R) that
converge to the Dirac distribution (in the weak sense), we can specify a left-inverse
operator that is arbitrarily close to I∗ω0 and yet S -continuous.
While the imposition of generalized boundary conditions of the form (5.26) has some
significant implications for the statistical properties of the signal (non-stationary beha-
vior), it is less of an issue for signal processing because of the use of analysis tools
(wavelets, finite-difference operators) that stationarize these processes – in effect, fil-
tering out the null-space components – so that the traditional tools of the trade remain
applicable. Therefore, to simplify the presentation, we shall only consider boundary
conditions at zero in what follows, and work with the operators I∗ω0 and Iω0 .
104 Operators and their inverses

5.5 Fractional-order operators

5.5.1 Fractional derivatives in one dimension


In one dimension, we consider the general class of all LSI operators that are also scale-
invariant. To motivate their definition, let us recall that the nth-order derivative Dn cor-
responds to the Fourier multiplier (jω)n . This suggests the following fractional exten-
sion, going back to Liouville, whereby the exponent n is replaced by a non-negative real
number γ :

Dγ ϕ(r) = (jω)γ 
ϕ (ω)ejωr . (5.31)
R 2π
This definition is further generalized in the next proposition, which gives a complete
characterization of scale-invariant convolution operators in 1-D.
P R O P O S I T I O N 5.6 (see [UB07, Proposition 2]) The complete family of 1-D scale-
γ
invariant convolution operators of order γ ∈ R reduces to the fractional derivative ∂τ
whose Fourier-based definition is
γ γ dω
∂τγ ϕ(r) = (jω) 2 +τ (−jω) 2 −τ 
ϕ (ω)ejωr ,
R 2π
where ϕ is the 1-D Fourier transform of the input function ϕ under the implicit assump-
tion that the inverse Fourier integral on the right-hand side is convergent.
While the above representation is appealing, it needs to be treated with caution since
the underlying Fourier multipliers are unbounded (at infinity or at zero), which is incom-
patible with Lp -stability (see Theorem 3.5). The next theorem shows that the fractional-
γ
derivative operators ∂τ are well defined over S (R) for γ > −1 and τ ∈ R, but that
they have a tendency to spoil the decay of the functions to which they are applied.
γ
T H E O R E M 5.7 The differential operator ∂τ is continuous from S (R) to Lp (R) for
γ
γ > 1p − 1. Moreover, the fractional derivative ∂τ ϕ of a test function ϕ ∈ S (R)
remains indefinitely differentiable, but its decay is constrained by
 γ  Const
∂ ϕ(r) ≤ .
τ
1 + |r|γ +1
This is to be contrasted with the effect of Dn for n ∈ N, which maps S (R) into itself
and hence preserves rapid decay. To explain the effect, we observe that a fractional
differentiation is equivalent to a convolution. Since ϕ decreases rapidly, the decay of
the output is imposed by the tail of the impulse response. For instance, in the case of the
operator Dγ (with γ non-integer), we have that
−γ −1
r+
Dγ {δ}(r) = , (5.32)
(−γ )
which, for γ > −1, is a generalized function that decays like 1/|r|γ +1 . Note that the
(apparent) singularity at r = 0 is not damaging since it is tempered by the finite-part
integral of the definition (see Table A.1 and (5.33) below).
5.5 Fractional-order operators 105

The Fourier-domain characterization in Proposition 5.6 implies that these operators


γ γ
are endowed with a semigroup property: they satisfy the composition rule ∂τ ∂τ  =
γ +γ 
∂τ +τ  for γ  , γ + γ  ∈ (−1, +∞) and τ , τ  ∈ R. The parameter τ is a phase factor
γ
that yields a progressive transition between the purely causal derivative Dγ = ∂γ /2 and
its anti-causal counterpart Dγ ∗ = ∂−γ /2 , which happens to be the adjoint of the for-
γ

mer. We also note that ∂τ0 is equivalent to the fractional Hilbert transform operator H τ
γ γ
investigated in [CU10]. A special case of the semigroup property is ∂τ = ∂γ /2 ∂τ0−γ /2 =
Dγ H τ −γ /2 ,which indicates that the fractional derivatives of order γ are all related to
Dγ via a fractional Hilbert transform. The latter is a unitary operator (all-pass filter) that
essentially acts as a shift operator on the oscillatory part of a wavelet.
γ γ
The property that ∂τ can be factorized as ∂τ = Dn ∂τα , with n ∈ N, α = γ − n,
τ  = τ − n/2, and Dn :S (R) → S (R), has important consequences for the theory. In
rλ+
particular, it suggests several equivalent descriptions of the generalized function (λ+1) ,
as in
rλ+ r+λ+n
ϕ, = ϕ, Dn { }
(λ + 1) (λ + n + 1)
r+λ+n
= Dn∗ ϕ,
(λ + n + 1)
(−1)n ∞
= rλ+n ϕ (n) (r) dr. (5.33)
(λ + n + 1) 0
The last equality of (5.33) with n = min(0, −λ) provides an operational definition that
reduces to a conventional integral.
γ
In principle, we can obtain the shift-invariant inverse of the derivative operator ∂τ
with γ ≥ 0 by taking the order to be negative (fractional integrator) and reversing
the sign of τ . Yet, based on (5.32), which is valid for γ ∈ R\N, we see that this is
problematic because the impulse response becomes more delocalized as γ decreases.
As in the case of the ordinary derivative Dn , this calls for a stabilized version of the
inverse.

THEOREM 5.8 The fractional integration operator


γ + p1 −1 
ϕ (k) (0)ωk
−γ ∗ 1 
ϕ (ω) − k=0 k!
∂−τ ,p ϕ(r) = γ γ ejωr dω (5.34)
2π R (−jω) 2 −τ (jω) 2 +τ

continuously maps S (R) into Lp (R) for p > 0, τ ∈ R, and γ ∈ R+ , subject to the
restriction γ + 1p  = 1, 2, . . .. It is a linear operator that is scale-invariant and is a left
γ γ∗ −γ
inverse of ∂−τ = ∂τ . The adjoint operator ∂−τ ,p is given by
γ + p1 −1 (jωr)k
−γ 1 ejωr − k=0 k!
∂−τ ,p ϕ(r) = γ γ 
ϕ (ω) dω
2π R (−jω) 2 −τ (jω) 2 +τ

γ
and is the proper scale-invariant right inverse of ∂τ to be applied to generalized
functions.
106 Operators and their inverses

The stabilization in (5.34) amounts to an adjustment of the Fourier transform of the


input (subtraction of an adequate number of terms of its Taylor series at the origin)
to counteract the frequency-domain division by zero. This correction has the desirable
features of being linear with respect to the input and of preserving scale-invariance,
which is essential for our purpose.

Proof We only give a sketch of the proof for p ≥ 1, leaving out the derivation of
Proposition 5.9, which is somewhat technical. The first observation is that the operator
−γ ∗
can be factorized as ∂−τ ,p = ∂τα (I∗0 )np with α = np − γ and τ  = τ − np /2, where I∗0
is the corrected (adjoint) integrator defined by (5.21) with ω0 = 0. The integer order of
pre-integration is np = γ + 1p  − 1 + 1 = γ + 1p , which implies that the residual degree
of differentiation α is constrained according to
1
p − 1 < α = np − γ < 1p .

−γ ∗
This allows us to write ∂−τ ,p ϕ = ∂τα φ, where φ = (I∗0 )np ϕ is rapidly decaying by
Corollary 5.5. The required ingredient to complete the proof is a result analogous to
Theorem 5.7 which would ensure that ∂τα φ ∈ Lp (R) for α > 1p − 1. The easy scenario is
when ϕ ∈ S (R) has np vanishing moments, in which case φ = (I∗0 )np ϕ ∈ S (R) so that
Theorem 5.7 is directly applicable. In general, however, φ is (only) rapidly decreasing;
this is addressed by the following extension.

PROPOSITION 5.9 The fractional operator ∂τα I∗0 is continuous from S (R) to Lp (R)
for p > 0, τ ∈ R, and 1p − 1 < α < 1p . It also admits a continuous extension R (R) →
Lp (R) for p ≥ 1.

The reason for including the operator I∗0 in the statement is to avoid making explicit
hypotheses about the derivative of φ, which  is rapidly
 decaying but also exhibits a
Dirac impulse at the origin with a weight −  ∗
ϕ (0) . Since I0 :R (R) → R (R) (by
Proposition 5.4), the global continuity result for p ≥ 1 then follows from the chaining
of these elementary operators.
Similarly, we establish the left-inverse property by considering the factorization
−α
∂−τ = (D∗ )np ∂−τ
γ


and by recalling that I∗0 is a left inverse of D∗ = −D. The result then follows from
−α
the identity ∂τα ∂−τ = Id, which is a special case of the semigroup property of scale-
invariant LSI operators, under the implicit assumption that the underlying operations
are well defined in the Lp sense.

A final observation that gives insight into the design of Lp -stable inverse operators
is that the form of (5.34) for p = 1 coincides with the finite-part definition (see Appen-
dix A) of the generalized function

ejωr

gr (ω) = γ γ ∈ S  (R).
(−jω) 2 −τ (jω) 2 +τ
5.5 Fractional-order operators 107

Specifically, by using the property that 


ϕ ∈ S (R), we have that
−γ ∗ 1
∂−τ ,1 ϕ(r) = 
ϕ ,
gr

1 ejωr
= p.f. 
ϕ (ω) γ γ dω
2π R (−jω) 2 −τ (jω) 2 +τ
⎛ ⎞
γ  (k)
1 ⎝ 
ϕ (0)ω k
⎠ ejωr
= ϕ (ω) − γ γ dω,
2π R
k=0
k! (−jω) 2 −τ (jω) 2 +τ

where the finite-part regularization in the latter integral is the same as in the definition
in (A.1) of the generalized function xλ+ with − Re(λ) − 1 = γ . The catch with (5.34) is
that the number of regularization terms np = γ + 1p  is not solely dependent upon γ ,
but also on 1/p.

5.5.2 Fractional Laplacians


The fractional Laplacian of order γ ≥ 0 is defined by the inverse Fourier integral
γ dω
(− ) 2 ϕ(r) = ωγ 
ϕ (ω)ejω,r ,
Rd (2π)d
where  ϕ (ω) is the d-dimensional Fourier transform of ϕ(r). For γ = 2, this characteri-

zation coincides with the classical definition of the negative Laplacian: − = − ∂r2i .
In slightly more generality, we obtain a complete characterization of homogeneous
shift- and rotation-invariant operators and their inverses in terms of convolutions with
homogeneous rotation-invariant distributions, as given in Theorem 5.10. The idea and
definitions may be traced back to [Duc77, GS68, Hör80].

T H E O R E M 5.10 ( [Taf11, Corollary 2.ac]) Any continuous linear operator S (Rd ) →


S  (Rd ) that is simultaneously shift- and rotation-invariant, and homogeneous or scale-
invariant of order γ in the sense of Definition 5.2, is a multiple of the operator
d  
Lγ : ϕ → ρ −γ −d ∗ ϕ = (2π) 2 F −1 ρ γ 
ϕ ,

where ρ γ , γ ∈ C, is the distribution defined by


ωγ
ρ γ (ω) = γ   (5.35)
2  γ 2+d
2

with the property that


d
F {ρ γ } = (2π) 2 ρ −γ −d

( denotes the gamma function.)

Note that Lγ is self-adjoint, with Lγ ∗ = Lγ .


As is clear from the definition of ρ γ , for γ  = −d, −d − 2, −d − 4, . . ., Lγ is
simply a renormalized version of the fractional Laplacian introduced earlier. For
γ = −d − 2m, m = 0, 1, 2, 3, . . ., where the gamma function in the denominator of
108 Operators and their inverses

ρ γ has a pole, ρ γ is proportional to (− )m δ, while the previous definition of the


fractional Laplacian without normalization
 does not define a scale-invariant
 γ operator.
Also note that when Re(γ ) > −d Re(γ ) ≤ −d, respectively , ρ (its Fourier trans-
form, respectively) is singular at the origin. This singularity is resolved in the man-
ner described in Appendix A, which is equivalent to the analytical continuation of the
formula

ρ γ = γρ γ −2 ,

initially valid for Re(γ ) − 2 > −d, in the variable γ . For the details of the previous two
observations we refer the reader to Appendix A and Tafti [Taf11, Section 2.2]. 4
Finally, it is important to remark that, unlike integer-order operators, unless γ is a
positive even integer the image of S (Rd ) under Lγ is not contained in S (Rd ). Specifi-
cally, while for any ϕ ∈ S , Lγ ϕ is always an infinitely differentiable regular function,
in the case of γ  = 2m, m = 1, 2, . . ., it may have slow (polynomial) decay or growth, in
direct analogy with the general 1-D scenario characterized by Theorem 5.7.
Put more simply, the fractional Laplacian of a Schwartz function is not in general a
Schwartz function.

5.5.3 Lp -stable inverses


From the identity

1
ρ γ (ω)ρ −γ (ω) =  d+γ   d−γ  for ω  = 0,
 2  2

we conclude that up to normalization, L−γ is the inverse of Lγ on the space of Schwartz’


test functions with vanishing moments, in particular for Re(γ ) > −d. 5 This inverse
for Re(γ ) > −d can be further extended to a shift-invariant left inverse of Lγ acting
on Schwartz functions. However, as was the case in Section 5.4.1, this shift-invariant
inverse generally does not map S (Rd ) into a given Lp (Rd ) space, and is therefore not
suitable for defining generalized random fields in S  (Rd ).
The problem exposed in the previous paragraph is once again overcome by defining a
“corrected” left inverse. Here again, unlike the simpler scenario of ordinary differential
operators, it is not possible to have a single left-inverse operator that maps S (Rd ) into
the intersection of all Lp (Rd ) spaces, p > 0. Instead, we shall need to define a separate
left-inverse operator for each Lp (Rd ) space we are interested in. Under the constraints
of scale- and rotation-invariance, such “non-shift-invariant” left inverses are identified
in the following theorem.

d
4 The difference in factors of (2π) 2 and (2π)d between the formulas given here and in Tafti [Taf11] is due
to different normalizations used in the definition of the Fourier transform.
5 Here we exclude the cases where the gamma functions in the denominator have poles, namely where their
argument is a negative integer. For details see [Taf11, Section 2.2.2.]
5.6 Discrete convolution operators 109

THEOREM 5.11 ( [Taf11], Theorem 2.aq and Corollary 2.am) The operator
∂k ρ γ −d
Lp−γ ∗ : ϕ → ρ γ −d ∗ ϕ − yk ϕ(y) dy (5.36)
k! Rd
|k|≤Re(γ )+ pd −d

with Re(γ ) + d
p  = 1, 2, 3, . . . is rotation-invariant and homogeneous of order (−γ ) in
the sense of Definition 5.2. It maps S (Rd ) continuously into Lp (Rd ) for p ≥ 1.
−γ ∗
The adjoint of Lp is given by
∂k L−γ ϕ(0)
L−γ
p : ϕ → ρ
γ −d
∗ϕ− rk .
k!
|k|≤Re(γ )+ pd −d

If we exclude the cases where the denominator of (5.35) has a pole, we may normalize
the above operators to find left and right inverses corresponding to the fractional Lapla-
γ
cian (− ) 2 . The next theorem gives an equivalent Fourier-domain characterization of
these operators.
γ∗
THEOREM 5.12 (see [SU12, Theorem 3.7]) Let Ip with p ≥ 1 and γ > 0 be the
isotropic fractional integral operator defined by
∂ k
ϕ (0)ωk

ϕ (ω) −
k!
|k|≤γ + pd −d dω
Iγp ∗ ϕ(r) = ejω,r . (5.37)
Rd ωγ (2π)d
γ∗
Then, under the condition that γ  = 2, 4, . . . and γ + d
p  = 1, 2, 3, . . ., Ip is the unique
γ
left inverse of (− ) 2 that continuously maps S (Rd ) into Lp (Rd ) for p ≥ 1 and is scale-
γ
invariant. The adjoint operator Ip , which is the proper scale-invariant right inverse of
γ
(− ) 2 , is given by
j|k| rk ωk
ejω,r −
k!
|k|≤γ + pd −d dω
Iγp ϕ(r) = 
ϕ (ω) . (5.38)
Rd ωγ (2π)d
γ γ∗
The fractional integral operators Ip and Ip are both scale-invariant of order (−γ ),
but they are not shift-invariant.

5.6 Discrete convolution operators

We conclude this chapter by providing a few basic results on discrete convolution ope-
rators and their inverses. These will turn out to be helpful for establishing the existence
of certain spline interpolators which are required for the construction of operator-like
wavelet bases in Chapter 6 and for the representation of autocorrelation functions in
Chapter 7.
110 Operators and their inverses

The convention in this book is to use square brackets to index sequences. This allows
one to distinguish them from functions of a continuous variable. In other words, h(·)
or h(r) stands for a function defined on a continuum, while h[·] or h[k] denotes some
discrete sequence. The notation h[·] is often simplified to h when the context is clear.
The discrete convolution between two sequences h[·] and a[·] over Zd is defined as
(h ∗ a)[n] = h[m]a[n − m]. (5.39)
m∈Zd

This convolution describes how a digital filter with discrete impulse response h acts on
some input sequence a. If h ∈ 1 (Zd ), then the map a[·] → (h ∗ a)[·] is a continuous
operator p (Zd ) → p (Zd ). This follows from Young’s inequality for sequences,
h ∗ ap ≤ h1 ap , (5.40)
where
 1
n∈Zd |a[n]|p p for 1 ≤ p < ∞
ap =
supn∈Zd |a[n]| for p = ∞.

Note that the condition h ∈ 1 (Zd ) is the discrete counterpart of the (more involved)
TV condition in Theorem 3.5. As in the continuous formulation, it is necessary and suf-
ficient for stability in the extreme cases p = 1, +∞. A simplifying aspect of the discrete
setting is that the boundedness of the operator for p = ∞ implies all the other forms of
p -stability because of the embedding p (Zd ) ⊂ q (Zd ) for any 1 ≤ p < q ≤ ∞. The
latter property is a consequence of the basic norm inequality
ap ≥ aq ≥ a∞
for all a ∈ p (Zd ) ⊂ q (Zd ) ⊂ ∞ (Zd ).
In the Fourier domain, (5.39) maps into the multiplication of the discrete Fourier
transforms of h and a as
b[n] = (h ∗ a)[n] ⇔ B(ω) = H(ω)A(ω),
where we are using capital letters to denote the discrete-time Fourier transforms of the
underlying sequences. Specifically, we have that
B(ω) = Fd {b}(ω) = b[k]e−jω,k ,
k∈Zd

which is 2π-periodic, while the corresponding inversion formula is



b[n] = Fd−1 {B} [n] = B(ω)ejω,n .
[−π,π]d (2π)d
The stability condition h ∈ 1 (Zd ) ensures that the frequency response H(ω) of the
digital filter h is bounded and continuous.
The task of specifying discrete inverse filters is greatly facilitated by a theorem,
known as Wiener’s lemma, which ensures that the inverse convolution operator is
p -stable whenever the frequency response of the original filter is non-vanishing.
5.7 Bibliographical notes 111


THEOREM 5.13 (Wiener’s lemma) Let H(ω) = k∈Zd h[k]e−jω,k , with h ∈ 1 (Zd ),
be a stable discrete Fourier multiplier such that H(ω)  = 0 for ω ∈ [−π, π]d . Then,
G(ω) = 1/H(ω) has the same property in the sense that it can be written as 1/H(ω) =
 −jω,k with g ∈  (Zd ).
k∈Zd g[k]e 1

The so-defined filter g identifies a stable convolution inverse of h with the property
that
(g ∗ h)[n] = (h ∗ g)[n] = δ[n],
where

1 for n = 0
δ[n] =
0 for n ∈ Zd \ {0}
is the Kronecker unit impulse.

5.7 Bibliographical notes

Section 5.2
The specification of Lp -stable convolution operators is a central topic in harmonic ana-
lysis [SW71, Gra08]. The basic result in Proposition 5.1 relies on Young’s inequality
with q = r = 1 [Fou77]. The complete class of functions that result in S -continuous
convolution kernels is provided by the inverse Fourier transform of the space of smooth
slowly increasing Fourier multipliers, which play a crucial role in the theory of gener-
alized functions [Sch66]. They extend R (Rd ) in the sense that they also contain point
distributions such as δ and its derivatives.

Section 5.3
The operational calculus that is used for solving ordinary differential equations (ODEs)
can be traced to Heaviside, who also introduced the symbol D for the derivative op-
erator. It was initially met with skepticism because Heaviside’s exposition lacked rigor.
Nowadays, the preferred method for solving ODEs is based on the Laplace transform
or the Fourier transform. The operator-based formalism that is presented in Section 5.3
is a standard application of distribution theory and Green’s functions [Kap62, Zem10].

Section 5.4
The extension of the operational calculus for solving unstable ODE/SDEs is a more
recent development. It was initiated in [BU07] in an attempt to link splines and frac-
tals. The material presented in this section is based on [UTS14], for the most part. The
generalized boundary conditions of Section 5.4.3 were introduced in [UTAK14].

Section 5.5
Fractional derivatives and Laplacians play a central role in the theory of splines, the
primary reason being that these operators are scale-invariant [Duc77, UB00]. The proof
112 Operators and their inverses

of Proposition 5.6 can be found in [UB07, Proposition 2], while Theorem 5.7 is a slight
extension of [UB07, Theorem 3]. The derivation of the corresponding left and right
inverses for p = 2 is carried out in the companion paper [BU07]. These results were
extended to the multivariate setting for both the Gaussian [TVDVU09] and the gener-
alized Poisson setting [UT11]. A detailed investigation of the Lp -stable left inverses
of the fractional Laplacian, together with some unicity results, is provided in [SU12].
Further results and proofs can be found in [Taf11].

Section 5.6
Discrete convolution is a central topic in digital signal processing [OSB99]. The discrete
version of Young’s inequality can be established by using the same technique as for
the proof of Proposition 5.1. In that respect, the condition h ∈ 1 (Zd ) is the standard
criterion for stability in the theory of discrete-time linear systems [OSB99, Lat98]. It is
necessary and sufficient for BIBO stability (p = ∞) and for the preservation of absolute
summability (p = 1). Wiener stated his famous lemma in 1932 [Wie32, Lemma IIe].
Other relevant references are [New75, Sun07].
6 Splines and wavelets

A fundamental aspect of our formulation is that the whitening operator L is naturally


tied to some underlying B-spline function, which will play a crucial role in what follows.
The spline connection also provides a strong link with wavelets [UB03].
In this chapter, we review the foundations of spline theory and show how one can
construct B-spline basis functions and wavelets that are tied to some specific opera-
tor L. The chapter starts with a gentle introduction to wavelets that exploits the analogy
with Lego blocks. This naturally leads to the formulation of a multiresolution analysis
of L2 (R) using piecewise-constant functions and a de visu identification of Haar wave-
lets. We then proceed in Section 6.2 with a formal definition of our generalized brand of
splines – the cardinal L-splines – followed by a detailed discussion of the fundamental
notion of the Riesz basis. In Section 6.3, we systematically cover the first-order opera-
tors with the construction of exponential B-splines and wavelets, which have the conve-
nient property of being orthogonal. We then address the theory in its full generality and
present two generic methods for constructing B-spline basis functions (Section 6.4) and
semi-orthogonal wavelets (Section 6.5). The pleasing aspect is that these results apply
to the whole class of shift-invariant differential operators L whose null space is finite-
dimensional (possibly trivial), which are precisely those that can be safely inverted to
specify sparse stochastic processes.

6.1 From Legos to wavelets

It is instructive to get back to our introductory example of piecewise-contant splines in


Chapter 1 (§1.3) and to show how these are naturally connected to wavelets. The funda-
mental idea in wavelet theory is to construct a series of fine-to-coarse approximations of
a function s(r) and to exploit the structure of the multiresolution approximation errors,
which are orthogonal across scale. Here, we shall consider a series of approximating
signals {si }i∈Z , where si is a piecewise-constant spline with knots positioned on 2i Z.
These multiresolution splines are represented by their B-spline expansion

si (r) = ci [k]φi,k (r), (6.1)


k∈Z
114 Splines and wavelets

where the B-spline basis functions (rectangles) are dilated versions of the cardinal ones
by a factor of 2i :
⎧ , 
 i k  ⎨ 1, for r ∈ 2i k, 2i (k + 1)
0 r − 2
φi,k (r) = β+ = (6.2)
2i ⎩ 0, otherwise.

The variable i is the scale index that specifies the resolution (or knot spacing) a = 2i ,
while the integer k encodes for the spatial location. The B-spline of degree zero, φ =
φ0,0 = β+0 , is the scaling function of the representation. Interestingly, it is the identi-
fication of a proper scaling function that constitutes the most fundamental step in the
construction of a wavelet basis of L2 (R).
D E F I N I T I O N 6.1 (Scaling function) φ ∈ L2 (R) is a valid scaling function if and only
if it satisfies the following three properties:
• Two-scale relation:
φ(r/2) = h[k]φ(r − k), (6.3)
k∈Z
where the sequence h ∈ 1 (Z) is the so-called refinement mask
• Partition of unity:
φ(r − k) = 1 (6.4)
k∈Z
• The set of functions {φ(· − k)}k∈Z forms a Riesz basis.
In practice, a given brand of (orthogonal) wavelets (e.g., Daubechies or splines) is
often summarized by its refinement filter h since the latter uniquely specifies φ, subject
to the admissibility constraints (6.4) and φ ∈ L2 (R). In the case of the B-spline of degree
zero, we have that h[k] = δ[k] + δ[k − 1], where

⎨ 1, for k = 0
δ[k] =
⎩ 0, otherwise

is the discrete Kronecker impulse. This translates into what we jokingly refer to as the
Lego–Duplo relation 1
β+0 (r/2) = β+0 (r) + β+0 (r − 1). (6.5)
The fact that β+0
satisfies the partition of unity is obvious. Likewise, we already observed
in Chapter 1 that β+0 generates an orthogonal system which is a special case of a Riesz
basis.
By considering the rescaled version of such a basis, we specify the subspace of splines
at scale i as
 
Vi = si (r) = ci [k]φi,k (r) : ci ∈ 2 (Z) ⊂ L2 (R),
k∈Z
1 Duplos are the larger-scale versions of Lego building blocks and are more suitable for smaller children to
play with. The main point of the analogy with wavelets is that Legos and Duplos are compatible; they can
be combined to build more complex shapes. The enabling property is that a Duplo is equivalent to two
smaller Legos placed next to each other, as expressed by (6.5).
6.1 From Legos to wavelets 115

4 s0(r) Ã(r=2)
3 Wavelet:
2
1

0 2 4 6 8 2
+ 1
4 s1(r)
3 2 4 6 8
2
- –1
–2
1 ri(r) = si−1(r)−si(r)
2
0 2 4 6 8
4 + 1
3 s2(r)
2 4 6 8
2 - –1
1 –2
2
0 2 4 6 8
4 + 1
3 s3(r)
2 4 6 8
2 - –1
1 –2

0 2 4 6 8

Figure 6.1 Multiresolution signal analysis using piecewise-constant splines with a dyadic scale
progression. Left: multiresolution pyramid. Right: error signals between two successive levels of
the pyramid.

which, in our example, contains


, all the finite-energy functions that are piecewise-
constant on the intervals 2i k, 2i (k + 1) with k ∈ Z. The two-scale relation (6.3)
implies that the basis functions at scale i = 1 are contained in V0 (the original space
of cardinal splines) and, by extension, in Vi for i ≤ 0. This translates into the general
inclusion property Vi ⊂ Vi for any i > i, which is fundamental to the theory. A subtler
-
point is that the closure of i∈Z Vi is equal to L2 (R), which follows from the fact that
any finite-energy function can be approximated arbitrarily well by a piecewise-constant
spline when the sampling step 2i tends to zero (i → −∞). The necessary and sufficient
condition for this asymptotic convergence is the partition of unity (6.4), which ensures
that the representation is complete.
Having set the notation and specified the underlying hierarchy of function spaces, we
now proceed with the multiresolution approximation procedure starting from the fine-
scale signal s0 (x), as illustrated in Figure 6.1. Given the sequence c0 [·] of fine-scale
coefficients, our task is to construct the best spline approximation at scale 1 which is
specified by its B-spline coefficients c1 [·] in (6.1) with i = 1. It is easy to see that the
minimum-error solution (orthogonal projection of s0 onto V1 ) is obtained by taking the
mean of two consecutive samples. The procedure is then repeated at the next-coarser
scale and so forth until one reaches the bottom of the pyramid, as shown on the left-
hand side of Figure 6.1. The description of this coarsening algorithm is

1 1  
ci [k] = ci−1 [2k] + ci−1 [2k + 1] = ci−1 ∗ h̃ [2k]. (6.6)
2 2
116 Splines and wavelets

It is run recursively for i = 1, . . . , imax where imax denotes the bottom level of the
pyramid. The outcome is a multiresolution analysis of our input signal s0 .
In order to uncover the wavelets, it is enlightening to look at the residual signals
ri (r) = si−1 (r) − si (r) ∈ Vi−1 on the right side of Figure 6.1. While these are splines
that live at the same resolution as si−1 , they actually have half the apparent degrees of
freedom. These error signals exhibit a striking sign-alternation pattern due to the fact
that two consecutive samples (ci−1 [2k], ci−1 [2k + 1]) are at an equal distance from their
mean value (ci [k]). This suggests rewriting the residuals more concisely in terms of
oscillating basis functions (wavelets) at scale i, like

ri (r) = si−1 (r) − si (r) = di [k]ψi,k (r), (6.7)


k∈Z

where the (non-normalized) Haar wavelets are given by


 
r − 2i k
ψi,k (r) = ψHaar ,
2i
with the Haar wavelet being defined by (1.19). The wavelet coefficients di [·] are given
by the consecutive half-differences
1 1  
di [k] = ci−1 [2k] − ci−1 [2k + 1] = ci−1 ∗ g̃ [2k]. (6.8)
2 2
More generally, since the wavelet template at scale i = 1, ψ1,0 ∈ V0 , we can write

ψ(r/2) = g[k]φ(r − k), (6.9)


k∈Z

which is the wavelet counterpart of the two-scale relation (6.3). In the present example,
we have g[k] = (−1)k h[k], which is a general relation that is characteristic of an or-
thogonal design. Likewise, in order to gain in generality, we have chosen to express
the decomposition algorithms (6.6) and (6.8) in terms of discrete convolution (filter-
ing) and downsampling operations where the corresponding Haar analysis filters are
h̃[k] = 12 h[−k] and g̃[k] = 12 (−1)k h[−k]. The Hilbert-space interpretation of this ap-
proximation process is that ri ∈ Wi , where Wi is the orthogonal complement of Vi in
Vi−1 ; that is, Vi−1 = Vi +Wi with Vi ⊥ Wi (as a consequence of the orthogonal-projection
theorem).
Finally, we can close the loop by observing that
imax
 
s0 (r) = simax (r) + si−1 (r) − si (r)
  
i=1 ri (r)
imax
= cimax [k]φimax ,k (r) + di [k]ψi,k (r), (6.10)
k∈Z i=1 k∈Z

which provides an equivalent, one-to-one representation of the signal in an orthogonal


wavelet basis, as illustrated in Figure 6.2.
6.1 From Legos to wavelets 117

Figure 6.2 Decomposition of a signal into orthogonal scale components. The error signals
ri = si−1 − si between two successive signal approximations are expanded using a series of
properly scaled wavelets.

More generally, we can push the argument to the limit and apply the decomposition
to any finite-energy function
∀s ∈ L2 (R), s= di [k]ψi,k , (6.11)
i∈Z k∈Z

where di [k] = s, ψ̃i,k L2 and {ψ̃i,k }(i,k)∈Z2 is a suitable (bi-)orthogonal wavelet basis
with the property that ψ̃i,k , ψi ,k L2 = δk−k ,i−i .
Remarkably, the whole process described above – except the central expressions in
(6.6) and (6.8), and the equations explicitly involving β+0 – is completely generic and
applicable to any other wavelet basis of L2 (R) that is specified in terms of a wavelet
filter g and a scaling function φ (or, equivalently, an admissible refinement filter h). The
bottom line is that the wavelet decomposition and reconstruction algorithm is fully des-
cribed by the four digital filters (h, g, h̃, g̃) that form a perfect reconstruction filterbank.
The Haar transform is associated with the shortest possible filters. Its less favorable as-
pects are that the basis functions are discontinuous and that the scale-truncated error
decays only like the first power of the sampling step a = 2i (first order of approxima-
tion).
The fundamental point of our formulation is that the Haar wavelet is matched to
d
the pure derivative operator D = dr , which goes hand-in-hand with Lévy processes (see
Chapter 1). In that respect, the critical observations relating to spline and wavelet theory
are as follows:
• All piecewise-constant functions can be interpreted as D-splines.
• The Haar wavelet acts as a smoothed version of the derivative in the sense that
Haar = Dφ, where φ is an appropriate kernel (triangle function).
118 Splines and wavelets

• The B-spline of degree zero can be expressed as β+0 = βD = Dd D−1 δ, where the
finite-difference operator Dd is the discrete counterpart of D.
We shall now show how these ideas are extendable to a much broader class of differen-
tial operators L.

6.2 Basic concepts and definitions

6.2.1 Spline-admissible operators


Let L : S (Rd ) → S  (Rd ) be a generic Fourier-multiplier operator with frequency
response  L(ω). We further assume that L has a continuous extension L : X (Rd ) →
S  (Rd ) to some larger space of functions X (Rd ) with S (Rd ) ⊂ X (Rd ).
The null space of L is denoted by NL and defined as
NL = {p0 (r) : Lp0 (r) = 0}.

The immediate consequence of L being LSI is that NL is shift-invariant as well, in the


sense that p0 (r) ∈ NL ⇔ p0 (r − r0 ) ∈ NL for any r0 ∈ Rd . We shall use this property
to argue that NL generally consists of generalized functions whose Fourier transforms
are point distributions. In the space domain, they correspond to modulated polynomials,
which are linear combinations of exponential monomials of the form ejω0 ,r rn with ω0 ∈
Rd and multi-index n = (n1 , . . . , nd ) ∈ Nd . It actually turns out that the existence of a
single such element in NL has direct implications for the structure and dimensionality
of the underlying function space.
PROPOSITION 6.1 (Characterization of null space) If L is LSI and pn (r) = ez0 ,r rn ∈
NL with z0 ∈ Cd , then NL does necessarily include all exponential monomials of the
form pm (r) = ez0 ,r rm with m ≤ n. In addition, if NL is finite-dimensional, it can only
consist of atoms of that particular form.
Proof The LSI property implies that pn (r − r0 ) ∈ NL for any r0 ∈ Rd . To make our
point about the inclusion of the lower-order exponential polynomials in NL , we start by
expanding the scalar term (ri − r0,i )ni as
ni  
ni m ni −m ni !
(ri − r0,i )ni = r (−1)ni −m r0,i = (−1)k rm k
i r0,i .
m i m! k!
m=0 m+k=ni

By proceeding in a similar manner with the other monomials and combining the results,
we find that
n!
(r − r0 )n = (−1)|k| rm rk0
m! k!
m+k=n

= bm (r0 )rm ,
m≤n

with polynomial coefficients bm (r0 ) that depend upon the multi-index m and the shift
r0 . Finally, we note that the exponential factor ez0 ,r can be shifted by r0 by simple
6.2 Basic concepts and definitions 119

multiplication with a constant (see (6.12) below). These facts taken together establish
the structure of the underlying vector space. As for the last statement, we rely on the
theory of Lie groups that tells us that the only finite-dimensional collection of functions
that are translation-invariant is made of exponential polynomials. The pure exponentials
ez0 ,r (with n = 0) are special in that respect: they are the eigenfunctions of the shift
operator in the sense that
ez0 ,r−r0 = λ(r0 ) ez0 ,r (6.12)
with the (complex) eigenvalue λ(r0 ) = ez0 ,r0 , and hence the only elements that specify
shift-invariant subspaces of dimension 1.
Since our formulation relies on the theory of generalized functions, we shall focus on
the restriction of NL to S  (Rd ). This rules out the exponential factors z0 = α 0 + jω0
in Proposition 6.1 with α 0 ∈ Rd \ {0}, for which the Fourier-multiplier operator is not
necessarily well defined. We are then left with null-space atoms of the form ejω0 ,r rn
with ω0 ∈ Rd , which are functions of slow growth.
The next important ingredient is the Green’s function ρL of the operator L. Its defining
property is LρL = δ, where δ is the d-dimensional Dirac distribution. Since there are
many equivalent Green’s functions of the form ρL + p0 where p0 ∈ NL is an arbitrary
component of the null space, we resolve the ambiguity by defining the (primary) Green’s
function of L as
* +
1
ρL (r) = F −1 (r), (6.13)

L(ω)
with the requirement that ρL ∈ S  (Rd ) is an ordinary function of slow growth. In other
words, we want ρL (r) to be defined pointwise for any r ∈ Rd and to grow no faster
than a polynomial. The existence of the generalized inverse Fourier transform (6.13)
imposes some minimal continuity and decay conditions on 1/ L(ω) and also puts
 some
restrictions on the number and nature of its singularities e.g., the zeros of 
L(ω) .
D E F I N I T I O N 6.2 (Spline admissibility) The Fourier-multiplier operator L : X (Rd ) →
S  (Rd ) with frequency response  L(ω) is called spline admissible if (6.13) is well de-
fined and ρL (r) is an ordinary function of slow growth.
An important characteristic of spline-admissible operators is the rate of growth of
their frequency response at infinity.
DEFINITION 6.3 (Order of a Fourier multiplier) The Fourier multiplier  L(ω) is of
(asymptotic) order γ ∈ R+ if there exists a radius R ∈ R+ and a constant C such that
C|ω|γ ≤ |
L(ω)| (6.14)
for all |ω| ≥ R, where γ is critical in the sense that the condition fails for any larger
value.
The order is in direct relation with the degree of smoothness of the Green’s function
ρL . In the case of a scale-invariant operator, it also coincides with the scaling order (or
degree of homogeneity) of  L(ω). For instance, the fractional-derivative operator Dγ ,
120 Splines and wavelets

which is defined via the Fourier multiplier (jω)γ , is of order γ . Its Green’s function is
given by (see Table A.1)
* + γ −1
1 r+
ρDγ (r) = F −1 (r) = , (6.15)
(jω)γ (γ )
γ −1
where  is Euler’s gamma function (see Appendix C) and r+ = max(0, r)γ −1 . Clearly,
the latter is a function of slow growth. It has a single singularity at the origin whose
Hölder exponent is (γ − 1), and is infinitely differentiable everywhere else. It follows
that ρDγ is uniformly Hölder-continuous of degree (γ −1). This is one less than the order
of the operator. On the other hand, the null space of Dγ consists of the polynomials of
n (jω)γ
degree N = γ − 1 since d dω n ∝ (jω)γ −n is vanishing at the origin up to order N
with (γ − 1) ≤ N < γ (see argument in Section 6.4.1).
A fundamental result is that all partial differential operators with constant coefficients
are spline-admissible. This follows from the Malgrange–Ehrenpreis theorem, which
guarantees the existence of their Green’s function [Mal56, Wag09]. The generic form
of such operators is

LN = an ∂ n
|n|<N

n1 +···+n
with an ∈ Rd , where ∂ n is the multi-index notation for ∂n1 d
n . The corresponding
∂r1 ··· ∂rdd

Fourier multiplier is LN (ω) = |n|<N an j|n| ωn , which is a polynomial of degree N =
|n|. The operator is elliptic if 
LN (ω) vanishes at the origin and nowhere else. More
generally, it is called quasi-elliptic of order γ if 
LN (ω) fulfills the growth condition
in Definition 6.3. For d = 1, it is fairly easy to determine ρL using standard Fourier-
inversion techniques (see Chapter 5). Moreover, the condition for quasi-ellipticity of
order N is automatically satisfied. When moving to higher dimensions, the study of
partial differential operators and the determination of their Green’s functions becomes
more challenging because of the absence of a general multidimensional factorization
mechanism. Yet, it is possible to treat special cases in full generality, such as the scale-
invariant operators (with homogeneous, but not necessarily, rotation-invariant, Fourier
multipliers) and the class of rotation-invariant operators that are polynomials of the
Laplacian (− ).

6.2.2 Splines and operators


The foundation of our formulation is the direct correspondence between a spline-
admissible operator L and a particular brand of splines.

D E F I N I T I O N 6.4 (Cardinal L-spline) A function s(r) (possibly of slow growth) is


called a cardinal L-spline if and only if

Ls(r) = a[k]δ(r − k).


k∈Zd
6.2 Basic concepts and definitions 121

The location of the Dirac impulses specifies the spline discontinuities (or knots). The
term “cardinal” refers to the particular setting where these are located on the Cartesian
grid Zd .
The remarkable aspect of this definition is that the operator L has the role of a
mathematical A-to-D converter since it maps a continuously defined signal s into a
discrete sequence a = (a[k]). Also note that the weighted sum of Dirac impulses on
the right-hand side of the above equation can be interpreted as the continuous-domain
representation of the discrete signal a – it is a hybrid-type representation that is com-
monly used in the theory of linear systems to model ideal sampling (multiplication with
a train of Dirac impulses).
The underlying concept of a spline is fairly general and it naturally extends to non-
uniform grids.

D E F I N I T I O N 6.5 (Non-uniform spline) Let {rk }k∈S be a set of points (not necessarily
finite) that specifies a (non-uniform) grid in Rd . Then, a function s(r) (possibly of slow
growth) is a non-uniform L-spline with knots {rk }k∈S if and only if

Ls(r) = ak δ(r − rk ).
k∈S

The direct implication of this definition is that a (non-uniform) L-spline with knots
{rk } can generally be expressed as

s(r) = p0 (r) + ak ρL (r − rk ), (6.16)


k∈S

where ρL = L−1 δ is the Green’s function of L and p0 ∈ NL is an appropriate null-space


component that is typically selected to fulfill some boundary conditions.
In the case where the grid is uniform, it is usually more convenient to express splines
in terms of localized B-spline functions which are shifted replicates of a simple template
βL , or some other equivalent generator. An important requirement is that the set of
B-spline functions constitutes a Riesz basis.

6.2.3 Riesz bases


To quote Ingrid Daubechies [Dau92], a Riesz basis is the next-best thing after an ortho-
gonal basis. The reason for not enforcing orthogonality is to leave more room for other
desirable features such as simplicity of construction, maximum localization of the basis
function (e.g., compact support), and, last but not least, fast computational solutions.

DEFINITION 6.6 (Riesz basis) A sequence of functions {φk (r)}k∈Z in L2 (Rd ) forms a
Riesz basis if and only if there exist two constants A and B such that

Ac2 ≤  ck φk (r)L2 (Rd ) ≤ Bc2


k∈Z
122 Splines and wavelets

for any sequence c = (ck ) ∈ 2 . More generally, the basis is Lp -stable if there exist two
constants Ap and Bp such that

Ap cp ≤  ck φk (r)Lp (Rd ) ≤ Bp cp .


k∈Z

This definition imposes an equivalence between the L2 (Lp , resp.) norm of the conti-

nuously defined function s(r) = k∈Z ck φk (r) and the 2 (p , resp.) norm of its expan-
sion coefficients (ck ). This ensures that the representation is stable in the sense that a
small perturbation of the expansion coefficients results in a perturbation of comparable
magnitude on s(r) and vice versa. Also note that the lower inequality implies that the
functions {φk } are linearly independent (by setting s(r) = 0), which is the defining pro-
perty of a basis in finite dimensions – but which, on its own, does not ensure stability in
infinite dimensions. When A = B = 1, we have a perfect norm equivalence, which trans-
lates into the basis being orthonormal (Parseval’s relation). Finally, we point out that
the existence of the bounds A and B ensures that the (infinite) Gram matrix is positive
definite so that it can be readily diagonalized to yield an equivalent orthogonal basis.
In the (multi-)integer shift-invariant case where the basis functions are given by
φk (r) = φ(r − k), k ∈ Zd , there is a simpler equivalent reformulation of the Riesz
basis requirement of Definition 6.6.
THEOREM 6.2 Let φ(r) ∈ L2 (Rd ) be a B-spline-like generator whose Fourier trans-

form is denoted by φ(ω). Then, {φ(r − k)}k∈Zd forms a Riesz basis with Riesz bounds A
and B if and only if
0 < A2 ≤  + 2πn)|2 ≤ B2 < ∞
|φ(ω (6.17)
n∈Zd

for almost every ω. Moreover, the basis is Lp -stable for all 1 ≤ p ≤ +∞ if, in addition,

sup |φ(r − k)| = A2,∞ < +∞. (6.18)


r∈[0,1]d k∈Zd

Under such condition(s), the induced function space


⎧ ⎫
⎨ ⎬
Vφ = s(r) = c[k]φ(r − k) : c ∈ p (Zd )
⎩ d

k∈Z

is a closed subspace of Lp (Rd ), including the standard case p = 2.


Observe that the central quantity in (6.17) corresponds to the discrete-domain Fourier
transform of the Gram sequence aφ [k] = φ(· − k), φ L2 . Indeed, we have that

Aφ (ejω ) = aφ [k]e−jω,k =  + 2πn)|2 ,


|φ(ω (6.19)
k∈Zd n∈Zd

where the right-hand side follows from Poisson’s summation formula applied to the

sampling at the integers of the autocorrelation function (φ ∗ φ)(r). Equation (6.19)
is especially advantageous in the case of compactly supported B-splines, for which the
autocorrelation is often known explicitly (as a B-spline of twice the order), since it
6.2 Basic concepts and definitions 123

reduces the calculation to a finite sum over the support of the Gram sequence (discrete-
domain Fourier transform).
Theorem 6.2 is a fundamental result in sampling and approximation theory [Uns00].
It is instructive here to briefly run through the L2 part of the proof, which also serves as
a refresher on some of the standard properties of the Fourier transform. In particular, we
emphasize the interaction between the continuous and discrete aspects of the problem.

Proof We start by computing the Fourier transform of s(r) = k∈Zd c[k]φ(r − k),
which gives

F {s}(ω) = 
c[k] e−jω,k φ(ω) (by linearity and shift property)
k∈Zd

= C(ejω ) · φ(ω),

where C(ejω ) is recognized as the discrete-domain Fourier transform of c[·]. Next, we


invoke Parseval’s identity and manipulate the Fourier-domain integral as follows:
 
 jω 2  2 dω
s2L2 = C(e ) φ(ω)
Rd (2π)d
 
 j(ω+2πn) 2  2 dω
= C(e ) φ(ω + 2πn)
[0,2π]d (2π)d
n∈Zd
   2 dω
 jω 2 
= C(e ) φ(ω + 2πn)
[0,2π]d (2π)d
n∈Zd
 
 jω 2 dω
= C(e ) Aφ (ejω ) .
[0,2π]d (2π)d

Here, we have used the fact that C(ejω ) is 2π-periodic and the non-negativity of the
integrand to interchange the summation and the integral (Fubini). This naturally leads
to the inequality

inf Aφ (ejω ) · c22 ≤ s2L2 ≤ sup Aφ (ejω ) · c22 ,


ω∈[0,2π]d ω∈[0,2π]d

where we are now making use of Parseval’s identity for sequences, so that
 
 j(ω 2 dω
c22 = C(e ) .
[0,2π]d (2π)d
The final step is to show that these bounds are sharp. This can be accomplished through
the choice of some particular (bandlimited) sequence c[·].

Note that the “almost everywhere” part of (6.17) can be dropped when φ ∈ L1 (Rd )
because the Fourier transform of such a function is continuous (Riemann–Lebesgue
lemma).
While the result of Theorem 6.2 is restricted to the classical Lp spaces, there is no fun-
damental difficulty in extending it to wider classes of weighted (with negative powers)
Lp spaces by imposing some stricter condition than (6.18) on the decay of φ. For ins-
tance, if φ has exponential decay, then the definition of the function space Vφ can be
124 Splines and wavelets

extended for all sequences c that are growing no faster than a polynomial. This happens
to be the appropriate framework for sampling generalized stochastic processes which
do not live in the Lp spaces since they are not decaying at infinity.

6.2.4 Admissible wavelets


The other important tool for analyzing stochastic processes is the wavelet transform,
whose basis functions must be “tuned” to the object under investigation.
D E F I N I T I O N 6.7 A wavelet function ψ is called L-admissible if it can be expressed
as ψ = LH φ with φ ∈ L1 (Rd ).

Observe that we are now considering the Hermitian transpose operator LH = L ,
which is distinct from the adjoint operator L∗ when the impulse response has some
imaginary component. The reason for this is that the wavelet-analysis step involves
a Hermitian inner product ·, · L2 whose definition differs by a complex conjugation
from that of the distributional scalar product ·, · used in our formulation of stochastic
processes
 when the second argument is complex-valued; specifically, f, g L2 = f, g =
R d f (r)g(r) dr.
The best-matched wavelet is the one for which the wavelet kernel φ is the most
localized – ideally, the shortest possible support assuming that it is at all possible to
construct a compactly supported wavelet basis. The very least is that φ should be
concentrated around the origin and exhibit a sufficient rate of decay; for instance,
|φ(r)| ≤ 1+rC
α for some α > d.
A direct implication of Definition 6.7 is that the wavelet ψ will annihilate all the
components (e.g., polynomials) that are in the null space of L because ψ(· − r0 ), p0 =
φ(· − r0 ), Lp0 = 0, for all p0 ∈ NL and r0 ∈ Rd . In conventional wavelet theory,
this behavior is achieved by designing “Nth-derivative-like” wavelets with vanishing
moments up to polynomial degree N − 1.

6.3 First-order exponential B-splines and wavelets

Rather than aiming for the highest level of generality right away, we propose to first
examine the 1-D first-order scenario in some detail. First-order differential models are
important theoretically because they go hand-in-hand with the Markov property. In that
respect, they constitute the next level of generalization beyond the Lévy processes.
Mathematically, the situation is still quite comparable to that of the derivative operator
in the sense that it leads to a nice, self-contained construction of (exponential) B-splines
and wavelets. The interesting aspect, though, is that the underlying basis functions are
no longer conventional wavelets that are dilated versions of a single prototype: they
now fall into the lesser-known category of non-stationary 2 wavelets.
2 In the terminology of wavelets, the term “non-stationary” refers to the property that the shape of the wavelet
changes with scale, but not with respect to the location, as the more usual statistical meaning of the term
would suggest.
6.3 First-order exponential B-splines and wavelets 125

Figure 6.3 Comparison of basis functions related to the first-order differential operator
Pα = D − αI for α = 0, −1, −2, −4 (dark to light). (a) Green’s functions ρα (r). (b) Exponential
B-splines βα (r). (c) Augmented spline interpolators ϕint (r). (d) Orthonormalized versions of the
exponential spline wavelets ψα (r) = P∗α ϕint (r).

The (causal) Green’s function of our canonical first-order operator Pα = (D − αId)


is identical to the impulse response ρα of the corresponding differential system, while
the (1-D) null space of the operator is given by Nα = {a0 eαr : a0 ∈ R}. Some examples
of such Green’s functions are shown in Figure 6.3. The case α = 0 (dark curve) is the
classical one already treated in Section 6.1.

6.3.1 B-spline construction


The natural discrete approximation of the differential operator Pα = (D − αId) is the
first-order weighted difference operator

α s(r) = s(r) − eα s(r − 1). (6.20)


Observe that α annihilates the exponentials a0 eαr so that its null space includes Nα .
The corresponding B-spline is obtained by applying α to ρα , which yields
  ⎧
1 − eα e−jω ⎨ eαr , for 0 ≤ r < 1
βα (r) = F −1 (r) = (6.21)
jω − α ⎩ 0, otherwise.

In effect, the localization by α results in a “chopped-off” version of the causal Green’s


function that is restricted to the interval [0, 1) (see Figure 6.3b). Importantly, the scheme
remains applicable in the unstable scenario Re(α) ≥ 0. It always results in a well-
defined Fourier transform due to the convenient pole–zero cancellation in the central
expression of (6.21). The marginally unstable case α = 0 results in the rectangular
function shown in Figure 6.3b, which is the standard basis function for representing
126 Splines and wavelets

piecewise-constant signals. Likewise, βα generates an orthogonal basis for the space of


cardinal Pα -splines in accordance with Definition 6.4. This allows us to specify our pro-
totypical exponential spline space as V0 = span{βα (· − k)}k∈Z with knot spacing 20 = 1.

6.3.2 Interpolator in augmented-order spline space


The second important ingredient is the interpolator for the “augmented-order” spline

space generated by the autocorrelation (β α ∗ βα )(r) of the B-spline. Constructing it is
especially easy in the first-order case because it involves the simple normalization
1 ∨
ϕint,α (r) = ∨ (β α ∗ βα )(r). (6.22)
(β α ∗ βα )(0)
Specifically, ϕint,α is the unique cardinal PH α Pα -spline function that vanishes at all the
integers, except at the origin where it takes the value one (see Figure 6.3c). Its classical
use is to provide a sinc-like kernel for the representation of the corresponding family of
splines, and also for the reconstruction of spline-related signals, including special brands
of stochastic processes, from their integer samples [UB05b]. Another remarkable and
lesser-known property is that this function provides the proper smoothing kernel for
defining an operator-like wavelet basis.

6.3.3 Differential wavelets


In the generalized spline framework, instead of specifying a hierarchy of multiresolution
subspaces of L2 (R) (the space of finite-energy functions) via the dilation of a scaling
function, one considers the fine-to-coarse sequence of L-spline spaces

Vi = {s(r) ∈ L2 (R) : Ls(r) = ai [k]δ(r − 2i k)},


k∈Z

where the embedding Vi ⊇ Vj for i ≤ j is obvious from the (dyadic) hierarchy of spline
knots, so that sj ∈ Vj implies that sj ∈ Vi with an appropriate subset of its coefficients
ai [k] being zero.
We now detail the construction of a wavelet basis at resolution 1 such that W1 =
span{ψ1,k }k∈Z with W1 ⊥ V1 and V1 + W1 = V0 = span{βα (· − k)}k∈Z . The recipe is to
take ψ1,k (r) = ψα (r − 1 − 2k)/ψα L2 , where ψα is the mother wavelet given by

α ϕint,α (r) ∝
ψα (r) = PH H
α βα (r).

Here, H α is the Hermitian adjoint of the finite-difference operator α . Examples of


such exponential-spline wavelets are shown in Figure 6.3d, including the classical Haar
wavelet (up to a sign change) which is obtained for α = 0. The basis functions ψ1,k are
shifted versions of ψα that are centered at the odd integers and normalized to have a
unit norm. Since these wavelets are non-overlapping, they form an orthonormal basis.
Moreover, the basis is orthogonal to the coarser spline space V1 as a direct consequence
of the interpolating property of ϕint,α (Proposition 6.6 in Section 6.5). Finally, based on
6.4 Generalized B-spline basis 127

the fact that ψ1,k ∈ V0 for all k ∈ Z, one can show that these wavelets span W1 , which
translates into
 
W1 = v(r) = v1 [k]ψ1,k (r) : v1 ∈ 2 (Z) .
k∈Z

This method of construction extends to the other wavelet subspaces Wi provided that
the interpolating kernel ϕint,α is replaced by its proper counterpart at resolution a = 2i−1
and the sampling grid adjusted accordingly. Ultimately, this results in a wavelet basis of
L2 (R) whose members are all Pα -splines – that is, piecewise-exponential with parameter
α – but not dilates of the same prototype unless α = 0. Otherwise, the corresponding
decomposition is not fundamentally different from a conventional wavelet expansion.
The basis functions are equally well localized and the scheme admits the same type of
fast reversible filterbank algorithm, albeit with scale-dependent filters [KU06].

6.4 Generalized B-spline basis

The procedure of Section 6.3.1 remains applicable for the broad class of spline-
admissible operators (see Definition 6.2) in one or multiple dimensions. The two
ingredients for constructing a generalized B-spline basis are: (1) the knowledge of the
Green’s function ρL of the operator L, and (2) the availability of a discrete approxima-
tion (finite-difference-like) of the operator of the form
Ld s(r) = dL [k]s(r − k) (6.23)
k∈Zd

with dL ∈ 1 (Zd ) that fulfills the null-space matching constraint 3


Ld p0 (r) = Lp0 (r) = 0 for all p0 ∈ NL . (6.24)
The generalized B-spline associated with the operator L is then given by
 
d [k]e−jk,ω
βL (r) = Ld ρL (r) = F −1 k∈Z d L
(r), (6.25)

L(ω)
where the numerator and denominator in the right-hand expression correspond to the
frequency responses of Ld and L, respectively. The null-space matching constraint is
especially helpful for the unstable cases where ρL ∈/ L1 (Rd ): it ensures that the zeros
of L(ω) (singularities) are cancelled by some corresponding zeros of 
 Ld (ω) so that the
Fourier transform of βL remains bounded.
D E F I N I T I O N 6.8 The function βL specified by (6.25) is an admissible B-spline for L
if and only if (1) βL ∈ L1 (Rd ) ∩ L2 (Rd ) and (2) it generates a Riesz basis of the space
of cardinal L-splines.
3 We want the null space of L to include N and to remain as small as possible. In that respect, it is worth
d L
noting that the null space of a discrete operator will always be much larger than that of its continuous-
domain counterpart. For instance, the derivative operator D suppresses constant signals, while its finite-
difference counterpart annihilates all 1-periodic functions, including the constants.
128 Splines and wavelets

In light of Theorem 6.2, the latter property requires the existence of the two Riesz
bounds A and B such that
 
 −jk,ω 2
 k∈Zd dL [k]e
0<A ≤ 2
|βL (ω + 2πn)| = 
2
≤ B2 . (6.26)
d |L(ω + 2πn)| 2
d n∈Z n∈Z

A direct consequence of (6.25) is that


LβL (r) = dL [k]δ(r − k), (6.27)
k∈Zd

so that βL is itself a cardinal L-spline in accordance with Definition 6.4. The bottom
line in Definition 6.8 is that any cardinal L-spline admits a unique representation in the
B-spline basis {βL (· − k)}k∈Zd as

s(r) = c[k]βL (r − k), (6.28)


k∈Zd

where the c[k] are the B-spline coefficients of s.


While (6.25) provides us with a nice recipe for constructing B-splines, it does not
guarantee that the Riesz-basis condition (6.26) is satisfied. This needs to be established
on a case-by-case basis. The good news for the present theory of stochastic processes
is that B-splines are available for virtually all the operators that have been discussed
so far.

6.4.1 B-spline properties


To motivate the use of B-splines, we shall first restrict our attention to the space VL of
cardinal L-splines with finite energy, which is formally defined as
⎧ ⎫
⎨ ⎬
VL = s(r) ∈ L2 (Rd ) : Ls(r) = a[k]δ(r − k) . (6.29)
⎩ d

k∈Z

The foundation of spline theory is that there are two complementary ways of representing
splines using different types of basis functions: Green’s functions vs. B-splines. The
first representation follows directly from Definition 6.4 see also (6.16) and is given by

s(r) = p0 (r) + a[k]ρL (r − k), (6.30)


k∈Zd

where p0 ∈ NL is a suitable element of the null space of L and where ρL = L−1 δ is the
Green’s function of the operator. The functions ρL (·−k) are non-local and very far from
being orthogonal. In many cases, they are not even part of VL , which raises fundamental
issues concerning the L2 convergence of the infinite sum 4 in (6.30) and the conditions
that must be imposed upon the expansion coefficients a[·]. The second type of B-spline
expansion (6.28) does not have such stability problems. This is the primary reason why
it is favored by practitioners.
4 Without further assumptions on ρ and a, (6.30) is only valid in the weak sense of distributions.
L
6.4 Generalized B-spline basis 129

Stable representation of cardinal L-splines


The equivalent B-spline specification of the space VL of cardinal splines is
⎧ ⎫
⎨ ⎬
VL = s(r) = c[k]βL (r − k) : c[·] ∈ 2 (Zd ) ,
⎩ d

k∈Z

where the generalized B-spline βL satisfies the conditions in Definition 6.8. The Riesz-
basis property ensures that the representation is stable in the sense that, for all s ∈ VL ,
we have that
Ac2 ≤ sL2 ≤ Bc2 . (6.31)
 1
Here, c2 = k∈Zd |c[k]|
2 2 is the  -norm of the B-spline coefficients c. The fact
2
that the underlying functions are cardinal L-splines is a simple consequence of the atoms
being splines themselves. Moreover, we can easily make the link with (6.30) by using
(6.27), which yields
Ls(r) = c[k]LβL (r − k) = (c ∗ dL )[k] δ(r − k).
  
k∈Zd k∈Zd a[k]

The less obvious aspect, which is implicit in the definition of the B-spline, is the com-
pleteness of the representation in the sense that the B-spline basis spans the space VL
defined by (6.29). We shall establish this by showing that the B-splines are capable of
reproducing ρL as well as any component p0 ∈ NL in the null space of L. The impli-
cation is that any function of the form (6.30) admits a unique expansion in a B-spline
basis. This is also true when the function is not in L2 (Rd ), in which case the B-spline
coefficients c are no longer in 2 (Zd ) due to the discrete–continuous norm equivalence
(6.31).

Reproduction of Green’s functions


The reproduction of Green’s functions follows from the special form of (6.25). To reveal
it, we consider the inverse L−1
d of the discrete localization operator Ld specified by
(6.23), whose continuous-domain impulse response is written as
* +
−1 −1 1
Ld δ(r) = p[k]δ(r − k) = F  −jk,ω
.
d
k∈Z k∈Zd dL [k]e

The sequence p, which can be determined by generalized inverse Fourier transform, is of


slow growth with the property that (p ∗ d)[k] = δ[k]. The Green’s function reproduction
formula is then obtained by applying L−1 d to the B-spline βL and making use of the
left-inverse property of L−1
d . Thus,
L−1 −1
d βL (r) = Ld Ld ρL (r) = ρL (r)

results in
ρL (r) = p[k]βL (r − k). (6.32)
k∈Zd
130 Splines and wavelets

To illustrate the concept, let us get back to our introductory example in Section 6.3.1,
with L = Pα = (D − αId) where Re(α) < 0. The frequency response of this first-order
operator is

Pα (ω) = jω − α,

while its Green’s function is given by


* +
1
ρα (r) = 1+ (r)eαr = F −1 (r).
jω − α
On the discrete side of the picture, we have the finite-difference operator α with
 α (ω) = 1 − eα−jω ,

and its inverse −1 whose expansion coefficients are


α
* +
−1 1
pα [k] = 1+ [k]e = Fd
αk
[k],
1 − eα e−jω

where Fd−1 denotes the discrete-domain inverse Fourier transform. 5 The application of
(6.32) then yields the exponential-reproduction formula

1+ (r)eαr
= eαk βα (r − k), (6.33)
k=0

where βα is the exponential B-spline defined by (6.21). Note that the range of applica-
bility of (6.33) extends to Re(α) ≤ 0.

Reproduction of null-space components


A fundamental property of B-splines is their ability to reproduce the components that
are in the null space of their defining operator. In the case of our working example, we
can simply extrapolate (6.33) for negative indices, which yields

eαr = eαk βα (r − k).


k∈Z

It turns out that this reproduction property is induced by the matching null-space
constraint (6.24) that is imposed upon the localization filter. While the reproduction of
exponentials is interesting in its own right, we shall focus here on the important case
of polynomials and provide a detailed Fourier-based analysis. We start by recalling that
the general form of a multidimensional polynomial of total degree N is

qN (r) = an rn ,
|n|≤N
% &
5 Our definition of the inverse discrete Fourier transform in 1-D is Fd−1 H(ejω ) [k] =
1 π jω jωk dω with k ∈ Z.
2π −π H(e )e
6.4 Generalized B-spline basis 131

n
using the multi-index notation with n = (n1 , . . . , nd ) ∈ Nd , rn = rn11 · · · rdd , and |n| =
n1 + · · · + nd . The generalized Fourier transform of qN ∈ S  (Rd ) (see Table 3.3 and
entry rn f (r) with f (r) = 1) is given by


qN (ω) = (2π)d an j|n| ∂ n δ(ω),
|n|≤N

where ∂ nδ
denotes the nth partial derivative of the multidimensional Dirac impulse δ.
Hence, the Fourier multiplier 
L will annihilate the polynomials of order N if and only if

L(ω)∂ δ(ω) = 0 for all |n| ≤ N. To understand when this condition is met, we expand
n

L(ω)∂ n δ(ω) in terms of ∂ k
L(0), |k| ≤ |n|, by using the general product rule for the
manipulation of Dirac impulses and their derivatives, given by
n!
f (r) ∂ n δ(r − r0 ) = (−1)|n|+|l| ∂ k f (r0 ) ∂ l δ(r − r0 ).
k! l!
k+l=n

The latter follows from Leibniz’ rule for partially differentiating a product of functions,
n! k l
∂ n (fϕ) = ∂ f ∂ ϕ,
k! l!
k+l=n

and the adjoint relation ϕ, f ∂ n δ(· − r0 ) = ∂ n∗ (fϕ), δ(· − r0 ) with ∂ n∗ = (−1)|n| ∂ n .
This allows us to conclude that the necessary and sufficient condition for the inclusion
of the polynomials of order N in the null space of L is

∂ n
L(0) = 0, for all n ∈ Nd with |n| ≤ N, (6.34)

which is equivalent to L(ω) = O(ωN+1 ) around the origin. Note that this behavior is
prototypical of scale-invariant operators such as fractional derivatives and Laplacians.
The same condition has obviously to be imposed upon the localization filter  Ld in order
for the Fourier transform of the B-spline in (6.25) to be non-singular at the origin. Since

Ld (ω) is 2π-periodic, we have that

∂ n
Ld (2πk) = 0, k ∈ Zd , n ∈ Nd with |n| ≤ N. (6.35)

For practical convenience, we shall assume that the B-spline βL is normalized to have a
L (ω) = 
L (0) = 1. Based on (6.35) and β
unit integral 6 so that β Ld (ω)/
L(ω), we find that

⎨ βL (0) = 1
(6.36)
⎩ ∂ nβL (2πk) = 0, k ∈ Zd \{0}, n ∈ Nd with |n| ≤ N,

L (ω) is
which are the so-called Strang–Fix conditions of order N. Recalling that j|n| ∂ n β
n
the Fourier transform of r βL (r) and that periodization in the signal domain corresponds
to a sampling in the Fourier domain, we finally deduce that
L (0) = Cn ,
(r − k)n βL (r − k) = j|n| ∂ n β n ∈ Nd with 0 < |n| ≤ N, (6.37)
k∈Zd

L (0)  = 0 due to a proper cancellation of poles


6 This is always possible thanks to (6.24), which ensures that β
and zeros in the right-hand side of (6.25).
132 Splines and wavelets

with the implicit assumption that βL has a sufficient order of algebraic decay for the
above sums to be convergent. The special case of (6.37) with n = 0 reads
βL (r − k) = 1 (6.38)
k∈Zd

and is called the partition  It reflects the fact that βL reproduces the constants.
 of unity.
More generally, (6.37) or (6.36) is equivalent to the existence of sequences pn such
that
rn = pn [k]βL (r − k) for all |n| ≤ N, (6.39)
k∈Zd

which is a more direct statement of the polynomial-reproduction property. For instance,


(6.37) with n = (1, . . . , 1) implies that
⎛ ⎞

r⎝ βL (r − k)⎠ − kβL (r − k) = C(1,...,1) ,


k∈Zd k∈Zd
  
p(0,...,0) (k)=1

from which one deduces that p(1,...,1) [k] = k + C(1,...,1) . The other sequences pn , which
are polynomials in k, may be determined in a similar fashion by proceeding recursively.
Another equivalent way of stating the Strang–Fix conditions of order N is that the sums
" #
kn βL (r − k) = j|n| ∂ωn e−jω,r β L (−ω) 
ω=2πl
k∈Zd l∈Zd

are polynomials with leading term rn for all |n| ≤ N. The left-hand-side expression
follows from Poisson’s summation formula 7 applied to the function f (x) = xn βL (r − x)
with r being considered as a constant shift.

Localization
The guiding principle for designing B-splines is to produce basis functions that are
maximally localized on Rd . Ideally, B-splines should have the smallest possible sup-
port, which is the property that makes them so useful in applications. When it is not
possible to construct compactly supported basis functions, the B-spline should at least
be concentrated around the origin and satisfy some decay bound with the tightest pos-
sible constants. The primary types of spatial localization, by order of preference, are:
(1) Compact support: βL (r) = 0 for all r ∈/  where  ⊂ Rd is a bounded set with the
smallest possible Lebesgue measure
(2) Exponential decay: |βL (r)| ≤ C exp(−α|r|) with the largest possible α ∈ R+
(3) Algebraic decay: |βL (r)| ≤ C 1+r
1
α with the largest possible α ∈ R .
+

By relying on the classical relations that link spatial decay to the smoothness of the
Fourier transform, one can get a good estimate of spatial decay based on the knowledge
7 The standard form of Poisson’s summation formula is   
k∈Zd f (k) = l∈Zd f (2πl). It is valid for any
Fourier pair f,
f = F {f} ∈ L1 (Rd ) with sufficient decay for the two sums to be convergent.
6.4 Generalized B-spline basis 133

of the Fourier transform β L (ω) = Ld (ω)/L(ω) of the B-spline. Since the localization
Ld (ω) acts by compensating the (potential) singularities of 
filter  L(ω), the guiding prin-
ciple is that the rate of decay is essentially determined by the degree of differentiability
of 
L(ω).
Specifically, if β L (ω) is differentiable up to order N, then the B-spline βL is gua-
ranteed to have an algebraic decay of order N. To show this, we consider the Fourier
L subject to the constraint that ∂ n β
transform pair rn βL (r) ↔ j|n| ∂ n β L ∈ L1 (Rd ) for all
|n| < N. From the definition of the inverse Fourier integral, it immediately follows that
n  1
r βL (r) ≤ L L1 ,
∂ n β
(2π)d

which, when properly combined over all multi-indices |n| < N, yields an algebraic
decay estimate with α = N. By pushing the argument to the limit, we see that exponential
decay (which is faster than any order of algebraic decay) requires that β L ∈ C∞ (Rd )
(infinite order of differentiability), which is only possible if  ∞
L(ω) ∈ C (Rd ) as well.
The ultimate limit in Fourier-domain regularity is when β L has an analytic extension
8
that is an entire function. In fact, by the Paley–Wiener theorem (Theorem 6.3 below),
one achieves compact support of βL if and only if β L (ζ ) is an entire function of expo-
nential type. To explain this concept, we focus on the 1-D case where the B-spline βL is
supported in the finite interval [−A, +A]. We then consider the holomorphic Fourier (or
Fourier–Laplace) transform of the B-spline, given by
+A
L (ζ ) =
β βL (r)e−ζ r dr (6.40)
−A

with ζ = σ + jω ∈ C, which formally amounts to replacing jω by ζ in the expression for


the Fourier transform of βL . In order to obtain a proper analytic extension, we need to
L (ζ ) satisfies the Cauchy–Riemann equation. We shall do so by applying a
verify that β
dominated-convergence argument. To that end, we construct the exponential bound

  +A
βL (ζ ) ≤ eA|ζ |
 |βL (r)| dr
−A
. .
+A +A
≤ eA|ζ | 1 dr |βL (r)|2 dr
−A −A

= eA|ζ | 2A βL L2 ,

where we have applied the Cauchy–Schwarz inequality to derive the lower inequality.
Since e−ζ r for r fixed is itself an entire function and (6.40) is convergent over the whole
complex plane, the conclusion is that β L (ζ ) is entire as well, in addition to being a
function of exponential type A as indicated by the bound. The whole strength of the
Paley–Wiener theorem is that the implication also works the other way around.

8 An entire function is a function that is analytic over the whole complex plane C.
134 Splines and wavelets

T H E O R E M 6.3 (Paley–Wiener) Let f ∈ L2 (R). Then, f is compactly supported in


[−A, A] if and only if its Laplace transform

F(ζ ) = f (r)e−ζ r dr
R
is an entire function of exponential type A, meaning that there exists a constant C such
that
|F(ζ )| ≤ CeA|ζ |
for all ζ ∈ C.
The result implies that one can deduce the support of f from its Laplace transform. We
can also easily extend the result to the case where the support is not centered around
  the
origin by applying the Paley–Wiener theorem to the autocorrelation function f ∗ f∨ (r).
The latter is supported in the interval [−2A, 2A], which is twice the size of the support of
f, irrespective of its center. This suggests the following expression for the determination
of the support of a B-spline:
  
log sup|ζ |≤R  L (−ζ )
βL (ζ )β
support(βL ) = lim sup . (6.41)
R→∞ R
It returns twice the exponential type of the recentered B-spline, which gives
support(βL ) = 2A. While this formula is only strictly valid when β L (ζ ) is an entire
function, it can be used otherwise as an operational measure of localization when the
underlying B-spline is not compactly supported. Interestingly, (6.41) provides a mea-
sure that is additive with respect to convolution and proportional to the order γ . For
instance, the support of an (exponential) B-spline associated with an ordinary differen-
tial operator of order N is precisely N, as a consequence of the factorization property of
such B-splines (see Sections 6.4.2 and 6.4.4).
To get some insight into (6.41), let us consider the case of the polynomial B-spline of
order one (or degree zero) with βD (r) = 1[0,1) (r) and Laplace transform
 −ζ 
D (ζ ) = 1 − e
β .
ζ
The required product in (6.41) is
−ζ
D (−ζ ) = −e + 2 − e ,
ζ
D (ζ )β
β
ζ2
which is analytic over the whole complex plane because of the pole–zero cancellation
at ζ = 0. For R sufficiently large, we clearly have that
  R −R eR
max  D (−ζ ) = e + 2 + e
βD (ζ )β → .
|ζ |≤R R2 R2
By plugging the above expression into (6.41), we finally get
R − 2 log R
support(βD ) = lim sup = 1,
R→∞ R
6.4 Generalized B-spline basis 135

which is the desired result. While the above calculation may look like overkill for the
determination of the already-known support of βD , it becomes quite handy for mak-
ing predictions for higher-order operators. To illustrate the point, we now consider the
B-spline of order γ associated with the (possibly fractional) derivative operator Dγ
whose Fourier–Laplace transform is
 γ
 1 − e−ζ
βD (ζ ) =
γ .
ζ
We can then essentially replicate the previous manipulation while moving the order out
of the logarithm to deduce that
γ R − 2γ log R
support(βDγ ) = lim sup = γ.
R→∞ R
This shows that the “support” of the B-spline is equal to its order, with the caveat that
the underlying Fourier–Laplace transform β Dγ (ζ ) is only analytic (and entire) when the
order γ is a positive integer. This points to the fundamental limitation that a B-spline
associated with a fractional operator – that is, when  L(ζ ) is not an entire function –
cannot be compactly supported.

Smoothness
The smoothness of a B-spline refers to its degree of continuity and/or differentiability.
Since a B-spline is a linear combination of shifted Green’s functions, its smoothness is
the same as that of ρL .
Smoothness descriptors come in two flavors – Hölder continuity vs. Sobolev differ-
entiability – depending on whether the analysis is done in the signal or Fourier domain.
Due to the duality between Fourier decay and order of differentiation, the smoothness
of βL may be predicted from the growth of  L(ω) at infinity, without need for the exp-
licit calculation of ρL . To that end, one considers the Sobolev spaces Wα2 , which are
defined as
* +
Wα2 (Rd ) = f : (1 + ω2 )α |
f (ω)|2 dω < ∞ .
Rd

Since the partial differentiation operator ∂ n corresponds to a Fourier-domain multiplica-


tion by (jω)n , the inclusion of f in Wα2 (Rd ) requires that its (partial) derivatives be well
defined inthe L2 sense
α/2 up to order α. The same is also true for the “Bessel potential”
operators Id − of order α, or, alternatively, the fractional Laplacians (− )α/2
with Fourier multiplier ωα .

PROPOSITION 6.4 Let βL be an admissible B-spline that is associated with a Fourier



multiplier L(ω) of order γ . Then, βL ∈ Wα2 (Rd ) for any α < γ − d/2.

Proof Because of Parseval’s identity, the statement βL ∈ Wα2 (Rd ) is equivalent 9 to


βL ∈ L2 (Rd ) and (− )α/2 βL ∈ L2 (Rd ). Since the first inclusion is part of the definition,
9 The underlying Fourier multipliers are of comparable size in the sense that there exist two constants c and
1
c2 such that c1 (1 + ω2 )α ≤ 1 + ω2α ≤ c2 (1 + ω2 )α .
136 Splines and wavelets

it is sufficient to check for the second. To that end, we recall the stability conditions
βL ∈ L1 (Rd ) and dL ∈ 1 (Rd ), which are implicit to the B-spline construction (6.25).
These, together with the order condition (6.14), imply that
|
Ld (ω)| ≤ dL 1
   
  dL 1
  Ld (ω) 
|βL (ω)| =   ≤ min βL L1 , C .
L(ω)  ωγ

This latter bound allows us to control the L2 norm of (− )α/2 βL by splitting the spectral
range of integration as

L (ω)|2 dω
(− )α/2 βL 2L2 = ω2α |β
Rd (2π)d
= L (ω)|2 dω +
ω2α |β ω2α |βL (ω)|2 dω
(2π) d (2π)d
ω<R ω>R
dω dω
≤ βL 2L1 ω2α d
+C2 d21 ω2α−2γ .
ω<R (2π) ω>R (2π)d
     
I1 I2

The first integral I2 is finite due to the boundedness of the domain. As for I2 , it is
convergent provided that the rate of decay of the argument is faster than d, which cor-
responds to the critical Sobolev exponent α = γ − d/2.
As the final step of the analysis, we invoke the Sobolev embedding theorems to infer
that βL is Hölder-continuous of order r with r < α − d2 = (γ − d), which essentially
means that βL is differentiable up to order r with bounded derivatives. One should keep
in mind, however, that the latter estimate is a lower bound on Hölder continuity, unlike
the Sobolev exponent in Proposition 6.4, which is sharp. For instance, in the case of the
1-D Fourier multiplier (jω)γ , we find that the corresponding (fractional) B-spline – if it
exists – should have a Sobolev smoothness (γ − 12 ), and a Hölder regularity r < (γ −1).
Note that the latter is arbitrarily close (but not equal) to the true estimate r0 = (γ − 1)
that is readily deduced from the Green’s function (6.15).

6.4.2 B-spline factorization


A powerful aspect of spline theory is that it is often possible to exploit the factorization
properties of differential operators to recursively generate whole families of B-splines.
Specifically, if the operator can be decomposed as L = L1 L2 , where the B-splines asso-
ciated with L1 and L2 are already known, then βL = βL1 ∗ βL2 is the natural choice of
B-spline for L.
P R O P O S I T I O N 6.5 Let βL1 , βL2 be admissible B-splines for the operators L1 and L2 ,
respectively. Then, βL (r) = (βL1 ∗ βL2 )(r) is an admissible B-spline for L = L1 L2 if and
only if there exists a constant A > 0 such that
 
 L2 (ω + 2πn)2 ≥ A > 0
βL1 (ω + 2πn)β
n∈Zd
6.4 Generalized B-spline basis 137

γ
for all ω ∈ [0, 2π]d . When L1 = L2 for γ ≥ 0, then the auxiliary condition is automati-
cally satisfied.

Proof Since βL1 , βL2 ∈ L1 (Rd ), the same holds true for βL (by Young’s inequality).
From the Fourier-domain definition (6.25) of the B-splines, we have
 −jk,ω 
Li (ω) = k∈Zd dLi [k]e
β =
Ld,i (ω)
,

Li (ω) 
Li (ω)
which implies that
   

Ld,1 (ω)
Ld,2 (ω) 
Ld (ω)
−1 −1
βL = F =F = Ld L−1 δ

L1 (ω)L2 (ω) 
L(ω)

with L−1 = L−1 −1


2 L1 and Ld = Ld,1 Ld,2 . This translates into the combined localization
operator Ld s(r) = k∈Zd dL [k]s(r − k) with dL [k] = (d1 ∗ d2 )[k], which is factorizable
by construction. To establish the existence of the upper Riesz bound for βL , we perform
the manipulation

AβL (ω) = L (ω + 2πn)|2



n∈Zd

= L1 (ω + 2πn)|2 |β
|β L2 (ω + 2πn)|2
n∈Zd
⎛ ⎞2

≤⎝ L1 (ω + 2πn)| |β
|β L2 (ω + 2πn)|⎠
n∈Zd

≤ L1 (ω + 2πn)|2
|β L2 (ω + 2πn)|2

n∈Zd n∈Zd
≤ B21 B22 < +∞,

where the third line follows from the norm inequality a2 ≤ a1 and the fourth from
Cauchy–Schwarz; B1 and B2 are the upper Riesz bounds of βL1 and βL2 , respectively.
The additional condition in the proposition takes care of the lower Riesz bound.

6.4.3 Polynomial B-splines


The factorization property is directly applicable to the construction of the polynomial
B-splines (we use the equivalent notation β+n = βDn+1 in Section 1.3.2) via the iterated
convolution of a B-spline of degree zero. Specifically,

βDn+1 (r) = (βD ∗ βDn )(r) = (βD ∗ · · · ∗ βD )(r)


  
n+1

with βD = β+0 = 1[0,1) and the convention that βD0 = βId = δ.


138 Splines and wavelets

6.4.4 Exponential B-splines


More generally, one can consider a generic Nth-order differential operator of the form
Pα = Pα1 · · · PαN with parameter vector α = (α1 , . . . , αN ) ∈ CN and Pαn = D − αn Id.
The corresponding basis function is an exponential B-spline of order N with parameter
vector α, which can be decomposed as
 
βα (r) = βα1 ∗ βα2 ∗ · · · ∗ βαN (r), (6.42)

where βα = βPα is the first-order exponential spline defined by (6.21). The Fourier-
domain counterpart of (6.42) is

$
N
1 − eαn e−jω
α (ω) =
β , (6.43)
jω − αn
n=1

which also yields

βα (r) = α ρα (r), (6.44)


 
where α = α1 · · · αN with α defined by (6.20) is the corresponding Nth-order
localization operator (weighted differences) and ρα the causal Green’s function of Pα .
Note that the complex parameters αn , which are the roots of the characteristic poly-
nomial of Pα , are the poles of the exponential B-spline, as seen in (6.43). The actual
recipe for localization is that each pole is cancelled by a corresponding (2π-periodic)
zero in the numerator.
Based on the above equations, one can infer the following properties of the exponen-
tial B-splines (see [UB05a] for a complete treatment of the topic):
• They are causal, bounded, and compactly supported in [0, N], simply because all
elementary constituents in (6.42) are bounded and supported in [0, 1).
• They are piecewise-exponential with joining points at the integers and a maximal
degree of smoothness (spline property). The first part follows from (6.44) using the
well-known property that the causal Green’s function of an Nth-order ordinary differ-
ential operator is an exponential polynomial restricted to the positive axis. As for the
statement about smoothness, the B-splines are Hölder-continuous of order (N − 1).
In other words, they are differentiable up to order (N − 1) with bounded derivatives.
This follows from the fact that Dβαn (r) = δ(r) − eαn δ(r − 1), which implies that every
additional elementary convolution factor in (6.42) improves the differentiability of
the resulting B-spline by one.
• They are the shortest elementary constituents of exponential splines (maximally
localized kernels) and they each generate a valid Riesz basis (by integer shifting) of
the spaces of cardinal Pα -splines if and only if αn − αm = j2πk, k ∈ Z, for all distinct,
purely imaginary poles.
• They reproduce the exponential polynomials that are in the null space of the operator
Pα , as well as any of its Green’s functions ρα , which all happen to be special types of
Pα -splines (with a minimum number of singularities).
6.4 Generalized B-spline basis 139

• For α = (0, . . . , 0), one recovers Schoenberg’s classical polynomial B-splines of de-
gree (N − 1) [Sch46, Sch73a], as expressed by the notational equivalence
β+n (r) = βDn+1 (r) = β(0, . . . , 0) (r).
  
n+1

The system-theoretic interpretation is that the classical polynomial spline of degree n


has a pole of multiplicity (n + 1) at the origin: It corresponds to an (unstable) linear
system that is an (n + 1)-fold integrator.
There is also a corresponding B-spline calculus whose main operations are
• Convolution by concatenation of parameter vectors:
(βα 1 ∗ βα 2 )(r) = β(α 1 :α 2 ) (r)
• Mirroring by sign change:
N 
$
αn
βα (−r) = e β−α (r + N)
n=1

• Complex conjugation:
βα (r) = βα (r)
• Modulation by parameter shifting:
ejω0 r βα (r) = βα+jω0 (r),
with the convention that j = (j, . . . , j).
Finally, we point out that exponential B-splines can be computed explicitly on a case-
by-case basis using the mathematical software described in [Uns05, Appendix A].

6.4.5 Fractional B-splines


The fractional splines are an extension of the polynomial splines for all non-integer
degrees α > −1. The most notable members of this family are the causal fractio-
nal B-splines β+α whose basic constituents are piecewise-power functions of degree
α [UB00]. These functions are associated with the causal fractional-derivative opera-
tor Dα+1 , whose Fourier-based definition is

Dγ ϕ(r) = (jω)γ 
ϕ (ω)ejωr
R 2π
in the sense of generalized functions. The causal Green’s function of Dγ is the one-sided
power function of degree (γ − 1) specified by (6.15). One constructs the corresponding
B-splines through a localization process similar to the classical one, replacing finite
differences by the fractional differences defined as

(1 − e−jω )γ 
γ
+ ϕ(r) = ϕ (ω)ejωr . (6.45)
R 2π
140 Splines and wavelets

In this respect, it is important to note that (1 − e−jω )γ = (jω)γ + O(|ω|2γ ), which


justifies this particular choice. By applying (6.25), we readily obtain the Fourier-domain
representation of the fractional B–splines,
 α+1
1 − e−jω
+α (ω) =
β , (6.46)
(jω)α+1

which can then be inverted to provide the explicit time-domain formula


α+1 rα
+ +
β+α (r) = (6.47)
(α + 1)
∞  
m α + 1 (r − m)α+
= (−1) ,
m (α + 1)
m=0

where the generalized fractional binomial coefficients are given by


 
α+1 (α + 2) (α + 1)!
= = .
m (m + 1) (α + 2 − m) m! (α + 1 − m)!

What is remarkable about this construction is the way in which the classical B-spline
formulas of Section 1.3.2 carry over to the fractional case almost literally, by merely
replacing n by α. This is especially striking when we compare (6.47) to (1.11), as well
as the expanded versions of these formulas given below, which follow from the (gener-
alized) binomial expansion of (1 − e−jω )α+1 .
Likewise, it is possible to construct the (α, τ ) extension of these B-splines. They are
α+1 α+1
associated with the operators L = ∂τα+1 ←→ (jω) 2 +τ (−jω) 2 −τ and τ ∈ R [BU03].
This family covers the entire class of translation- and scale-invariant operators in 1-D
(see Proposition 5.6).
The fractional B-splines share virtually all the properties of the classical B-splines,
including the two-scale relation, and can also be used to define fractional wavelet bases
with an order γ = α + 1 that varies continuously. They only lack positivity and compact
support. Their most notable properties are summarized below.
• Generalization. For α integer, they are equivalent to the classical polynomial splines.
The fractional B-splines interpolate the polynomial ones in very much the same way
as the gamma function interpolates the factorials.
• Stability. All brands of fractional B-splines satisfy the Riesz-basis condition in Theo-
rem 6.2.
• Regularity. The fractional splines are α-Hölder continuous; their critical Sobolev
exponent (degree of differentiability in the L2 sense) is α + 1/2 (see Proposition 6.4).
• Polynomial reproduction. The fractional B-splines reproduce the polynomials of
degree N = α that are in the null space of the operator Dα+1 (see Section 6.2.1).
• Decay. The fractional B-splines decay at least like |r|−α−2 ; the causal ones are com-
pactly supported for α integer.
• Order of approximation. The fractional splines have the non-integer order of approxi-
mation α + 1, a property that is rather unusual in approximation theory.
6.4 Generalized B-spline basis 141

• Fractional derivatives. Simple formulas are available for obtaining the fractional de-
rivatives of B-splines. In addition, the corresponding fractional spline wavelets be-
have essentially like fractional-derivative operators.

6.4.6 Additional brands of univariate B-splines


To be complete, we briefly mention some additional types of univariate B-splines that
have been investigated systematically in the literature:
• The generalized exponential B-splines of order N that cover the whole class of dif-
ferential operators with rational transfer functions [Uns05]. These are parameterized
by their poles and zeros. Their properties are very similar to those of the exponential
B-splines of Section 6.4.4, which are included as a special case.
• The Matérn splines of (fractional) order γ and parameter α ∈ R+ with L = (D +
αId)γ ←→ (jω + α)γ [RU06]. These constitute the fractionalization of the exponen-
tial B-spline with a single pole of multiplicity N.
In principle, it is possible to construct even broader families via the convolution of exis-
ting components. The difficulty is that it may not always be possible to obtain explicit
signal-domain formulas, especially when some of the constituents are fractional.

6.4.7 Multidimensional B-splines


While the construction of B-splines is well understood and covered systematically in
1-D, the task becomes more challenging in multiple dimensions because of the inherent
difficulty of imposing compact support. Apart from the easy cases where the operator L
can be decomposed in a succession of 1-D operators (tensor-product B-splines and box
splines), the available collection of multidimensional B-splines is much more restricted
than in the univariate case. The construction of B-splines is still considered an art where
the ultimate goal is to produce the most localized basis functions. The primary families
of multidimensional B-splines that have been investigated so far are:
γ
• the polyharmonic B-splines of (fractional) order γ with L = (− ) 2 ←→ ωγ
[MN90b, Rab92a, Rab92b, VDVBU05]

• the box splines of multiplicity N ≥ d with L = Du1 · · · DuN ←→ N n=1 jω, un with
un  = 1, where Dun = ∇, un is the directional derivative along un [dBHR93].
The box splines are compactly supported functions in L1 (Rd ) if and only if the set of
orientation vectors {un }N
n=1 forms a frame of R .
d

We encourage the reader who finds the present list incomplete to work on expanding it.
The good news for the present study is that the polyharmonic B-splines are particularly
relevant for image-processing applications because they are associated with the class
of operators that are scale- and rotation-invariant. They naturally come into play when
considering isotropic fractal-type random fields.
The principal message of this section is that B-splines – no matter the type – are
localized functions with an equivalent width that increases in proportion to the order.
In general, the fractional brands and the non-separable multidimensional ones are not
142 Splines and wavelets

compactly supported. The important issue of localization and decay is not yet fully
resolved in higher dimensions. Also, since Ld s = βL ∗ Ls, it is clear that the search for a
“good” B-spline βL is intrinsically related to the problem of finding an accurate nume-
rical approximation Ld of the differential operator L. Looking at the discretization issue
from the B-spline perspective leads to new insights and sometimes to non-conventional
solutions. For instance, in the case of the Laplacian L = , the continuous-domain
localization requirement points to the choice of the 2-D discrete operator d described
by the 3 × 3 filter mask
⎛ ⎞
−1 −4 −1
1⎜⎜ −4 20 −4 ⎟ ,

Isotropic discrete Laplacian: ⎝ ⎠
6
−1 −4 −1
which is not the standard version used in numerical analysis. This particular set of
weights produces a much nicer, bell-shaped polyharmonic B-spline than the conven-
tional finite-difference mask, which induces significant directional artifacts, especially
when one starts iterating the operator [VDVBU05].

6.5 Generalized operator-like wavelets

In direct analogy with the first-order scenario in Section 6.3.3, we shall now take
advantage of the general B-spline formalism to construct a wavelet basis that is
matched to some generic operator L.

6.5.1 Multiresolution analysis of L2 (Rd )


The first step is to lay out a fine-to-coarse sequence of (multidimensional) L-spline
spaces in essentially the same way as in our first-order example. Specifically,
Vi = {s(r) ∈ L2 (Rd ) : Ls(r) = ai [k]δ(r − Di k)},
k∈Zd
where D is a proper dilation matrix with integer entries (e.g., D = 2I in the standard
dyadic configuration). These spline spaces satisfy the general embedding relation Vi ⊇
Vj for i ≤ j.
The reference space (i = 0) is the space of cardinal L splines which admits the stan-
dard B-spline representation
V0 = {s(r) = c[k]βL (r − k) : c ∈ 2 (Zd )},
k∈Zd
where βL is given by (6.25). Our implicit assumption is that each Vi admits a similar
B-spline representation
Vi = {s(r) = ci [k]βL,i (r − Di k) : ci ∈ 2 (Zd )},
k∈Zd
which involves the multiresolution generators βL,i described in Section 6.5.2.
6.5 Generalized operator-like wavelets 143

6.5.2 Multiresolution B-splines and the two-scale relation


In direct analogy with (6.25), the multiresolution B-splines βL,i are localized versions
of the Green’s function ρL with respect to the grid Di Zd . Specifically, we have that

βL,i (r) = di [k]ρL (r − Di k) = Ld,i L−1 δ(r),


k∈Zd

where Ld,i is the discretized version of L on the grid Di Zd . The Fourier-domain coun-
terpart of this equation is

di [k]e−jω,D k
i
L,i (ω) =
β k∈Zd
. (6.48)
L(ω)

The implicit requirement for the multiresolution decomposition scheme to work is that
βL,i generates a Riesz basis. This needs to be asserted on a case-by-case basis.
A particularly favorable situation occurs when the operator L is scale-invariant with
L(aω) = |a|γ 
 L(ω). Let i > i be two multiresolution levels of the pyramid such that
i −i
D = mI, where m is a proportionality constant. It is then possible to relate the
B-spline at resolution i to the one at the finer level i via the simple dilation relation

βL,i (r) ∝ βL,i (r/m).

This is shown by considering the Fourier transform of βL,i (r/m), which is written as

di [k]e−jω,mD k
i
L,i (mω)
|m|d β = |m| d k∈Zd

L(mω)
 i −i Di k
k∈Zd di [k]e−jω,D
= |m| d−γ

L(ω)
 i k
k∈Zd di [k]e−jω,D
= |m| d−γ

L(ω)

and found to be compatible with the form of β L,i (ω) given by (6.48) by taking di [k] ∝
di [k]. The prototypical scenario is the dyadic configuration D = 2I for which the
B-splines at level i are all constructed through the dilation of the single prototype
βL = βL,0 , subject to the scale-invariance constraint on L. This happens, for instance, for
the classical polynomial splines which are associated with the Fourier multipliers (jω)N .
A crucial ingredient for the fast wavelet-transform algorithm is the two-scale relation
that links the B-spline basis functions at two successive levels of resolution. Specifically,
we have that

βL,i+1 (r) = hi [k]βL,i (r − Di k),


k∈Zd
144 Splines and wavelets

where the sequence hi specifies the scale-dependent refinement filter. The frequency
response of hi is obtained by taking the ratios of the Fourier transforms of the
corresponding B-splines as
L,i+1 (ω)
β

hi (ω) = (6.49)
L,i (ω)
β
 −jω,Di+1 k
k∈Zd di+1 [k]e
=  , (6.50)
−jω,Di k
k∈Zd di [k]e

which is 2π(DT )−i -periodic and hence defines a valid digital filter with respect to the
spatial grid Di Zd .
To illustrate those relations, we return to our introductory example in Section 6.1:
the Haar wavelet transform, which is associated with the Fourier multipliers jω (deri-
vative) and (1 − e−jω ) (finite-difference operator). The dilation matrix is D = 2 and the
localization filter is the same at all levels because the underlying derivative operator is
scale-invariant. By plugging those entities into (6.48), we obtain the Fourier transform
of the corresponding B-spline at resolution i as
j2i ω
D,i (ω) = 2−i/2 1 − e
β ,

where the normalization by 2−i/2 is included to standardize the norm of the B-splines.
The application of (6.49) then yields
i+1
 1 1 − ej2 ω
hi (ω) = √
2 1 − ej2 ω
i

1 i
= √ (1 + ej2 ω ),
2

which, up to the normalization by 2, is the expected refinement filter with coefficients
proportional to (1, 1) that are independent of the scale.

6.5.3 Construction of an operator-like wavelet basis


To keep the notation simple, we concentrate on the specification of the wavelet basis at
the scale i = 1, with W1 = span{ψ1,k }k∈Zd \DZd such that W1 ⊥ V1 and V0 = V1 + W1 ,
where V0 = span{βL (· − k)}k∈Zd is the space of cardinal L-splines.
The relevant smoothing kernel is the interpolation function ϕint = ϕint,0 for the space

of cardinal LH L-splines, which is generated by (β L ∗ βL )(r) (autocorrelation of the
generalized B-spline). This interpolator is best described in the Fourier domain using
the formula
 
|βL (ω)|2
−1
ϕint (r) = F  (r), (6.51)

n∈Zd |βL (ω + 2πn)|
2
6.5 Generalized operator-like wavelets 145

L ) is the Fourier transform of the generalized B-spline βL (resp., β L ).


L (resp., β ∨
where β
It satisfies the fundamental interpolation property

⎨ 1, for k = 0
ϕint (k) = δ[k] = (6.52)
⎩ 0, for k ∈ Zd \{0}.

The existence of such a function is guaranteed whenever βL = βL,0 is an admissible


B-spline. In particular, the Riesz basis condition (6.26) implies that the denominator of

ϕint (ω) in (6.51) is non-vanishing.
The sought-after wavelets are then constructed as ψ1,k (r) = ψL (r−k)/ψL L2 , where
the operator-like mother wavelet ψL is given by
ψL (r) = LH ϕint (r), (6.53)
where LH is the adjoint of L with respect to the Hermitian-symmetric L2 inner product.
Also, note that we are removing the functions located on the next-coarser resolution
grid DZd associated with V1 (critically sampled configuration).
The proof of the following result is illuminating because it relies heavily on the notion
of duality, which is central to our whole argument.
PROPOSITION 6.6 The operator-like wavelet ψL = LH ϕint satisfies the property
s1 , ψL (· − k) L2 = s1 , ψL (· − k) = 0, ∀k ∈ Zd \DZd for any spline s1 ∈ V1 . Moreover,

it can be written as ψL (r) = LH d β̃L (r) = k∈Zd dL [−k]β̃L (r−k), where {β̃L (·−k)}k∈Zd
is the dual basis of V0 such that βL (· − k), β̃L (· − k ) L2 = δ[k − k ]. This implies that
W1 = span{ψ1,k }k∈Zd \DZd ⊂ V0 and W1 ⊥ V1 .
Proof We pick an arbitrary spline s1 ∈ V1 and perform the inner-product manipulation
s1 , ψL (· − k0 ) L2 = s1 , L∗ {ϕint (· − k0 )} (by shift-invariance of L)
= Ls1 , ϕint (· − k0 ) (by duality)
= a1 [k]δ(· − Dk), ϕint (· − k0 ) (by definition of V1 )
k∈Zd

= a1 [k]ϕint (Dk − k0 ). (by definition of δ)


k∈Zd
Due to the interpolation property of ϕint , the kernel values in the sum are vanishing if
Dk − k0 ∈ Zd \{0} for all k ∈ Zd , which proves the first part of the statement.
As for the second claim, we consider the Fourier-domain expression for ψL :
L (ω) = 
ψ ϕint (ω) = 
L(ω) L (ω) 
L(ω) β β̃ L (ω),
where
 L (ω)
β
β̃ L (ω) = 

n∈Zd |βL (ω + 2πn)|
2

is the Fourier transform of the dual B-spline β̃L . The above factorization implies that

ϕint (r) = (β̃L ∗ β L )(r), which ensures the biorthonormality β̃L (· − k), βL (· − k ) L2 =

δ[k] = ϕint (k − k ) of the basis functions.
146 Splines and wavelets

Finally, by replacing β L (ω) by its explicit expression (6.25), we show that




L(ω)βL (ω) = Ld (ω), where 
  −k,ω is the frequency response
Ld (ω) = k∈Zd dL [k]e
of the (discrete) operator Ld . This implies that ψL =  Ld
β̃ L , which is the Fourier
equivalent of ψL = LH
d β̃L .

This interpolation-based method of construction is applicable to all the wavelet


subspaces Wi and leads to the specification of operator-like Riesz bases of L2 (Rd )
under relatively mild assumptions on L [KUW13]. Specifically, we have that Wi =
span{ψi,k }k∈Zd \DZd with

ψi,k (r) ∝ LH ϕint,i−1 (r − Di−1 k), (6.54)


where ϕint,i−1 is the LH L-spline interpolator on the grid Di−1 Zd . The fact that the in-
terpolator is specified with respect to the grid of the next-finer spline space Vi−1 =
span{βL,i−1 (· − Di−1 k)}k∈Zd is essential to ensure that Wi ⊂ Vi−1 . This kernel satisfies
the fundamental interpolation property
ϕint,i−1 (Di−1 k) = δ[k], (6.55)
which results in Wi being orthogonal to Vi = span{βL,i (·−Di k)}k∈Zd (the reasoning is the
same as in the proof of Proposition 6.6, which covers the case i = 1). For completeness,
we also provide the general expression for the Fourier transform of ϕint,i ,
|βL,i (ω)|2
ϕint,i (ω) = |det(D)|i 
   
 T −i 2
n∈Zd βL,i ω + 2π(D ) n
 2  2
Ld,i (ω) / 
 L(ω)
= |det(D)| i
 2    2
Ld,i (ω) / n∈Zd  L ω + 2π(DT )−i n 
|L(ω)|−2
= |det(D)|i     , (6.56)

d L ω + 2π(D ) n
T −i −2
n∈Z

which can be used to show that LH ϕint,i (· − Di k) ∝ LH d,i β̃L,i (· − D k) ∈ Vi for any
i

k∈Z . d

While we have seen that this scheme produces an orthonormal basis for the first-
order operator Pα in Section 6.3.3, the general procedure does only guarantee semi-
orthogonality. More precisely, it ensures the orthogonality between the wavelet sub-
spaces Wi . If necessary, one can always fix the intra-scale orthogonality a posteriori by
forming appropriate linear combinations of wavelets at a given resolution. The resulting
orthogonal wavelets will still be L-admissible in the sense of Definition 6.7. However,
for d > 1, intra-scale orthogonalization is likely to spoil the simple, convenient struc-
ture of the above construction, which uses a single generator per scale irrespective of
the number of dimensions. Indeed, the examples of multidimensional orthogonal wave-
let transforms that can be found in the literature – either separable or non-separable –
systematically involve M = (det(D) − 1) distinct wavelet generators per scale. More-
over, unlike the present operator-like wavelets, they generally do not admit an explicit
analytical description.
6.6 Bibliographical notes 147

In summary, wavelets generally behave like differential operators and it is possible


to match them to a given class of stochastic processes. The wavelet transforms that are
currently most widely used in applications act as multiscale derivatives or Laplacians.
They are therefore best suited for the representation of fractal-type stochastic processes
that are defined by scale-invariant SDEs [TVDVU09].
The general theme that emerges is that a signal transform will behave appropriately
if it has the ability to suppress the signal components (polynomial or sinusoidal trends)
that are in the null space of the whitening operator L. This will result in a stationariz-
ing effect that is well documented in the Gaussian context [Fla89, Fla92]. This is the
fundamental reason why vanishing moments are so important.

6.6 Bibliographical notes

Section 6.1
Alfréd Haar constructed the orthogonal Haar system as part of his Ph.D. thesis, which
he defended in 1909 under the supervision of David Hilbert [Haa10]. From then on,
the Haar system remained relatively unnoticed until it was revitalized by the discov-
ery of wavelets nearly one century later. Stéphane Mallat set the foundation of the
multiresolution theory of wavelets in [Mal89] with the help of Yves Meyer, while
Ingrid Daubechies constructed the first orthogonal family of compactly supported wave-
lets [Dau88]. Many of the early constructions of wavelets are based on splines [Mal89,
CW91, UAE92, UAE93]. The connection with splines is actually quite fundamental in
the sense that all multiresolution wavelet bases, including the non-spline brands such
as Daubechies’, necessarily include a B-spline as a convolution factor – the latter is
responsible for their primary mathematical properties such as vanishing moments, dif-
ferentiability, and order of approximation [UB03]. Further information on wavelets can
be found in several textbooks [Dau92, Mey90, Mal09].

Section 6.2
Splines constitute a beautiful topic of investigation in their own right, with hundreds
of papers specifically devoted to them. The founding father of the field is Schoenberg,
who, during wartime, was asked to develop a computational solution for constructing
an analytic function that fits a given set of equidistant noisy data points [Sch88]. He
came up with the concept of spline interpolation and proved that polynomial spline
functions have a unique expansion in terms of B-splines [Sch46]. While splines can also
be specified for non-uniform grids and extended in a variety of ways [dB78,Sch81a], the
cardinal setting is especially pleasing because it lends itself to systematic treatment with
the aid of the Fourier transform [Sch73a]. The relation between splines and differential
operators was recognized early on and led to the generalization known as L-splines
[SV67].
The classical reference on partial differential operators and Fourier multipliers is
[Hör80]. A central result of the theory is the Malgrange–Ehrenpreis theorem [Mal56,
148 Splines and wavelets

Ehr54], and its extension stating that the convolution with a compactly supported gener-
alized function is invertible [Hör05].
The concept of a Riesz basis is standard in functional analysis and approximation
theory [Chr03]. The special case where the basis functions are integer translates of a
single generator is treated in [AU94]. See also [Uns00] for a review of such representa-
tions in the context of sampling theory.

Section 6.3
The first-order illustrative example is borrowed from [UB05a, Figure 1] for the
construction of the exponential B-spline, and from [KU06, Figure 1] for the wave-
let part of the story.

Section 6.4
The 1-D theory of cardinal L-splines for ordinary differential operators with constant
coefficients is due to Micchelli [Mic76]. In the present context, we are especially con-
cerned with ordinary differential equations, which go hand-in-hand with the extended
family of cardinal exponential splines [Uns05]. The properties of the relevant B-splines
are investigated in full detail in [UB05a], which constitutes the ground material for
Section 6.4. A key property of B-splines is their ability to reproduce polynomials. It is
ensured by the Strang–Fix conditions (6.37) which play a central role in approximation
theory [dB87, SF71]. While there is no fundamental difficulty in specifying cardinal-
spline interpolators in multiple dimensions, it is much harder to construct compactly
supported B-splines, except for the special cases of the box splines [dBH82, dBHR93]
and exponential box splines [Ron88]. For elliptic operators such as the Laplacian, it is
possible to specify exponentially decaying B-splines, with the caveat that the construc-
tion is not unique [MN90b,Rab92a,Rab92b]. This calls for some criterion to identify the
most localized solution [VDVBU05]. B-splines, albeit non-compactly supported ones,
can also be specified for fractional operators [UB07]. This line of research was initiated
by Unser and Blu with the construction of the fractional B-splines [UB00]. As sugges-
ted by the name, the (Gaussian) stochastic counterparts of these B-splines are Mandel-
brot’s fractional Brownian motions [MVN68], as we shall see in Chapters 7 and 8. The
association is essentially the same as the connection between the B-spline of degree
zero (rect) and Brownian motion, or, by extension, the whole family of Lévy processes
(see Section 1.3).

Section 6.5
de Boor et al. were among the first to extend the notion of multiresolution analysis
beyond the idea of dilation and to propose a general framework for constructing “non-
stationary” wavelets [dBDR93]. Khalidov and Unser proposed a systematic method for
constructing wavelet-like basis functions based on exponential splines and proved that
these wavelets behave like differential operators [KU06]. The material in Section 6.5
is an extension of those ideas to the case of a generic Fourier-mutiplier operator in
multiple dimensions; the full technical details can be found in [KUW13]. Operator-like
wavelets have also been specified within the framework of conventional multiresolution
6.6 Bibliographical notes 149

analysis; in particular, for the Laplacian and its iterates [VDVBU05, TVDVU09] and
for the various brands of 1-D fractional derivatives [VDVFHUB10], which have the
common property of being scale-invariant. Finally, we mention that each exponential-
spline wavelet has a compactly supported Daubechies counterpart that is orthogonal
and operator-like in the sense of having the same vanishing exponential moments
[VBU07].
7 Sparse stochastic processes

Having dealt with the technicalities of defining acceptable inverse operators, we can
now apply the framework to characterize – and also generate – relevant families of
sparse processes. As in the previous chapters, we start with a simple example to expose
the key ideas. Then, in Section 7.2, we develop the generalized version of the innovation
model that covers the complete spectrum of Gaussian and sparse stochastic processes.
We characterize the solution(s) in full generality, while pinpointing the conditions under
which the so-defined processes are stationary or self-similar. In Section 7.3, we provide
a complete description of the stationary processes, including the CARMA (continuous-
time autoregressive moving average) family which constitutes the non-Gaussian exten-
sion of the classical ARMA processes. In Section 7.4, we turn our attention to non-
stationary signals and characterize the important class of Lévy-type processes that are
defined by unstable linear SDEs. Finally, in Section 7.5, we investigate fractal-type pro-
cesses (not necessarily Gaussian) that are solutions of fractional, scale-invariant SDEs.

7.1 Introductory example: non-Gaussian AR(1) processes

A Lévy-driven AR(1) process with parameter α ∈ C is defined by a first-order SDE


with L = Pα = (D − αId), as given by

Pα sα = w,

where w is a white Lévy noise excitation. We have already seen that the solution for
Re(α) < 0 is given by sα = P−1 α w = ρα ∗ w, where ρα is the impulse response of
the underlying system. We have also shown in Section 5.3.1 that the concept remains
applicable for Re(α) > 0 using the extended definition (5.10) of P−1 −1
α . Since Pα is a
S -continuous convolution operator, this results in a well-defined stationary process, the
Gaussian version of which is often referred to as the Ornstein–Uhlenbeck process.
To make the connection with splines, we observe that the first-order impulse response
can be written as a sum of exponential B-splines,

ρα (r) = eαk βα (r − k), (7.1)


k≥0

as illustrated in Figure 7.1a. The B-spline generator βα (r) is defined by (6.21) and is
supported in the interval [0, 1). A key observation is that the B-spline coefficients eαk
7.1 Introductory example: non-Gaussian AR(1) processes 151

Figure 7.1 Spline-based representation of the impulse response and autocorrelation function of a
differential system with a pole at α = −1. (a) The impulse response (dashed line) is decomposed
as a linear combination of the integer shifts of an exponential B-spline (solid line). (b) The
autocorrelation is synthesized by interpolating its sample values at the integers (discrete
autocorrelation); the corresponding (second-order) spline interpolation kernels are represented
using solid lines.

in (7.1) correspond to the impulse response of the digital filter −1 α described by the
transfer function 1−e1α z−1 , which is the natural discrete counterpart of P−1
α .
−1 −1
The operator version of (7.1) therefore reads ρα = Pα δ = α βα , which makes an
interesting connection between the analog and discrete versions of a first-order operator.
We have shown in prior work that this type of relation is fundamental to the theory of
linear systems and that it carries over for higher-order systems [Uns05].
Since the driving term is white, the correlation structure of the process (second-order
statistics) is fully characterized by the (Hermitian-symmetric) autocorrelation of the
impulse response. In the case where α is real-valued and negative, we get
 
Rρα (r) = ρα (· + r), ρα = (ρ ∨
α ∗ ρα )(r) ∝ e
α|r|
= eα|k| ϕint,α r − k , (7.2)
k∈Z

which can also be expanded in terms of (augmented-order) B-splines. For simplicity,


we have left out the proportionality factor so that the right-hand side can be identified
152 Sparse stochastic processes

as the normalized autocorrelation function of the AR(1) process

E{sα (· + r) sα (·)}
csα (r) = = eα|r| .
E{|sα |2 }
The rightmost sum in (7.2) expresses the fact that the continuous-domain correlation
function can be reconstructed from the discrete-domain one, csα [k] = csα (r)r=k = eα|k|
(the sampled version of the former), by using a Shannon-type interpolation formula,
which involves the “augmented-order” spline kernel ϕint,α introduced in Section 6.3.2.
Let σw2 < +∞ denote the variance of a zero-mean white input noise w observed
through a normalized observation window ϕ/ϕ. Then, it is well known (Wiener–
Khintchine theorem) that the power spectrum of sα is given by

sα (ω) = σw2 F {ρ ∨


α ∗ ρα }(ω) = σw |
2
ρα (ω)|2 ,

α (ω) = jω−α
with ρ 1
. We note, however, that the power spectrum provides a complete
characterization of the process only when the input noise is Gaussian.

7.2 General abstract characterization

The generic solution of the general innovation model (4.23) is given by s = L−1 w,
where w is a particular brand of white noise with Lévy exponent f (see Definition 4.1)
and L−1 a proper right inverse of L. The theoretical results of Section 4.5.2 guarantee
the existence of this solution as a generalized stochastic process over S  (Rd ) provided
that L−1∗ (the adjoint of L−1 ) and f satisfy some joint regularity conditions. The three
configurations of interest that balance the range of acceptable innovations are listed
below for further reference.

DEFINITION 7.1 (Conditions for existence)


• Condition S: L−1∗ is a continuous operator S (Rd ) → S (Rd ) and f is Lévy–
Schwartz admissible (see Theorem 4.8)
• Condition R: L−1∗ is a continuous operator S (Rd ) → R (Rd ) and f is Lévy–
Schwartz admissible
• Condition Lp : L−1∗ is a continuous operator S (Rd ) → Lp (Rd ) and f is p-admissible
(see Definition 4.4) for some p ∈ [1, 2].
When any one of these conditions is satisfied, the couple (L−1∗ , f ) is said to be
admissible.

Note that the conditions of Definition 7.1 are given in order of increasing level of
complexity in the inversion of the whitening operator L when the problem is ill-posed
over S (Rd ).
Now, if (L−1∗ , f ) is admissible, then the underlying generalized stochastic pro-
cesses, or random fields when d > 1, are well defined and completely specified by their
7.2 General abstract characterization 153

characteristic functional (see Theorem 4.17)


 
 
s (ϕ) = E{ejϕ,s } = exp
P f L−1∗ ϕ(r) dr . (7.3)
Rd
In order to represent second-order dependencies, Gelfand and Vilenkin define the
correlation functional of a generalized (complex-valued) process s as
Bs (ϕ1 , ϕ2 ) = E{ϕ1 , s ϕ2 , s, }, (7.4)
where ϕ2 , s stands for the complex conjugate of ϕ2 , s .
These two description modes are easy to relate when the process s is real-valued (see
Section 3.5.2). To that end, we observe that Bs (ϕ1 , ϕ2 ) = E{X1 X2 }, where X1 = ϕ1 , s
and X2 = ϕ2 , s = ϕ2 , s are (conventional) real-valued random variables. We then
invoke the moment-generating property of the joint characteristic function of X1 and
X2 , E{ej(ω1 X1 +ω2 X2 ) } = Ps (ω1 ϕ1 + ω2 ϕ2 ), which yields

s (ω1 ϕ1 + ω2 ϕ2 ) 
∂ 2P
Bs (ϕ1 , ϕ2 ) = −  . (7.5)
∂ω1 ∂ω2 
ω1 =0,ω2 =0

We have already used this mechanism in Section 4.4.3, Proposition 4.15 to determine
the covariance of a white Lévy noise under the standard zero-mean and finite-variance
assumptions. Specifically, we showed that the correlation/covariance form of the inno-
vation process w is given by
Bw (ϕ1 , ϕ2 ) = σw2 ϕ1 , ϕ2 , (7.6)
with σw2 = −f  (0). Here, we rely on duality (i.e., ϕ, L−1 w = L−1∗ ϕ, w ) to transfer
this result to the output of the general innovation model (4.23) as
Bs (ϕ1 , ϕ2 ) = BL−1 w (ϕ1 , ϕ2 )
= Bw (L−1∗ ϕ1 , L−1∗ ϕ2 )
= σw2 L−1 L−1∗ ϕ1 , ϕ2 , (7.7)
which is consistent with (7.5) under the implicit assumptions that σw2 = −f  (0) and
f  (0) = 0. Finally, we recover the autocorrelation function of s by making the substitu-
tion ϕ1 = δ(· − r1 ) and ϕ2 = δ(· − r2 ) in (7.7), which leads to
Rs (r1 , r2 ) = E{s(r1 ) s(r2 )}
= Bs (δ(· − r1 ), δ(· − r2 ))
= σw2 L−1 L−1∗ δ(· − r1 ), δ(· − r2 ) . (7.8)
This is justified by the kernel theorem (see Section 3.3.4), which allows one to express
the correlation functional as

Bs (ϕ1 , ϕ2 ) = ϕ1 (r1 )Rs (r1 , r2 )ϕ2 (r2 ) dr1 dr2 ,


Rd Rd

where Rs (r1 , r2 ) ∈ S  (Rd × Rd ) is the (generalized) correlation function of the general-


ized stochastic process s, by definition.
154 Sparse stochastic processes

The bottom line is that the correlation structure of the process is entirely determined
by the impulse response of the Hermitian symmetric operator L−1 L−1∗ , which may or
may not be shift-invariant.
For formalization purposes, it is useful to categorize stochastic processes based on
whether or not they are invariant to the elementary coordinate transformations. Inva-
riance here is not meant literally, but probabilistically, in the sense that the application
of a given spatial transformation (translation, rotation, or scaling) leaves the probability
laws of the process unchanged.
However, since the objects of interest are generalized functions, we need to properly
define the underlying notions. The translation by r0 ∈ Rd of a generalized function
φ ∈ S  (Rd ) is denoted by φ(· − r0 ), while its scaling by a is written as φ(·/a). The
definition of these operations (see Section 3.3.2) is

ϕ, φ(· − r0 ) = ϕ(· + r0 ), φ (translation by r0 ∈ Rd )


ϕ, φ(T−1 ·) = | det(T)| ϕ(T·), φ (affine transformation)

for any pair (ϕ, φ) ∈ S (Rd ) × S  (Rd ) where it is implicitly assumed that the (d × d)
coordinate transformation matrix T is invertible. The scaling by a > 0 is obtained by
selecting T = aI whose determinant is ad .

D E F I N I T I O N 7.2 (Stationary process) A generalized stochastic process s is statio-


nary if it has the same probability laws as any of its translated versions s(· − r0 ) or,
equivalently, if the characteristic functional Ps (ϕ) = E{ejϕ,s } satisfies
 
P s ϕ(· + r0 )
s (ϕ) = P (7.9)

for any ϕ ∈ S (Rd ) and any r0 ∈ Rd .

DEFINITION 7.3 (Isotropic process) A generalized stochastic process s is isotropic if


it has the same probability laws as any of its rotated versions s(RT ·) or, equivalently, if
its characteristic functional Ps (ϕ) = E{ejϕ,s } satisfies
 
s (ϕ) = P
P s ϕ(R·)

for any ϕ ∈ S (Rd ) and any (d × d) rotation matrix R.

D E F I N I T I O N 7.4 (Self-similar process) A generalized stochastic process s is self-


similar of scaling order H if it has the same probability laws as any of its scaled and
renormalized versions aH s(·/a) or, equivalently, if P s (ϕ) = E{ejϕ,s } satisfies
 
P s aH+d ϕ(a·)
s (ϕ) = P

for any ϕ ∈ S (Rd ) and any dilation factor a > 0.

The scaling order H is also called the Hurst exponent of the process. Here, the relevant
adjoint relation is ϕ, aH s(·/a) = aH |a|d ϕ(a·), s , which follows from the definition of
the affine transformation and the linearity of the duality product.
7.2 General abstract characterization 155

One can readily check that all Lévy noises are stationary and isotropic. Self-similarity,
by contrast, is a more restrictive property that is only shared by the stable members of
the family for which the exponent f is a homogeneous function of degree α.
Similarly, one can also define weaker forms of invariance by considering the effect of
a transformation on the first- and second-order moments only. This leads to the notions
of wide-sense stationarity, isotropy, and self-similarity under the implicit assumption
that the variances are finite (second-order process).

D E F I N I T I O N 7.5 (Wide-sense stationarity) A generalized stochastic process s is wide-


sense stationary (WSS) if

E{ϕ, s } = E{ϕ(· + r0 ), s }
   
Bs ϕ1 , ϕ2 = Bs ϕ1 (· + r0 ), ϕ2 (· + r0 )

for any ϕ, ϕ1 , ϕ2 ∈ S (Rd ) and any r0 ∈ Rd , or, equivalently, if its (generalized) mean
E{s(r)} is constant and its (generalized) autocorrelation is a function of the relative
displacement only; that is,

E{s(r1 )s(r2 )} = Rs (r1 − r2 ).

If, in addition, Rs (r1 − r2 ) = Rs (|r1 − r2 |), then the process is WSS isotropic.

For the innovation model s = L−1 w, we have E{s(r)} = 0 whenever f  (0) = 0, which
is a property that is shared by all symmetric Lévy exponents (and all concrete examples
considered in this book). This zero-mean assumption facilitates the treatment of second-
order processes (with minimal loss in generality).

D E F I N I T I O N 7.6 (Wide-sense self-similarity) A generalized stochastic process s


(with zero mean) is wide-sense self-similar with scaling order H if it has the same
second-order moments as its scaled and renormalized version aH s(·/a) with a > 0. The
condition is met if the correlation functional satisfies
   
Bs ϕ1 , ϕ2 = a2H+2d Bs ϕ1 (a·), ϕ2 (a·) (7.10)

for any ϕ1 , ϕ2 ∈ S (Rd ) or, equivalently, if the (generalized) autocorrelation function is


such that
 r1 r2 
a2H Rs , = Rs (r1 , r2 ).
a a
Note that wide-sense self-similarity with H  = 0 implies that Rs (0, 0) = E{s2 (0)}
is either zero or infinite, so the property is incompatible with wide-sense stationarity
(unless s = 0).
In the case of our generalized innovation model, the invariance properties of the sto-
chastic process are primarily determined by the choice of the operator L. The precise
statement is given in the theorem below, which also makes the link with the more
conventional specification of a linear process in terms of a stochastic integral against
some kernel h.
156 Sparse stochastic processes

THEOREM 7.1 Let s = L−1 w be a generalized stochastic process whose character-


istic functional is given by (7.3), where (L−1∗ , f ) is an admissible pair in reference to
Definition 7.1. We also define the kernel

h(·, r ) = L−1 {δ(· − r )} ∈ S  (Rd × Rd ). (7.11)

Then, depending on the characteristics of L−1 (or, equivalently, L−1∗ ), the process s
enjoys the following properties.
(1) If L−1 is linear shift-invariant, then s is stationary and

h(r, r ) = h(r − r , 0) = ρL (r − r ),

where ρL = L−1 {δ} is the Green’s function of L.


(2) If L−1 is shift- and rotation-invariant, then s is stationary isotropic and h(r, r ) =
ρL (|r − r |), where ρL (|r|) = L−1 {δ}(r) is a purely radial function.
(3) If L−1∗ is scale-invariant of order (−γ ) and σw2 = −f  (0) < ∞, then s is wide-
sense self-similar with Hurst exponent H = γ − d/2.
(4) If L−1∗ is scale-invariant of order (−γ ) and f is homogeneous 1 of degree 0 <
α ≤ 2, then s is self-similar with Hurst exponent H = γ − d + d/α.

Moreover,
 if h is an ordinary function of Rd × Rd with h(r, ·) ∈ R (Rd ) resp., h(r, ·) ∈
Lp (Rd ) for r ∈ Rd , then s admits the pointwise representation

s(r) = h(r, ·), w . (7.12)

Proof The existence of the generalized stochastic process s is ensured by Theo-


rem 4.17. Moreover, the kernel theorem (Theorem 3.1) allows one to express the
measurement X = ϕ, L−1 w = L−1∗ ϕ, w as

X = ϕ, s =  h(r, ·)ϕ(r) dr, w ,


Rd

where h(·, ·) ∈ S  (Rd × Rd ) is the kernel defined by (7.11).


Statement (1): If L−1 is LSI, then the same holds true for L−1∗ , whose impulse res-
ponse is L−1∗ {δ}(r) = h(0, r) = ρL (−r). By duality, this is equivalent to L−1 {δ}(r) =
h(r, 0) = ρL (r) with the property that LL−1 {δ} = LρL = δ, so that ρL is the Green’s
function of L. From the definition of translation-invariance,  we have that L−1∗ {ϕ(·
 −1∗  +
r0 )}(r) = L −1∗ {ϕ}(r + r ), which is then used to show that f L {ϕ(· + r )}(r) dr =
 −1∗  0 R d 0
Rd f L {ϕ}(x) dx with the change of variable x = r+r 0 . Hence the condition in Defi-
nition 7.2 is fulfilled.
Statement (2): This is a special case of Statement (1) where L−1 is also rotation-
invariant, which translates into ρL (r) = ρL (|r|). The condition in Definition 7.3 is esta-
blished by simple change of variable in the exponent of the characteristic functional.
Statement (3): Our strategy there is to verify (7.10) starting from (7.7). The fact that
L is scale-invariant of order (−γ ) implies that both L−1 and L−1∗ enjoy the same
−1

1 The class of such admissible Lévy exponents are the α-stable ones; the symmetric members of the family
are fα (ω; s0 ) = −|s0 ω|α .
7.2 General abstract characterization 157

property, as expressed by (5.1). Therefore,


 
Bs ϕ1 (a·), ϕ2 (a·) = σw2 L−1∗ {ϕ1 (a·)}, L−1∗ {ϕ2 (a·)}
= σw2 a−γ L−1∗ {ϕ1 }(a·), a−γ L−1∗ {ϕ2 }(a·) (by scale-invariance)
= σw2 a−2γ L−1∗ {ϕ1 }(a·), L−1∗ {ϕ2 }(a·) (by bilinearity)
= a−2γ a−d σw2 L−1∗ ϕ1 , L−1∗ ϕ2 (change of variable)
 
= a−2γ −d Bs ϕ1 , ϕ2 ,

so that the condition in Definition 7.6 is satisfied if and only if 2H + 2d = 2γ + d, which


we rewrite as H = γ − d/2.
Statement (4): The result is obtained by evaluating the required characteristic
exponent where P s (ϕ) is specified by (7.3) with f such that f (aω) = aα f (ω) (α
homogeneous). Specifically,
  " #
s aH+d ϕ(a·) =
log P f aH+d L−1∗ {ϕ(a·)}(r) dr (by linearity of L−1∗ )
Rd
" #
= f aH+d a−γ L−1∗ {ϕ}(ar) dr (by scale-invariance)
Rd
" #
= a−d f aH+d−γ L−1∗ {ϕ}(x) dx (change of variable)
Rd
" #
= aαH+αd−αγ −d f L−1∗ ϕ(r) dr. (α-homogeneity of f)
Rd
 
s ϕ if and only if αH + αd − αγ − d = 0, which
The latter expression is equal to log P
yields H = γ − d + d/α.
 observation X0 = ϕ0 , w where ϕ0 = h(r0 , ·)
As for the final result, we consider the
with ϕ0 ∈ R (Rd ) resp., ϕ0 ∈ Lp (Rd ) . Since Pw admits a continuous extension over
R (Rd ) (see Proposition 8.1 in Chapter 8), we can specify the characteristic function of
X0 as 
pX0 (ω) = E{ejωX0 } = Pw (ωϕ0 ). Then, 
pX0 is a continuous, positive definite func-
tion of ω so that we can invoke Bochner’s theorem (Theorem 3.7), which ensures that
X0 = L−1∗ {δ(· − r0 )}, w = δ(· − r0 ), L−1 w = s(r0 ) is a well-defined (conventional)
random variable. The result also extends to ϕ0 ∈ Lp (Rd ) whenever f is p-admissible
(see Theorem 8.2, which will be proven later on).

Since the Lévy innovations w are all intrinsically stationary, there is no distinction in
this model between stationarity and WSS, except for the fact that the latter requires the
variance σw2 to be finite. This is not so for self-similarity, which is a more-demanding
property. In that respect, we note that there is no contradiction between Statements (3)
and (4) in Theorem 7.1 because the second-order moments of α-stable processes (for
which f is homogeneous of order α) are undefined for α < 2 (due to the unbounded
variance of the noise). The intersection occurs for α = 2 (Gaussian scenario) while
larger homogeneity indices (α > 2) are excluded by the Lévy admissibility condition.
The last result in Theorem 7.1 is fundamental, for it tells us when s(r) can be inter-
preted as a conventional stochastic process; that is, as a random function of the index
158 Sparse stochastic processes

variable r. With a slight abuse of notation, we may rewrite (7.12) as

s(r) = h(r, r )w(r ) dr = h(r, r )W (dr ), (7.13)


Rd Rd

which shows the connection with conventional stochastic integration (Itô calculus).
Here, W is a random measure over Rd that is formally defined as W(E) = 1E , w for
any Borel set E ⊆ Rd .
While we have already pointed out the incompatibility between stationarity and self-
similarity, there is a formal way to bypass this limitation by enforcing stationarity selec-
tively through the test functions whose moments are vanishing (up to a certain order).
Specifically, we shall specify in Section 7.5 processes that fulfill this quasi-stationarity
condition with the help of the Lp -stable, scale-invariant inverse of L∗ defined in Sec-
tion 5.5. This construction results in self-similar processes with stationary increments,
the prototypical example being fractional Brownian motion. But before that we shall
investigate other concrete examples of generalized stochastic processes, starting with
the simpler stationary ones.

7.3 Non-Gaussian stationary processes

If the inverse operator L−1 is shift-invariant with generic impulse response ρL ∈L1 (Rd ),
then (7.12) is equivalent to a convolutional system with s(r) = (ρL ∗ w)(r). We can then
apply (7.3) in conjunction with Proposition 5.1 to obtain the characteristic functional of
this process, which reads
 
 
s (ϕ) = exp
P f (ρL∨ ∗ ϕ)(r) dr . (7.14)
Rd

More generally, we may consider generalized processes that are obtained by LSI filter-
ing of an innovation process w and are not necessarily solutions of a stochastic differ-
ential equation.

P R O P O S I T I O N 7.2 (Generalized stationary processes) Let s = h∗w, where μh TV <∞
(bounded variation) (resp., h is rapidly decaying) and w is a white-noise process over
S  (Rd ) whose Lévy exponent f is p-admissible (resp., Lévy–Schwartz admissible).
Then, s is a generalized stochastic process in S  (Rd ) that isstationary
 ∨ and complete-
 

ly specified by the characteristic functional Ps (ϕ) = exp Rd f (h ∗ ϕ)(r) dr . In
σw2
general, the process is non-Gaussian unless f (ω) = − 2 |ω| .
2

The proof is the same as that of Statement (1) in Theorem 7.1. As for the existence of
the process when f is p-admissible, we rely on the convolution inequality, h∨ ∗ ϕLp =
h ∗ ϕ ∨ Lp ≤ μh TV ϕLp , which ensures that the Lp condition in Definition 7.1 is
satisfied. In that respect, we note that the bounded variation hypothesis on h (which is
less stringent than h ∈ L1 ) is the minimal requirement for stability when p = 1, while it
can be weakened to  hL∞ < ∞ for p = 2 (see Statements (1) and (3) in Theorem 3.5).
7.3 Non-Gaussian stationary processes 159

 
In addition, when h ∈ Lp (Rd ) resp., h ∈ R (Rd ) , we can invoke the last part of
Theorem 7.1 to show that the point values s(r) of the process are well defined, so that s
also admits a classical interpretation.

7.3.1 Autocorrelation function and power spectrum


Next, we determine the autocorrelation function of the stochastic process s and make a
link with splines by following essentially the same path as for our introductory AR(1)
example. Our basic requirement is that the whitening operator L is spline-admissible in
the sense of Definition 6.8, which implies the existence of the generalized B-spline βL
and the interpolation function ϕint specified by (6.51).

Let 
Ld (ω) = k∈Zd dL [k]e−jω,k with dL ∈ 1 (Zd ) be the transfer function of Ld
(the discrete version of the whitening operator L). Under the condition that  Ld (ω)  = 0
for ω ∈ [−π, π] , we invoke Wiener’s lemma (Theorem 5.13), which ensures that
d

the (discrete) inverse operator L−1


d is Lp -stable. This naturally leads to the generalized
B-spline reproduction formula for the impulse response of the system,

ρL (r) = L−1
d βL (r) = p[k]βL (r − k), (7.15)
k∈Zd

 jω,k
with p[k] = [−π,π]d Le (ω) (2π) d ∈ 1 (Z ). The concept also generalizes for the specifi-
dω d
d
cation of the second-order moments.

P R O P O S I T I O N 7.3 The autocorrelation function of the stationary stochastic process


s(r) = (ρL ∗ w)(r), where w is a white Lévy noise with variance σw2 , is given by

Rs (r) = E{s(· + r) s(·)} = σw2 (ρL ∗ ρ ∨


L )(r).

Moreover, it satisfies the Shannon-like interpolation formula


 
Rs (r) = σw2 (ρL ∗ ρ ∨
L )(r) = Rs [k]ϕint r − k ,
k∈Zd

where the interpolation function is defined by (6.51) and where the expansion coeffi-
cients Rs [k] = E{s(· + k) s(·)} = Rs (r)|r=k correspond to the discrete-domain version of
the correlation.

The power spectrum of the process is the Fourier transform of the autocorrelation
function,

σw2
s (ω) = F {Rs }(ω) = , (7.16)
|
L(ω)|2

an expression that is consistent with the interpretation of the signal as a filtered white
noise.
160 Sparse stochastic processes

Proof The first part of the statement is obtained from (7.8) as

E{s(r1 ) s(r2 )} = σw2 L−1 L−1∗ δ(· − r1 ), δ(· − r2 )


= σw2 L−1 L−1∗ δ, δ(· + r1 − r2 ) (by shift-invariance)
= σw2 (ρ L ∗ ρL∨ )(−r1 + r2 ) (by definition of δ)
= σw2 (ρL ∗ ρ ∨
L )(r1 − r2 ). (by Hermitian symmetry)

As expected, this yields a correlation function that only depends on the relative differ-
ence r = (r1 − r2 ) of the index variables. To establish the validity of the interpolation
formula, we consider (6.51) and manipulate the Fourier-domain expression for ϕint as

|βL (ω)|2

ϕint (ω) = 
 (ω + 2πn)|2
d |β
n∈Z L
|
Ld (ω)|2
|
L(ω)|2
=  (from definition of generalized B-spline)
|
Ld (ω+2πn)|2
n∈Zd |
L(ω+2πn)|2
1
|
L(ω)|2
=  1
n∈Zd |
L(ω+2πn)|2
s (ω)
= ,
n∈Zd s (ω + 2πn)

where the simplification in the third line results from the property that Ld (ω) (transfer
function of a digital filter) is 2π-periodic. The final formula is the ratio of s (ω) (the
continuous-domain power spectrum of s given by (7.16)) and its discrete-domain coun-
 
terpart k∈Zd Rs [k]e−jω,k = n∈Zd s (ω + 2πn) (by Poisson’s summation formula),
which proves the desired result.

7.3.2 Generalized increment process


A standard side-effect of innovation models is the induction of long-range signal depen-
dencies due to the non-compact nature (IIR) of the impulse response of the system L−1 .
We have already pointed out that those could be partially suppressed by application of
the discrete form of the whitening operator Ld . The good news is that the resulting out-
put (generalized increment process) will not only be approximately decoupled but also
stationary, irrespective of the properties of the input process.

PROPOSITION 7.4 (Generalized increment process) Let s = L−1 w be a generalized


stochastic process (possibly non-stationary) where the whitening operator L is spline-
admissible with generalized B-spline βL ∈ R (Rd ) such that βL (·−r0 ) = Ld L−1 δ(·−r0 )
u (ϕ) =
 r0 ∈ R
for all d . Then, u = L s is stationary with characteristic functional P

  d
exp Rd f (βL ∗ ϕ)(r) dr , where f is the Lévy exponent of the innovation w. Its auto-

correlation function is E{u(· + r) u(·)} = σw2 (βL ∗ β L )(r).
7.3 Non-Gaussian stationary processes 161

Note that we can also write an extended version of this result for the class of spline-
admissible operators L with βL ∈ L1 (Rd ) which requires that f be p-admissible for some
p ∈ [1, 2].

Proof The characterization of the B-spline in the statement of the proposition ensures
that the composition of Ld and L−1 is LSI with impulse response βL = Ld L−1 δ. Since Ld
is LSI, the condition is trivially satisfied when L−1 is shift-invariant as well. Generally,
the condition will also be met in the non-stationary scenarios considered later in this
chapter (see Proposition 7.6), the fundamental reason being that Ld must annihilate
the components that are in the null space of L (see Section 6.4 on the construction of
generalized B-splines). This property allows us to write u = Ld s = Ld L−1 w = βL ∗ w.
The result then follows as a direct consequence of Propositions 7.2 and 7.3.

An important implication of Proposition 7.4 is that the quality of the decoupling is


solely dependent upon the localization properties of βL . This theme is further developed
in Sections 7.4.4 and 8.3.

7.3.3 Generalized stationary Gaussian processes


In light of Proposition 7.2, we observe that the complete class of stationary Gaussian
processes is specifiable via the basic convolutional model s = h∗wGauss , where wGauss is
a zero-mean Gaussian innovation process and h an L2 -stable convolution operator with
Fourier multiplier H(ω) =  h(ω) ∈ L∞ (Rd ). This follows from the p = 2 admissibility
of the Gaussian Lévy exponent fGauss (ω) = −σw2 |ω|2 /2, the necessity and sufficiency
of the condition H(ω) ∈ L∞ (Rd ) for the convolution operator to be continuous over
L2 (Rd ) (see Theorem 3.5), and the last existence result in Theorem 4.17.
The resulting generalized Gaussian process is uniquely specified by its autocorrela-
tion function

RsGauss (r) = σw2 (h ∗ h )(r),

or, equivalently, by its spectral density


 
sGauss (ω) = F RsGauss (r) (ω) = σw2 |H(ω)|2 . (7.17)

Its characteristic functional is given by


 
 σw2 ∨
PsGauss (ϕ) = exp − h ∗ ϕ ,
2
(7.18)
2
which, by using Parseval’s identity, can also be rewritten as
 
s 1 2 dω
P Gauss (ϕ) = exp −  s (ω)|ϕ (ω)| .
2 Rd Gauss (2π)d
The necessary and sufficient condition for the existence of such generalized Gaussian
processes is that sGauss (ω) be bounded almost everywhere. This is less restrictive than
the requirement sGauss ∈ L2 (Rd ) of the classical formulation, which ensures that the
162 Sparse stochastic processes

variance of the process E{|sGauss (r0 )|2 } is finite. The simplest example that fits the gener-
ic filtered-white-noise model but does not meet the latter condition is wGauss with h = δ.

7.3.4 CARMA processes


The acronym ARMA traditionally refers to discrete Gaussian processes that are solu-
tions of stable Nth-order difference equations driven by discrete white Gaussian noise.
The corresponding discrete system is characterized by a rational transfer function whose
denominator determines the Nth-order AR (autoregressive) part of the filter and the
numerator the MA (moving average) part. The Gaussian CARMA(N, M) (continuous-
ARMA) processes are the continuous-domain counterparts of these discrete processes.
They are solutions of the generic Nth-order differential equation

pN (D)s(t) = qM (D)w(t), (7.19)

with defining polynomials


$
N
pN (ζ ) = ζ N + aN−1 ζ N−1 + · · · + a0 = (ζ − αn )
n=1
$
M
qM (ζ ) = bM ζ M + bM−1 ζ M−1 + · · · + b0 = bM (ζ − γm ),
m=1

where an , bm ∈ R, M < N, and D is the derivative operator. Traditionally, the driving


term w is a Gaussian white noise with variance σw2 . The underlying linear system is
characterized by its poles α = (α1 , . . . , αN ) and zeros γ = (γ1 , . . . , γM ) and is known
to be causal-stable if and only if Re(αn ) < 0, for n = 1, . . . , N. Under those conditions,
the solution of (7.19) is given by

sα (t) = (hα,γ ∗ w)(t)


−1
 
with hα,γ (t) = F Hα,γ (ω) (t), where
M
qM (jω) (jω − γm )
Hα,γ (ω) = = bM m=1
N
n=1 (jω − αn )
pN (jω)
is the frequency response of the underlying system. The link with the operator formalism
of Section 5.3.2 is

hα,γ (t) = bM Pγ1 · · · PγM P−1 −1


α · · · Pα1 δ(t)
  N 
L−1

with Pα = D − αId and P−1 αt


α δ(t) = 1+ (t)e . Moreover, since M < N, we can decompose
Hα,γ (ω) into simple partial fractions and obtain an expression for the impulse response
of the system as a sum of elementary components that decay exponentially, which shows
that hα,γ (t) is rapidly decreasing. We are therefore meeting the least constraining condi-
tion hα,γ ∈ R (R) of Proposition 7.2. This ensures that we can apply the framework to
7.4 Lévy processes and their higher-order extensions 163

specify not only Gaussian CARMA processes, but also a whole variety of non-Gaussian
variants associated with more general Lévy innovations. These extended CARMA pro-
cesses are stationary and completely characterized by the characteristic functional (7.14)
with ρL = hα,γ and d = 1 without any restriction on the Lévy exponent f. Moreover,
since the underlying kernel h(t0 , τ ) = hα,γ (t0 − τ ) for t0 fixed is bounded and expo-
nentially decreasing, they admit an interpretation as an ordinary stochastic process (by
Theorem 7.1).
The autocorrelation function of the process is defined under the additional second-
order hypotheses f  (0) = −σw2 and f  (0) = 0. It is given by

Rsα (t) = σw2 (hα,γ ∗ hα,γ )(t),

which is a sum of symmetric exponentials with parameters α. The corresponding power


spectrum is
  
b2 M |jω − γm |2  qM (jω) 2
sα (ω) = σw2 MN m=1 = σw2   ,
n=1 |jω − αn |
2 pN (jω) 
which is consistent with (7.17).

7.4 Lévy processes and their higher-order extensions

Lévy processes and their extensions can also be defined in the introduced framework,
but their specification is more delicate due to the fact that their underlying SDE is un-
stable. This requires the use of the “regularized” bounded inverse operators that we
presented in Section 5.4. As this constitutes a significant departure from the traditional
shift-invariant setting, we shall detail the construction in Section 7.4.1 and provide the
connection with the classical theory.

7.4.1 Lévy processes


In the standard time-domain formulation of stochastic processes, the solution of a linear
differential equation is usually expressed as the stochastic integral
+∞
s(t) = h(t, τ ) dW(τ ),
0

where W is a random measure which is a Brownian motion or, by extension, a Lévy


process. In keeping with the above notation, we shall now show that a Lévy process
 t W(t)
t
can be generated via the integration of white noise as W(t) = 0 w(τ ) dτ = 0 dW(τ ),
which is consistent with the Lévy innovation w = Ẇ being the derivative of W (in the
“weak” sense of generalized functions).
To establish this connection, we recall the classical definition of Lévy processes.

DEFINITION 7.7 (Lévy process) The stochastic process W = {W(t) : t ∈ R+ } is a


Lévy process if it fulfills the following requirements:
164 Sparse stochastic processes

(1) W(0) = 0 almost surely.


(2) Given 0 ≤ t1 < t2 < . . . < tn , the increments W(t2 ) − W(t1 ), W(t3 ) − W(t2 ), . . .,
W(tn ) − W(tn−1 ) are mutually independent.
(3) For any given step T, the increment process
 δT W(t), where δT is the operator that
associates W(t) with W(t) − W(t − T) , is stationary.

In addition, one typically requires the process to fulfill some form of probabilistic
continuity.
In our framework, the equivalent of the above processes is obtained as a solution of
the stochastic differential equation

DW = Ẇ = w, (7.20)

subject to the boundary condition W(0) = 0, where D = dtd is the derivative operator
and W is to be defined as a random element of S  (R). The driving term w in (7.20) is a
1-D Lévy innovation, as defined in Section 4.4, with characteristic functional
 
E{e
jϕ,w 
} = Pw (ϕ) = exp f (ϕ(r)) dr . (7.21)
R

We recall that w has the property of independence at every point, meaning that any pair
of random variables ϕ, w and ψ, w , for test functions ϕ and ψ of disjoint support,
are independent. In terms of the characteristic functional, this property translates into
having Pw (ω1 ϕ + ω2 ψ) factorize as

P w (ω1 ϕ)P
w (ω1 ϕ + ω2 ψ) = P w (ω2 ψ) for disjointly supported ϕ and ψ. (7.22)

To say that a generalized random process W fulfills (7.20) is, for us, to have

D∗ ϕ, W = ϕ, w for all ϕ ∈ S (R), (7.23)

where D∗ = −D is the adjoint of D. For W to be fully characterized as a random element


of S  (R), we need to give a consistent definition of ϕ, s for all ϕ ∈ S (R). We next
show that we find a particular solution of (7.20) by defining

ϕ, W = I∗0 ϕ, w , (7.24)

where I∗0 is the left inverse S (R) → R (R) of D∗ specified in Section 5.4.1. In view of
(7.24), ϕ, W is probabilistically characterized by the functional

P w (I∗ ϕ).
W (ϕ) = P (7.25)
0

To see that (7.24) implies (7.23), note, first, that for any ϕ  ∈ S (R) that can be written
as D∗ ϕ for some ϕ ∈ S (R), we have

ϕ  , W = D∗ ϕ, W = I∗0 D∗ ϕ, w .

Now, since I∗0 is a left inverse S (R) → R (R) of D∗ : S (R) → S (R), we find

D∗ ϕ, W = ϕ, w

(where ϕ can be arbitrarily chosen), which is the same as (7.23).


7.4 Lévy processes and their higher-order extensions 165

We symbolically represent the particular solution W defined by (7.24) and (7.25) as


W = I0 w, (7.26)
where I0 is the adjoint of I∗0 . For completeness, we also determine the corresponding
kernel h(t, τ ) in (7.11), which takes the form
I0 {δ(· − τ )}(t) = 1+ (t − τ ) − 1+ (−τ ). (7.27)
It is the “transpose” of the generalized impulse response of I∗0 given by

⎨ 1 for τ ≥ 0
(0,τ ] (t),
I∗0 {δ(· − τ )}(t) = 1+ (τ − t) − 1+ (−t) = (7.28)
⎩ −1(τ ,0] (t), for τ < 0,

which follows from (5.17) with ω0 = 0 and ϕ being replaced by δ(· − τ ). While (7.27)
and (7.28) are equivalent identities with the role of the variables t and τ being inter-
changed, the main point is that the kernel on the right of (7.28) for τ fixed is compactly
supported. This allows us to invoke Theorem 7.1 to show that the point values of the
process, W(tn ) = 1(0,tn ] , w , are ordinary random variables.
Having defined a particular solution of (7.20) as I0 w, let us now show that it is
consistent with the axiomatic definition of a Lévy process given by Definition 7.7. The
zero boundary condition at the origin (Property (1) in Definition 7.7) is imposed by the
operator I0 (see Theorem 5.3 with ω0 = 0). As for the other two properties, we recall
that, for t ≥ 0,
+∞
I∗0 ϕ(t) = ϕ(τ ) dτ ,
t
from which we deduce
∞ t+T  
I∗0 δT∗ ϕ(t) = ϕ(τ ) − ϕ(t + T) dτ = ϕ(τ ) dτ = 1[−T,0) ∗ ϕ (t).
t t
From there, we see that, for the increment process δT W,
ϕ, δT W = δT∗ ϕ, W = I∗0 δT∗ ϕ, w = 1[−T,0) ∗ ϕ, w ,
which is equivalent to
δT W = W(·) − W(· − T) = 1(0,T] ∗ w
because 1(0,T] = 1∨
[−T,0) . Now, since w is stationary, we have that
w
Xt = ϕ(· − t), δT W = 1[−T,0) ∗ ϕ(· − t), w = 1[−T,0) ∗ ϕ, w = ϕ, δT W = X0
w
for all t ∈ R, where = denotes equivalence
 in law. This proves that δT W is stationary.
Finally, by writing W(tm ) − W(tm−1 ) = 1(tm−1 ,tm ] , w and using Proposition 3.10 in
combination with (7.22), we see that the joint characteristic function of the increments
U1 = W(t2 ) − W(t1 ), U2 = W(t3 ) − W(t2 ), . . ., Un−1 = W(tn ) − W(tn−1 ) with 0 ≤ t1 <
t2 < . . . < tn separates as
 w (ω1 1(t ,t ] ) P
p(U1 :Un−1 ) (ω1 , . . . , ωn−1 ) = P w (ω2 1(t ,t ] ) · · · P
w (ωn−1 1(t ,t ] ),
1 2 2 3 n−1 n
166 Sparse stochastic processes

which implies their independence (second property in Definition 7.7).


We close this section with a demonstration of the use of (7.25) for the determination
of the statistical distribution of the sample values of a Lévy process. In our formalism,
the sampling operation is expressed as
W(t1 ) = δ(· − t1 ), W = I∗0 δ(· − t1 ), w = 1(0,t1 ] , w .
This allows us to deduce the characteristic function of W(t1 ) by direct substitution 2 of
ϕ = ω1 δ(· − t1 ) in (7.25) as
 w (ω1 1(0,t ] )
pW(t1 ) (ω1 ) = E{ejω1 W(t1 ) } = P 1
 
 
= exp f ω1 1(0,t1 ] (τ ) dτ .
R
Recalling that f (0) = 0, we then use the fact that 1(0,t1 ] (t) is equal to 1 on its support of
size t1 and zero otherwise to simplify the above integral, which yields

pW(t1 ) (ω1 ) = et1 f (ω1 ) . (7.29)
This final result is the celebrated Lévy–Khintchine characterization of a Lévy process,
which shows that the pdf of W(t1 ) is infinitely divisible and that W is non-stationary
because the underlying Lévy exponent t1 f (ω1 ) is dependent upon the sampling ins-
tant t1 .

7.4.2 Higher-order extensions of Lévy processes


We saw in (7.20) that Lévy processes can be characterized as solutions of first-order
SDEs involving a simple derivative operator. Naturally, it is of interest to extend this
definition to unstable higher-order SDEs of the same form as (7.19) with K poles on
the imaginary axis. To that end, we adopt the operator notation of Section 5.4.2 and an
ordering of the poles where the purely imaginary ones come last with αN−K−k = jωk ,
1 ≤ k ≤ K. We then factorize pN (D) into first-order terms, separating the unstable terms
of the left-hand side, which allows us to rewrite (7.19) as
(Pα1 · · · PαN−K ) (Pjω1 · · · PjωK ) sα = qM (D) w,
where Pαn = D − αn Id and qM (D) is a polynomial in D of degree M. The above SDE
can be solved by inverting each of the first-order operators on the left individually, in
the manner described in Section 5.4.2. The formal solution is thus given by
sα = (IωK · · · Iω1 ) (P−1 · · · P−1 ) qM (D)w
    αN−K  α1 
I(ωK :ω1 )
P−1
(α :α
1 N−K )

= I(ωK :ω1 ) TLSI w, (7.30)


where the elementary inverse operators P−1
αn and Iωk are defined by (5.10) and (5.19),
respectively. Note that the proposed method of resolution involves the two composite

2 This is equivalent to plugging ϕ = ω 1 


1 (0,t1 ] into Pw (ϕ), which is legitimate since the latter is a continuous,
positive definite functional over R (R) (see Proposition 8.1).
7.4 Lévy processes and their higher-order extensions 167

operators I(ωK :ω1 ) and TLSI = P−1 (α1 :αN−K ) qM (D), the last of which is linear shift-invariant
and S -continuous. While the ordering of the factors of TLSI is immaterial (due to
the commutativity of convolution), this is not so for the Kth-order integration operator
I(ωK :ω1 ) = IωK · · · Iω1 , which is shift-variant and directly responsible for the boundary
conditions. Specifically, the prescribed ordering of the imaginary poles imposes the K
linear boundary conditions at the origin


⎪ sα (0) = 0



⎨ PjωK sα (0) = 0
.. (7.31)



⎪ .


Pjω2 · · · PjωK sα (0) = 0,
which are part of the definition of the underlying generalized Lévy process sα .
Finally, we invoke Corollary 5.5 and Theorem 4.17 to get the full characterization of
the Nth-order generalized Lévy process in terms of its characteristic functional
 
 ∗ ∗ 

Psα (ϕ) = exp f TLSI I(ω1 :ωK ) ϕ(t) dt , (7.32)
R
where f is the Lévy–Schwartz exponent of the innovation, I∗(ω1 :ωK ) is the composite
stabilized integration operator defined by (5.24) and (5.18), and TLSI is the convolution
operator corresponding to the stable part of the system with Fourier multiplier

 qM (jω) Kk=1 (jω − jωk ) qM (jω)
TLSI (ω) = = N−K ,
n=1 (jω − αn )
pN (jω)
where Re(αn )  = 0 for 1 ≤ n ≤ N − K. Clearly, the extended filtered-white-noise model
(7.30) is compatible with the definition of CARMA processes of Section 7.3.4 if we
simply set K = 0; it also yields the classical Lévy processes when N = K = 1 and
ω1 = 0. In that respect, we observe that the derived process Pjω1 · · · PjωK sα = TLSI w
is stationary (by Proposition 7.2 and the right-inverse property of I(ωK :ω1 ) ), so that we
may interpret K as the order of stationary deficiency.

7.4.3 Non-stationary Lévy correlations


In Section 7.3.1, we showed that there is a simple relation between the autocorrela-
tion function of a second-order stationary process and the Green’s function ρL of the
whitening operator of L. We also derived a spline interpolation formula that connects
the continuous- and discrete-domain versions of the autocorrelation function (see Pro-
position 7.3). In principle, it is possible to obtain similar results for the higher-order
extensions of the Lévy processes from Section 7.4.2, but the correlation formulas are
more involved and somewhat harder to get to. We shall illustrate the concept by consi-
dering the case of an Nth-order process with a single order of non-stationarity (K = 1
and αN = jω0 ) subject to the boundary condition sα (0) = 0.
PROPOSITION 7.5 (Correlation structure of processes with one order of
non-stationarity) Let sα (t) be an Nth-order generalized Lévy process with whitening
168 Sparse stochastic processes

operator L that meets the boundary condition sα (0) = 0 imposed by the presence of the
pole jω0 on the imaginary axis. The non-stationary correlation of the continuous and
discrete-domain versions of the process is fully specified by

Rsα (t, t ) = E{sα (t) sα (t )} = vsα (t − t) − ejω0 t vsα (t ) − e−jω0 t vsα (−t)

Rsα [k, k ] = E{sα [k] sα [k ]} = vsα [k − k] − ejω0 k vsα [k ] − e−jω0 k vsα [−k],

where vs (t) = σw2 ρLL∗ (t) and where vsα [k] = vsα (t)|t=k . These two entities are linked
through the exponential-spline interpolation formula

vsα (t) = vsα [k]ϕint (t − k),


k∈Z

where ϕint (t) is specified by (6.51). Moreover, vsα (t) = O(|t|) is a function of slow growth
whose generalized Fourier transform is given by

σw2 ,

vsα (ω) = = Vsα (ejω )
ϕint (ω),
|
L(−ω)|2

where 
L(ω) is the frequency response of L which exhibits a zero at αN = jω0 .

Proof From (7.8), we have that Rsα (t1 , t2 ) = σw2 L−1 L−1∗ {δ(· − t1 )}(t2 ). Since the
system has a single singularity on the imaginary axis at jω0 , (7.32) implies that L−1∗ =
T∗LSI I∗ω0 , where TLSI is a BIBO-stable LSI system whose transfer function is TLSI (ω) =
j(ω+ω0 )

L(ω)
Using the Fourier-domain formula (5.21) of I∗ω0 and the fact that TLSI is a
.
standard convolution operator, we find that the Fourier transform of T∗LSI I∗ω0 {δ(· − t1 )}
is given by

e −jωt1− ejω0 t1

TLSI (−ω) .
−j(ω + ω0 )

Likewise, we have that L−1 = Iω0 TLSI , which, by considering the complex-conjugate
counterpart of (5.20) for Iω0 , yields
   
e−jωt1 − ejω0 t1 e jωt − e−jω0 t dω
L−1 L−1∗ {δ(· − t1 )}(t) = |
TLSI (−ω)|2
R −j(ω + ω0 ) j(ω + ω 0 ) 2π
 
ejω(t−t1 ) − ejω0 t1 ejωt − e−jω0 t e−jωt1 + e−jω0 (t−t1 ) dω
= |
TLSI (−ω)|2 (7.33)
R |ω + ω0 |2 2π
= ρLL∗ (t − t1 ) − ejω0 t1 ρLL∗ (t) − e−jω0 t ρLL∗ (−t1 ).

The critical step in this derivation is the evaluation of the integral in (7.33) which,
contrary to appearances, is non-singular, due to the presence of the fourth term in the
numerator. To make this explicit, we recall that the proper specification of the inverse
7.4 Lévy processes and their higher-order extensions 169

Fourier transform of 1/| L(−ω)|2 , which has a second-order singularity at ω = −ω0 ,


involves a finite-part integral (see Appendix A) that can be resolved as follows:
* +
−1 1
ρLL∗ (t) = F (t)
|
L(−ω)|2
ejωt dω
= p.f. | TLSI (−ω)|2
R |ω + ω0 | 2π
2

jωt − e−jω0 t 1 + j(ω + ω )t

e dω
| 2 0
= TLSI (−ω)| ,
R |ω + ω 0 | 2 2π
where the numerator is corrected by subtracting the first two terms of the Taylor series
of ejωt around ω = −ω0 .
This means that, in order to neutralize the singularity in the denominator of (7.33),
we need to correct the three leading exponentials in the numerator by subtracting their
first-order development around ω = −ω0 . This is possible by rewriting the numerator as
, 1
Numerator = ejω(t−t1 ) − e−jω0 (t−t1 ) 1 + j(ω + ω0 )(t − t1 )
" , 1#
− ejω0 t1 ejωt − e−jω0 t 1 + j(ω + ω0 )t
" , 1#
− e−jω0 t e−jωt1 − ejω0 t1 1 − j(ω + ω0 )t1
= ejω(t−t1 ) − ejω0 t1 ejωt − e−jω0 t e−jωt1 + e−jω0 (t−t1 ) ,
which shows that the terms that are required to make the inverse-Fourier integral non-
singular precisely add up to e−jω0 (t−t1 ) .

In particular, for N = 1 and ω0 = 0, we find that


σw2   
RWiener (t, t ) = |t − t| − |t | − |t| ,
2
which is the well-known autocorrelation function of Brownian motion (a.k.a. Wiener
process).

7.4.4 Removal of long-range dependencies


The characteristic functional (7.32) provides a complete description of the Lévy pro-
cesses and their extensions. While the formula is elegant conceptually, it is not quite as
favorable for the derivation of the joint statistics of these processes. Even in the simplest
case of correlations, the calculations can get quite involved, as we just saw in Section
7.4.3. The source of the difficulty is the non-stationary nature of the operator I∗(ω1 :ωK ) .
Another confounding factor is the induction of long-range dependencies due to lack of
decay of the Green’s function ρL in the non-stable scenario. Fortunately, these depen-
dencies can be suppressed, for the most part, by applying the decoupling procedure out-
lined in Proposition 7.4. The practical benefit is that this produces an equivalent signal –
the generalized-increment process – that is stationary and maximally decoupled, which
greatly simplifies the statistical analysis.
170 Sparse stochastic processes

For an Nth-order generalized Lévy process, it is possible to characterize all rele-


vant quantities explicitly by taking advantage of the functional link with exponential
B-splines. Specifically, given the poles α ∈ CN of the system, one defines the following
spline-related entities:
(1) The (causal) exponential B-spline with parameter α (poles)
 
$
N
1 − eαn −jω dω
jωt
βα (t) = e (7.34)
R jω − αn 2π
n=1
(2) The Nth-order (causal) finite-difference (or localization) operator

α ϕ(t) = α1 ··· αN ϕ(t), (7.35)


which is obtained by cascading the first-order operators αn ϕ(t) = ϕ(t)−eαn ϕ(t−1)
with n = 1, . . . , N.
The important point for our argument is that the exponential B-spline βα (t) is causal
with a compact support of size N and differentiable up to order (N − 1) (see
Section 6.4.4).
We recall that the generalized Lévy process sα satisfies the differential equation (5.6),
which is associated with the whitening operator L whose frequency response in the
sense of generalized functions is
N
 pN (jω) n=1 (jω − αn )
L(ω) = = M .
qM (jω) bM m=1 (jω − γm )
This suggests considering the generalized exponential B-spline
βL (t) = α ρL (t)
= qM (D)βα (t), (7.36)
 
where ρL = F −1 1/
L(ω) is the Green’s function of L. The crucial property here is
that βL ∈ R (R) with the shortest possible support. Indeed, βL has the same (minimal)
support as βα and is bounded (because qM (D) is a differential operator of order M < N),
independently of the location of the poles in the complex plane. Our next proposition
ensures that the B-spline in (7.36) and the finite-difference operator α are the appro-
priate ingredients for decoupling generalized Lévy processes.
PROPOSITION 7.6 (Approximate inversion by finite differences) Let L−1 be the
Nth-order inverse operator defined by
L−1 = IωK · · · Iω1 P−1 −1
αN−K · · · Pα1 qM (D),

with Re(αn )  = 0 for 1 ≤ n ≤ N − K and αK+k = jωk for 1 ≤ k ≤ K, where Iωk and
P−1
αn are specified by (5.19) and (5.10), respectively. Then, the generalized B-spline βL
defined by (7.36) and (7.34) has the following properties:
−1
αL ϕ = βL ∗ ϕ
−1∗ ∗
L αϕ = βL∨ ∗ ϕ
for all ϕ ∈ S (R).
7.4 Lévy processes and their higher-order extensions 171

Since βL is maximally localized and close to an impulse, the interpretation is that α


yields an approximate inverse of L−1 . In other words, α acts as a discrete proxy for
the continuous-domain whitening operator L.

Proof The key is to rely on the factorization property (6.42) of the exponential
B-spline βα and to consider the elementary factors one at a time. To that end, we first
establish that
−1
α Pα ϕ = βα ∗ ϕ (7.37)
jωk Iωk ϕ = βjωk ∗ ϕ, (7.38)

as well as the adjoint counterparts of these relations. The first identity is a direct conse-
quence of the definition of the first-order exponential B-spline
 
−1 1 − e
α−jω
βα (t) = α ρα (t) = F (t),
jω − α

where ρα = F −1 {1/(jω − α)} is the Green’s function of the operator Pα or, equiva-
lently, the impulse response of the inverse operator P−1
α . As for the second identity, we
apply the time-domain definition (5.19) of Iωk , which yields

jωk {ρjωk ∗ ϕ} − (ρjωk ∗ ϕ)(0) jωk {e }


jωk r
jωk Iωk ϕ =
  
=0
= jωk {ρjωk } ∗ ϕ
= βjωk ∗ ϕ,

where we have used the fact that jωn annihilates the sinusoidal components that are
in the null space of Pjωk and the associativity of convolution operators such as jωk .
By applying (7.37) and (7.38) recursively and making use of the commutativity of
S -continuous LSI operators, we find that
−1
  −1 −1
α L ϕ = (α1 :αN−1 ) jωK IωK I(ωK−1 :ω1 ) PαN−K · · · Pα1 qM (D){ϕ}
 
= βjωK ∗ (α1 :αN−2 ) jωK−1 IωK−1 I(ωK−2 :ω1 ) P−1 −1
αN−K · · · Pα1 qM (D){ϕ}
..
.
= βjωK ∗ · · · ∗ βjω1 ∗ βαN−K ∗ · · · ∗ βα1 ∗ qM (D){ϕ}
= qM (D){βα1 ∗ · · · ∗ βαN } ∗ ϕ
= qM (D){βα } ∗ ϕ
= βL ∗ ϕ.

The adjoint counterpart of this identity is obtained by applying the same divide-and-
conquer strategy.

The idea is now to apply the localization operator α to sα in order to partially


decouple the process and, more importantly, to suppress its non-stationary components.
This yields the generalized-increment process uα = α sα , which has a much simpler
172 Sparse stochastic processes

statistical structure than sα . Indeed, by combining the result of Proposition 7.6 with
(7.30), we show that
ϕ, uα = ϕ, α sα

=  ∗α ϕ, L−1 w
= L−1∗ ∗α ϕ, w = βL∨ ∗ ϕ, w
for all ϕ ∈ S (R). This is equivalent to
uα = α sα = βL ∗ w, (7.39)
which is a form that is much more convenient than sα = L−1 w because the convo-
lution with βL preserves stationarity. The other favorable aspect is that the general-
ized B-spline βL (which has a compact support) is much better localized than the
Green’s function ρL , especially in the non-stable scenario where ρL exhibits polyno-
mial growth. We are now in the position to invoke Proposition 7.4 to get the complete
statistical characterization of uα , including its correlation function which is proportional

to βLL∗ (t) = (βL ∗ β L )(t), where βL is the generalized exponential B-spline defined by
(7.36).

7.4.5 Examples of sparse processes


Examples of realizations of Gaussian vs. sparse stochastic processes are shown
in Figures 7.2 to 7.5. These signals were generated using three types of driving

 Example 1: generation
Figure 7.2  of generalized stochastic processes
 with
 whitening operator
L = D pole vector α = (0) . (a) B-spline functions βL (t) = rect t − 12 and βLL∗ (t) = tri(t).
(b) Brownian motion. (c) Compound-Poisson process with λ = 1/32 and Gaussian amplitude
distribution pA (a) = (2π)−1/2 e−a /2 . (d) SαS Lévy motion with α = 1.2.
2
7.4 Lévy processes and their higher-order extensions 173

Figure 7.3 Example 2: generation


 of generalized stochastic processes with whitening operator
L = D2 pole vector α = (0, 0) . (a) B-spline functions βL (t) = tri(t) and βLL∗ (t) (cubic
B-spline). (b) Gaussian process. (c) Generalized Poisson process with λ = 1/32 and Gaussian
amplitude distribution pA (a) = (2π)−1/2 e−a /2 . (d) Generalized SαS process with α = 1.2.
2

innovations: Gaussian (panel b), impulsive Poisson (panel c), and symmetric-alpha-
stable (SαS) with α = 1.2 (panel d).
The relevant operators are:
• Example 1: L = D (Lévy process)
• Example 2: L = D2 (second-order extension of Lévy process)
• Example 3: L = (D − α1 Id)(D − α2 Id) and α = (j3π/4, −j3π/4) (generalized Lévy
process)
• Example 4: L = (D−α1 Id)(D−α2 Id) and α = (−0.05+jπ/2, −0.05−jπ/2) (CAR(2)
process).
The corresponding B-splines (βL and βLL∗ ) are shown in the upper-left panel of each
figure.
The signals that are displayed side-by-side share the same whitening operator, but
they differ in their sparsity patterns, which come in three flavors: none (Gaussian), finite
rate of innovation (Poisson), and heavy-tailed statistics (SαS). The Gaussian signals are
uniformly textured, while the generalized Poisson processes are piecewise-smooth by
construction.

Lowpass processes
The classical Lévy processes (Figure 7.2) are obtained by integration of white Lévy
noise. They go hand-in-hand with the B-spline of degree zero (rect) and its autocorrela-
tion (triangle function) which is a B-spline degree 1. The Gaussian version (Figure 7.2b)
is a Brownian motion. It is quite rough and nowhere differentiable in the classical sense.
174 Sparse stochastic processes

Figure 7.4 Example 3: generation of generalized stochastic processes with operator


L = (D − α1 Id)(D − α2 Id) and α = (j3π/4, −j3π/4). (a) B-spline functions βL and βLL∗ .
(b) Gaussian process. (c) Generalized Poisson process with λ = 1/32 and Gaussian amplitude
distribution. (d) Generalized SαS process with α = 1.2.

Yet, it is mean-square continuous due to the presence of the single pole at the origin.
The Poisson version (compound-Poisson process) is piecewise-constant, each jump cor-
responding to the occurrence of a Dirac impulse. The SαS Lévy motion exhibits local
fluctuations punctuated by large (but rare) jumps, as is characteristic for this type of
process [ST94, App09]. Overall, it is the jump behavior that dominates, making it even
sparser than its Poisson counterpart.
The example in Figure 7.3 (second-order extension of a Lévy process) corresponds to
one additional level of integration, which yields smoother signals (i.e., one-time diffe-
rentiable in the classical sense). The corresponding Poisson process is piecewise-linear,
while the SαS version looks globally smoother than the Gaussian one, except for a few
sharp discontinuities in its slope. The basic B-spline here is a triangle, while βLL∗ is
a cubic B-spline. The signals in Figures 7.2 and 7.3 are non-stationary; the underlying
processes have the remarkable property of being self-similar (fractals) due to the scale-
invariance of the pure-derivative operators. The Gaussian and SαS processes are strictly
self-similar in the sense that the statistics are preserved through rescaling. By contrast,
the scaling of the Poisson processes necessitates some corresponding adjustment of the
rate parameter λ [UT11].

Bandpass processes
The second-order signals in Figure 7.4 are non-stationary as well, but no longer low-
pass nor self-similar. They are real-valued, and C1 -continuous almost everywhere (pair
of complex-conjugate poles in the left complex plane). They constitute some kind
7.4 Lévy processes and their higher-order extensions 175

Figure 7.5 Example 4: generation of generalized stochastic processes with whitening operator
L = (D − α1 Id)(D − α2 Id) and α = (−0.05 + jπ/2, −0.05 − jπ/2). (a) B-spline functions βL and
βLL∗ . (b) Gaussian AR(2) process. (c) Generalized Poisson process with λ = 1/32 and Gaussian
amplitude distribution. (d) SαS AR(2) process with α = 1.2.

of modulated (or bandpass) counterpart of the Lévy processes, which appears to be


much better suited for the modeling of acoustic signals. As in the other examples, the
Gaussian version is looking cluttered. The Poisson signal is somewhat stereotyped
(stretches of pure oscillating regime) and not quite as realistic-looking as its SαS
counterpart.
As soon as the poles are moved away from the imaginary axis, the processes become
stationary. This is illustrated in Figure 7.5 with some CAR(2) (continuous autoregres-
sive) examples, the non-Gaussian versions of which have a marked tendency to exhibit
characteristic bursts associated with the impulse response of the system. These latter
processes are part of the stationary CARMA family characterized in Section 7.3.4.

7.4.6 Mixed processes


One can also construct signals with a more intricate structure by simple addition of
independent elementary processes. This results in the mixed process smix = s1 + · · · + sM
whose characteristic functional is the product of the characteristic functionals of the
individual constituents. It is given by
 
$M M
 −1∗ 

Ps (ϕ) = 
Ps (ϕ) = exp fm L ϕ(t) dt ,
mix m m
m=1 R m=1

where sm is some elementary process with whitening operator Lm and Lévy function
fm (ω). As a demonstration of the concept, we have synthesized some acoustic samples
176 Sparse stochastic processes

by mixing random signals associated with elementary musical notes (pair of poles at the
corresponding frequency). These can be downloaded from http://www.sparseprocesses.
org/. The Gaussian versions are diffuse, cluttered, and boring to listen to. The general-
ized Poisson and SαS samples are more interesting perceptually – reminiscent of chimes
– with the latter sounding less dry and more realistic. Note that mixing does not gain us
anything in the Gaussian case because the resulting signal is still part of the traditional
family of Gaussian ARMA processes (this follows from Parseval’s relation and the fact
 σw2
that M m=1  is expressible as an equivalent rational power spectrum). This is
|Lm(−ω)|2

not so for the non-Gaussian members of the family, which are not decomposable in gen-
eral, meaning that the mixing of sparse processes opens up new modeling perspectives.
Interestingly, the Gaussian acoustic samples are almost impossible to compress using
mp3/AAC, while the generalized Poisson and SαS ones can be faithfully reproduced at
a much lower bit rate.

7.5 Self-similar processes

In order to construct self-similar stochastic processes in our framework, we seek op-


erators L that are scale-invariant, and which have scale-invariant inverses, in the sense
of Definition 5.2. This narrows down the class of suitable operators to the fractional
γ γ
derivatives ∂τ in 1-D and the fractional Laplacians (− ) 2 in multiple dimensions if
rotation-invariance is required as well. This suggests that self-similar processes can be
specified as solutions of fractional SDEs. Once again, the definition of these processes
requires the use of regularized bounded inverses which, in this case, were introduced
in Section 5.5. A difference with the Lévy processes is that, for fractional-order opera-
tors, the definition of the inverse depends not only on the forward operator but also on
the innovation – more precisely, on the domain of the characteristic functional of the
innovation.
An alternative non-isotropic multidimensional solution is a separable operator of the
γ γ γ
form L = ∂r1 ,τ1 · · · ∂rd ,τd , where ∂rn ,τ denotes the (γ , τ )-derivative along the coordinate
axis rn (see Section 5.5.1). The simplest example is the partial derivative operator L =
∂r1 · · · ∂rd , which results in the definition of the Mondrian process (see Section 7.5.3).
We now describe in more detail the first class of self-similar and isotropic pro-
cesses and fields. The generation of these processes, as suggested above, is achieved by
applying the Lp -continuous inverses defined in Section 5.5.3 to self-similar and rotation-
invariant innovations.
For the processes that are defined using these operators to be self-similar in the strict
sense (see Definition 7.4), the innovation to which the operator is applied needs to be
self-similar as well, in the sense that ϕ, w and cϕ, w( a· ) have the same statistics for
some appropriate normalization factor c. This narrows down the class of innovations
to α-stable noises. More generally, however, we may apply a scale-invariant operator
to a Lévy noise in order to obtain a process/field that is wide-sense self-similar (see
Definition 7.6).
7.5 Self-similar processes 177

In Sections 7.5.1 and 7.5.2, we review genuine self-similar models 3 that result
from the application of scale-invariant operators to SαS innovations, and then devote
Section 7.5.3 to the case of Poisson innovations, which yield wide-sense self-similar
models.

7.5.1 Stable fractal processes


The class of self-similar and rotation-invariant innovations used in our definition
consists of symmetric-alpha-stable white noises defined (in one and several dimen-
sions) by the characteristic functional
w (ϕ) = e−s0 ϕαα
P (7.40)
α,s0

with α ∈ (0, 2], where  · αα denotes the αth power of the Lα (quasi-)norm and s0 is
an arbitrary positive normalization constant. Note that the largest domain of definition
α,s on which it remains finite is the Lebesgue space Lα (Rd ). The characteristic
of P 0
functional Pw
α,s0 is also continuous with respect to the Lα topology (and, a fortiori,
with respect to any finer topology such as those of D , S ⊂ Lα ).
The innovation
 d wα,s  0 defined by (7.40) is stationary, isotropic, and self-similar with
scaling index α − d in the following sense:

ϕ, wα,s0 (RT ·) = ϕ(R·), wα,s0


= ϕ, wα,s0 in probability law (rotation-invariance)
−1
ϕ, wα,s0 (a ·) = a ϕ(a·), wα,s0
d

d
= ad− α ϕ, wα,s0 . in probability law (self-similarity).

This, in combination with the rotation-invariance and scale-invariance of degree


−γ ∗
(−γ ) of the inverse Laplacian operator Lα introduced in Theorem 5.11, shows that
the generalized random process/field
−H−d+ αd
sα,H = Lα wα,s0

defined by the characteristic functional


d
P w (L−H−d+
s (ϕ) = P α
α∗
ϕ) (7.41)
α,H α,s0

is isotropic and self-similar with Hurst exponent
 H for H  = 0, 1, 2, . . . see Theorem 7.1,
Statement (4) with γ = H + d − d/α . Finally, we note that the existence of these
−γ ∗
processes is guaranteed since the operator Lα used above correctly maps S (Rd ) into
d
Lα (R ), per Theorem 5.11.
The random processes/fields sα,H are the so-called fractional stable motions, which
generalize fractional Brownian motions corresponding to the Gaussian case (α = 2).
These processes have been widely applied in stochastic modeling, especially in their
3 These models are studied in more detail in the thesis of P. D. Tafti [Taf11], which is entirely devoted to the
study of self-similar fields, with emphasis on vectorial extensions (flow fields).
178 Sparse stochastic processes

H=.5
H=.75
H=1.25
H=1.5
Figure 7.6 Gaussian vs. sparse fractal-like processes: comparison of fractional Brownian motion
(left column) and generalized Poisson (right column) stochastic processes as the order increases.
The processes that are side-by-side have the same order γ = H + 12 and identical second-order
statistics.

early Gaussian variety known as fractional Brownian motion, as discussed by Man-


delbrot and Van Ness [MVN68] (some examples are shown in the left column of
Figure 7.6). They are remarkable due to their fractal nature and long-range dependen-
cies. Their most significant statistical properties are listed below.
(1) Self-similarity. The process sα,H is equivalent in probability law to its scaled version
aH sα,H ( a· ). The property is established by showing that sα,H and aH sα,H ( a· ) have
the same characteristic functional (see proof of Statement (4) in Theorem 7.1). It
also justifies its parameterization by H, the Hurst or self-similarity exponent of the
process.
(2) Isotropy. The process sα,H is equivalent in probability law to sα,H (RT ·), where R is
an arbitrary orthonormal matrix in Rd×d . The proof is similar to that of the previous
property.
(3) Non-stationarity and stationary (n + 1)th increments. The process sα,H is non-
−γ
stationary since it is the result of applying the non-shift-invariant operator Lα
with γ = H + d − d/α to a stationary white-noise process. However, despite its non-
stationarity, any finite increment of sα,H of order H+1, as defined in the following
theorem, is stationary.
THEOREM 7.7 Let Y = {y0 , . . . , yn } be a set of n + 1 vectors in Rd and denote by
δY the finite-difference operator defined as the composition of the operators

δyi : f → f − f (· − yi ).

Then, for H ≤ n, the random process/field

δY∗ sα,H

is stationary.
7.5 Self-similar processes 179

Proof We first observe that the (n+1)th-order finite-difference operator δY is shift-


invariant and annihilates the moments of a function ϕ up to degree n, so that

yk δY ϕ(y) dy = 0
Rd
for |k| ≤ n (this is proved easily by induction on n or by differentiation in the
Fourier domain). From there, according to (5.36) we have
Lα−γ ∗ δY ϕ = ρ γ −d ∗ (δY ϕ)
−γ ∗
because the correction term for δY ϕ that normally distinguishes Lα from the shift-
invariant inverse is zero in this case. Consequently,
ϕ(· − h), δY∗ sα,H = δY ϕ(· − h), sα,H
−H−d+ αd ∗
= Lα δY ϕ(· − h), wα
−H−d+ αd ∗
= Lα δY ϕ, wα (· + h) by shift-invariance
−H−d+ αd ∗
= Lα δY ϕ, wα by the stationarity of wα
= ϕ, δY∗ sα,H , in probability law
which proves that δY∗ sα,H is stationary.
(4) Variogram and covariance in the finite-variance case (with 0 < H < 1). For α < 2,
fractional stable motions have infinite variance. But for α = 2, the covariance struc-
ture of fractional Brownian fields can be derived from the characteristic functional
of its (n + 1)th-order increments. Here, we show this derivation for 0 < H < 1.
P R O P O S I T I O N 7.8 (Self-similar variograms) The variogram and the covariance
function of a fractional Brownian field with 0 < H < 1 are given by
2γs2,H (r, s) = E{|s2,H (r) − s2,H (s)|2 } ∝ 2ρ 2H (r − s)
and
Rs2,H (r, s) = E{s2,H (r)ss,H (s)} ∝ 2ρ 2H (r) − 2ρ 2H (r − s) + 2ρ 2H (s),
respectively, where ρ γ is the γ -homogeneous distribution defined in Theorem 5.10.
Proof Let us fix r and s and temporarily denote by u the increment process
u(h) = s2,H (r + h) − s2,H (s + h) = δr−s s2,H (r + h).
Then, the variogram of s2,H corresponds to the variance of u at 0. To compute it, we
first observe that
−H−d/2
ϕ, δr−s s2,H (· + r) = ϕ, δr−s L2 w(· + r)
−H−d/2∗ ∗
= L2 {δr−s ϕ(· + r)}, w
= ρ ∗ ϕ(· + r) − ρ
H−d/2 H−d/2
∗ ϕ(· + s), w
 H−d/2 
= ρ (· + r) − ρ H−d/2
(· + s) ∗ ϕ, w .
180 Sparse stochastic processes

This shows that, for fixed r and s, u is a filtered Gaussian white noise with general-
ized covariance function
 H−d/2 ∨  
ρ (·+r)−ρ H−d/2 (·+s) ∗ ρ H−d/2 (·+r)−ρ H−d/2 (·+s) ∝ 2ρ 2H (r−s)−2ρ 2H (·),
using the symmetry and convolution properties of ρ γ . In particular, for the variance
at 0 of u (which is the same as the variance everywhere, due to stationarity), we
have
2γs2,H (r, s) = E{|s2,H (r) − s2,H (s)|2 } ∝ 2ρ 2H (r − s).
By developing the above result, we then find the generalized covariance
Rs2,H (r, s) = E{s2,H (r)ss,H (s)} ∝ 2ρ 2H (r) − 2ρ 2H (r − s) + 2ρ 2H (s).

The result of Proposition 7.8 generalizes to all wide-sense self-similar processes


obtained by replacing the α-stable innovation in (7.41) with a general finite-variance
stationary innovation.

7.5.2 Fractional Brownian motion through the looking-glass


To demonstrate the analytical power of the formalism, we now consider the special
case d = 1, α = 2, and H ∈ (0, 1). The corresponding self-similar Gaussian process is
fractional Brownian motion (fBm), commonly denoted by BH . The evaluation of (7.41)
then yields
  2 
1 ϕ (ω) − 
ϕ (0)  dω
E{e
jϕ,BH 
} = PBH (ϕ) = exp −   (7.42)
2 R  (−jω)γ  2π

with γ = H + 12 , where we have applied (5.38) and used Parseval’s relation to rewrite
−γ ∗
L2 ϕ22 in the Fourier domain. It is important to understand that (7.42) completely
characterizes fBm. While there are several equivalent ways of writing the denominator
in the Fourier-domain integral, we have chosen the form |ω|2γ = |(−jω)γ |2 . In terms of
operators, this translates into
" #
PB (ϕ) = exp − 1 Iγ ∗ {ϕ}L ,
H 2 0,2 2

γ∗
where I0,2 is the canonical left inverse of the fractional-derivative operator Dγ ∗ (see
definitions in Table 7.1). This, in turn, implies that fBm is the solution of the fractional
stochastic differential equation
Dγ BH = w, (7.43)
where w is a normalized Gaussian white noise.
γ∗
By writing the explicit form of I0,2 with γ ∈ (0.5, 1.5), we obtain

γ∗ 
ϕ (ω) − 
ϕ (0) jωτ dω
I0,2 {ϕ}(τ ) = e (7.44)
R (−jω)γ 2π
= (D−γ +1 )∗ I∗0 {ϕ}(τ ), (7.45)
7.5 Self-similar processes 181

Table 7.1 Causal scale-invariant operators and their adjoint.

Definition Properties

Fractional derivatives of order γ ∈ (−1, +∞)



Dγ {ϕ}(τ ) = ϕ (ω)(jω)γ ejωτ
 Causal, LSI
R 2π

Dγ ∗ {ϕ}(τ ) = ϕ (ω)(−jω)γ ejωτ
 Anti-causal, LSI
R 2π

Fractional integrators of order γ ∈ (1 − 1p , +∞), γ + 1p  = 1, 2, 3, . . .


γ −1+ p1  (jω)k
γ ejωτ − k=0 k!  dω γ
I0,p {ϕ}(τ ) = γ ϕ (ω) I0,p {ϕ}(0) = 0
R (jω) 2π
γ −1+ p1  
ϕ (k) (0)ωk
γ∗ 
ϕ (ω) − k=0 k! dω
I0,p {ϕ}(τ ) = ejωτ Lp -stable
R (−jω)γ 2π

where I∗0 = I1∗0,2 is the regularized integrator that we have already encountered during
our investigation of Lévy processes. One can also verify that (7.44) coincides with the
−γ ∗
Lp -stable left inverse ∂τ ,2 of Theorem 5.8 with τ = −γ /2 and p = 2. The interest of
(7.45) is that it suggests a possible representation of fBm as
BH = I0 D−H+1/2 w,
where I0 imposes the boundary condition BH (0) = 0.
To determine the underlying kernel denoted by hγ ,2 (t, τ ), we recall that h(t, τ ) =
L−1 {δ(· − τ )}(t) = L−1∗ {δ(· − t)}(τ ). Specifically, by inserting the Fourier transform of
δ(· − t) into (7.44), we find that
e−jωt − 1 jωτ dω
hγ ,2 (t, τ ) = γ
e
R (−jω) 2π
1 "  γ −1 #
γ −1
= − (τ − t) + − (−τ )+
(γ )
1 " γ −1 γ −1
#
= (t − τ )+ − (−τ )+ , (7.46)
(γ )
which is consistent with the relation
* +
−1 1 tα+
F (t) =
(jω)α+1 (α + 1)
for α ∈
/ Z (see Table A.1).
Examples of such kernel functions with t = t1 = 1 (fixed) are shown in Figure 7.7 –
the ones of interest with γ ∈ (0.5, 1.5) have their area shaded. These are representatives

of the two main regimes where the functions are either bounded γ ∈ [1, 3/2) or
unbounded γ ∈ (1/2, 1) with two (square-integrable) singularities at τ = 0 and τ = t1 .
While hγ ,2 (t, τ ) is made up of individual atoms (one-sided power functions) whose
182 Sparse stochastic processes

1 1

−2 −1 1 2 −2 −1 1 2
(a) (b)

−2 −1 1 2
(c)

Figure 7.7 Comparison of the standard and higher-order fBm-related kernels hγ1 ,2 (1, τ ) (shaded)
vs. hγ2 ,2 (1, τ ) (dashed line) with γ2 = γ1 + 1: (a) (γ1 , γ2 ) = (0.8, 1.8); (b) (γ1 , γ2 ) = (1, 2);
(c) (γ1 , γ2 ) = (1.2, 2.2).

energy is infinite, the remarkable twist is that the combination in (7.46) yields a function
of τ that is square-integrable.

P R O P O S I T I O N 7.9 The function hγ ,2 (t1 , ·), where t1 ∈ R is a fixed parameter, belongs


to L2 (R) for γ ∈ (0.5, 1.5). More generally, the right-hand side of (7.46) defines the
kernel hγ ,p (t1 , ·) ∈ Lp (R) with p > 0 if and only if 1 − 1p < γ < 2 − 1p . On the other
hand, none of these kernels is rapidly decaying, except for h1,2 (t1 , ·) = 1(0,t1 ] (Brownian
motion).

Proof We consider the scenario t1 ≥ 0. First, we note that (7.47) with γ = 1 and t = t1
simplifies to

h1,2 (t1 , ·) = I∗0 {δ(· − t1 )}(τ ) = 1(0,t1 ] (τ ), (7.47)

which is compactly supported and hence rapidly decaying (see Figure 7.7b). For the
other values of γ , we split the range of integration into two parts in order to handle the
singularities separately from the decay of the tail.
(1) τ ∈ [−t1 , t1 ] with t1 > 0 finite. 1
This part of the Lp -integral will be bounded if 0 τ p(γ −1) dτ < ∞. This happens
for 1 − 1p < γ , which is always satisfied for p = 2 since γ = H + 12 > 12 .
(2) τ ∈ (−∞, −t1 ).
Here, we base our derivation on the factorized representation I0,2 = I0 D−γ +1 , which
γ

follows from (7.45). The operator D−γ +1 is a shift-invariant operator. Its impulse
7.5 Self-similar processes 183

response is given by
* + γ −2
−1 1 t+
ρDγ −1 (t) = F (t) = ,
(jω)γ −1 (γ − 1)

which is also the Green’s function of Dγ −1 . Using (7.47), we then obtain

hγ ,2 (t1 , τ ) = (D−γ +1 )∗ I∗0 {δ(· − t1 )}(τ ) = (ρD∨γ −1 ∗ 1(0,t1 ] )(τ ),

which implies that hγ ,2 (t1 , τ ) decays like 1/|τ |2−γ as τ → −∞ (unless γ = 1, in


which case ρId = δ). It follows that the tail of |hγ ,2 (t1 , τ )|p is integrable provided
that (2 − γ )p > 1 or, equivalently, γ < 2 − 1/p, which corresponds to the whole
range H ∈ (0, 1) for p = 2.

Taking advantage of Proposition 7.9, we invoke Theorem 7.1 to show that the gen-
eralized process BH (t) = I0 D−H+1/2 w(t) admits a classical interpretation as a random
function of t. The theorem also yields the stochastic integral representation

BH (t) = hH+1/2,2 (t, ·), w


1 " γ −1 γ −1
#
= (t − τ )+ − (−τ )+ dW(τ )
R (γ )
 0 " # t 
1
= (t − τ )γ −1 − (−τ )γ −1 dW(τ ) + (t − τ )γ −1 dW(τ ) ,
(γ ) −∞ 0
(7.48)

where the last equation with γ = H + 1/2 is the one that was originally used by Man-
delbrot and Van Ness to define fBm. Here, W(τ ) = B1/2 (τ ) is the standard Brownian
motion whose derivative in the sense of generalized functions is w.
While this result is reassuring, the real power of the distributional approach is that it
naturally lends itself to generalizations. For instance, we may extend the definition to
larger values of H by applying an additional number of integrals. The relevant adjoint
operator is In∗ ∗ n
0 = (I0 ) , whose Fourier-domain expression is

 ϕ (k) (0)ωk
ϕ (ω) − nk=0 
 dω
(I∗0 )n {ϕ}(τ ) = k!
ejωτ . (7.49)
R (−jω)n 2π
γ∗
This formula, which is consistent with the form of I2,0 in Table 7.1 with γ = n, follows
from the definition of I∗0 and the property that the Fourier transform of ϕ ∈ S (R) admits

ϕ (ω) = ∞
k
the Taylor series representation  k=0 
ϕ (k) (0) ωk! , which yields the required limits
at the origin. This allows us to consider the higher-order version of (7.43) for γ = H + 12
with H ∈ R+ \N, the solution of which is formally expressed as
1
BH = In D−(H−n)+ 2 w,
184 Sparse stochastic processes

where n = H. This construction translates into the extended version of fBm specified
by the characteristic functional
⎛  2 ⎞
 H  ϕ (k) (0)ωk 
ϕ (ω) − k=0  dω ⎟
E{e
jϕ,BH
}=P B (ϕ) = exp ⎜
⎝−
1  k! 
H
2 R  H+1/2  2π ⎠ , (7.50)
(−jω) 

where H ∈ R+ \N is the Hurst exponent of the process.


The relevant kernels are determined by using the same inversion technique as before
and are given by
γ
hγ ,p (t, τ ) = I0,p {δ(· − t)}(τ )
γ −1+1/p k γ −1−k
1 γ −1 t (−τ )+
= (t − τ )+ − , (7.51)
(γ ) k! (γ − k)
k=0

where the case of interest for fBm is p = 2 (see examples in Figure 7.7). For γ = n ∈ N,
the expression simplifies to

⎨ 1 (τ ) (t−τ )n , for 0 < t
(0,t]
hn,p (t, τ ) = In∗ {δ(· − t)}(τ ) = n!
0 ⎩ −1[t,0) (τ ) (t−τ )n , for t < 0,
n!

which, once again, is compactly supported.


The final ingredient for the theory is the extension of Proposition 7.9, which is proven
in exactly the same fashion.

PROPOSITION 7.10 The kernels defined by (7.51) can be decomposed as


 
∨ (t − ·)n
hγ ,p (t, τ ) = ρDγ −n ∗ 1(0,t] (τ ),
n!
where n = γ − 1 + 1/p. Moreover, hγ ,p (t1 , ·) ∈ Lp (R) for any fixed value t = t1 and
γ − 1 + 1p ∈ R+ \N.

For p = 2, this covers the whole range of non-integer Hurst exponents H = γ − 1/2
so that the classical interpretation of BH (t) remains applicable. Moreover, since In0 is a
right inverse of Dn , these processes satisfy the extended boundary conditions

Dk BH (0) = BH−k (0) = 0

for k = 0, . . . , H − 1.
Similarly, we can fractionally integrate the SαS noise wα to generate stable self-
similar processes that are inherently sparse for α < 2. The operation is acceptable as
γ
long as I0,p meets the Lp -stability requirement of Theorem 5.8 with p = α, which we
restate as

Hα = γ − 1 + 1
α ∈ R+ \N.

Interestingly enough, this is the same condition as in Proposition 7.10. This ensures that
the so-constructed fractional stable motions (fSm) are well defined in the generalized
7.5 Self-similar processes 185

sense (by the Minlos-Bochner theorem), as well as in the ordinary sense (by Theo-
rem 7.1). Statement (4) of Theorem 7.1 also indicates that the parameter Hα actually
represents the Hurst exponent of these processes, which is consistent with the analysis
of Section 7.5.1 (in particular, (7.41) with d = 1).
It is also possible to define stable fractal processes by considering any other variant
γ
∂τ of fractional derivative within the family of scale-invariant operators (see Proposi-
tion 5.6). Unlike in the Gaussian case where the Fourier phase is irrelevant, the fractio-
γ
nal SDE ∂τ sα,H = wα will specify a wider variety of self-similar processes with the
same overall properties.

7.5.3 Scale-invariant Poisson processes


While not strictly self-similar, the realizations of random fields obtained by solv-
ing a stochastic PDE involving a scale-invariant (homogeneous) operator and a
compound-Poisson innovation give rise to interesting patterns. After revisiting the
classical compound-Poisson process, we review Poisson fields and give two examples
of such processes and fields here, namely, those associated with the fractional Laplacian
operator and with the partial derivative ∂r1 · · · ∂rd .

Compound-Poisson processes and fractional extensions


The compound-Poisson process is a Lévy process per (7.26) with a compound-Poisson
innovation as the driving term of its SDE. Its restriction to the positive half of the real
axis finds a simple representation as
Ak 1r≥0 (r − rk ), (7.52)
k∈Z

where the sequence {rk } forms a Poisson point process in R+ and the amplitudes Ak
are i.i.d. Clearly, realizations of the above process correspond to random polynomial
splines of degree zero with random knots (see Figure 1.1). This characterization on the
half-axis is consistent with the one given on the full axis in Equation (1.14).
More generally, we may define fractional processes with a compound-Poisson inno-
vation and the fractional integrators identified in Section 5.5.1, which give rise to ran-
dom fractional splines. The subclass of these generalized Poisson processes associated
with causal fractional integrators finds the following representation as a piecewise frac-
tional polynomial on the positive half-axis:
γ −1
Ak (r − rk )+ .
k∈Z
They are whitened by the causal fractional-derivative operator Dγ whose Fourier
symbol is (jω)γ . Some examples of such processes are given in the right column of
Figure 7.6.

Scale-invariant Poisson fields


As we saw in Section 4.4.2, a compound-Poisson innovation w can be modeled as an
infinite collection of Dirac impulses with random amplitudes, with two important prop-
erties. First, the number of Diracs in any given compact neighborhood  is a Poisson
186 Sparse stochastic processes

H = 0.5 H = 0.75 H = 1.25 H = 1.75

Figure 7.8 Gaussian fBm (top row) vs. sparse generalized Poisson images (bottom row) in 2-D.
The fields in each column share the same operator and second-order statistics.

variable with parameter λVol(), where λ is a fixed density parameter and Vol()
is the volume (Lebesgue measure) of . Second, fixing  and conditioning on the
number of Diracs therein, the distribution of the position of each Dirac is uniform in 
and independent of all the other Diracs (their amplitudes are also independent). We can
represent a realization of w symbolically as

w(r) = Ak δ(r − rk ),
k∈Z

where Ak are the i.i.d. amplitudes with some generalized probability density pA , the
rk are the positions coming from a Poisson innovation (i.e., fulfilling the above require-
ments) independent from the Ak , and the ordering (the index k) is insignificant. We recall
that, by Theorem 4.9, the characteristic functional of the compound-Poisson process is
related to the density of the said amplitudes by
 " # 

PwPoisson (ϕ) = exp λ e jaϕ(r)
− 1 pA (a) da dr .
Rd R

We shall limit ourselves here to amplitude distributions that have, at a minimum, a finite
first moment.
The combination of P w
Poisson with the operators noted previously involves no addi-
tional subtleties compared to the case of α-stable innovations, at least for the processes
with finite mean that we are considering here. It is, however, noteworthy to recall that
the compound-Poisson Lévy exponents are p-admissible with p = 1 and, in the special
case of symmetric and finite-variance amplitude distributions, for any p ∈ [1, 2] (see
Proposition 4.5). The inverse operator with which P w
Poisson is composed therefore needs
to be chosen/constructed such that it maps S (R ) continuously into Lp (Rd ). Such op-
d

erators were identified in Section 5.4.2 for integer-order SDEs and Sections 5.5.3, 5.5.1
for fractional-order derivatives and Laplacians.
7.6 Bibliographical notes 187

The generic form of such Poisson processes is

sPoisson (r) = p0 (r) + Ak ρL (r − rk ),


k∈Z

where ρL is the Green’s function of L and p0 (r) ∈ NL is a component in the null space
of the whitening operator. The reason for the presence of p0 is to satisfy the boundary
conditions at the origin that are imposed upon the process.
We provide examples of the outcome of applying inverse fractional Laplacians to a
Poisson field in Figure 7.8. The fact that the Green’s function of the underlying fractio-
nal Laplacian is proportional to rH−1 explains the change of appearance of these pro-
cesses when switching from the singular mode with H < 1 to the continuous one with
H > 1. As H increases further, the generalized Poisson processes become smoother and
more similar visually to their fractal Gaussian counterparts (fractional Brownian field),
which are displayed on the upper row of Figure 7.8 for comparison. Next, we shall look
at another interesting example, the Mondrian processes.

The Mondrian process


The Mondrian process is the name by which we refer to the process associated with
the innovation model with operator L = ∂r1 · · · ∂rd and a sparse innovation – typically, a
compound-Poisson one [UT11]. The instance of primary interest to us is the 2-D variety
with L = ∂r1 ∂r2 . In this case, the Green’s function of L is the indicator function of the
positive quarter-plane, given by

ρL (r) = 1r1 ,r2 ≥0 (r),

which we may use to define an admissible inverse of L in the positive quarter-plane.


Similar to the compound-Poisson processes from Section 7.5.3, the Mondrian process
finds a simple description in the positive quarter-plane as the random sum

Ak 1r1 ,r2 ≥0 (r − rk ),
k∈Z

in direct parallel to (7.52), where the rk are distributed uniformly in any neighborhood
in the positive quarter-plane (Poisson point distribution).
A sample realization of this process, which bears some resemblance to paintings by
Piet Mondrian, is shown in Figure 7.9.

7.6 Bibliographical notes

Section 7.1
The first-order processes considered in Section 7.1 are often referred to as non-Gaussian
Ornstein–Uhlenbeck processes; one of their privileged areas of application is financial
modeling [BNS01]. The representation of the autocorrelation in terms of B-splines is
discussed in [KMU11, Section II.B], albeit in the purely Gaussian context. The fact
188 Sparse stochastic processes

Figure 7.9 Realization of the Mondrian process with λ = 1/30.

that exponential B-splines provide the formal connection between ordinary continuous-
domain differential operators and their discrete-domain finite-difference counterpart
was first made explicit in [Uns05].

Section 7.2
Our formulation relies heavily on Gelfand and Vilenkin’s theory of generalized sto-
chastic processes, in which stationarity plays a predominant role [GV64]. The main
analytical tools are the characteristic and the correlation functionals which have been
used with great effect by Russian probabilists in the 1970s and 1980s (e.g., [Yag86]),
but which are lesser known in Western circles. The probable reason is that the domi-
nant paradigm for the investigation of stochastic processes is measure-theoretic – as
opposed to distributional – meaning that processes are defined through stochastic inte-
grals (with the help of Itô’s calculus and its non-Gaussian variants) [Øks07, Pro04].
Both approaches have their advantages and limitations. On one hand, the Itô calculus
can handle certain non-linear operations on random processes that cannot be dealt with
in the distributional approach. Gelfand’s framework, on the other hand, is ideally suited
for performing any kind of linear operations, including some, such as fractional deriva-
tives, which are much more difficult to define in the other framework. Theorem 7.1 is
fundamental in that respect because it provides the higher-level elements for performing
the translation between the two modes of representation.
7.6 Bibliographical notes 189

Section 7.3
The classical theory of stationary processes was developed for the most part in the
1930s. The concept of (second-order) stationarity is due to Khintchine, who established
the correlation properties of such processes [Khi34]. Other key contributors include
Wiener [Wie30], Doob [Doo37], Cramér [Cra40], and Kolmogorov [Kol41]. Classical
textbooks on stochastic processes are [Bar61,Doo90,Wie64]. The manual by Papoulis is
very popular among engineers [Pap91]. The processes described by Proposition 7.2 are
often called linear processes (under the additional finite-variance hypothesis) [Bar61].
Their conventional representation as a stochastic integral is
t
s(t) = h(t − τ ) dW(τ ) = h(t − τ ) dW(τ ), (7.53)
R −∞

where W(τ ) = 1[0,τ ) , w is a stationary finite-variance process with independent incre-


ments (that is, a second-order Lévy process) and where h ∈ L1 (R) is such that h(t) = 0,
for t < 0 (causality). Bartlett provides an expression for their characteristic functional
that is formally equivalent to the result given in Proposition 7.2 [Bar61, Section 5.2,
Equations (24)–(25)]. His result can be traced back to 1951 [BK51, Equation (34)] but
does not address the issue of existence. It involves the auxiliary function
*  1 +
KW (ω) = log E exp jω dW(τ ) ,
0

which coincides with the Lévy exponent f of our formulation (see Proposition 4.12). A
special case of the model is the so-called shot noise, which is obtained by taking w to
be a pure Poisson noise [Ric77, BM02].
Implicit in Property 7.3 is the fact that the autocorrelation of a second-order stationary
process is a cardinal L∗ L-spline. This observation is the key to proving the equivalence
between smoothing-spline techniques and linear mean-square estimation in the sense of
Wiener [UB05b]. It also simplifies the identification of the whitening operator L from
sampled data [KMU11].
The Gaussian CARMA processes are often referred to as Gaussian stationary
processes with rational spectral density. They are the solutions of ordinary sto-
chastic differential equations and are treated in most books on stochastic processes
[Bar61, Doo90, Pap91]. They are usually defined in terms of a stochastic integral such
as (7.53) where h is the impulse response of the underlying Nth-order differential
system and W(t) a standard Brownian motion. Alternatively, one can solve the SDE
using state-space techniques; this is the approach taken by Brockwell to characterize
the extended family of Lévy-driven CARMA processes [Bro01]. Brockwell makes
the assumption that E{|W(1)| } < ∞ for some  > 0, which is equivalent to our
Lévy–Schwartz admissibility condition (see (9.10) and surrounding text) and makes the
formulations perfectly compatible. The CARMA model can also be extended to N ≤ M,
in which case it yields stationary processes that are no longer defined pointwise. The
Gaussian theory of such generalized CARMA processes is given in [BH10]. In light of
what has been said, such results are also transposable to the more general Lévy setting,
190 Sparse stochastic processes

since the underlying convolution operator remains S -continuous (under the stability
assumption that there is no pole on the imaginary axis).

Section 7.4
The founding paper that introduces Lévy processes – initially called additive processes –
is the very same that uncovers the celebrated Lévy–Khintchine formula and specifies the
family of α-stable laws [Lév34]. We can therefore only concur with Loève’s statement
about its historical importance (see Section 1.4). Besides Lévy’s classical monograph
[Lév65], other recommended references on Lévy processes are [Sat94, CT04, App09].
The book by Cont and Tankov is a good starting point, while Sato’s treatise is remark-
able for its precision and completeness.
The operator-based method of resolution of unstable stochastic differential equa-
tions that is deployed in this section was developed by the authors with the help of
Qiyu Sun [UTS14]. The approach provides a rigorous backing for statements such as
“a Lévy process is the integral of a non-Gaussian white noise,” “a continuous-domain
non-Gaussian white noise is the (weak) derivative of a Lévy process,” or “a Lévy pro-
cess is a non-Gaussian process with a 1/ω spectral behavior,” which are intuitively
appealing, but less obviously related to the conventional definition. The main advantage
of the framework is that it allows for the direct transposition of deterministic methods
for solving linear differential equations to the stochastic setting, irrespective of any
stability considerations. Most of the examples of sparse stochastic processes are taken
from [UTAK14].

Section 7.5
Fractional Brownian motion (fBm) with Hurst exponent 0 < H < 1 was introduced
by Mandelbrot and Van Ness in 1968 [MVN68]. Interestingly, there is an early paper
by Kolmogorov that briefly mentions the possibility of defining such stochastic objects
[Kol40]. The multidimensional counterparts of these processes are fractional Brownian
fields [Man01], which are also solution of the fractional SDE (− )γ /2 s = w where w
is a Gaussian white noise [TVDVU09]. The family of stable self-similar processes with
0 < H < 1 is investigated in [ST94], while their higher-order extensions are briefly
considered in [Taf11].
The formulation of fBms as generalized stochastic processes was initiated by Blu
and Unser in order to establish an equivalence between MMSE estimation and spline
interpolation [BU07]. A by-product of this study was the derivation of (7.42) and (7.50).
The latter representation is compatible with the higher-order generalization of fBm for
H ∈ R+ \Z+ proposed by Perrin et al. [PHBJ+ 01], with the underlying kernel (7.51)
being the same.
The scale-invariant Poisson processes of Section 7.5.3 were introduced by the authors
[UT11]; they are the direct stochastic counterparts of spline functions where the knots
and the strength of the singularities are assigned in a random fashion. The examples,
including the Mondrian process, are taken from that paper on sparse stochastic
processes.
8 Sparse representations

In order to obtain an uncoupled representation of a sparse process s = L−1 w of the


type described in the previous chapters, it is essential that we somehow invert the
integral operator L−1 . The ideal scenario would be to apply the differential operator
L = (L−1 )−1 to uncover the innovation w that is independent at every point. Unfortu-
nately, this is not feasible in practice because we do not have access to the signal s(r)
over the entire domain r ∈ Rd , but only to its sampled values on a lattice or, more
generally, to a series of coefficients in some appropriate basis. Our analysis options, as
already alluded to in Chapter 2, are essentially twofold: the application of a discrete
version of the operator, or an operator-like wavelet analysis.
This chapter is devoted to the in-depth investigation of these two modes of represen-
tation. As with the other chapters, we start with a concrete example (Lévy process) to lay
down the key ideas in Section 8.1. Our primary tool for deriving the transform-domain
pdfs is the characteristic functional, reviewed in Section 8.2 and properly extended so
that it can handle arbitrary analysis functions ϕ ∈ Lp (Rd ). In Section 8.3, we investi-
gate the decoupling ability of finite-difference-type operators and determine the statis-
tical distribution of the resulting generalized increments. In Section 8.4, we show how
a sparse process can be expanded in a matched-wavelet basis and provide the complete
multivariate description of the transform-domain statistics, including general formulas
for the wavelet cumulants. Finally, in Section 8.5, we apply these methods to the rep-
resentation of first-order processes. In particular, we focus on the sampled version of
these processes and adopt a matrix-vector formalism to make the link with traditional
transform-domain signal-processing techniques. Our main finding is that operator-like
wavelets generally outperform the classical sinusoidal transforms (DCT and KLT) for
the decoupling of sparse AR(1) and Lévy processes and that they essentially provide an
independent-component analysis (ICA).

8.1 Decoupling of Lévy processes: finite differences vs. wavelets

To expose the reader to the concepts, it is instructive to return to our introductory


example in Chapter 1: the Lévy process denoted by W(t). As we already saw, the fun-
damental property of W(t) is that its increments are independent (see Definition 7.7).
192 Sparse representations

When the Lévy process is sampled uniformly on the integer grid, we have access to its
equispaced increments

u[k] = W(t)|t=k − W(t)|t=k−1 , (8.1)

which are i.i.d. Moreover, due to the property that W(0) = 0 (almost surely), the relation
can be inverted as
k
W(k) = u[n], (8.2)
n=1

which provides a convenient recipe for generating Lévy processes.


To see how this fits the formalism of Chapter 7, we recall that the corresponding
whitening operator is D = dtd = P0 . Its discrete counterpart is the finite-difference
operator Dd = 0 . This notation is consistent with the pole vector of a (basic) Lévy
process being α = (0).
According to the generic procedure outlined in Section 7.4.4, we then specify the
increment process associated with W(t), t ∈ R as

u0 (t) = 0 W(t) = (β0 ∗ w)(t) = β0∨ (· − t), w , (8.3)

where β0 = βD = β+0 is the rectangle B-spline defined by (1.4) (or (6.21) with α = 0) and
w is an innovation process with Lévy exponent f. Based on Proposition 7.4, we deduce
that u0 (t), which
 is defined
 for all t ∈ R, is stationary with characteristic functional
Pu (ϕ) = P w β ∨ ∗ϕ , where Pw is defined by (7.21). Let us now consider the random
0 0
variable U = δ(· − k), s = u[k]. To obtain its characteristic function, we rely on the
theoretical framework of Section 8.2 and simply substitute ϕ = ωδ(· − k) in P u (ϕ),
0
which yields
   

pU (ω) = Pw ωβ ∨ (· − k) = P w ωβ ∨ (by stationarity)
0 0
 
 ∨ 
= exp f ωβ0 (t) dt
R
= ef (ω) = 
pid (ω). (8.4)

The disappearance of the integral results from the binary nature of β0∨ and the fact that
f (0) = 0. This shows that the increments of the Lévy process are infinitely divisible (id)
with canonical pdf pU = pid , which corresponds to the observation of the innovation
through a unit rectangular window (see Proposition 4.12). Likewise, we find that the
joint characteristic function of U1 = β0∨ (· − k1 ), w = u[k1 ] and U2 = β0∨ (· − k2 ), w =
u[k2 ] for any |k1 − k2 | ≥ 1 factorizes as
 
 w ω1 β ∨ (· − k1 ) + ω2 β ∨ (· − k2 )
pU1 ,U2 (ω1 , ω2 ) = P 0 0
pU (ω1 ) 
= pU (ω2 ),

pU is given by (8.4). Here, we have used the fact that the supports of β0∨ (· − k1 )
where 

and β0 (· − k2 ) are disjoint, together with the independence at every point of w
(Proposition 4.11). This implies that the random variables U1 and U2 are independent
8.1 Decoupling of Lévy processes 193

for all pairs of distinct indices (k1 , k2 ), which proves that the sequence of Lévy incre-
ments {u[k]}k∈Z is i.i.d. with pdf pU = pid . The bottom line is that the decoupling of the
samples of a Lévy process afforded by (8.1) is perfect and that this transformation is
reversible.
The alternative is to expand the Lévy process in a wavelet basis that is matched to the
operator L = D. To that end, we revisit the Haar analysis of Section 1.3.4 by applying
the generalized wavelet framework of Section 6.5.3 with d = 1 and (scalar) dilation
D = 2. The required ingredient is the D∗ D-interpolator with respect to the grid 2i−1 Z:
ϕint,i−1 (t) = β(0,0) (t/2i−1 + 1), which is a symmetric triangular function of unit height
and support [−2i−1 , 2i−1 ] (piecewise-linear B-spline). The specialized form of (6.54)
then yields the derivative-like wavelets

ψi,k (t) = 2i/2−1 D∗ ϕint,i−1 (t − 2i−1 k)


= D∗ φi (t − 2i−1 k),

where φi (t) = 2i/2−1 ϕint,i−1 (t) ∝ φ0 (t/2i ) is the normalized triangular kernel at reso-
lution level i and k ∈ Z \ 2Z. The difference with the classical Haar-wavelet formulas
(1.19) and (1.20) is that the polarity is reversed (since D∗ = −D) and the location index
k restricted to the set of odd integers. Apart from this change in indexing convention,
required by the general (multidimensional) formulation, the two sets of basis functions
are equivalent (see Figure 1.4). Next, we recall that the Lévy process can be represented
as s = I0 w, where I0 is the integral operator defined by (5.4). This allows us to express
its Haar wavelet coefficients as

Vi [k] = ψi,k , s
= D∗ φi (· − 2i−1 k), I0 w
= I∗0 D∗ φi (· − 2i−1 k), w (by duality)
= φi (· − 2 i−1
k), w , (left-inverse property of I∗0 )

which is very similar to (8.3), except that the rectangular smoothing kernel β0∨ is
replaced by a triangular one. By considering the continuous version Vi (t) = φi (·−t), w
of the wavelet transform at scale i, we then invoke Proposition 7.2 to show that Vi (t)
is stationary with characteristic functional P V (ϕ) = P w (φi ∗ ϕ). Moreover, since
i
the smoothing kernels φi (· − 2 k) for i fixed and odd indices k are not overlapping,
i−1

we deduce that the sequence of Haar-wavelet coefficients {Vi [k]}k∈Z\2Z is i.i.d. with
characteristic function  w (ωφi ).
pVi (ω) = E{ejωφi ,w } = P
If we now compare the wavelet situation for i = 1 with that of the Lévy increments,
we observe that the wavelet analysis involves one more layer of smoothing of the inno-
vation with β0 (since φ1 = ϕint,0 = β0 ∗ β0∨ ), which slightly complicates the statistical
calculations. For i > 1, there is an additional coarsening effect which has some interest-
ing statistical implications (see Section 9.8).
While the smoothing effect on the innovation is qualitatively the same in both sce-
narios, there are fundamental differences too. In the wavelet case, the underlying dis-
crete transform is orthogonal, but the coefficients are not fully decoupled because of
194 Sparse representations

the unavoidable inter-scale dependencies, as we shall see in Section 8.4.1. Conversely,


the finite-difference transform (8.1) is optimal for decoupling Lévy processes, but the
representation is not orthogonal. In this chapter, we show that this dual way of expand-
ing signals remains applicable for the complete family of sparse processes under mild
conditions on the whitening operator L. While the consideration of higher-order opera-
tors makes us gain in generality, the price to pay is a slight loss in the decoupling ability
of the transforms, because longer basis functions (and B-splines) tend to overlap more.
Those limitations notwithstanding, we shall see that the derivation of the transform-
domain statistics is in all points analogous to the Lévy scenario. In addition, we shall
provide the tools for assessing the residual dependencies that are unavoidable when the
random processes are non-Gaussian.

8.2 Extended theoretical framework

Fundamentally, a generalized stochastic process s is only observable indirectly through


its inner products Xn = ϕn , s with some family {ϕn } of test functions. Conversely, any
linear measurement of a conventional or generalized stochastic process s is expressible
as Y = ψ, s with some suitable kernel ψ ∈ S  (Rd ). We also recall that a generalized
stochastic process on S  (Rd ) is completely characterized by its characteristic func-
tional Ps (ϕ) = E{ejϕ,s } with ϕ ∈ S (Rd ) (see Minlos–Bochner Theorem 3.9). The
combination of these two ingredients suggests a simple, powerful mechanism for the
determination of the characteristic function of the random variable Y = ψ, s by formal
substitution of ϕ = ωψ in the characteristic functional of s. This leads to

pY (ω) = E{ejωY } = E{ejωψ,s } (by definition)
= E{e }
jωψ,s
(by linearity)
=Ps (ωψ), (by identification)
where ψ is fixed and ω ∈ R plays the role of the Fourier-domain variable. The same
approach is extendable to the determination of the joint pdf of any finite collection of
linear measurements {Yn = ψn , s }Nn=1 (see Proposition 3.10).
This is very nice in principle, except that most of the kernels ψn considered in this
chapter are non-smooth and almost systematically outside of S (Rd ). To properly handle
this issue, we need to extend the domain of P s to a larger class of (generalized) func-
tions, as explained in the next two sections.

8.2.1 Discretization mechanism: sampling vs. projections


For practical purposes, it is desirable to represent a stochastic process s by a finite num-
ber of coefficients that can be stored and processed in a computer. The discretization
methods can be divided in two categories. The first is a representation of s in terms of
values that are sampled uniformly, which is formally described as
s[k] = δ(· − k), s .
8.2 Extended theoretical framework 195

The implicit assumption here is that the process s has a sufficient degree of continuity
for its sample values to be well defined. Since s = L−1 w, it is actually sufficient that the
operator L−1 satisfies some mild regularity conditions. The simplest scenario is when
L−1 is LSI with its impulse response ρL belonging to some function space X ⊆ L1 (Rd ).
Then,

s[k] = δ(· − k), s = δ(· − k), ρL ∗ w = ρL∨ (· − k), w .

Now, if the characteristic functional of the noise admits a continuous extension from
S (Rd ) to X (specifically, X = R (Rd ) or X = Lp (Rd ), as shown in Section 8.2.2), we
w (ϕ).
obtain the characteristic function of s[k] by replacing ϕ ∈ X by ωρL∨ (· − k) in P

The continuity and positive definiteness of Pw ensure that the sample values of the
process have a well-defined probability distribution (by Bochner’s Theorem 3.7) so that
they can be interpreted as conventional random variables. The argument carries over to
the non-shift-invariant scenario as well, by replacing the samples by their generalized
increments. This is equivalent to sampling the generalized increment process u = Ld s,
which is stationary by construction (see Proposition 7.4).
The second discretization option is to expand s in a suitable basis whose expansion
coefficients (or projections) are given by

Yn = ψn , s ,

where the ψn are appropriate analysis functions (typically, wavelets). The argument
concerning the existence of the underlying expansion coefficients as conventional ran-
dom variables is based on the following manipulation:

Yn = ψn , s = ψn , L−1 w = L−1∗ ψn , w .

The condition for admissibility, which is essentially the same as in the previous
w admits an
scenario, is that φn = L−1∗ ψn ∈ X , subject to the constraint that P
extension that is continuous and positive definite over X . This is consistent with
the notion of L-admissible wavelets (Definition 6.7), which asks that ψ = L∗ φ with
φ ∈ X ⊆ L1 (Rd ).

8.2.2 Analysis of white noise with non-smooth functions


Having motivated the necessity of extending the domain of P w (ϕ) to the largest pos-
sible class of functions X , we now proceed with the mathematics. In particular, it is
crucial to remove the stringent smoothness requirement associated with S (Rd ) (infinite
order of differentiability), which is never met by the analysis functions used in prac-
tice. To that end, we extend the basic continuity result of Theorem 4.8 to the class of
functions with rapid decay, which can be done without restriction on the Lévy exponent
f. This takes care, in particular, of the cases where ϕ = ψn is bounded with compact
support, which is typical for an Nth-order B-spline or a wavelet basis function in 1-D.
196 Sparse representations

P R O P O S I T I O N 8.1 Let f be a Lévy–Schwartz exponent as specified in Theorem 4.8.


Then, the Lévy noise functional
 
 
w (ϕ) = exp
P f ϕ(r) dr
Rd

is continuous and positive definite over R (Rd ).

The proof is the same as that of Theorem 4.8. It suffices to replace all occurrences of
S (Rd ) by R (Rd ) and to restate the pointwise inequalities of the proof in the “almost
everywhere” sense. The reason is that the only property that matters in the proof is the
decay of ϕ, while its smoothness is irrelevant to the argument.
As we show next, reducing the constraints on the decay of the analysis functions is
possible too, by imposing additional conditions on f.

T H E O R E M 8.2 If f is a p-admissible Lévy exponent (see Definition 4.4) for some


p ≥ 1, then the Lévy-noise functional
 
 
w (ϕ) = exp
P f ϕ(r) dr
Rd

is continuous and positive definite over Lp (Rd ).

Proof Since the exponential function is continuous with exp(0) = 1, it is sufficient to


show that the functional

 
w (ϕ) =
F(ϕ) = log P f ϕ(r) dr
Rd

is continuous over Lp (Rd ) and conditionally positive definite of order one (see
Definition 4.6). To that end, we first observe that F(ϕ) is well defined for all ϕ ∈ Lp (Rd )
with
  
|F(ϕ)| ≤ f ϕ(r)  dr ≤ Cϕpp ,
Rd
 
due to the p-admissibility of f. This assumption also implies that |ω|f  (ω) ≤ C|ω|p ,
which translates into
 ω2 
 
|f (ω2 ) − f (ω1 )| =  f  (ω) dω
ω1
 ω2 
 
≤ C ωp−1 dω ≤ C max(|ω1 |p−1 , |ω2 |p−1 ) |ω2 − ω1 |
ω1

≤ C (|ω1 |p−1 + |ω2 − ω1 |p−1 ) |ω2 − ω1 |,


8.3 Generalized increments 197

where we have used the fact that max(a, b) ≤ |a| + |b − a|. Next, we consider a
convergent sequence {ϕn }∞ d
n=1 in Lp (R ) whose limit is denoted by ϕ. We then have
that
  " #

F(ϕn ) − F(ϕ) ≤ C |ϕ(r)|p−1 |ϕn (r) − ϕ(r)| + |ϕn (r) − ϕ(r)|p dr
"R
d
#
≤ C ϕp ϕn − ϕp + ϕn − ϕp ,
p−1 p

p
(by Hölder’s inequality with q = p−1 )

where the right-hand side converges to zero as limn→∞ ϕn − ϕp = 0, which proves
the continuity of F on Lp (Rd ). The second part of the statement is a direct conse-
quence of the conditional positive definiteness of f. Indeed, for every choice ϕ1 , . . . ,
ϕN ∈ Lp (Rd ), ξ1 , . . . , ξN ∈ C, and N ∈ Z+ , we have that

N N N N
 
F(ϕm − ϕn )ξm ξ n = f ϕm (r) − ϕn (r) ξm ξ n dr ≥ 0
m=1 n=1 Rd m=1 n=1
  
≥0

N
subject to the constraint n=1 ξn = 0.

8.3 Generalized increments for the decoupling of sample values

We recall that the generalized increment process associated with the stochastic process
s = L−1 w is defined as

u(r) = Ld s(r) = dL [k]s(r − k),


k∈Zd

where Ld is the discrete-domain version of the operator L. It is a finite-difference-like


operator that involves a suitable sequence of weights dL ∈ 1 (Zd ).
The motivation for computing u = Ld s is to partially decouple s and to remove its non-
stationary components in the event that the inverse operator L−1 is not shift-invariant.
The implicit requirement for producing stationary increments is that the continuous-
domain composition of the operators Ld and L−1 is shift-invariant with impulse res-
ponse βL ∈ L1 (Rd ) (see Proposition 7.4). Our next result provides a general criterion
for checking that this condition is met. It ensures that the scheme is applicable when-
ever the operator L is spline-admissible (see Definition 6.2) with associated generalized
B-spline βL , the construction of which is detailed in Section 6.4. The key, of course, is
in the selection of the appropriate “localization” operator Ld .

PROPOSITION 8.3 Let ρL = F −1 {1/ L(ω)} ∈ S  (Rd ) be the Green’s function of


a spline-admissible operator L with frequency response 
L(ω) and associated B-spline
198 Sparse representations


βL = Ld ρL ∈ L1 (Rd ). Then, the operator L−1 : ϕ(r) → Rd h(r, r )ϕ(r ) dr is a valid
right inverse of L if its kernel is of the form
h(r, r ) = L−1 {δ(· − r )}(r) = ρL (r − r ) + p0 (r; r ),
where p0 (r; r ) with r fixed is included in the null space NL of L. Moreover, we have
that
Ld L−1 ϕ = βL ∗ ϕ (8.5)
L−1∗ L∗d ϕ = βL∨ ∗ϕ (8.6)
for all ϕ ∈ S (Rd ).
Proof The equivalent kernel (see kernel Theorem 3.1) of the composed operator
LL−1 is
 
LL−1 {δ(· − r )} = L h(·, r )
 
= L ρL (· − r ) + p0 (·; r )
= L{ρL (· − r )} + L{p0 (·; r )}
  
0

= δ(· − r ),
which proves that LL−1 = Id, and hence that L−1 is a valid right inverse of L. Next, we
consider the composed operator Ld L−1 and apply the same procedure to show that
Ld L−1 {δ(· − r )} = βL (· − r ),
where we have used the property that Ld p0 (·; r ) = 0 since NL is included in the null
space of Ld (see (6.24) and text immediately below). It follows that

Ld L−1 {ϕ} = βL (· − r )ϕ(r ) dr = βL ∗ ϕ,


Rd

which is a well-defined convolution operation since βL ∈ L1 (Rd ). Finally, Equation


(8.6), which involves the corresponding left inverse L−1∗ of L∗ , is simply the adjoint of
the latter relation.
A nice illustration of the first part of Proposition 8.3 is the specific form (7.51) of the
kernel associated with fractional Brownian motion and its non-Gaussian variants (see
Section 7.5.2).
As far as generalized B-splines are concerned, the proposition implies that the formal
definition βL = Ld L−1 δ is invariant to the actual choice of the right-inverse operator
L−1 when NL ∩ S  (R d
 ) is non-trivial.
 This general form is an extension of the usual
definition βL = Ld ρL see (6.25) that involves the (unique) shift-invariant right inverse
of L whose impulse response is ρL .
From now on, we shall therefore assume that the inverse operator L−1 that acts on
the innovation w fulfills the conditions in Proposition 8.3. We can then apply (8.5) to
derive the convolution formula
u = Ld s = βL ∗ w. (8.7)
8.3 Generalized increments 199

Since this relation is central to our argument, we want to detail the explicit steps of its
derivation. Specifically, for all ϕ ∈ S (Rd ), we have that
ϕ, u = ϕ, Ld s = ϕ, Ld L−1 w
= L−1∗ L∗d ϕ, w (by duality)
= βL∨ ∗ ϕ, w . (from (8.6))
This in turn implies that the characteristic functional of u = Ld s is given by
u (ϕ) = P
P w (β ∨ ∗ ϕ). (8.8)
L

Consequently, the resulting generalized increment process is stationary even when


the original process s = L−1 w is not, which happens when L−1 fails to be shift-
invariant. The intuition for this property is that Ld annihilates the signal components
that are in the null space of L. Indeed,the form of the corrected left-inverse operators of
Chapter 5 e.g., the integrator of (5.4) suggests that the non-stationary part of the signal
corresponds to the null-space components (exponential polynomials) that are added to
the solution during the inversion to fulfill the boundary conditions when the underlying
SDE is unstable. The result also indicates that the decoupling effect will be strongest
when the convolution kernel βL (r) is the most localized and closest to an impulse. The
challenge is therefore to select Ld such that βL is the most concentrated around the
origin. This is precisely what the designer of “good” B-splines tries to achieve, as we
saw in Chapter 6.
The final step is the discretization where u = Ld s is sampled uniformly, which yields
the sequence of generalized increments
u[k] = Ld s(r)|r = k = (dL ∗ s)[k].
The practical interest of the last formula is that {u[k]}k∈Zd can be computed from the
sampled values s[k] = δ(· − k), s , k ∈ Zd , of the initial process s via a simple discrete
convolution with dL (digital filtering). In most cases, the mapping
 from
 s[·] to u[·] can
also be inverted, as we saw in the case of the Lévy process see (8.2) .

8.3.1 First-order statistical characterization


To determine the pointwise statistical description of the generalized increment process
u, we consider the random variable U = βL∨ , w . Its characteristic function  pU (ω) is
obtained by plugging ϕ = ωδ into the characteristic functional (8.8). In this way, we are
able to specify its first-order pdf via the inverse Fourier transformation
  
∨ dω
pU (x) = e Rd f ωβL (r) dr e−jωx .
R 2π
This results in an id pdf with the modified Lévy exponent
 
fβL (ω) = f ωβL (−r) dr
Rd
 
= f ωβL (r) dr. (by change of variable)
Rd
200 Sparse representations

For instance, in the case of a symmetric-alpha-stable (SαS) innovation with f (ω) =


− α!
1
|s0 ω|α and dispersion parameter s0 , we find that pU is SαS as well, with new disper-
sion parameter sU = s0 βL Lα , where βL Lα is the (pseudo) Lα -norm of the B-spline.
In particular, for α = 2 this shows that the generalized increments of a Gaussian process
are Gaussian as well, with a variance that is proportional to the squared L2 -norm of the
B-spline associated with the whitening operator L.

8.3.2 Higher-order statistical dependencies


As already mentioned, the decoupling afforded by Ld is not perfect. In the spatial (or
temporal) domain, the remaining convolution of w with βL induces dependencies over
the spatial neighborhood that intersects with the support of the B-spline. When this
support is greater than the sampling step (T = 1), this introduces residual coupling
among the elements of u[·]. This effect can be accounted for exactly by considering
a K-point neighborhood K (k0 ) of k0 on the Cartesian lattice Zd and by defining the
corresponding K-dimensional vector u[k0 ] = (U1 , . . . , UK ) whose components are the
values u[k] with k ∈ K (k0 ). The Kth-order joint pdf of u[k0 ] is denoted by p(U1 :UK ) (u).
Since the generalized-increment process u = Ld s is stationary, p(U1 :UK ) (u) does not
depend on the (absolute) location k0 . We determine its K-dimensional Fourier transform

(Kth-order characteristic function) by making the substitution ϕ = k∈K ωk δ(· − k) in
(8.8), which yields
⎛ ⎞
 
p(U1 :UK ) (ω) = E{ejω,u } = exp ⎝
 f ωk βL∨ (r − k) dr⎠ ,
Rd k∈K

where ω is the K-dimensional frequency variable with components ωk where k ∈


K = K (0). In addition, when the innovation fulfills the second-order conditions of
Proposition 4.15, the autocorrelation function of the generalized-increment process is
given by
% &

RLd s (r) = E u(· + r)u(·) = σw2 (βL ∗ β L )(r). (8.9)

Hence, the correlation between the discretized increments is simply


  ∨
E u[k1 ]u[k2 ] = σw (βL ∗ β L )(k1 − k2 ).
2

This points once more to the importance of selecting the discrete operator Ld such that
the support of βL = Ld ρL is minimal or decays as fast as possible when a compact
support is not achievable.
This analysis clearly shows that the support of the B-spline governs the range of
dependency of the generalized increments with the property that u[k1 ] and u[k2 ] are
independent whenever |k1 − k2 | ≥ support(βL ). In particular, this implies that the
sequence u[·] is i.i.d. if and only if support(βL ) ≤ 1, which is precisely the case for
the first-order B-splines βα with α ∈ C, which go hand-in-hand with the Lévy (α = 0)
and AR(1) processes.
8.3 Generalized increments 201

8.3.3 Generalized increments and stochastic difference equations


To illustrate the procedure, we now focus on the extended family of generalized Lévy
processes sα of Section 7.4.2. These are solutions of ordinary differential equations
with rational transfer functions specified by their poles α = (α1 , . . . , αN ) and zeros
γ = (γ1 , . . . , γM ) with M < N. The extension over the classical CARMA framework
is that the poles can be arbitrarily located in the complex plane
 so that the underlying
system may be unstable. The associated B-spline is given by see (7.36)
 
$
N
1 − eαn −jω dω
jωt
βL (t) = qM (D)βα (t) = e qM (jω) , (8.10)
R jω − αn 2π
n=1

where qM (ζ ) = bM M m=1 (ζ − γm ) and βα is the exponential B-spline defined by (7.34).
The generalized-increment process is then given by

uα (t) = α sα (t) = (βL ∗ w)(t), (8.11)

where α is the finite-difference operator defined by (7.35). A more explicit charac-


terization of α is through its weights dα , whose representation in the z-transform
domain is
N $
N
Dα (z) = dα [k]z−k = (1 − eαn z−1 ). (8.12)
k=0 n=1

By sampling (8.11) at the integers, we obtain


N
uα [k] = uα (t)|t=k = dα [n]sα [k − n]. (8.13)
n=0

In light of this relation, it is tempting to investigate whether it is possible to decorre-


late uα [·] even further in order to describe sα [·] through a discrete ARMA-type model.
Ideally, we would like to come up with an equivalent discrete-domain innovation model
that is easier to exploit numerically than the defining stochastic differential equation
(7.19). To that end, we perform the spectral factorization of the sampled augmented
B-spline kernel
N
BL (z) = βLL∗ (k)z−k = B+L (z)B−
L (z), (8.14)
k=−N
N−1 +
where B+L (z) = k=0 bL [k]z−k = B− −1
L (z ) specifies a causal finite-impulse-response

(FIR) filter of size N.  The crucial point for the argument below is that B+L (ejω ) or,
equivalently, BL (ejω ) is non-vanishing. To that end, we invoke the Riesz-basis property
of an admissible B-spline (see Definition 6.8) together with (6.19), which implies that

BL (ejω ) = L (ω + 2πn)|2 ≥ A2 > 0.



n∈Z
202 Sparse representations

P R O P O S I T I O N 8.4 (Stochastic difference equation) The sampled version sα [·] of a


generalized second-order Lévy process with pole vector α and associated B-spline βL
satisfies the discrete ARMA-type whitening equation
N N−1
dα [n]sα [k − n] = b+L [m]e[k − m],
n=0 m=0

where dα and b+L are defined by (8.12) and (8.14), respectively. The driving term e[k]
is a discrete stationary white noise (“white” meaning fully decorrelated or with a flat
power spectrum). However, e[k] is a valid innovation sequence with i.i.d. samples only
if the corresponding continuous-domain process is Gaussian, or, in full generality (i.e.,
in the non-Gaussian case), if it is a first-order Markov or Lévy-type process with N = 1.
2
Proof Since |B+L (ejω )| = BL (ejω ) is non-vanishing and is a trigonometric polyno-
mial of ejω whose roots are inside the unit circle, we have the guarantee that the inverse
filter whose frequency response is B+ (e1 jω ) is causal-stable. It follows that e (ejω ) =
 L

n∈Z |βL (ω+2πn)|
2
σw2 =
BL (ejω )
σw2 , which proves the first part of the statement. As for the second
part, we recall that decorrelation is equivalent to independence in the Gaussian case
only. In the non-Gaussian case, the only way to ensure independence is by restrict-
ing ourselves to a first-order process, which results in an AR(1)-type equation with
e[n] = uα1 [n]. Indeed, since the corresponding B-spline is a size 1, uα1 [·] is i.i.d.,
which implies that pU uα1 [k] {uα1 [k − m]}m∈Z+ = pU (uα1 [k]). This is equivalent to
sα1[·] having
 the Markov property since pS sα1 [k] {sα1 [k − m]}m∈Z+ = pU (uα1 [k]) =
pS sα1 [k] sα1 [k − 1] .

8.3.4 Discrete whitening filter


The specification of a discrete-domain whitening filter that fully decorrelates the sam-
pled version s[·] of a second-order stochastic process s(·) is actually feasible in all gener-
ality under the assumption that the operator L is spline-admissible. However, it is only
really useful for Gaussian processes, since this is the only scenario where decorrelation
is synonymous with statistical independence. The multidimensional frequency response
of the corresponding discrete whitening filter is

Ld (ω)

LG (ω) = 3
  2 . (8.15)
d
βL (ω + 2πn) 
n∈Z

The Riesz-basis property of the B-spline ensures that the denominator of (8.15) is non-
vanishing, so that we may invoke Wiener’s lemma (Theorem 5.13) to show that the filter
is stable. Generally, the support of its impulse response is infinite, unless the support of
the B-spline is unity (Markov property).

8.3.5 Robust localization


We end our discussion of generalized increments with a description of a robust variant
that is obtained by replacing the canonical localization operator Ld by some shorter
8.3 Generalized increments 203

filter 
Ld . This option is especially useful for decoupling fractal-type processes that
are associated with fractional whitening operators whose discrete counterparts have an
infinite support. The use of ordinary finite-difference operators, in particular, is motiva-
ted by Theorem 7.7 because of their stationarizing effect.
The guiding principle is to select a localization filter 
Ld that has a compact support
and is associated with some “augmented” operator  L = L0 L, where L0 is a suitable dif-
ferential operator. The natural candidate for 
L is a (non-fractional) differential operator
of integer order γ̃ ≥ γ whose null space is identical to that of L and whose B-spline βL̃
has a compact support. This latter function is given by

βL̃ = 
Ld ρL̃ ,

where ρL̃ = F −1 {1/ L whose Fourier symbol is 


L̃(ω)} is the Green’s function of  L̃(ω) =
L0 (ω)
 L(ω). The subsequent application of L0 then yields the smoothing kernel

φ(r) = L0 βL̃ (r) (8.16)


=Ld L0 ρL̃ (r) = 
Ld ρL (r) (8.17)
= dL̃ [k]ρL (r − k).
k

This shows that φ is a (finite) linear combination of the integer shifts of ρL (the Green’s
function of the original operator L), and hence is a cardinal L-spline (see Definition 6.4).
To describe the decoupling effect of  Ld on the signal s = L−1 w, we observe that
 −1
(8.17) is equivalent to φ = Ld L δ, which yields

Ld s = 
ũ =  Ld L−1 w
= φ ∗ w.

This equation is the same as (8.7), except that we have replaced the original B-spline
βL by the smoothing kernel defined by (8.16). The procedure is acceptable whenever
φ ∈ Lp (Rd ), and its decay at infinity is comparable to that of βL . We call such a scheme
a robust localization because its qualitative effect is the same as that of the canonical
operator Ld . For instance, we can rely on the results of Chapter 9 to show that the
statistical distributions of ũ and u have the same global properties (sparsity pattern).
The price to pay is a slight loss in decoupling power because the localization of φ is
worse than that of βL , itself being the best that can be achieved within the given spline
space (B-spline property).
To access the remaining dependencies, it is instructive to determine the corresponding
autocorrelation function
% &
∗ ∨
Rũ (r) = E ũ(· + r)ũ(·) = σw2 (L0 L0 ){βL̃ ∗ β L̃ }(r) (8.18)

that primarily depends upon the size of the augmented-order B-spline βL̃ . Now, in the
special case where L0 is an all-pass operator with |  (ω)|2 =
L0 (ω)|2 = 1, we have that |β L̃
L (ω)| so that the autocorrelation functions of u = Ld s and ũ = 
|β 2 Ld s are identical. This
implies that the decorrelation effect of the localization filters Ld and Ld are equivalent,
which justifies replacing one by the other.
204 Sparse representations

Example of fractional derivatives


γ
The fractional derivatives ∂τ , which are characterized in Proposition 5.6, are scale-
invariant by construction. Since the family is complete, it implicitly specifies the
broadest class of 1-D self-similar processes that are solutions of fractional SDEs (see
Section 7.5). Here, we shall focus on the symmetric versions of these derivatives with
τ = 0 and Fourier multiplier L(ω) = |ω|γ . The corresponding symmetric fractional B-
splines of degree α = (γ − 1) are specified by the Fourier-domain formula (see [UB00])
 
1 − e−jω α+1
 α
β0 (ω) = . (8.19)
|ω|α+1
To determine the time-domain counterpart of this equation, we rely on a generalized
version of the binomial expansion, which is due to Thierry Blu.

THEOREM 8.5 (Blu’s generalized binomial expansion) Let u, v ∈ R with u + v > − 12 .


Then, for any z = ejθ on the unit circle,
 
u + v −k
(1 + z)u (1 + z−1 )v = z ,
u+k
k∈Z

with generalized binomial coefficients


 
u u! (u + 1)
= = .
v u! (u − v)! (v + 1) (u − v + 1)
We then apply Theorem 8.5 with u = v = α+1 2 to expand the numerator of (8.19) and
compute the inverse Fourier transform (see Table A.1). This yields

β0α (t) = β∂ α+1 (t) = dα,0 [k]ρα,0 (t − k), (8.20)


0
k∈Z

where
 
α+1
dα,0 [k] = (−1)k α+1 (8.21)
2 +k
and

* + ⎨ (−1)n+1 2n
−1 1 π(2n)! t log |t|, for α = 2n ∈ 2N
ρα,0 (t) = F (t) =
|ω|α+1 ⎩ −1
|t|α , for α ∈ R+ \ 2N.
2(α+1) sin( π2 α)

Observe that ρα,0 is the Green’s function of the fractional-derivative operator ∂0α+1 ,
while the dα,0 [k] are the coefficients of the corresponding (canonical) localization filter.
The simplest instance occurs for α = 1, where (8.20) reduces to

β01 (t) = 12 |t + 1| − |t| + 12 |t − 1|,

which is the triangular function supported in [−1, 1] (symmetric B-spline of degree one)
shown in Figure 8.1c. In general, however, the fractional B-splines β0α (t) with α ∈ R+
are not compactly supported, unless α is an odd integer. In fact, they can be shown to
decay like O(1/tα+2 ), which is a behavior that is characteristic of fractional operators.
8.4 Wavelet analysis 205

1.5 1.5 1

1.0 1.0
0.5
0.5 0.5

0.0 0.0 0

−0.5 −0.5
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

(a) (b) (c)


1/2+1
Figure 8.1 Comparison of kernels related to the symmetric fractional derivative ∂0 with
3/2 1/2
Fourier multipler |ω| . (a) Fractional B-spline β0 (t), (b) Localization kernel
1/2
φ1/2,1 (t) = ∂0 β01 (t), (c) Augmented-order B-spline β01 (t).

As far as the implementation of the localization operator Ld is concerned, the


most favorable scenario is when α = 2n + 1 is odd, in which case d2n+1,0 [·] is a
(modulated) binomial sequence of length 2n + 2 that corresponds to the (2n + 2)th
centered finite-difference operator. Otherwise, dα,0 [·] has an infinite length and decays
like O(1/|k|α+2 ), which makes the numerical implementation of the filter impractical.
This suggests switching to a robust localization where dα,0 is replaced by d2n+1,0 , with
(2n − 1) < α ≤ (2n + 1).
To be specific, we now consider the case 0 < α ≤ 1 with L = ∂0α+1 and the
augmented-order operator  L = D2 = ∂01−α ∂0α+1 . The corresponding smoothing kernel
is then given by

1  
φα,1 (t) = ∂01−α β01 (t) =  π  12 |t + 1|α − |t|α + 12 |t − 1|α .
(α + 1) sin 2 α

This function decays like O(1/|t|−α+2 ) since the finite-difference operator asymptoti-
cally acts as a second-order derivative. The relevant functions for α = 1/2 are shown in
1/2
Figure 8.1. The fractional-derivative operator ∂0 is a non-local operator. Its application
to the compactly supported B-spline β01 produces a kernel with algebraically decaying
1/2
tails. While β0 and φ1/2,1 are qualitatively similar, the former function is better local-
ized with smaller secondary lobes, which reflects the superior decoupling performance
of the canonical scheme. Yet, this needs to be balanced against the fact that the robust
version uses a three-tap filter (second-order finite difference), as opposed to the solution
(8.21) of infinite support dictated by the theory.

8.4 Wavelet analysis

The other option for uncoupling the information is to analyze the signal s(r) using
wavelet-like functions. The implicit assumption for the following properties is that we
have a real-valued wavelet basis available that is matched to the operator L. Specifically,
(n)
the structure of the transform must be such that the basis functions ψi,k at scale i are
206 Sparse representations

translated versions of a fixed number N0 (typically, N0 = 1 or N0 = det(D) − 1) of


normalized “mother” wavelets of the form
ψi (r) = L∗ φi (r),
(n) (n)

(n)
where the φi with n = 1, . . . , N0 are scale-dependent smoothing kernels whose width is
proportional to the scale ai = det(D)i/d , where D is the underlying dilation matrix (e.g.,
D = 2I for a standard dyadic scale progression). In the traditional wavelet transform,
L is scale-invariant and the wavelets at resolution i are dilated versions of the primary
ones at level i = 0 with φi(n) (r) ∝ φ0(n) (D−i r).
Here, for simplicity of notation, we shall focus on the general operator-like design
of Section 6.5 with N0 = 1, which has the advantage of involving the single mother
wavelet
ψi (r) = L∗ φi (r),
while the complete set of basis functions is represented by
ψi,k = L∗ φi (· − Di−1 k)
= L∗ φi,k (8.22)
with i ∈ Z and k ∈ Zd \ DZd . The technical assumption for the wavelet coefficients
Yi [k] = ψi,k , s to be well defined is that L−1∗ ψi,k = φi,k ∈ X , where X = R (Rd ) or,
in the event that f is p-admissible, X = Lp (Rd ) with p ∈ [1, 2].

8.4.1 Wavelet-domain statistics


The key observation is that the wavelet coefficients Yi,k = ψi,k , s of the random signal
s at a given scale i can be obtained by first correlating the signal with the wavelet
ψi = L∗ φi (continuous wavelet transform) and sampling thereafter. Similar to the case
of the generalized increments, this has a stationarizing and decoupling effect.
PROPOSITION 8.6 (Wavelet-domain pdf) Let vi (r) = ψi (· − r), s with ψi = L∗ φi
be the ith channel of the continuous wavelet transform of the generalized (stationary
or non-stationary) stochastic process s = L−1 w, where w is an innovation process with
Lévy exponent f. Then, vi is a generalized stationary process with characteristic functio-
nal Pv (ϕ) = Pw (φi ∗ ϕ), where P w is defined by (4.13). Moreover, the characteristic
i
function of the (discrete) wavelet coefficient Yi [k] = ψi,k , s = vi (Di−1 k) is given by
 w (ωφi ) = efφi (ω) and is infinitely divisible with Lévy exponent
pYi (ω) = P
 
fφi (ω) = f ωφi (r) dr.
Rd

Proof Recalling that s = L−1 w, we get


vi (r) = ψi (· − r), s = L∗ φi (· − r), L−1 w
= L−1∗ L∗ φi (· − r), w
 
= φi (· − r), w = φi∨ ∗ w (r),
8.4 Wavelet analysis 207

where we have used the fact that L−1∗ is a valid (continuous) left inverse of L∗ . Since
φi ∈ X , we can invoke Proposition 7.2 to prove the first part.
For the second part, we also use the fact that Yi,k = ψi,k , s = φi,k , w . Based on the
w (ϕ) = E{ejϕ,w }, we then obtain
definition of the characteristic functional P


pYi (ω) = E{ejωYi,k } = E{ejωφi,k ,w }
= E{ejωφi ,w } = E{ejωφi ,w } (by stationarity and linearity)
 
 
=Pw (ωφi ) = exp f ωφi (r) dr ,
Rd

where we have inserted the explicit form given in (4.13). The result then follows by
w : X → C is a continuous, positive definite functional with
identification. Since P

Pw (0) = 1, we conclude that the function fφi is conditionally positive of order one
(by Schoenberg’s correspondence Theorem 4.7) so that it is a valid Lévy exponent
(see Definition 4.1). This proves that the underlying pdf is infinitely divisible (by
Theorem 4.1).

We determine the joint characteristic function of any two wavelet coefficients Y1 =


ψi1 ,k1 , s and Y2 = ψi2 ,k2 , s with indices (i1 , k1 ) and (i2 , k2 ) using a similar technique.

P R O P O S I T I O N 8.7 (Wavelet dependencies) The joint characteristic function of the


wavelet coefficients Y1 = Yi1 [k1 ] = ψi1 ,k1 , s and Y2 = Yi2 [k2 ] = ψi2 ,k2 , s of the
generalized stochastic process s in Proposition 8.6 is given by
 
 

pY1 ,Y2 (ω1 , ω2 ) = exp f ω1 φi1 ,k1 (r) + ω2 φi2 ,k2 (r) dr ,
Rd

where f is the Lévy exponent of the innovation process w. The coefficients are independent
if the kernels φi1 ,k1 and φi2 ,k2 have disjoint support. Their correlation is given by

E{Y1 Y2 } = σw φi1 ,k1 , φi2 ,k2


2
(8.23)

under the assumption that the variance σw2 = −f  (0) of w is finite.

Proof The first formula is obtained by substitution of ϕ = ω1 ψi1 ,k1 + ω2 ψi2 ,k2 in
w (L−1∗ ϕ) and simplification using the left-inverse property of L−1∗ . The
E{ejϕ,s } = P
statement about independence follows from the exponential nature of the characteristic
function and the property that f (0) = 0, which allows for the factorization of the charac-
teristic function when the support of the kernels are distinct (independence of the noise
at every point). The correlation formula is obtained by direct application of (7.7) with
ϕ1 = ψi1 ,k1 = L∗ φi1 ,k1 and ϕ2 = ψi2 ,k2 = L∗ φi2 ,k2 .

These results provide a complete characterization of the statistical distribution of


sparse stochastic processes in some matched-wavelet domain. They also indicate that
the representation is intrinsically sparse since the transformed-domain statistics are infi-
nitely divisible. Practically, this translates into the wavelet-domain pdfs having heavier
tails than a Gaussian (unless the process is Gaussian) – see the general argument and
results of Chapter 9.
208 Sparse representations

It should be noted, however, that the quality of the decoupling depends strongly on the
spread of the smoothing kernels φi , which should be chosen to be maximally localized
for best performance. In the case of a first-order system (see the example in Section 6.3.3
and the wavelets in Figure 6.3d), the basis functions for i fixed are not overlapping,
which implies that the wavelet coefficients within a given scale are independent. This is
not so across scales because of the cone-shaped region where the support of the kernels
φi1 and φi2 overlap. Incidentally, the inter-scale correlation of wavelet coefficients is
often exploited in practice to improve coding performance [Sha93] and signal recon-
struction by imposing joint sparsity constraints [CNB98].

8.4.2 Higher-order wavelet dependencies and cumulants


To describe higher-order interactions, we use K-dimensional multi-index vectors of the
form n = (n1 , . . . , nK ) whose entries nk are non-negative integers. We then define the
following multi-index operations and operators:

• Sum of components |n| = K k=1 nk = n
• Factorial n! = n1 ! n2 ! · · · nK ! with the convention that 0! = 1
• Exponentiation of a vector z = (z1 , . . . , zK ) ∈ CK : zn = zn11 · · · znKK
• Higher-order partial derivative of a function f (ω), ω = (ω1 , . . . , ωK ) ∈ RK :
∂ |n| f (ω)
∂ n f (ω) = .
∂ω1n1 · · · ∂ωKnK
The notation allows for the concise description of the multinomial theorem given by
 K n
n! n n
jωk = j ω ,
n!
k=1 |n|=n
n+K−1
which involves a summation over K−1 distinct monomials of the form ωn = ω1n1 · · ·
ωKnK with n1 + · · · + nK = n. It also yields a compact formula for the Nth-order Taylor
series of a multidimensional function f (ω) with well-defined derivatives up to order
N + 1,
N
∂ n f (ω0 )
f (ω0 + ω) = ω − ω0n + O(ω − ω0 N+1 ). (8.24)
n!
n=|n|=0

Let 
p(X1 :XK ) (ω), ω ∈ RK , be the multidimensional characteristic function associated
with the Kth-order joint pdf p(X1 :XK ) (x) whose polynomial moments are all assumed to
be finite. Then, if the function f (ω) = log p(X1 :XK ) (ω) is well defined, we can write the
full multidimensional Taylor series expansion

∂ n f (0) n
p(X1 :XK ) (ω) =
f (ω) = log ω
n!
n=|n|=0

jn ωn
= (−j)n ∂ n f (0) , (8.25)
   n!
n=|n|=0 κn
8.4 Wavelet analysis 209

where the internal summation is through all multi-indices whose sum is |n| = n. By
definition, the cumulants of p(X1 :XK ) (x) are the coefficients of this expansion, so that

κn = (−j)|n| ∂ n f (0).

The interest of these quantities is that they are in one-to-one relation with the (multidi-
mensional) moments of p(X1 :XK ) defined by

mn = xn p(X1 :XK ) (x) dx = xn11 · · · xnKK p(X1 :XK ) (x) dx,


RK RK

which also happen to be the coefficients of the Taylor-series expansion of 


p(X1 :XK ) (ω)
around ω = 0. Since id laws are specified in terms of their Lévy exponent f, it is often
easier to determine their cumulants than their moments. Another advantage of cumu-
lants is that they provide a direct measure of the deviation from a Gaussian, whose
cumulants are zero for orders greater than two (since a Gaussian Lévy exponent is
necessarily quadratic).

PROPOSITION 8.8 (Higher-order wavelet dependencies) Let {(ik , kk )}K k=1 be a set of
indices and {Yk = ψik ,kk , s }K with ψ = L ∗φ be the corresponding wavelet
k=1 ik ,kk ik ,kk
coefficients of the generalized stochastic process s in Proposition 8.6. Then, the joint
characteristic function of (Y1 , . . . , YK ) is given by
 
 
p(Y1 :YK ) (ω1 , . . . , ωK ) = exp f ω1 φi1 ,k1 (r) + · · · + ωK φiK ,kK (r) dr ,
Rd

where f is the Lévy exponent of the innovation process w. The coefficients are inde-
pendent if the kernels φik ,kk with k = 1, . . . , K have disjoint support. The wavelet cumu-
lants with multi-index n are given by

κn {Y1 : YK } = κn φin11,k1 (r) · · · φinKK,kK (r) dr, (8.26)


Rd
K
under the assumption that the innovation cumulants of order n = k=1 nk ,

n 
n ∂ f (ω) 
κn = (−j) ,
∂ωn ω=0

are finite.

Proof Since Yk = ψik ,kk , L−1 w = φik ,kk , w , the first part is obtained by direct substi-
tution of ϕ = ω1 φi1 ,k1 + · · ·+ ωKφiK ,kK inthe characteristic functional of the innovation
E{eϕ,w } = Pw (ϕ) = exp
R f ϕ(r) dr . To prove the second part, we start from the
Taylor-series expansion of f, which reads

(jω)n
f (ω) = κn ,
n!
n=0
210 Sparse representations

where the κn are the cumulants of the canonical innovation pdf pid . Next, we consider
the multidimensional wavelet Lévy exponent fY (ω) = logp(Y1 :YK ) (ω) and expand it as
 
fY (ω) = f ω1 φi1 ,k1 (r) + · · · + ωK φiK ,kK (r) dr
Rd
∞  n
jω1 φi1 ,k1 (r) + · · · + jωK φiK ,kK (r)
= κn dr (1-D Taylor expansion of f )
Rd n!
n=0

1 n!  n  n
= κn jω1 φi1 ,k1 (r) 1 · · · jωK φiK ,kK (r) K dr
Rd n! n!
n=0 n=|n|
(Multinomial expansion of inner sum)

jn ωn
= κn φin11,k1 (r) · · · φinKK,kK (r) dr .
n! Rd
n=|n|=0   
The formula for the cumulants of (Y1 , . . . , YK ) is then obtained by identification with
(8.25).

We note that, for N = 2, we recover Proposition 8.8. For instance, under the second-
order hypothesis, we have that

κ1 = −j f  (0) = 0

κ2 = (−j)2 f  (0) = σw2 ,

so that (8.26) with n = (1, 1) and n = 2 is equivalent to the wavelet-correlation formula


(8.23).
Finally, we mention that the cumulant formula in Proposition 8.8 is directly trans-
posable to the generalized increments of Section 8.3 by merely replacing φik ,kk by
βL∨ (· − kk ) in (8.26) in accordance with the identity u[kk ] = βL∨ (· − kk ), w . In fact,
the result of Proposition 8.8 applies for any linear transform of s that involves a collec-
tion of basis functions {ψk }K −1∗ ψ ∈ R (Rd ), and more generally
k=1 such that φk = L k
φk ∈ Lp (Rd ) when f is p-admissible.

8.5 Optimal representation of Lévy and AR(1) processes

We conclude the chapter with a detailed investigation of the effect of such signal
decompositions on first-order processes. We are especially interested in the evaluation
of performance for data reduction and the comparison with the “optimal” solutions pro-
vided by the Karhunen–Loève transform (KLT) and independent-component analysis
(ICA). While ICA is usually determined empirically by running a suitable algorithm,
the good news is that it can be worked out analytically for this particular class of signal
models and used as gold standard.
The Gaussian AR(1) model is of special relevance since it has been put forward in the
past to justify two popular source-encoding algorithms: linear predictive coding (LPC)
8.5 Optimal representation of Lévy and AR(1) processes 211

and DCT-based transform coding [JN84]. LPC, on the one hand, is used for voice com-
pression in the GSM standard for mobile phones (2G cellular network). It is also part
of the FLAC lossless audio codec. The DCT, on the other hand, was introduced as an
approximation of the KLT of an AR(1) process [Ahm74]. It forms the core of the widely
used JPEG method of lossy compression for digital pictures. Our primary interest here
is to investigate the extent to which these classical tools of signal processing remain
relevant when switching to the non-Gaussian regime.

8.5.1 Generalized increments and first-order linear prediction


The generalized CAR(1) processes of interest to us are solutions of the first-order SDE
(see Section 7.3.4)

(D − α1 Id)s(t) = w(t)

with a single scalar coefficient α1 ∈ R. The underlying system is causal-stable for


α1 < 0 and results in a stationary output s, which may be Gaussian or not, depending
on the type of innovation w. The unstable scenario α1 → 0 corresponds to the Lévy
processes described in Section 7.4.1.
If we sample s(·) at the integers, we obtain the discrete AR(1) process s[·] that satis-
fies the first-order difference equation

s[k] − a1 s[k − 1] = u[k] (8.27)

with a1 = eα1 . From the result in Section 8.3.1 and Proposition 8.4, we know that u[·]
is i.i.d. with an infinitely divisible distribution characterized by the (modified) Lévy
exponent
  1  
fU (ω) = f ωβα1 (t) dt = f ωeα1 t dt.
R 0

Here, βα1 (t) = 1[0,1) (t)eα1 t is the exponential B-spline associated with the first-order
whitening operator L = D−α1 Id (see Section 6.3) and f is the exponent of the continuous
domain innovation w.
To make the link with predictive coding and classical estimation theory, we observe
that s̃[k] = a1 s[k − 1] is the best (minimum-error) linear predictor of s[k] given the
past of the signal {s[k − n]}∞ n=1 . This suggests interpreting the generalized increments
u[k] = s[k] − s̃[k] as prediction errors.
For the benefit of the reader, we recall that the main idea behind linear predictive
coding (LPC) is to sequentially transmit the prediction error u[k], rather than the sample
values of the signal, which are inherently correlated. One also typically uses higher-
order AR models to better represent the effect of the sound-production system. A re-
finement for real-world signals is to re-estimate the prediction coefficients from time to
time in order to readapt the model to the data.
212 Sparse representations

8.5.2 Vector-matrix formulation


The common practice in signal processing is to describe finite-dimensional signal trans-
forms within the framework of linear algebra. To that end, one considers a series of
N consecutive samples Xn = s[n] that are concatenated into the signal vector x =
X1 , . . . , XN ), which, from now on, will be treated as a multivariate random variable.
One also imposes some boundary conditions for the signal values that are outside the
observation window, such as, for instance, the periodic extension s[k] = s[k + N]. An-
other example is s[0] = 0, which is consistent with the definition of a Lévy process.
In the present scenario, the underlying signal is specified by the discrete AR(1) inno-
vation model (8.27) and the Lévy exponent fU of its driving term u[·]. The transcription
of this model in matrix-vector form is

Lx = u,
 
where u = U1 , . . . , UN with Un = u[n] and L is a Toeplitz matrix with non-zero entries
[L]n,n = 1 and [L]n−1,n = a1 = eα1 .
Assuming that L is invertible,1 we then solve this linear system of equations, which
yields

x = L−1 u, (8.28)

where u is i.i.d., in direct analogy with the continuous-domain representation of signal


s = L−1 w. Since (8.28) is a special instance of the general finite-dimensional innovation
model, we can refer to the formulas of Section 4.3 for the complete multivariate statis-
tical description. For instance, the joint characteristic function of the signal is given by
(4.12) with A = L−1 and f = fU .

8.5.3 Transform-domain statistics


The primary motivation for applying a linear transform to x is to produce the equivalent
signal representation

y = (Y1 , . . . , YN ) = Tx (8.29)

whose components Yn can be processed or analyzed individually, which is justifiable


when the Yn are (approximately) decoupled. A transform is thereby characterized by
an (N × N) matrix T = [t1 · · · tN ]T which is assumed to be invertible. If no further
constraint is imposed, then the optimal transform for the present model is T ∝ L−1 ,
which results in a perfect decoupling, as in LPC. However, there are many applications
such as denoising and coding where it is important to preserve the norm of the signal, so
that one constrains the transform matrix T to be orthonornal. In what follows, we shall
describe and compare the solutions that are available for that purpose and quantify the
penalty that is incurred by imposing the orthonormality condition.

1 This property is dependent upon a proper choice of discrete boundary conditions (e.g., s[0] = 0).
8.5 Optimal representation of Lévy and AR(1) processes 213

But, before that, let us determine the transform-domain statistics of the signal in order
to qualify the effect of T. Generally, if we know the Nth-order pdf of x, we can readily
deduce the joint pdf of the transform-domain coefficients y = Tx as

1
p(Y1 :YN ) (y) = p(X :X ) (T−1 y). (8.30)
|det(T)| 1 N
The Fourier equivalent of this relation is


p(Y1 :YN ) (ω) = 
p(X1 :XN ) (TT ω),

where p(Y1 :YN ) and 


p(X1 :XN ) are the characteristic functions of y and x, respectively. In
the case of the innovation model (8.28), we can be completely explicit and obtain closed
formulas, including the complete multivariate characterization of the transform-domain
cumulants. To that end, we rewrite (8.29) as y = Au, where

A = TL−1 = [a1 · · · aN ]T = [b1 · · · bN ] (8.31)

is the composite matrix that combines the effect of noise shaping (innovation model)
and the linear transformation of the data. The row vectors of A are am = (am,1 , . . . , am,N )
with am,n = [A]m,n , while its column vectors are denoted by bn = (a1,n , . . . , aN,n ).

P R O P O S I T I O N 8.9 Let y = Au, where A = [b1 · · · bN ] is an invertible matrix of


size N and u is i.i.d. with an infinite-divisible pdf with Lévy exponent fU . Then, the joint
characteristic function of y = (Y1 , . . . , YN ) is given by
 N

 T 

p(Y1 :YN ) (ω) = exp fU [A ω]n
n=1
 N

 
= exp fU bn , ω . (8.32)
n=1

The corresponding multivariate cumulants with multi-index m = (m1 , . . . , mN ) are


N
mN
κm {Y1 : YN } = κm {U} am
1,n · · · aN,n
1
(8.33)
n=1
N
= κm {U} bm
n, (8.34)
n=1

where am,n = [A]m,n = [bn ]m and κm {U} = (−j)m ∂ m fU (0) is the (scalar) cumulant
N
of order m = n=1 mn of the innovation. Finally, the marginal distribution of Yn =
tn , x = an , u is infinitely divisible with Lévy exponent
N
 
fYn (ω) = fU an,m ω . (8.35)
m=1
214 Sparse representations

Proof Since u = (U1 , . . . , UN ) is i.i.d. with characteristic function 


pU (ω) = e fU (ω) , we
have that
 N 
$N

p(U1 :UN ) (ω) = efU (ωn ) = exp fU (ωn ) .
n=1 n=1

Moreover,
% & % &

p(Y1 :YN ) (ω) = E ejy,ω = E ejAu,ω
% T
&
= E eju,A ω = 
p(U1 :UN ) (AT ω).

Combining these two formulas yields (8.32).


The second part is obtained by adapting the proof of Proposition 8.8. In essence, the
idea is to replace the integral over Rd by a sum over n, and the basis functions by the
row vectors of A.
As for the last statement, the characteristic functionpYn (ω) is obtained by the replace-
ment of ω by ωen in (8.32), where en is the nth canonical unit vector. Indeed, setting one
of the frequency-domain variables to zero is equivalent to performing the corresponding
marginalization (integration) of the joint pdf. We thereby obtain
 N 
 
pYn (ω) = exp fU an,m ω ,
m=1

whose exponent is the sought-after quantity. Implicit to this result is the fact that the
infinite-divisibility property is preserved when performing linear combination of id
variables. Specifically, let f1 and f2 be two valid Lévy exponents. Then, it is not hard to
see that f (ω) = f1 (a1 ω) + f2 (a2 ω), where a1 , a2 ∈ R are arbitrary constants, is a valid
Lévy exponent as well (in reference to Definition 4.1).

Proposition 8.9 in conjunction with (8.31) provides us with the complete character-
ization for the transform-domain statistics. To illustrate its usage, we shall now deduce
the expression for the transform-domain covariances. Specifically, the covariance bet-
ween Yn1 and Yn2 is given by the second-order cumulant with multi-index m = en1 + en2 ,
whose expression (8.33) with m = 2 simplifies to
N
  
E Yn1 − E{Yn1 } Yn2 − E{Yn2 } = κ2 {U} an1 ,n an2 ,n
n=1
= σ02 an1 , an2 , (8.36)

where σ02 = −f U (0) is the variance of the innovation and where an1 and an2 are the
n1 th and n2 th row vectors of the matrix A = [a1 · · · aN ]T , respectively. In particular, for
n1 = n2 = n, we find that the variance of Yn is given by

Var{Yn } = σ02 an 2 .


8.5 Optimal representation of Lévy and AR(1) processes 215

Covariance matrix
An equivalent way of expressing second-order moments is to use covariance matrices.
Specifically, the covariance matrix of the random vector x ∈ RN is defined as
%  T &
CX = E x − E{x} x − E{x} . (8.37)

This is a (N × N) symmetric, positive definite matrix. A standard manipulation then


yields

CY = TCX TT
= ACU AT = σ02 AAT , (8.38)

where A is defined by (8.31). The reader can easily verify that this result is equivalent
to (8.36). Likewise, the second-order transcription of the innovation model is

CX = L−1 CU L−1T
= σ02 (LT L)−1 . (8.39)

Differential entropy
A final important theoretical quantity is the differential entropy of the random vector
x = (X1 , . . . , XN ), which is defined as
 
H(X1 :XN ) = E − log p(X1 :XN ) (x)

=− p(X1 :XN ) (x) log p(X1 :XN ) (x) dx. (8.40)


RN

For instance, the differential entropy of the N-dimensional multivariate Gaussian pdf
with mean x0 and covariance matrix CX ,
1 " #
pGauss (x) = 2 exp − 12 (x − x0 )T C−1
X (x − x0 ) ,
(2π)N det(CX )

is given by
1
H(pGauss ) = (N + N log(2π) + log det(CX ))
2
1 " #
= log (2πe)N det(CX ) . (8.41)
2
The special relevance of this expression is that the Gaussian distribution is known to
have the maximum differential entropy among all densities with a given covariance CX .
This leads to the inequality
1 " #
−H(X1 :XN ) + log (2πe)N det(CX ) ≤ 0, (8.42)
2
where the quantity on the left is called the negentropy.
216 Sparse representations

Table 8.1 Table of differential entropies.a



Probability density function p(x) Differential entropy (− R p(x) log p(x) dx)

1 −x2 /(2σ 2 ) " #


Gaussian e 1 1 + log(2πσ 2 )
2πσ 2 2
1 −|x|/λ
Laplace e 1 + log(2λ)

s 1
Cauchy 0 2 log(4πs0 )
π s + x2
0
 r+ 1 " # "3 " ##
1 1 2
Student " # 2
(r + 12 ) ψ(r + 12 ) + ψ(r) + log r 1
2 B r, 2
B r, 1 x +1
2
a B(r, s) and ψ(z) = d log (z) are Euler’s beta and digamma functions, respectively (see Appendix C).
dz

To quantify the effect of the linear transformation T, we calculate the entropy of


(8.30), which, upon change of variables, yields

H(Y1 :YN ) = − p(Y1 :YN ) (x) log p(Y1 :YN ) (x) dx


RN
= H(X1 :XN ) − log |det(T)| (8.43)
= H(U1 :UN ) − log |det(A)|
= N · HU − log |det(T)| + log |det(L)|, (8.44)

where HU = − R pU (x) log pU (x) dx is the differential entropy of the (scalar) innova-
tion. Equation (8.43) implies that the differential entropy is invariant to orthonormal
transformations (since det(T) = 1), while (8.44) shows that it is primarily determined
by the probability law of the innovation. Some specific formulas for the differential
entropy of id laws are given in Table 8.1.

8.5.4 Comparison of orthogonal transforms


While the obvious candidate for expanding sparse AR(1) processes is the operator-
like transform of Section 6.3.3, we shall also consider two classical data analysis
methods – principal-component analysis (PCA) and independent-component analysis
(ICA) – which can be readily transposed to our framework.

Discrete Karhunen–Loève transform


The discrete KLT is the model-based  version of
 PCA. It relies on prior knowledge of
the covariance matrix of the signal see (8.37) . Specifically, the KLT matrix TKLT =
[h1 · · · hN ]T is build from the eigenvectors hn of CX , with the eigenvalues λ1 ≥ · · · ≥
λN being ordered by decreasing magnitude. This results in a representation that is per-
fectly decorrelated, with
CKLT = TKLT CX TTKLT = diag(λ1 , . . . , λN ).
8.5 Optimal representation of Lévy and AR(1) processes 217

This implies that the solution is optimal in the Gaussian scenario, where decorrelation
is equivalent to independence. PCA is essentially the same technique, except that it
replaces CX by a scatter matrix that is estimated empirically from the data.
In addition to the decorrelation property, the KLT minimizes any criterion of the form
(see [Uns84, Appendix A])

N
 
R(T) = G Var{Yn } , (8.45)
n=1

where G : R+ → R is an arbitrary, continuous, monotonically decreasing, convex


(or increasing concave) function and Var{Yn } = tTn CX tn is the variance of Yn = tn , x .
Under the finite-variance hypothesis, the covariance matrix of the AR(1) process is
given by (8.39), where L−1 is the inverse of the prediction filter and σ02 is the variance
of the discrete innovation u. There are several instances of the model for which the KLT
can be determined analytically, based on the fact that the eigenvectors of σ02 (LT L)−1
are the same as those of (LT L). Specifically, h is an eigenvector of CX with eigenvalue
λ if and only if

σ02
(LT L)h = h.
λ

For the AR(1) model, LT L is tridiagonal. This can then be converted into a set of
second-order difference equations that may be solved recursively. In particular, for
α1 = 0 (Lévy process), the eigenvectors for n = 1, . . . , N correspond to the sinusoi-
dal sequences
 
2 2n − 1
hn [k] = √ sin π k . (8.46)
2N + 1 2N + 1

Depending on the boundary conditions, one can obtain similar formulas for the eigen-
vectors when α1  = 0. The bottom line is that the solutions are generally sinusoids, with
minor variations in phase and frequency. This is consistent with the fact that the correla-
tion matrix CX is very close to being circular, and that circular matrices are diagonalized
by the discrete Fourier transform (DFT). Another “universal” transform that provides an
excellent approximation of the KLT of an AR(1) process (see [Ahm74]) is the discrete
cosine transform (DCT) whose basis vectors for n = 1, . . . , N are
 
2 πn 1
gn [k] = √ cos (k − ) . (8.47)
2N N 2

A key property is that the DCT is asymptotically optimal in the sense that its per-
formance is equivalent to that of the KLT when the block size N tends to infinity.
Remarkably, this is a result that holds for the complete class of wide-sense-stationary
processes [Uns84], which may explain why this transform performs so well in practice.
218 Sparse representations

Independent-component analysis
The decoupling afforded by the KLT emphasizes decorrelation, which is not necessa-
rily appropriate when the underlying model is non-Gaussian. Instead, one would rather
like to obtain a representation that maximizes statistical independence, which is the
goal pursued by independent-component analysis (ICA). Unlike the KLT, there is no
single ICA but a variety of numerical solutions that differ in terms of the criterion being
optimized [Com94]. The preferred measure of independence is mutual information
(MI), with the caveat that it is often difficult to estimate reliably from the data. Here, we
take advantage of our analytic framework to bypass this estimation step, which is the
empirical part of ICA.
Specifically, let

z = (Z1 , . . . , ZN ) = TICA x

with TTICA TICA = I. By definition, the mutual information of the random vector (Z1 , . . . ,
ZN ) is given by
 N

I(Z1 , . . . , ZN ) = HZn − H(Z1 :ZN ) ≥ 0, (8.48)
n=1


where HZn = − R pZn (z) log pZn (z) dz is the differential entropy of the component
Zn (which is computed from the marginal distribution
 pZn ), and H(Z1 :ZN ) is the Nth-
order differential entropy of z see (8.40) . The relevance of this criterion is that
I(Z1 , . . . , ZN ) = 0 if and only if the variables Z1 , . . . , ZN are statistically independent.
The other important property is that H(Z1 :ZN ) = H(Y1 :YN ) = H(X1 :XN ) , meaning that the
joint differential  entropydoes not depend upon the choice of transformation as long as
|det(T)| = 1 see (8.43) . Therefore, the ICA transform that minimizes I(Z1 , . . . , ZN )
subject to the orthonormality constraint is also the one that minimizes the sum of the
transform-domain entropies. We call this optimal solution the min-entropy ICA.
The implicit understanding with ICA is that the components of z are (approximately)
independent so that it makes good sense to approximate the joint pdf of z by the product
of the marginals given by

$
N
p(Z1 :ZN ) (z) ≈ pZn (zn ) = q(Z1 :ZN ) (z). (8.49)
n=1

As it turns out, the min-entropy ICA minimizes the Kullback–Leibler divergence be-
tween p(Z1 :ZN ) and its separable approximation q(Z1 :ZN ) . Indeed, the Kullback–Leibler
divergence between two N-dimensional probability density functions p and q is de-
fined as
 
  p(z)
D pq = log p(z) dz.
RN q(z)
8.5 Optimal representation of Lévy and AR(1) processes 219

In the present context, this simplifies to

 $
N
 N
D p(Z1 :ZN )  pZn = −H(Z1 :ZN ) − p(Z1 :ZN ) (z) log pZn (zn ) dz
n=1 RN n=1
N
= −H(Z1 :ZN ) − pZn (z) log pZn (z) dz
n=1 R
N
= −H(Z1 :ZN ) − HZn = I(Z1 , . . . , ZN ) ≥ 0,
n=1

with equality to zero if and only if p(Z1 :ZN ) (z) and its product approximation on the
middle/left-hand-side of (8.49) are equal.
In the special case where the innovation follows an SαS distribution with fU (ω) =
−|s0 ω|α , we can derive a form of the entropy criterion that is directly amenable to
numerical optimization. Indeed, based on Proposition 8.9, we find that the characteristic
function of Yn is given by
 N 
 α
pY (ω) = exp
n − s0 an,m ω
m=1
 
= exp −sα0 an αα |ω|α = e−|sn ω| ,
α
(8.50)

for which we deduce that Yn is SαS with dispersion parameter


 N
 α1
sn = s0 an α = s0 |aαn,m |α .
m=1

This implies that pYn is a rescaled version of pU so that its entropy is given by

HYn = HU − log an α .

Hence, minimizing the mutual information (or, equivalently, the sum of the entropies of
the transformed coefficients) is equivalent to the optimization of
N
I(T; α) = − log an α
n=1
 
N N  α
1  
=− log [TL−1 ]αn,m  . (8.51)
α
n=1 m=1

Based on (8.48), (8.44), and the property that |det(L)| = 1, we can actually verify that
(8.51) is the exact formula for the mutual information I(Y1 , . . . , YN ). In particular, for
α = 2 (Gaussian case), we find that

N N
1 1
I(T; 2) = − log an 22 = − log Var{Yn },
2 2
n=1 n=1
220 Sparse representations

which is a special instance of (8.45). It follows that, for α = 2, ICA is equivalent to the
KLT, as expected.

Experimental results and discussion


To quantify the performance of the various transforms, we considered a series of signal
vectors of size N = 64, which are sampled versions of Lévy processes and CAR(1)
signals with SαS innovations. The motivation for this particular setup is twofold. First,
the use of a stable excitation lends itself to a complete analytical treatment due to the
preservation of the SαS property see (8.50) . Second, the model offers direct control
over the degree of sparsity via the adjustment of α ∈ (0, 2]. The classical Gaussian
configuration is achieved with α = 2, while the distribution becomes more and more
heavy-tailed as α decreases with pYn (x) = O(1/xα+1 ) (see Appendix C).
Our quality index is the average mutual information of the transform-domain coeffi-
cients given by N1 I(Y1 , . . . , YN ) ≥ 0. A value of zero indicates that the transform coeffi-
cients are completely independent. The baseline transformation is the identity, which is
expected to yield the worst results. The optimal performance is achieved by ICA, which
is determined numerically for each brand of signals based on the optimization of (8.51).
In Figure 8.2, we compare the performance of the DCT, the Haar wavelet transform,
and the ICA gold standard for the representation of Lévy processes (with a1 = e0 = 1).
We verified that the difference in performance between the DCT and the KLT associated
with Brownian motion was marginal so that the DCT curve is truly representative of the
best that is achievable within the realm of “classical” sinusoidal transforms. Interesting-
ly, the Haar wavelet transform is indistinguishable from the optimal solution for values
of α below 1, and better than the DCT up to some transition point. The change of regime
of the DCT for larger values of α is explained by the property that it converges to the
optimal solution for α = 2, the single point at which ICA and the KLT are equivalent.

101
Mutual Information

100

Identity
DCT/ KLT
Haar wavelet
Optimal (ICA)
10−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
a

Figure 8.2 Decoupling ability of various transforms of size N = 64 for the representation of SαS
Lévy processes as a function of the stability/sparsity index α. The criterion is the average mutual
information. The Gaussian case α = 2 corresponds to a Brownian motion.
8.5 Optimal representation of Lévy and AR(1) processes 221

We also examined the basis functions of ICA and found that they were very similar to
Haar wavelets. In particular, it appeared that the ICA training algorithm would almost
systematically uncover the N/2 shorter Haar wavelets, which is not overly surprising
since these are basis vectors also shared by the discrete whitening filter L.
Remarkably, this statistical model supports the (quasi)-optimality of wavelets. Since
mutual information is in direct relation to the transform-domain entropies which are
predictors of coding performance, these results also provide an explanation of the
superiority of wavelets for the M-term approximation of Lévy processes reported in
Section 1.3.4 (see Figure 1.5).
The graph in Figure 8.3 provides the matching results for AR(1) signals with a1 = 0.9.
Also included is a comparison between the Haar transform and the operator-like wavelet
transform that is matched to the underlying AR(1) model. The conclusions are essen-
tially the same as before. Here, too, the DCT closely replicates the performance of the
KLT associated with the Gaussian brand of these processes. While the Haar transform
is superior to the DCT for the sparser processes (small values of α), it is generally
outperformed by the operator-like wavelet transform, which confirms the relevance of
applying matched basis functions.
The finding that a fixed set of basis functions (operator-like wavelet transform) is
capable of essentially replicating the performance of ICA is good news for applications.
Indeed, the computational cost of the wavelet algorithm is O(N), as opposed to O(N2 )
for ICA, not to mention the price of the training procedure (iterative optimization) which
is even more demanding (O(N2 ) per iteration of a gradient-based scheme). A further
conceptual advantage is that the operator-like wavelets are known in analytic form (see
Section 6.3.3), while ICA can, at best, only be determined numerically by running a
suitable optimization algorithm.

Identity
DCT/KLT
101 Haar wavelet
Mutual Information

Operator–like wavelet
Optimal (ICA)

100

10−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
a

Figure 8.3 Decoupling ability of various transforms for the representation of stable AR(1) signals
of size N = 64 with correlation a1 = 0.9 as a function of the stability/sparsity index α. The
criterion is the average mutual information. The DCT is known to be asymptotically optimal in
the Gaussian case (α = 2).
222 Sparse representations

8.6 Bibliographical notes

Sections 8.1–8.4
The property that finite differences decouple Lévy processes is well known – in fact, it
is the starting point of the definition of such “additive” processes [Lév65]. By contrast,
the observation that Haar wavelets have the same kind of ability (on a scale-by-scale
basis) is much more recent [UTS14].
A crucial theoretical aspect is the extension of the domain of the characteristic func-
tional that was carried out in Section 8.2. The proof of Theorem 8.2 is adapted from
[UTS14, Theorem 3], while more general results for arbitrary Lévy exponents can be
found in [Fag14].
The material in Section 8.3 is an extension of the results presented in [UTAK14].
In that respect, we note that the correspondence between continuous-time and discrete-
time ARMA models (the Gaussian part of Proposition 8.4) is a classical result in the
theory of Gaussian stationary processes [Doo90]. The localization of the fractional-
derivative operators in Section 8.3.5 is intimately linked to the construction of fractional
B-splines with the help of the generalized binomial expansion (see [BU03, UB00]).
The theoretical results on wavelet-domain statistics in Section 8.4 are an extension
of [UTS14, Section V.D]. In particular, the general cumulant formulas (Proposition 8.8)
are new, to the best of our knowledge.

Section 8.5
The use of linear transforms for the decorrelation of signals is a classical topic in signal
and image processing [Pra91, JN84, Jai89]. The DCT was introduced by Ahmed et al.
as a practical substitute for the KLT of an AR(1) process [Ahm74]. As part of his thesis
work, Unser proved its asymptotic equivalence with the KLT for the complete class
of Gaussian stationary processes [Uns84]. The AR(1) model has frequently been used
for comparing linear transforms using various performance metrics derived from the
Gaussian hypothesis [PAP72, HP76, Jai79, JN84]. While such measures clearly point
to the superiority of sinusoidal transforms, they lose their relevance in the context of
sparse processes. The derivation of the KLT of a Lévy process (see (8.46)) can be found
in [KPAU13, Appendix II].
Two classical references on ICA are [Com94, HO00]. In practice, ICA is determined
from the data based on the minimization of a suitable contrast that favors independence
or, by extension, non-Gaussianity/sparsity. There is some empirical evidence of a link
between ICA and wavelets. The first is a famous experiment by Olshausen and Field,
who computed ICA from a large collection of natural image patches and pointed out
the similarity between the extracted factors and a directional wavelet/Gabor analysis
[OF96a]. In 1999, Cardoso and Donoho reported a numerical experiment involving a
(non-Gaussian) sawtooth process that resulted in ICA basis functions that were remark-
ably similar to wavelets [Car99]. The characterization of ICA for SαS processes that is
presented in Section 8.5.4, and the demonstration of the connection with operator-like
wavelets, are based on the more recent work of Pad and Unser [PU13].
The source for the calculation of the differential entropies in Table 8.1 is [LR78].
9 Infinite divisibility and
transform-domain statistics

As we saw in Chapter 8, we have at our disposal two primary tools to analyze/


characterize sparse signals: (1) the construction of the generalized-increment process
by application of the discrete version of the whitening operator (localization), and (2)
a wavelet analysis using operator-like basis functions. The goal shared by the two
approaches is to approximately recover the innovations by numerical inversion of the
signal-generation model. While the localization option appears to give the finest level
of control, the wavelet option is interesting as well because of its greater robustness to
modeling errors. Indeed, performing a wavelet transform (especially if it is orthogonal)
is a much stabler operation than taking a discrete derivative, which tends to amplify
high-frequency perturbations.
In this chapter, we take advantage of our functional framework to gain a better
understanding of the true nature of the transform-domain statistics. Our investigation
revolves around the fact that the marginal pdfs are infinitely divisible (id) and that they
can be determined explicitly via the analysis of a common white noise (innovation). We
make use of this result to examine a number of relevant properties of the underlying id
distributions and to show that they are, for the most part, invariant to the actual shape
of the analysis window. This implies, among other things, that the qualitative effect of
operator-like filtering is the same in both analysis scenarios, in the sense that the sparsity
pattern of the innovation is essentially preserved.
The chapter is organized as follows. In Section 9.1, we interpret as spectral mixing the
observation of white Lévy noise through some general analysis function ϕ ∈ Lp (Rd ).
We prove that the resulting pdf is infinitely divisible under mild conditions on ϕ and f.
The key idea is that the pdf of Xϕ = ϕ, w is completely characterized by its modified
Lévy exponent
 
fϕ (ω) = f ωϕ(r) dr, (9.1)
Rd

where f is the Lévy exponent of the continuous-domain innovations. This also translates
into a correspondence between the Lévy density vϕ (a) of fϕ and the canonical v(a) that
characterizes the innovation (see Theorem 4.2).
The central part of the chapter is devoted to the demonstration that fϕ (resp., vϕ ) pre-
serves the primary features of f (resp. v) and by the same token those of the underlying
pdf. The properties of unimodality, self-decomposability, and stability are covered in
224 Infinite divisibility and transform-domain statistics

Sections 9.2, 9.3, and 9.4, respectively, while the characterization of the decay and the
determination of the moments are carried out in Sections 9.5–9.6. The conclusion that
follows is that the shape of the analysis window does not fundamentally change the
nature of the transform-domain pdfs. This is good news for practical applications since
it allows us to do some relatively robust modeling by sticking to a particular family of id
distributions (such as the Student, symmetric gamma, or SαS laws in Table 4.1) whose
dispersion and decay parameters can be tuned to fit the statistics of a given type of sig-
nal. These findings also suggest that the transform-domain statistics are only mildly
dependent upon the choice of a particular wavelet basis as long as the analysis wavelets
are matched to the whitening operator of the process in the sense that ψi = L∗ φi with
φi ∈ Lp (Rd ).
In the case where the operator is scale-invariant (fractal process), we can be more
specific and obtain a precise characterization of the evolution of the wavelet statistics
across scale. Our mathematical analysis hinges upon the semigroup properties of id
laws, which are reviewed in Section 9.7. These results are then used in Section 9.8 to
show that the wavelet-domain pdfs are ruled by a diffusion-like equation. This allows
us to prove that the wavelet-domain pdfs converge to Gaussians – or, more generally, to
stable distributions – as the scale gets coarser.

9.1 Composition of id laws, spectral mixing, and analysis of white noise

The fundamental aspect of infinite divisibility is that the property is conserved


through basic convolution-based compositions of probability density functions. Spe-
" pX1 and pX2 be two #id pdfs. Then, the id property is maintained for
cifically, let
pY (x) = |s11 | pX1 ( s·1 ) ∗ |s12 | pX2 ( s·2 ) (x)⇔  pY (ω) =  pX1 (s1 ω) · 
pX2 (s2 ω), where pY
can be interpreted as the pdf of the linear combination Y = s1 X1 + s2 X2 of inde-
pendent random variables X1 and  X2 . Indeed, the resulting Lévy triplet is s1 m1 + s1 m2 ,
s21 b1 + s22 b2 , s11 v1 ( s·1 ) + s12 v2 ( s·2 ) , where (mi , bi , vi ), i = 1, 2 are the Lévy triplets of pX1
and pX2 . Along the same lines, we have the implication that, for any finite τ ≥ 0,
f (ω) (p-)admissible ⇒ τ f (ω) is a (p-)admissible Lévy exponent.
 τ
The Fourier-domain interpretation here is  pXτ (ω) = pX1 (ω) , which translates into a
τ -fold (fractional) convolution of pX1 .
In this work, we rely heavily on the property that the composition of id probability
laws can undergo a limit process by considering a mixing that involves a continuum of
independent random contributions (integration of white noise). For instance, we have
already emphasized in Section 4.4.3 that the observation of a white Lévy noise through
a rectangular window results in a reference random variable Xid = rect, w with a cano-
nical id pdf pid (x) (see Proposition 4.12). The important question that we are addressing
here is to describe the effect of applying a non-rectangular analysis window, as in (2.4)
or (2.1).
The first step is to determine the pdf of the random variable Xϕ = ϕ, w for an
arbitrary ϕ. We have already alluded to the fact that pXϕ is infinitely divisible. Indeed,
9.1 id laws, spectral mixing, and analysis of white noise 225

the characteristic function of Xϕ is obtained by making the substitution ϕ → ωϕ in the


characteristic functional (4.13) of the Lévy noise, which yields

 4w (ωϕ) = e fϕ (ω) ,
pXϕ (ω) = P

where fϕ is defined by (9.1).


The next result shows that fϕ is indeed an admissible Lévy exponent – that is,
a conditional, positive definite function of order one admitting a Lévy–Khintchine
decomposition with some modified Lévy density vϕ .
 
THEOREM 9.1 Let f (ω) be a valid Lévy exponent with Lévy triplet b1 , b2 , v(a) sub-
ject to the constraint that R min(a2 , |a|p )v(a) da < ∞ for some p ∈ [0, 2]. Then, for
any ϕ ∈ Lp (Rd ) ∩ L2 (Rd ) ∩ L1 (Rd ), fϕ (ω) = Rd f ωϕ(r) dr is an admissible Lévy
exponent with Gaussian parameter

b2,ϕ = b2 ϕ2L2

and modified Lévy density

1
vϕ (a) = v(a/ϕ(r)) dr, (9.2)
ϕ |ϕ(r)|

where ϕ ⊆ Rd denotes the domain over which ϕ(r)  = 0. Moreover, fϕ (ω) (resp., vϕ )
satisfies the same type of admissibility condition as f (ω) (resp., v).

Note thatthe restriction ϕ ∈ L1 (Rd ) is only required for the non-centered scena-
rios, where R av(a) da  = 0 and b1  = 0. The corresponding Lévy parameter b1,ϕ then
depends upon the type of Lévy–Khintchine  formula. In the case of the fully-compensated
representation (4.5), we have b1,ϕ = b1 Rd ϕ(r) dr.

LEMMA 9.2 Let v(a) ≥ 0 be a valid Lévy density such that R min(a2 , |a|p )v(a) da <
∞ for some p ∈ [0, 2]. Then, for any given ϕ ∈ Lp (Rd ) ∩ L2 (Rd ), the modified density
vϕ (a) specified by (9.2) is Lévy-admissible with

p
min(a2 , |a|p )vϕ (a) da < 2ϕ2L2 a2 v(a) da + 2ϕLp |a|p v(a) da. (9.3)
R |a|<1 |a|≥1

If, in addition, R |a|p v(a) da < ∞ for some p ≥ 0, then the result holds for any ϕ ∈
Lp (Rd ) and vϕ satisfies the inequality

p
|a|p vϕ (a) da < 2ϕLp |a|p v(a) da. (9.4)
R R

The limit case p = 0 is covered by using the convention


 that L0 (Rd ) is the space of boun-
ded, compactly supported functions with ϕL0 = ϕ dr, where ϕ = {r : ϕ(r)  = 0}.
0
226 Infinite divisibility and transform-domain statistics

Proof of Lemma 9.2. vϕ (a) is non-negative by construction. To prove that it satisfies


the required bound, we rewrite it as
∞ 1  
vϕ (a) = v a/θ μϕ (dθ ),
−∞ |θ |

where μϕ is the measure describing the amplitude distribution of ϕ(r) within the range
θ ∈ R, with zero contribution at θ = 0 (to avoid dividing by zero). For further conven-
ience, we also define the measure
 μ|ϕ| that specifies the amplitude distribution of |ϕ(r)|.
To check the finiteness of |a|>1 |a|p vϕ (a) da, we first consider the contribution I1 of the
positive values
∞ ∞ ∞ |a|p  
I1 = |a|p vϕ (a) da = v a/θ μϕ (dθ ) da
1 1 −∞ |θ |
∞ ∞ |a|p  
= v a/θ da μϕ (dθ )
−∞ 1 |θ |
∞ ∞
= |a θ |p v(a ) da μϕ (dθ )
0 1/θ
∞ 1 ∞ 
≤ |a θ |p v(a ) da + |a θ |p v(a ) da μϕ (dθ ),
0 min(1,1/θ) 1

where we are relying on Tonelli’s theorem to interchange the integrals and where we
have made the reverse change of variable a = a/θ . The crucial step is to note that
a θ ≥ 1 for the range of values within the first inner integral, which yields
∞ 1 ∞ 
 2    p  
I1 ≤ |a θ | v(a ) da + |a θ | v(a ) da μϕ (dθ )
0 min(1,1/θ) 1
∞ 1 ∞ ∞
≤ θ 2 μϕ (dθ ) a2 v(a) da + |θ |p μϕ (dθ ) ap v(a) da.
−∞ 0 −∞ 1

R |θ |
Proceeding in the same fashion for the negative values and recalling that p μ (dθ )
ϕ
p
= ϕLp , we find that

p
|a|p vϕ (a) da ≤ ϕ2L2 a2 v(a) da + ϕLp |a|p v(a) da < ∞,
|a|≥1 |a|<1 |a|≥1
  
where we have used R min(a2 , |a|p )v(a) da = |a|<1 a2 v(a) da + |a|≥1 |a|p v(a) da. As
for the quadratic part (I2 ) of the admissibility condition, we consider the integral
1
I2 = a2 vϕ (a) da
0
1 ∞ a2  
= v a/θ μϕ (dθ ) da
0 −∞ θ
∞ 1/θ
= (a θ )2 v(a ) da μϕ (dθ )
0 0
9.1 id laws, spectral mixing, and analysis of white noise 227

with the change of variable a = a/θ . Since a < 1/θ within the bound of the inner
integral, we have
 
∞ 1 max(1,1/θ)
I2 ≤ (a θ )2 v(a ) da + |a θ |p v(a ) da μϕ (dθ )
0 0 1
∞ 1 ∞ ∞
≤ θ 2 μϕ (dθ ) a2 v(a) da + |θ |p μϕ (dθ ) ap v(a) da.
−∞ 0 −∞ 1

Using the same technique for the negative values, we get

p
a2 vϕ (a) da ≤ ϕ2L2 a2 v(a) da + ϕLp |a|p v(a) da < ∞,
|a|<1 |a|<1 |a|≥1

which is then combined with the previous result to yield (9.3). The announced Lp
inequality is obtained in a similar fashion without the necessity of splitting the inte-
grals in subparts.

L E M M A 9.3 Let v(a) ≥ 0 be a Lévy density such that R min(a2 , |a|p )v(a) da < ∞
for some p ∈ [0, 2]. Then, the non-Gaussian Lévy exponent
" #
g(ω) = ejaω − 1 − jaω1|a|<1 (a) v(a) da
R\{0}

is bounded by
|g(ω)| ≤ C2 |ω|2 + Cq |ω|q ,
 
with C2 = |a|<1 |a|2 v(a) da, q = min(1, p), and Cq = 2 |a|≥1 |a|q v(a) da.

Proof of Lemma 9.3. We rewrite g(ω) as


" # " #
g(ω) = ejaω − 1 − jaω v(a) da + ejaω − 1 v(a) da.
|a|<1 |a|≥1

Next, we observe that the condition in the statement implies that

|a|q v(a) da < ∞


|a|>1

for all 0 ≤ q ≤ p, and, in particular, q = min(1, p). We then consider the inequalities
 
 jx 
e − 1 − jx ≤ x2

and
 
 jx 
e − 1 ≤ min(|x|, 2)
≤ 2 min(|x|, 1)
≤ 2|x|q ,
228 Infinite divisibility and transform-domain statistics

where the restriction q ≤ 1 ensures that |x|q ≥ |x| for |x| ≤ 1. By combining these
elements, we construct the upper bound
   
 jaω   jaω 
|g(ω)| ≤ e − 1 − jaω v(a) da + e − 1 v(a) da
|a|<1 |a|≥1

≤ |aω|2 v(a) da + 2 |ωa|q v(a) da


|a|<1 |a|≥1

≤ |ω|2 |a|2 v(a) da + 2|ω|q |a|q v(a) da.


|a|<1 |a|≥1

Proof of Theorem 9.1. First, we use Lemma 9.3 together with the Lévy–Khintchine
formula (4.3) to show that the modified Lévy exponent fϕ is a well-defined function of
ω. This is achieved by establishing the upper bound
 
fϕ (ω) ≤ A1 |ω| + A2 |ω|2 + Aq |ω|q , (9.5)
"  #
where q = min(1, p), A1 = |b1 |ϕL1 , A2 = b22 + |a|<1 |a|2 v(a) da ϕ2L2 and A3 =
 q
2 |a|≤1 |a|q v(a) da ϕLq .

To lay out the technique of proof, we first assume that |a|>1 |a| v(a) da < ∞ and
consider the Lévy–Khintchine representation (4.5) of f. This yields

ω2
fϕ (ω) = jb1 ϕ(r)ω dr − b2 |ϕ(r)|2 dr + gϕ (ω)
Rd Rd 2
ω2
= j b1 ϕ(r) dr ω − b2 ϕ2L2 + gϕ (ω),
 Rd
     2
b2,ϕ
b1,ϕ

where
" #
gϕ (ω) = ejaϕ(r)ω − 1 − jaϕ(r)ω v(a) da dr,
Rd R\{0}

with the property that gϕ (0) = 0. Next, we identify vϕ (a) by making the change of
variable a (r) = aϕ(r), while restricting the domain of integration to the subregion of
Rd over which the argument is non-zero:
"  # 1  

gϕ (ω) = eja ω − 1 − ja ω v a /ϕ(r) da dr
ϕ R\{0} |ϕ(r)|
"  # 1   
= eja ω − 1 − ja ω v a /ϕ(r) dr da ,
R\{0} ϕ |ϕ(r)|
  
vϕ (a )

where the interchange of integrals is legitimate thanks to (9.5) (by Fubini). Lemma 9.2
ensures that vϕ is admissible in accordance with Definition 4.3.
9.1 id laws, spectral mixing, and analysis of white noise 229


The scenario |a|>1 |a| v(a) da = ∞ is trickier because it calls for a more careful
compensation of the singularity of the Lévy density. The classical Lévy–Khintchine
formula (4.4) leads to an integral of the form
"  #
gϕ (ω) = ejaϕ(r)ω − 1 − jaϕ(r)ω h a, ϕ(r) v(a) da dr
Rd R\{0}
 
with
 h a, ϕ(r) = 1|a|<1 (a). It turns out that the exact form of the compensation
h a, ϕ(r) is not important as long as it stabilizes the integral by introducing an
appropriate linear bias which results in a modification of the constant b 1 . Instead

of the canonical
  solution, we propose an alternative regularization with h a, ϕ(r) =
1|aϕ(r)|<1 a , which is compatible with the change of variable a = a/ϕ(r). The ratio-
nale is that this particular choice is guaranteed to lead to a convergent integral, as a
consequence of Lemma 9.2, so that the remainder of the proof is the same as in the
previous case: change of variable and interchange of integrals justified by Fubini’s
theorem. This also leads to some modified constant b1,ϕ which is necessarily finite
since both gϕ (ω) and fϕ (ω) are bounded.
To carry out the proof of Lemma 9.2, we have exploited the fact  that the integration
of f against a function ϕ amounts to a spectral mixing f (ω; μϕ ) = R f (θ ω) μϕ (dθ ). This
results in an admissible
 Lévy exponent provided that the pth absolute moment of the
measure μ is finite: R |θ |p μϕ (dθ ) < ∞. The equivalence with fϕ (ω) is obtained by
defining μϕ ((−∞, θ ]) = Meas{r : ϕ(r) < θ and ϕ(r)  = 0}, which specifies the ampli-
 R . To gain further insight into the mixing process,
tude distribution of ϕ(r) as r ranges over d

we like to view the Lebesgue integral R f (θ ω) μϕ (dθ ) as the limit of a sequence of Lévy

exponents fN (ω) = N n=1 f (θn,N ω)τn , each corresponding to a characteristic function of
N  
the form e fN (ω) = n=1  pX (sn ω)τn with sn = θn,N and τn = μϕ [θn−1,N , θn,N ) . The latter
is interesting because it provides a convolutional interpretation of the mixing process,
and also because it shows that all that matters is the amplitude distribution of ϕ, and not
its actual spatial structure.
C O R O L L A RY 9.4 (Symmetric Lévy exponents) Let f be an admissible Lévy exponent
and let ϕ ∈ Lp (R) for p ≥ 1 be a function  such
 that its
 amplitude
 distribution (or his-
togram) μϕ is symmetric. Then, fϕ (ω) = Rd f ωϕ(r) dr = R f (θ ω) μϕ (dθ ) is a valid
symmetric Lévy exponent
 that admits the canonical representation (4.4) with modified
Lévy parameters bϕ , vϕ (a) = vϕ (−a) , as specified in Theorem 9.1. Conversely, if f is
symmetric to start with, then fϕ stays symmetric for any ϕ, irrespective of its amplitude
distribution.
Proof Based on the Lévy–Khintchine representation of f and by relying on Fubini’s
theorem to justify the interchange of integrals, we get
b2  jaθω 
fϕ (ω) = − ω2 θ 2 μϕ (dθ ) + e − 1 − jaθ ω v(a) da μϕ (dθ )
2
 R   R R
d

ϕ2
b2 2 ∞  
=− ω ϕ2 + 2 cos(aθ ω) − 1 μϕ (dθ )v(a) da,
2 R 0
230 Infinite divisibility and transform-domain statistics

where we have made use of the symmetry assumption μϕ (E) = μϕ (−E). Since the
above formula is symmetric in ω and fϕ (ω) is a valid Lévy exponent (by Theorem 9.1),
we can invoke Corollary 4.3, which yields the desired result. The converse part of
Corollary 9.4 is obvious.
The practical relevance of Corollary 9.4 is that we can restrict ourselves to a sym-
metric model of noise without any loss of generality as long as the analysis function
(typically, a wavelet) has a symmetric amplitude distribution μϕ . This is equivalent to
all odd moments of μϕ being zero, including the mean.
One of the cornerstones of our formulation is that the mixing (white-noise integration)
does not fundamentally affect the key properties of the Lévy exponent, as will be made
clear in what follows.

9.2 Class C and unimodality

D E F I N I T I O N 9.1 An id distribution is said to be of class C if its Lévy density v(a) is


unimodal with mode 0; more precisely, if v (a) ≤ 0 for a > 0, and v (a) ≥ 0 for a < 0.
This is again a property that is conserved through basic composition: linear
combination of random variables and Fourier-domain exponentiation (fractional convo-
lution of pdf). Less obvious is the fact that the class-C property
τ is a necessary condition
for the family of distributions pXτ with  pXτ (ω) =  pX1 (ω) to be unimodal for all
τ ∈ R+ [Wol78]. Conversely, if pX1 is symmetric and of class C, then the whole family
pXτ is symmetric unimodal. The class-C unimodality equivalence, however, is only
partial. In particular, there exist non-symmetric distributions of class C that are not
unimodal, and some symmetric unimodal id distributions that are not in the class C for
which pXτ (x) is unimodal only for some subrange of τ [Wol78, Section 3]. The key
result for our purpose is as follows.
9.5 If the Lévy
C O R O L L A RY " density of an id distribution#is symmetric and unimodal,

then its pdf pX (x) = R exp − b22 ω2 + R cos(aω)v(a) da e−jxω 2dω π is symmetric and
unimodal as well. Moreover, the property is preserved through the white-noise integra-
tion process with some arbitrary bounded function ϕ ∈ L2 (Rd ).
Proof The first part is a restatement of a classical result in the theory of infinitely divi-
sible distributions [Wol78, Theorem 1]. As for the second part, we invoke Theorem 9.1,
which ensures that the modified Lévy exponent vϕ given by (9.2) is admissible. Since
v(a) = v(−a), vϕ is automatically symmetric as well. We prove its unimodality by com-
puting the derivative for a > 0 as
d 1
vϕ (a) = v(a/|ϕ(r)|) dr
da ϕ |ϕ(r)|
1
= v (a/|ϕ(r)|) dr ≤ 0,
ϕ |ϕ(r)|2
which follows from the condition v (a) ≤ 0 for a > 0 (unimodality of v(a)).
9.2 Class C and unimodality 231

The pleasing observation is that all the examples of id distributions in Table 4.1 are
in this category.
P R O P O S I T I O N 9.6 (Alternative statement) Let f (ω) = f (−ω) be a real-valued ad-
missible Lévy exponent of class C associated with a symmetric unimodal distribution
and ϕ a d-dimensional
 function such that ϕLp < ∞ for p ∈ R+ . Then, fϕ (ω) =
Rd f ωϕ(r) dr is a valid Lévy exponent that retains the symmetry and unimodality
properties.
The interest of the above is more in the constructive proof given below than in the
statement, which is slightly less general than Corollary 9.5.
Proof of Proposition 9.6 The result is obtained as a corollary of two theorems
in [GK68]: (i) the convolution of two symmetric 1 unimodal distributions is unimo-
dal, and (ii) if a sequence of unimodal distributions converges to a distribution, then the
limit function is unimodal as well. The idea is then to view the Lebesgue integral as the
n2 −1  
limit as n goes to infinity of the sequence of sums k=1 f ωsk,n μk,n , where  μ k,n is the

measure of the set Ek,n = {r : nk ≤ |ϕ(r)| < k+1 } and s k,n = arg min ϕ(r):r∈E
 f ωϕ(r) ,
n n,k
recalling that the Lévy exponent f is continuous.
 f (ωs )Each
μk,n individual term corresponds to
a characteristic function e μk,n f (ωsk,n ) = e k,n (Fourier-domain convolutional
factor) that is id (thanks to the rescaling and exponentiation property), symmetric, and
unimodal. Finally, we rely on the admissibility condition and Lebesgue’s dominated-
convergence theorem to show that the sequence converges to the limit fϕ (ω) = fϕ (−ω)
which specifies a valid symmetric id distribution.
This proof also suggests that the class-C property is the tightest possible condition in
the mixing scenario with arbitrary ϕ. Indeed, class C plus symmetry is necessary and
sufficient for pXτ (x) to be unimodal for all τ ∈ R+ [Wol78].
Another interesting class of id distributions are those that are completely monotonic.
We recall that a function (or density) q(x) is completely monotonic on R+ if it is of class
C∞ with alternating derivatives, so that
dn q(x)
(−1)n ≥ 0 for n ∈ N, x ≥ 0.
dxn
Thanks to Bernstein’s theorem, the symmetric version of completely monotonic distri-
butions can be expressed as mixtures of Laplacians; that is,
+∞ 1 −λ|x|
q(x) = λe pZ (λ) dλ (9.6)
0 2
for some mixing density pZ (λ) on R+ . By making the change of variable θ = 1/λ (scale
of the exponential distribution), we may also express q(x) as
+∞ 1 −|x|/θ ∞
q(x) = e p1/Z (θ ) dθ = pY (x|θ )p1/Z (θ ) dθ
0 2θ 0
1 While translating Gnedenko and Kolmogorov’s book into English, K. L. Chung realized that Lapin’s theo-
rem on the convolution of unimodal distributions, which does not impose the symmetry condition, is
generally not true. The result for symmetric distributions goes back to Wintner in 1936 (see [Wol78] for a
historical account).
232 Infinite divisibility and transform-domain statistics

with pZ (λ) dλ = p1/Z (θ ) dθ , where pY (x|θ ) is the Laplace distribution with standard
deviation θ . The probabilistic interpretation of the above expansion is that of an expo-
nential scale mixture: q = pX is the pdf of the ratio X = Y/Z of two independent random
variables Y and Z with pdfs pY (standardized Laplacian with λ = 1) and pZ (under the
constraint that Z is positive), respectively.
Complete monotonicity is one of the few simple criteria that ensures that a distribu-
tion is infinitely divisible [SVH03, Theorem 10.1]. Moreover, one can readily show that
a symmetric completely monotonic distribution is log-convex, since
  2
d2 log q(x) q (x) q (x)
2
= − ≥ 0. (9.7)
dx q(x) q(x)
Indeed, based on the canonical form (9.6), we have that
 +∞ 2
 −λ|x| λ
2
q (x) = −λe pZ (λ) dλ
0 2
 +∞   +∞ 
λ 2 −λ|x| λ
≤ e−λ|x| pZ (λ) dλ λ e pZ (λ) dλ = q(x)q (x),
0 2 0 2
where we have made use of the Cauchy–Schwarz inequality applied to the inner product
1 1
u, v Z = EZ { λ2 uv} with u(λ) = e− 2 λ|x| and v(λ) = −λe− 2 λ|x| .
Note that, unlike monotonicity, complete monotonicity is generally not preserved
through spectral mixing.
A related property at the other end of the spectrum is log-concavity. It is equivalent
to the convexity of the log-likelihood potential X (x) = − log pX (x), which is advan-
tageous for optimization purposes and for designing MAP estimators (see Chapter 10).
The id laws in Table 4.1 that are log-concave are the Gaussian, Laplace, and hyper-
bolic secant distributions. It should be kept in mind, however, that most id distributions
are not log-concave since the property is incompatible with a slower-than-exponential
decay. The limit case is the symmetric unimodal Laplace distribution, which is both
log-concave and log-convex.

9.3 Self-decomposable distributions

DEFINITION 9.2 A random variable X is said to be self-decomposable if, for every


λ ∈ (0, 1), X = λX + Xλ in law, where X and Xλ are independent random variables
with pdf pX and pXλ , respectively. The corresponding pX (resp., 
pX ) is also said to be
self-decomposable.

The self-decomposability of X is therefore equivalent to the requirement that its cha-


racteristic function 
pX (ω) = E{ejωX } can be factorized as


pX (ω) = 
pX (λω) · 
pXλ (ω), (9.8)

where 
pXλ (ω) is a valid characteristic function for any λ ∈ (0, 1).
9.3 Self-decomposable distributions 233

All the examples of id distributions in Table 4.1 are self-decomposable. For instance,
in the case of the symmetric gamma (sym gamma) distribution, we have that
 r
pX (ω) 1

pλ (ω) = = λ + (1 − λ )
2 2
.

pX (λω) 1 + ω2

p0 (ω))r , where p0 (x) = λ2 δ(x) + (1 − λ2 ) 12 e−|x| is a valid


The latter is of the form (
(id) pdf for any λ ∈ (0, 1). The key properties of self-decomposable distributions are
[GK68, SVH03]:
• All self-decomposable distributions are infinitely divisible and of class C.
• Self-decomposability is preserved through basic composition: multiplication, resca-
ling, and exponentiation of characteristic functions.
• A self-decomposable distribution is necessarily unimodal [SVH03, Theorem 6.23].
In particular, if pX is symmetric, then max{pX } = pX (0) and pX (x) ≤ 0 for x > 0.
• If 
pX (ω) is self-decomposable then  pXλ (ω) = ppXX(λω)
(ω)
is self-decomposable as well for
any λ ∈ (0, 1). The converse is true with the weaker requirement that  pXλ (ω) need
only be a valid characteristic function.
• A distribution is self-decomposable if and only if its Lévy density v is such that av(a)
is non-increasing on (−∞, 0) and on (0, +∞) [SVH03, Theorem 6.12].
The last property provides a simple operational criterion for testing self-decomposa-
bility when the Lévy density is known explicitly. The requirement is that v(a) should
either be zero or decrease monotonically at least as fast as 1/a.

PROPOSITION 9.7 Let f (ω) be a (p)-admissible Lévy exponent  associated


  with a
self-decomposable distribution and let ϕ ∈ Lp (R ). Then, fϕ (ω) = Rd f ωϕ(r) dr is an
d

admissible Lévy exponent that retains the self-decomposability property.

Proof First, we invoke Theorem 9.1, which ensures that fϕ is a proper Lévy exponent
to start with. The self-decomposability condition (9.8) is equivalent to

f (ω) = f (λω) + fλ (ω),

where fλ (ω) is a valid Lévy exponent for any λ ∈ (0, 1). Inserting ϕ and taking the
integral on both sides gives

   
fϕ (ω) = f λωϕ(r) dr + fλ ωϕ(r) dr
Rd Rd
= fϕ (λω) + fλ,ϕ (ω),
  
where fλ,ϕ (ω) = Rd fλ ωϕ(r) dr. Finally, since | fλ (ω)| ≤ | f (ω)|+| f (λω)|, we are gua-
ranteed that the integration with respect to ϕ yields an acceptable Lévy exponent. This
implies that both e fϕ (ω) and e fλ,ϕ (ω) are valid characteristic functions, which establishes
the self-decomposability property.
234 Infinite divisibility and transform-domain statistics

9.4 Stable distributions

Stability is the most stringent distributional property in the chain since all stable distri-
butions are self-decomposable and hence unimodal of class C.

DEFINITION 9.3 A random variable X is called (strictly) stable if, for every n ∈ Z+ ,
there exists cn such that X = cn (X1 + · · · + Xn ) in law, where the Xi are i.i.d. with pdf
pX (x).

The definition clearly implies infinite divisibility. On the side of the Lévy exponent,
the condition translates into the homogeneity requirement f (aω) = aα f (ω), with α ∈
(0, 2] being the stability index. One can then exploit this property to get the complete
parameterization of the family of stable distributions. If we add the symmetry constraint,
then the class of homogeneous Lévy exponents reduces to f (ω) ∝ −|ω|α . These expo-
nents are associated with the symmetric-alpha-stable (SαS) laws whose characteristic
function is specified by

p(ω; α, s0 ) = e−|s0 ω| = e fα (s0 ω)


α


with α ∈ (0, 2] and s0 ∈ R+ .


The fundamental point is that stability is preserved through linear combinations, as
a consequence of the definition. This remains true when we switch to a continuum by
integrating stable white noise against some analysis function ϕ.

PROPOSITION 9.8 Let f (ω) = −|s0 ω|α be a Lévy exponent associated


  with
 an SαS
distribution of order α. Then, the SαS property is preserved for Rd f ωϕ(r) dr, provi-
ded that ϕLα < ∞ .

Specifically, we have that


 
fα s0 ωϕ(r) dr = −|s0 ωϕ(r)|α dr
Rd Rd
 
= − sα0 |ϕ(r)|α dr |ω|α dr
Rd
  
sα
0

= −|s0 ω|α

with s0 = s0 ϕLα .


The result also holds in the non-symmetric scenario where the distributions are para-
 requires the additional assumption that ϕ ∈ L1 (R) (for the
meterized by (α, β, τ ). It then
non-zero-mean case) and Rd |ϕ(r)| log |ϕ(r)| dr (in the special asymmetric case where
α = 1 and β  = 0).
Stable laws play a crucial role in the generalized formulation of the central-limit
theorem, the standard (finite-variance) case being α = 2 (Gaussian distribution).
9.5 Rate of decay 235

9.5 Rate of decay

A fundamental property of infinitely divisible distributions is that they cannot decay


faster than a Gaussian (heavy-tail behavior). The range of behaviors (from faster to
slower decay) can be categorized as follows:
Gaussian with pid (x) = e−O(|x |)
2

• Supra-exponential with pid (x) = e−O(|x| log(|x|+1))
• Exponential with pid (x) = e−O(|x|) (e.g., the Laplace distribution)
• Exponential polynomial with pid (x) = O(xθ1 e−θ2 (|x|) ) for some θ1 > 0 and θ2 > 0
(e.g., the beta and Meixner distributions)
• Algebraic or inverse polynomial with pid (x) = O(1/|x|θ ) (e.g., SαS and Student) for
some θ > 1.
The most convenient way of probing decay  is through the determination of some
generalized moments of the form mX,θ = R gθ (x)pX (x) dx = E{gθ (X)}, where gθ (x) =
gθ (−x) denotes some suitable family of symmetric functions that are increasing away
from the origin. The typical choices for gθ are eθ|x| (increasing exponential with θ > 0),
|x|θ1 eθ2 x (exponential polynomial with θ1 ≥ 0, θ2 > 0), and |x|θ (monomial with θ > 0).
We also recall that the conventional (polynomial) moments with gn (x) = xn and n integer
can be readily obtained from the Taylor series of the characteristic function e f (ω) at the
origin, since

1 dn pX (ω) 
E{X } =
n n
a pX (a) da = n . (9.9)
R j dωn ω=0
An interesting theoretical aspect that is further developed in Section 9.6 is that these
quantities are in direct correspondence with the polynomial moments of the Lévy den-
sity v, which are closely linked to the cumulants of pX . In particular, we have the follow-
ing relations for the mean and variance of the distribution:

E{X} = xpX (x) dx = b1 − av(a) da


R |a|≥1

Var{X} = (x − E{X})2 pX (x) dx = b2 + a2 v(a) da.


R R
In order to characterize the decay of the distribution, it is actually sufficient to check
whether a given type of generalized moment is finite and to determine the range of
exponents θ over which E{gθ (X)} is not well defined (i.e., infinite). Once more, the good
news is that this information can be deduced directly from the Lévy density v(a) [Sat94].
D E F I N I T I O N 9.4 A weighting function g(x) on R is called submultiplicative if there
exists a constant C ≥ 1 such that

g(x + y) ≤ C g(x) g(y) for x, y ∈ R,

and if it is bounded and measurable on any compact subset.


236 Infinite divisibility and transform-domain statistics

T H E O R E M 9.9 (Kruglov [Kru70]) For an id distribution pX with Lévy density v, we


have the following equivalences for the existence of generalized moments:

|x|θ pX (x) dx < ∞ ⇔ |a|θ v(a) da < ∞


R |a|>1

|x|θ1 eθ2 |x| pX (x) dx < ∞ ⇔ |a|θ1 eθ2 |a| v(a) da < ∞
R |a|>1

gθ (x)pX (x) dx < ∞ ⇔ gθ (a)v(a) da < ∞,


R |a|>1

for any family of submultiplicative weighting functions gθ .

The first two equivalences are deduced as special cases of the last one by considering
the weighting sequences gθ (x) = max(1, |x|θ ) and gθ1 ,θ2 (x) = max(1, |x|θ1 )eθ2 x , which
are submultiplicative for θ , θ2 > 0 and θ1 ≥ 0.
As an application of Theorem 9.9, we can restate the Lévy–Schwartz admissibility
condition in Theorem 4.8 as

E{|1[0,1) , w | } < ∞ for some  > 0.



(9.10)

The key here is that Xid = 1[0,1) , w (the canonical observation of the innovation w)
is an id random variable whose characteristic function is  pid (ω) = e f (ω) (see Proposi-
tion 4.12).
The direct implication is that pX will have the same decay behavior as v, keeping in
mind that it can be no better than supra-exponential (e.g., when v is compactly suppor-
ted), unless we are in the purely Gaussian scenario with v(a) = 0. This also means that,
from a decay point of view, supra-exponential and compact support are to be placed in
the same category.
The fundamental reason for the equivalences in Theorem 9.9 is that the corresponding
types of decay (exponential vs. polynomial) are preserved through convolution. In the
latter case, this can be quantified quite precisely using a generalized version of the
Young inequality for weighted Lp -spaces (see [AG01]):

gθ (h ∗ q)Lp ≤ Cθ gθ hL1 · gθ qLp

for any p ≥ 1 and any family of submultiplicative weighting functions gθ (x). The
canonical example of weighting function that is used to characterize algebraic decay is
gα (x) = (1 + |x|)α with αR+ , which is submultiplicative with constant Cα = 1.
In light of Theorem 9.9, our next result implies that the rate of polynomial decay and
the moment-related properties of pX are preserved through the white-noise integration
process.

P R O P O S I T I O N 9.10 Let v be a Lévy density with a p0 th order of algebraic decay at


p 
infinity for some p0 ≥ 1 (or better) and ϕ a function such that ϕLp = Rd |ϕ(r)|p

dr < ∞ for all p ≥ p0 − 1, with the formal convention that ϕ0L0 = ϕ dr. Then, the
transformed density vϕ given by (9.2) will retain the decay of v and belong to the same
9.6 Lévy exponents and cumulants 237

general class (e.g., finite-variance vs. infinite-variance density, Poisson, Gaussian or


alpha-stable). In particular, its pth-order absolute moment for any p ≥ 0 is given by

p
mvϕ ,p = |a|p vϕ (a) da = ϕLp mv,p (9.11)
R

and is well defined whenever mv,p , the pth-order absolute moment of v, is bounded.
Likewise, the integration against ϕ will preserve the finiteness of all absolute moments
of the underlying distribution, including those for p ≤ 2.

Proof For the admissibility of vϕ , we refer to Theorem 9.1. To show that the integrand
in (9.2) is well defined when ϕ(r) tends to zero and that vϕ has the same decay properties
at infinity as v, we consider the bound v(a) < C|a|−p0 with p0 ≥ 1, which implies
v(a/ϕ(r)) ≤ C|ϕ(r)|p0 −1 |a|−p0 . It follows that vϕ (a) ≤ CϕLp |a|−p0 with
1 p
that |ϕ(r)|
p = p0 − 1, meaning that the rate of decay of v is preserved. To check the integrability
of vϕ , we make the reverse change of variable a = ϕ(r)a and obtain

1
v(a/ϕ(r)) dr da = v(a ) dr da
R ϕ |ϕ(r)| R ϕ

= dr v(a ) da = ϕ0L0 v(a ) da .


ϕ R R
 p 
Likewise, we can readily show that R |a|p vϕ (a) da = ϕLp R |a|p v(a) da, assum-
ing that these latter quantities are well defined. For the absolute moments of order
p < 2, we can refer to Lemma 9.2 and the conservation of the admissibility condition
2 , |a|p )v (a) da < ∞.
R min(a ϕ

In particular, when v is of the Poisson type, we have that R v(a) da = λ < +∞,
and it follows that vϕ will be of the Poisson type as well, provided that ϕ is compactly
supported. The fact that the Poisson parameter λ increases in proportion to the support
of ϕ is consistent with the window intersecting more Dirac impulses (components of
the white Poisson noise).
We shall now see that a relation similar to (9.11) holds for the standard integer-order
moments of vϕ and v; for p > 2, these actually happen to be the cumulants of the
underlying probability distributions.

9.6 Lévy exponents and cumulants

The fact is that, with id distributions, it is often simpler to work with cumulants than
with ordinary moments, the idea being that cumulants are to the Lévy exponent what
moments are to the characteristic
 function.
Specifically, let 
pX (ω) = R pX (x)ejωx dx be the characteristic function of a pdf (not
necessarily id) whose moments mp = R xp pX (x) dx < ∞ are assumed to be well defined
for any p ∈ N. The boundedness of the moments implies that  pX (ω) (the conjugate
238 Infinite divisibility and transform-domain statistics

Fourier transform of pX ) is infinitely differentiable at the origin. Hence, it is possible to


represent 
pX , as well as its logarithm, by the Taylor series expansions

(jω)p

pX (ω) = mp , (9.12)
p!
p=0

(jω)n
pX (ω) =
log κn , (9.13)
n!
n=1

where the second formula provides the definition of the κn , which are the cumulants
of pX . This clearly shows that pX is uniquely characterized by its moments (by way
of its characteristic function), or, equivalently, in terms of its cumulants by way of its
cumulant generating function log pX (ω). Another equivalent way of expressing this cor-
respondence is

1 dn logpX (ω) 
κn {X} = n  ,
j dωn ω=0
which is the direct counterpart of (9.9). By equating the Taylor series of the exponential
of the right-hand sum in (9.13) with (9.12), one can derive a direct relation between the
moments and the cumulants

p−1  
p−1
mp = mn κp−n ,
n
n=0

with κ0 = 0 and m0 = 1. There is also a converse version of this formula. In prac-


tice, however, it is often more convenient to compute the cumulants from the centered
moments μp = E{(x − E{x})p } of the distribution. For reference purposes, we provide
the relevant formulas up to order 6:

κ2 = μ2
κ3 = μ3
κ4 = μ4 − 3μ22
κ5 = μ5 − 10μ3 μ2
κ6 = μ6 − 15μ4 μ2 − 10μ23 + 30μ32 ,

while κ1 = m1 is simply the mean of the distribution.


In the case of interest to us where pX is id, logpX (ω) = f (ω), which translates into
⎧ 
 ⎪
⎪ b1 − |a|≥1 av(a) da, n = 1
n
1 d f (ω)  ⎨ 
κn {X} = n = b2 + R a2 v(a) da, n=2
j dωn ω=0 ⎪ ⎪
⎩  n
R a v(a) da, n > 2.
This relation is fundamental for it relates the cumulants of the id distribution to the
moments of the Lévy density. It also implies that κ4 {X} > 0 whenever v(a)  = 0, which
translates into the property that all non-Gaussian id distributions are leptokurtic. This is
9.7 Semigroup property 239

consistent with id distributions being more peaky than a Gaussian around the mean and
exhibiting fatter tails.
A further theoretical motivation for using cumulants is that they provide a direct
measure of the deviation from Gaussianity since the cumulants of a Gaussian are neces-
2
sarily zero for n > 2 (because log gσ (ω) = − σ2 ω2 ).
A final practical advantage of cumulants is that they offer a convenient means of
quantifying – and possibly inverting – the effect of the white-noise integration process.
P R O P O S I T I O N 9.11 Let f be an admissible Lévy exponent and let ϕ ∈ Lp (R) for all
  
p ≥ 1. Then, the cumulants of fϕ (ω) = Rd f ωϕ(r) dr are related to those of f (ω) by
 n
κn,ϕ = κn ϕ(r) dr.
Rd
Proof We start by writing the Taylor series expansion of f as

(jω)n
f (ω) = κn ,
n!
n=1
where the quantities κn are, by definition, the cumulants of the density function pX (x) =
−1
F {e f (ω) }(x). Next, we replace ω by ωϕ(r) and integrate over Rd , which gives

  (jω)n  n
f ωϕ(r) dr = κn ϕ(r) dr = fϕ (ω).
Rd n! Rd
n=1
The last step is to equate the above expression to the expansion of the cumulant gen-
erating function fϕ

(jω)n
fϕ (ω) = κn,ϕ ,
n!
n=1
which yields the desired result.
A direct implication is that the odd-order cumulants of ϕ, w are zero whenever the
analysis kernel ϕ has a symmetric amplitude distribution. This results in a modified
Lévy exponent fϕ that is real-valued symmetric, which is consistent with Corollary 9.4.
In particular, this condition is fulfilled when the analysis kernel has an axis of antisym-
metry; that is, when there exists r0 such that ϕ(r0 + r) = −ϕ(r0 − r), ∀r ∈ Rd .
As for the non-zero even-order cumulants, we can use the relation
κ2m,ϕ
κ2m =
ϕ2m
L (Rd ) 2m

to recover the cumulants of w from the moments of the observed random variable ϕ, w .

9.7 Semigroup property

It turns out that every id pdf is embedded in a semigroup that plays a central role in
the classical theory of Lévy processes. These convolution semigroups define natural
240 Infinite divisibility and transform-domain statistics

families of id pdfs such as the sym gamma and Meixner distributions that are extending
the Laplace and hyperbolic secant distributions, respectively. In Section 9.8, we put
the semigroup property to good use to characterize the behavior of the wavelet-domain
statistics across scales.
Let us recall the exponentiation property:
 τ pX1 is id with characteristic function
if
pXτ (ω) = eτ f (ω) = 
e f (ω) then pXτ with  pX1 (ω) is id as well for any τ ∈ R+ . A direct
implication is the convolution relation that relates the pdfs at scale τ + τ0 and τ0 with
τ > 0, expressed as
 
pXτ +τ0 (x) = pXτ ∗ pXτ0 (x).
This suggests that the family of pdfs {pXτ : τ ∈ [0, ∞)} is endowed with a semigroup-
like structure. To specify the semigroup properties and spell out the implications, we
introduce the family of linear convolution operators Qτ with τ ≥ 0,
 
Qτ q(x) = pXτ ∗ q (x)
for any q ∈ L1 (R). Clearly, the family {Qτ : τ ≥ 0} is bounded over L1 (R) with Qτ  =
supqL =1 Qτ qL1 ≤ 1 (as a consequence of Young’s inequality) and is such that: (1)
1
Qτ1 Qτ2 = Qτ1 +τ2 for τ1 , τ2 ∈ [0, ∞), (2) Q0 = Id, and (3) limτ ↓0 Qτ q−qL1 = 0 for any
q ∈ L1 (R). It therefore satisfies all the properties of a strongly continuous contraction
semigroup. Such semigroups are entirely characterized by their infinitesimal generator
G, which is defined as
Qτ q(x) − q(x)
Gq(x) = lim . (9.14)
τ ↓0 τ
Based on this generator, the members of the group can be represented via the exponen-
tial map
1
Qτ = eτ G = Id + τ G + τ 2 G2 + · · · .
2!
Likewise, we may also write
pXτ + τ (x) − pXτ (x)
lim = GpXτ (x), (9.15)
τ →0 τ
which implies that pXτ (x) = p(x, τ ) is the solution of the partial differential equation

p(x, τ ) = Gp(x, τ ),
∂τ
with initial condition p(x, 0) = pX0 (x) = δ(x). In the present case where Qτ = eτ G is
shift-invariant, we have a direct correspondence with the frequency response eτ f (ω) . By
transposing Definition (9.14) into the Fourier domain, we identify G as the LSI operator
specified by
eτ f (ω) − 1 −jωx dω
Gq(x) = lim 
q(ω) e
τ ↓0 R τ 2π

=  q(ω)f (ω)e−jωx , (9.16)
R 2π
9.7 Semigroup property 241

where q(ω) is the (conjugate) Fourier transform of q(x). The fact that the Lévy exponent
f (ω) is the frequency response of G has some pleasing consequences for the interpreta-
tion of the PDE that rules the evolution of pXτ (x) as a function of τ .

9.7.1 Gaussian case


When f (ω) = − b22 ω2 , the operator G boils down to a second derivative along x. This
results in the diffusion-type evolution equation
∂ b2 ∂ 2
p(x, τ ) = p(x, τ ),
∂τ 2 ∂x2
∂ 2
where ∂x 2 = x is the 1D equivalent of the Laplacian. The solution is a Gaussian with

standard deviation τ b2 .

9.7.2 SαS case


When f (ω) = −sα0 |ω|α , the operator is proportional to the fractional Laplacian of order
α/2, denoted by (− x )α/2 . Hence, we can write down the fractional diffusion equation

p(x, τ ) = −sα0 (− x )α/2 p(x, τ ),
∂τ
which is quite similar to the Gaussian case, except that the impulse response of G is
no longer a point distribution. Here, the solution is an SαS distribution of order α and
dispersion sτ = τ 1/α s0 .

9.7.3 Compound-Poisson case


Under the assumption that v = λpA ∈ L1 (R) and b1 = b2 = 0,  we express the
Lévy
 exponent as f (ω) = R (e − 1)v(a) da = λ 
jaω pA (ω) − pA (0) , where 
pA (ω) =
p (x)ejωx dx is the characteristic function of p (x) ≥ 0 with the normalization
R A A
constraint pA (0) = 1. It follows that the generator G is a filter characterized by the
−1
impulse response g(x) = F {f}(x) = λpA (x) − λδ(x). The corresponding evolution
equation is
∂  
p(x, τ ) = λ pA ∗ p(·, τ ) (x) − λp(x, τ )
∂τ
and involves a convolution with the Poisson amplitude distribution pA .

9.7.4 General iterated-convolution interpretation


By following up on the Poisson example, we propose to describe the action of G in full
generality as a convolution with some impulse response g(x) = G{δ}(x). This leads to
the equivalent form of the evolution equation
∂  
p(x, τ ) = g ∗ p(·, τ ) (x), (9.17)
∂τ
242 Infinite divisibility and transform-domain statistics

which is quite
 attractive from
 an engineering perspective. In the case where f (ω) =
2
−b2 ω2 + R ejaω −1−jaω v(a) da, we obtain the explicit form of the impulse response
by the formal inverse (conjugate) Fourier transformation
b2   
g(x) = G{δ}(x) = δ (x) + δ(x − a) − δ(x) + aδ  (x) v(a) da,
2 R

where δ  and δ  are the first and second derivatives of the Dirac impulse. Some fur-
ther simplification is possible if we can split the second component of g(x) into parts,
although this requires some special care because all non-Poisson Lévy densities are sin-
gular around the origin. A first simplification occurs when R |a|v(a) da  < ∞, which
allows us to pull δ  out of the integral with its weight being m1 = R av(a) da. To
bypass the singularity issue, we consider the sequence of non-singular Lévy densities
v1/n (a) = v(a) for |a| > 1/n and zero otherwise, which converges to v(a) as n goes to
infinity. Using the fact that v1/n is Lebesgue-integrable (as a consequence of the admissi-
bility condition), we can perform a standard (conjugate) Fourier inversion, which yields
 
 b2 
g(x) = −m1 δ (x) + δ (x) + lim v1/n (x) − δ(x) v(a) da ,
2 n→∞ |a|<1/n

with the limit component limn→∞ v1/n (x) = v(x) being the original Lévy density. The
above representation is enlightening because the impulse response is now composed of
two terms: a distribution that is completely localized at the origin (linear combination
of Dirac impulse and its derivatives up to order two) plus a smoothing component that
converges to the initial Lévy density v. The Dirac correction is actually crucial because
it converts an essentially lowpass filter (convolution with v(x) ≥ 0) into a highpass one
which is consistent with the requirement that f (0) = 0.
We can also use this result to describe the evolution of pXτ for some small increment
τ , like
*  +
−1 1
pXτ + τ (x) = F pXτ (ω) 1 + τ f (ω) + τ 2 f2 (ω) + · · · (x)
2!
1  
= pXτ (x) + τ (g ∗ pXτ )(x) + τ 2 g ∗ g ∗ pXτ (x) + O( τ 3 ),
2!
where g = G{δ} is the impulse response of the semigroup generator. Since g is represen-
table as the sum of a point distribution concentrated at the origin and a Lévy density v,
this partly explains why pXτ will be endowed with the properties of v that are preserved
through convolution; in particular, the rate of decay (exponential vs. polynomial), sym-
metry, and unimodality.

9.8 Multiscale analysis

The semigroup property is especially relevant for describing the evolution across scale
of the wavelet-domain statistics of a sparse stochastic process. The premise for the
9.8 Multiscale analysis 243

validity of such an analysis is that the whitening operator L is scale-invariant of order γ


and that the mother wavelet ψ can be written as

ψ(r) = L∗ φ(r),

where φ ∈ L1 (Rd ) is a suitable “smoothing” kernel. We also assume that the wave-
lets are normalized to have a constant L2 norm across scales. In the framework of the
continuous wavelet transform, the wavelet at scale a > 0 is therefore given by

ψa (r) = a−d/2 ψ(r/a)


= a−d/2 L∗ {φ(·)}(r/a)
% &
= L∗ a−d/2+γ φ(·/a) (r)
= L∗ φa (r) (9.18)

with

φa (r) = aγ −d/2 φ(r/a), (9.19)

where we have made use of the scale-invariance property of L (see Definition 5.2 in
Chapter 5). The modified Lévy exponent at scale a can therefore be determined to be
 
fφa (ω) = f ωaγ −d/2 φ(r/a) dr
Rd
 
= f ωaγ −d/2 φ(r ) |a|d dr (change of variable r = r/a)
Rd
 
= a fφ aγ −d/2 ω .
d
(9.20)

Thus, we see that there are two mechanisms at play that determine the evolution of
the wavelet distribution across scale. The first is a simple change of amplitude 2 of the
wavelet coefficients with their standard deviation being divided by the factor aγ −d/2
that multiplies ω in (9.20). The second is the multiplication of the Lévy exponent by
τ = ad , which induces the kind of convolution semigroup investigated in Section 9.7.

9.8.1 Scale evolution of the pdf


To make the link with the previous results on semigroups explicit, we introduce the
family of id pdfs
−1
% &
pid (x; τ , fφ ) = F eτ fφ (ω) (x)

2 Let Y = bX where X is a random variable with pdf p and b a fixed scaling constant. Then, p (x) =
X Y
|b|−1
 pX (x/b),
2
 Var(Y) = b Var(X), and pY (ω) = pX (bω). In the present case, X is id with 
pX (ω) =
exp fφ (ω) .
244 Infinite divisibility and transform-domain statistics

that are tied to the smoothing kernel φ and indexed by the parameter τ ∈ R+ . Next, we
consider the random variables associated with the wavelet coefficients of the stochastic
process s(r) at some fixed location r0 and scale a:

Va = ψa (· − r0 ), s = φa (· − r0 ), w .

Since the Lévy noise w is stationary, it follows that Va has an id distribution with modi-
fied Lévy exponent fφa , as specified by (9.20). This allows us to express pVa , the pdf of
the wavelet coefficients Va , as
 
pVa (x) = pid x; ad , fφ (b·)
 
= |b|−1 pid x/b; ad , fφ

with b = aγ −d/2 . Instead of rescaling the argument of fφ or the pdf, we can also consider
the renormalized wavelet coefficients Ya = Va /aγ −d/2 whose distribution is character-
ized by

pYa (x) = pid (x; ad , fφ ),

which indicates that the evolution across scale is part of the same extended id family.
This connection allows us to transpose the results of Section 9.7 into the wavelet domain
and to infer the corresponding evolution equation. These findings are summarized as
follows.

PROPOSITION 9.12 Let s be a sparse process with scale-invariant whitening opera-


tor L of order γ and Lévy exponent f. The continuous wavelet transform at scale a of s
is given by va (r) = ψa (· − r), s with the wavelet ψ = L∗ φ being L-compatible. Then,
va (r) is stationary for all a > 0 and its first-order pdf pVa is infinitely divisible with Lévy
exponent given by (9.20).
Moreover, let pYτ denote the pdf of the scale-normalized wavelet coefficients yτ =
a d/2−γ va with τ = ad . Then, pYτ (x) = pY (x; τ ) satisfies the differential evolution equa-
tion
∂  
pY (x, τ ) = gφ ∗ pY (·, τ ) (x), (9.21)
∂τ
where gφ is the generalized
 function that is the inverse (conjugate) Fourier transform
of fφ (ω) = Rd f ωφ(r) dr.

9.8.2 Scale evolution of the moments


p
Equation (9.21) suggests that the evolution of the wavelet moments E{Va } is dependent
upon two effects: the first is the dilation of the pdf by aγ −d/2 , which translates into
a simple moment proportionality factor (aq(γ −d/2) ), while the second is the fractional
convolution (multiplication of the Lévy exponent by ad ), which induces some additional
spreading of the pdf. For the case where p is an integer, we are able to derive an explicit
formula by considering cumulants rather than moments.
9.8 Multiscale analysis 245

Since we know the wavelet-domain Lévy exponent fφa (ω) = log pVa (ω), we apply
the technique of Section 9.6 to determine the evolution of the cumulants across scale.
We obtain

1 dn log pVa (ω) 
κn {Va } = n 
j dωn

ω=0

1 dn ad fφ aγ −d/2 ω 
= n 
j dωn 
ω=0

" #n 1 dn fφ ω 

= ad aγ −d/2  (chain rule of differentiation)
jn dω n
ω=0
= ad an(γ −d/2) κn {V1 }. (9.22)

In the case of the variance (i.e., κ2 {Va } = Var{Va }), this simplifies to

Var{Va } = a2γ Var{V1 }. (9.23)

Not too surprisingly, the result is compatible with the scaling law of the variance of the
wavelet coefficients of a fractional Brownian field: Var{Va } = σ02 a2H+d , where H = γ − d2
is the Hurst exponent of the process. The latter corresponds to the special Gaussian
version of the theory with κn {Va } = 0 for n > 2.
The implication of (9.22) is that the evolution of the wavelet cumulants across scale
is linear in a log-log plot. Specifically, we have that

loga κn {Va } = loga κn {V1 } + d + n(γ − d/2),

which suggests a simple regression scheme for estimating γ from the moments of the
wavelet coefficients of a self-similar process.
Based on (9.22), we relate the evolution of the kurtosis to the scale as
κ4 {Va }
η4 (a) = = a−d η4 (1). (9.24)
κ22 {Va }
This implies that the kurtosis (if initially well defined) converges to zero as the scale
gets coarser. We also see that the rate of convergence is universal (independent of the
order γ ) and that it is faster in higher dimensions.
The normalization ratio for the other cumulants is given by
κm {Va } 1
ηm (a) = m/2
= ηm (1), (9.25)
κ2 {Va } ad(m/2−1)

which shows that the relative contributions of the higher-order cumulants (with m > 2)
tend to zero as the scale increases. This implies that the limit distribution converges to
a Gaussian under the working hypothesis that the moments (or cumulants) of pid are
well defined. This asymptotic behavior happens to be a manifestation of a generalized
version of the central-limit theorem, the idea being that the dilation of the observation
window has an effect that is equivalent to the summation of an increasing number of
i.i.d. random contributions.
246 Infinite divisibility and transform-domain statistics

9.8.3 Asymptotic convergence to a Gaussian/stable distribution


Let us first investigate the case of an SαS innovation model, which is quite instructive
and also fundamental for the asymptotic theory. The corresponding form of the Lévy
exponent is f (ω) = −|s0 ω|α with α ∈ (0, 2) and dispersion parameter s0 ∈ R+ . By
substituting this expression in (9.20), we get
 α
 γ −d/2 
fφa (ω) = − s0 a ωφ(r) ad dr
R d
 
 γ −d/2 d/α α
= −  s0 a a ω |φ(r)|α dr
Rd
 α
= − sa,φ ω , (9.26)

where

sa,φ = s0 aγ −d/2+d/α φLα . (9.27)

This confirms the fact that the stability property is conserved in the wavelet
√ domain.
In the particular case of a Gaussian process with α = 2 and s0 = σ0 / 2, we obtain
sa,φ = s0 aγ φL2 , which is compatible with (9.23).
The remarkable aspect is that the combination of (9.26) and (9.27) specifies the limit
distribution of the wavelet pdf for a sufficiently large under very general conditions,
where α ≤ 2 is the critical exponent of the underlying distribution. The parameter s0 is
related to the αth moment of the canonical pdf, where α is the largest possible exponent
for this moment to be finite, the standard case being α = 2.
Since the derivation of this kind of limit result for α < 2 (generalized version of
the central-limit theorem) is rather technical, we concentrate now on the finite-variance
case (α = 2), which can be established under rather general conditions using a basic
Taylor-series argument.
For simplicity, we consider a centered scenario withmodified Lévy triplet b1,φ = 0,
b2,φ = b2 φ2L2 , and vφ as specified by (9.2) such that R tvϕ (t) dt = 0. Note that these
conditions are automatically satisfied when the variance of the innovations is finite and
f is real-valued symmetric
 (see Corollary 9.4). Due to the finite-variance hypothesis, we
have that m2,φ = R t2 vφ (t) dt < ∞. This allows us to write the Taylor series expansion
of fφa (ω/aγ ) as
 
fφa (ω/aγ ) = ad fφ aγ −d/2 ω/aγ
 
= ad fφ a−d/2 ω
b2,φ ω2 m2,φ 2
=− − ω + O(a−d ω4 ),
2 2
which corresponds to the Lévy exponent of the normalized variable Za = Va /aγ . This
implies that: (1) the variance of Za is given by E{Z2a } = b2,φ + m2,φ = Var{V1 } and is
b +m
independent of a, and (2) lima→+∞ fφa (ω/aγ ) = − 2,φ 2 2,φ ω2 , which indicates that the
limit distribution of Za is a centered Gaussian (central-limit theorem).
9.9 Notes and pointers to the literature 247

Practically, this translates into the Gaussian approximation of the pdf of the wavelet
coefficients given by

 
pVa (x) ≈ Gauss 0, a2γ Var{V1 } ,
which becomes more and more accurate as the scale a increases. We note the simplicity
of the asymptotic model and the fact that it is consistent with (9.23) which specifies the
general evolution of the variance across scale.

9.9 Notes and pointers to the literature

While most of the results in this chapter are specific to sparse stochastic processes, the
presentation relies heavily on standard results from the theory of infinitely divisible
laws. Much of this theory gravitates around the one-to-one relation that exists between
the pdf (pid ) and the Lévy density (v) that appears in the Lévy–Khintchine representation
(4.2) of the exponent f (ω), the general idea being to translate the properties from one
domain to the other. Two classical references on this subject are the textbooks by Sato
[Sat94] and Steutel and Van Harn [SVH03].
The novel twist here is that the observation of white noise through an analysis win-
dow ϕ results in a modified Lévy exponent fϕ and hence a modified Lévy density vϕ ,
as specified by (9.2). The technical aspect is to make sure that both fϕ and vϕ are ad-
missible, which is the topic of Section 9.1. Determining the properties of the pdf of
Xϕ = ϕ, w then reduces to the investigation of vϕ , which can be carried out using
the classical tools of the theory of id laws. An alternative proof of Theorem 9.1 as
well as some complementary results on infinite divisibility and tail decay are reported
in [AU14].
Many of the basic concepts such as stability and self-decomposability go back to
the groundwork of Lévy [Lév25, Lév54]. The general form of Theorem 9.9 is due to
Kruglov [Kru70] (see also [Sat94, pp. 159–160]), but there are antecedents from Rama-
chandran [Ram69] and Wolfe [Wol71].
The use of semigroups and potential theory is a fruitful approach to the study of
continuous-time Markov processes, including Lévy processes. The concept appears to
have been pioneered by William Feller [Fel71, see Chapters 9 and 10]. Suggested read-
ings on this fascinating topic are [Sat94, Chapter 8], [App09, Chapter 3], and [Jac01].
The transposition of these tools in Section 9.8 for characterizing the evolution of the
wavelet statistics across scale is new, to the best of our knowledge; an expanded version
of this material will be published elsewhere.
10 Recovery of sparse signals

In this chapter, we apply the theory of sparse stochastic processes to the reconstruction
of signals from noisy measurements. The foundation of the approach is the specifi-
cation of a corresponding (finite-dimensional) Bayesian framework for the resolution
of ill-posed inverse problems. Given some noisy measurement vector y ∈ RM pro-
duced by an imaging or signal acquisition device (e.g., optical or X-ray tomography,
magnetic resonance), the problem is to reconstruct the unknown object (or signal) s as
a d-dimensional function of the space-domain variable r ∈ Rd based on the accurate
physical modeling of the imaging process (which is assumed to be linear).
The non-standard aspect here is that the reconstruction problem is stated in the conti-
nuous domain. A practical numerical scheme is obtained by projecting the solution onto
some finite-dimensional reconstruction space. Interestingly, the derived MAP estima-
tors result in optimization problems that are very similar to the variational formulations
that are in use today in the field of biomaging, including Tikhonov regularization and
1 -norm minimization. The proposed framework provides insights of a statistical nature
and also suggests novel computational schemes and solutions.
The chapter is organized as follows. In Section 10.1, we present a general method for
the discretization of a linear inverse problem in a shift-invariant basis. The correspon-
ding finite-dimensional statistical characterization of the signal is obtained by suitable
“projection” of the innovation model onto the reconstruction space. This information is
then used in Section 10.2 to specify the maximum a posteriori (MAP) reconstruction
of the signal. We also develop an iterative optimization scheme that alternates between
a classical linear reconstructor and a shrinkage estimator that is specified by the signal
prior. In Section 10.3, we apply these techniques to the reconstruction of biomedical
images. After reviewing the physical principles of image formation, we derive practical
MAP estimators for the deconvolution of fluorescence micrographs, for the reconstruc-
tion of magnetic resonance images, and for X-ray computed tomography. We present
illustrative examples and discuss the connections with several reconstruction algorithms
currently in favor. In Section 10.4, we investigate the extent to which such variational
methods approximate the minimum-mean-square-error (MMSE) solution for the sim-
pler problem of signal denoising. To that end, we present a direct algorithm for the
MMSE denoising of Lévy processes that is based on belief propagation. This optimal
solution is then used as reference for assessing the performance of non-Gaussian MAP
estimators.
10.1 Discretization of linear inverse problems 249

10.1 Discretization of linear inverse problems

The proposed discretization approach is inspired by the classical method of resolution


of PDEs using finite elements. The underlying principle is to reconstruct a continuously
defined approximation of the original signal that lives in some finite-dimensional sub-
space spanned by basis functions located on a reconstruction grid with step size h. The
advantages of this strategy are twofold. First, it provides a clean analytical solution
to the discretization problem via the specification of basis functions (splines or wave-
lets). Second, it offers the same type of error control as finite-element methods: in other
words, one can make the discretization error arbitrarily small by selecting a reconstruc-
tion grid that is sufficiently fine.
To discretize the problem, one represents the unknown signal s as a weighted sum

of basis functions s(r) = k∈ s[k]βk (r) with card()  = K, and specifies the linear
(noise-free) forward model y0 = Hs with s = s[k] k∈ , where the (M × K) system
matrix H accounts for the image-formation physics. The reconstruction problem then
essentially boils down to inverting this system of equations, which is typically very
large and ill-conditioned (underdetermined or unstable). Another confounding factor
is that the measurements are often corrupted by measurement noise. In practice, the
ill-posedness of the reconstruction problem is dealt with by introducing regularization
constraints that favor certain types of solutions.
Here, we assume that the underlying signal s(r) is a realization of a sparse stochastic
process and take advantage of the continuous-domain innovation model Ls = w to
specify the joint pdf pS (s) of the discrete representation of the signal. We then use this
prior information to derive solutions to the reconstruction problem that are optimal in
some well-defined statistical sense. We emphasize the MAP solution (i.e., the signal that
best explains the measured data) and propose some practical reconstruction algorithms.
We will also show that MMSE estimators are accessible under certain conditions (e.g.,
Gaussianity or Markov property).

10.1.1 Shift-invariant reconstruction subspace


Since the original specification of the problem is in the continuous domain, its proper
discretization requires a scheme by which the solution can be expressed in some recons-
truction space with minimal loss of information. For practical purposes, we would also
like to associate the discrete representation of the signal with its equidistant samples and
control the quality of the discretization through the use of a simple resolution parameter:
the sampling step h. This is achieved within the context of generalized sampling using
“shift-invariant” reconstruction spaces [Uns00].
To simplify the presentation and to set aside the technicalities associated with bound-
ary conditions, we start by considering signals that are defined over the infinite domain
Rd . Our reconstruction space at resolution h is then defined as
⎧ ⎫
⎨   ⎬
r − hk
Vh = sh (r) = ch [k]β : ch [k] ∈ ∞ (Zd ), r ∈ Rd ,
⎩ d
h ⎭
k∈Z
250 Recovery of sparse signals

which involves a family of shifted basis functions specified by a single B-spline-like


generator β(r). The basis functions are rescaled to match the resolution (dilation by h)
and positioned on the reconstruction grid hZd . For computational reasons, it is important
that the signal representation in terms of its coefficients ch [k] be stable and unambiguous
(Riesz-basis property). Moreover, for the discretization procedure to be acceptable, we
require that the error between the original signal s(r) and its projection onto Vh vanishes
as the sampling step h tends to zero. These properties are fulfilled if and only if β
satisfies the following conditions, which are standard in sampling and approximation
theory [Uns00]:

(1) Riesz-basis condition: for all ω ∈ Rd , 0 < A ≤ n∈Zd |β(ω  + 2πn)|2 ≤ B < ∞
where β  = F {β} is the d-dimensional Fourier transform of β (see Section 6.2.3).

(2) Lp -stability for p ≥ 1: k∈Zd |β(r − k)| < ∞, for all r ∈ Rd

(3) Partition of unity: k∈Zd β(r − k) = 1, for all r ∈ Rd .

We note that these conditions are met by the polynomial B-splines. These functions
are known to offer the best cost–quality tradeoff among the known families of interpo-
lation kernels [TBU00]. The computational cost is typically proportional to the support
of a basis function, while the quality is determined by its approximation order (asymp-
totic rate of decay of the approximation error). The approximation order of a B-spline
of degree n is (n + 1), which is the maximum that is achievable with a support of size
n + 1 [BTU01].
In practice, it makes good sense to choose β compactly supported so that condition
(2) is automatically satisfied. Under such a hypothesis, we can extend the representation
for sequences ch [·] with polynomial growth, which may be required for handling non-
stationary signals such as Lévy processes and fractals.
Our next ingredient is a biorthogonal (generalized) function β̃ such that

β(· − k), β̃(· − k ) = δk−k , (10.1)

where β̃ is not necessarily included in Vh . Based on this biorthogonality property, we


specify a signal projector onto the reconstruction space Vh as

   
51 · − hk 6 r − hk
PVh s(r) = β̃ ,s β .
d
hd h h
k∈Z

Using (10.1), it is not hard to verify that PVh is idempotent (i.e., PVh PVh s = PVh s) and
hence a valid (linear) projection operator. Moreover, the partition of unity guarantees
that the error of approximation decays like s − PVh sLp = O(h) (or even faster if β has
an order of approximation greater than one), so that the discretization error becomes
negligible for h sufficiently small.
To simplify the notation, we shall assume from here on that h = 1 and that the control
of the discretization error is adequate. (If not, we can decrease the sampling step further,
10.1 Discretization of linear inverse problems 251

which may also be achieved through an appropriate rescaling of the system of coordi-
nates in which r is expressed.) Hence, our signal reconstruction model is

s1 (r) = PV1 s(r) = s[k]β(r − k) (10.2)


k∈Zd

with

s[k] = β̃(· − k), s .

The main point here is that s1 (r) – the discretized


 version of s(r) – is uniquely de-
scribed by its expansion coefficients s[k] k∈Zd , so that we can reformulate the recons-
truction problem in terms of those quantities. The clear advantage of working with such
a hybrid representation is that it is statistically tractable (countable set of parameters)
and yet continuously defined so that it lends itself to a proper modeling of the signal-
acquisition/measurement process.

Discrete innovation and its statistical characterization


In Chapter 8, we have shown how to obtain the discrete-domain counterpart of the
innovation model Ls = w by applying to the samples of the signal the discrete version
Ld of the whitening operator L. In the present scenario, where the continuous-domain
process s is prefiltered by β̃ ∨ prior to sampling, we construct a variant of the generalized
increments as


u[k] = Ld (β̃ ∨ ∗ s)(r) = (dL ∗ s)[k],
r=k

where the right-hand convolution between dL and s[·] is discrete. Specifically, dL is 

the discrete-domain impulse response of Ld i.e., Ld {δ}(r) = k∈Zd dL [k]δ(r − k) ,
while the signal coefficients are given by s[k] in (10.2). Based on the defining properties
s = L−1 w and Ld L−1 δ = βL , where βL is the B-spline associated with L, we express the
discrete innovation u as

u[k] = β̃(· − k), Ld L−1 w = β̃(· − k), βL ∗ w


= (βL∨ ∗ β̃)(· − k), w .

This implies that u is stationary and that its statistical properties are in direct relation
with those of the white Lévy noise w (continuous-domain innovation). Recalling that w
is specified by its characteristic form Pw (ϕ) = E{ejϕ,w }, we obtain the joint charac-
 
teristic function of the innovation values in a K-point neighborhood, u = u[k] k∈ ,
  
∨ (r − k), where ω = ω
  K
by direct substitution of ϕ(r) = k∈K ω k β̃ ∗ β L k k∈K
is
the corresponding Fourier-domain indexing vector. Specifically,  pU (ω), which is the
(conjugate) Fourier transform of the joint pdf pU (u), is given by
⎛ ⎞
 
 w ⎝
pU (ω) = P ωk β̃ ∗ βL∨ (· − k)⎠ , (10.3)
k∈K

w (ϕ) is determined by its Lévy exponent f through Equation (4.13).


where P
252 Recovery of sparse signals

Since Ld is the best discrete approximation of L, we can expect the samples of u


to be approximately decoupled. Yet, in light of (10.3), we also see that the quality of
the decoupling is dependent upon the extent of the kernel β̃ ∗ βL∨ , which should be as
concentrated as possible. Since βL has minimal support by design, we can improve the
situation only by taking β̃ to be a point distribution. Practically, this suggests the choice
of a discretization model that uses basis functions that are interpolating with β = ϕint .
Indeed, the interpolation condition is equivalent to

ϕint , δ(· − k) = δk

so that the corresponding biorthogonal analysis function is β̃(r) = δ(r). This is the
solution that minimizes the overlap of the basis functions in (10.3).

Decoupling simplification
To obtain an innovation-domain description that is more directly applicable in practice,

we make the decoupling simplification  pU (ω) ≈ k∈K  pU (ωk ), which is equivalent
to assuming that the discrete innovation sequence u[·] is i.i.d. This means that the Kth-
order joint pdf of the discrete innovation can be factorized as
$  
pU (u) = pU u[k] , (10.4)
k∈K

with the following explicit formula for its first-order pdf:


%
−1 
" #&
pU (x) = F Pw ω(β̃ ∗ βL∨ ) (x). (10.5)

Note that the latter comes as a special case of (10.3) with K = 1. The important practical
point is that pU (x) is infinitely divisible, with modified Lévy exponent
 
pU (ω) = fβ̃∗β ∨ (ω) =
log f ω(β̃ ∗ βL∨ )(r) dr. (10.6)
L
Rd

This makes it fall within the general family of distributions investigated in Chapter 9
(by setting ϕ = β ∗ βL∨ ). Equivalently, we can write the corresponding log-likelihood
function
 
− log (pU (u)) = U u[k] (10.7)
k∈K
 
with U u = − log pU (u).

10.1.2 Finite-dimensional formulation


To turn the formulation of Section 10.1.1 into a practical scheme that is numerically
useful, we take a finite section of the signal model (10.2) by restricting ourselves to a
subset of K basis functions with k ∈  over some region of interest (ROI). In practice,
one often introduces problem-specific boundary conditions (e.g., truncation of the sup-
port of the signal, periodization, mirror symmetric extension) that can be enforced by
10.1 Discretization of linear inverse problems 253

suitable modification of the basis functions that intersect the boundaries of the ROI. The
corresponding signal representation then reads

s1 (r) = s[k]βk (r), (10.8)


k∈

where βk (r) is the basis function corresponding to β(r − k) in (10.2) up to the modi-
fications
 at the boundaries. This model is specified by the K-dimensional
  signal vector
s = s[k] k∈ that is related to the discrete innovation vector u = u[k] k∈ by

u = Ls,

where L is the (K× K) matrix representation of Ld , the discrete version of the whitening
operator L.
The general form of a linear, continuous-domain measurement model is

ym = s1 (r)ηm (r) dr + n[m] = s1 , ηm + n[m], (m = 1, . . . , M)


Rd

with sampling/imaging functions {ηm }M m=1 and additive measurement noise n[·]. The
sampling function ηm (r) represents the spatio-temporal response of the mth detector of
an imaging/acquisition device. For instance, it can be a 3-D point-spread function in
deconvolution microscopy, a line integral across a 2-D or 3-D specimen in computed
tomography, or a complex exponential (or a localized version thereof) in the case of
magnetic resonance imaging. This measurement model is conveniently described in
matrix-vector form as

y = y0 + n = Hs + n, (10.9)

where
• y = (y1 , .. . , yM ) is the M-dimensional measurement vector
• s = s[k] k∈ is the K-dimensional signal vector
• H is the (M × K) system matrix with

[H]m,k = ηm , βk = ηm (r)βk (r) dr (10.10)


Rd

• y0 = Hs is the noise-free measurement vector


• n is an additive i.i.d. noise component with pdf pN .
The final ingredient is the set of biorthogonal analysis functions {β̃k }k∈ , which are
such that β̃k , βk = δk−k , with k, k ∈ . We recall that the biorthogonality with the
synthesis functions {βk }k∈ is essential for determining the projection of the stochastic
process s(r) onto the reconstruction space. The other important point is that the choice
β̃ = δ, which corresponds to interpolating basis functions, facilitates the determination
of the joint pdf pU (u) of the discrete innovation vector u from (10.3) or (10.4).
The reconstruction task is now to recover the unknown signal vector s given the
noisy measurements y. The statistical approaches to such problems are all based on the
254 Recovery of sparse signals

determination of the posterior distribution pS|Y , which depends on the prior distribution
pS and the underlying noise model. Using Bayes’ rule, we have that
 
pY|S (y|s)pS (s) pN y − Hs pS (s)
pS|Y (s|y) = =
pY (y) pY (y)
1
= pN (y − Hs)pS (s),
Z
where the proportionality factor Z is not essential to the estimation procedure because it
only depends on the input data y, which is a known quantity.
 We also note that Z can be
recalculated by imposing the normalization constraint RK pS|Y (s|y) ds = 1, which is the
way it is handled in message-passing algorithms (see Section 10.4.2). The next step is
to introduce the discrete innovation variable u = Ls, whose pdf pU (u) has been derived
explicitly. If the linear mapping between u and s is one-to-one, 1 we clearly have that

pS (s) ∝ pU (Ls).

Using this relation together with the decoupling simplification (10.4), we find that
   $  
pS|Y (s|y) ∝ pN y − Hs pU (Ls) ≈ pN y − Hs pU [Ls]k , (10.11)
k∈

where pU is specified by (10.6) and solely depends on the Lévy exponent f of the
continuous-domain innovation w and the B-spline kernel βL = Ld L−1 δ associated with
the whitening operator L.
In the standard additive white Gaussian noise scenario (AWGN), we find that
 
y − Hs2 $  
pS|Y (s|y) ∝ exp − 2
pU [Ls]k ,

k∈

where σ 2 is the variance of the discrete measurement noise.


Conceptually, the best solution to the reconstruction problem is the MMSE estimator,
which is given by the mean of the posterior distribution

sMMSE (y) = E{s|y}.

The estimate sMMSE (y) is optimal in that it is closest (in the mean-square sense)
 to the
(unknown) noise-free signal s; i.e., E{|sMMSE (y)−s|2 } = min E{|s̃(y) − s|2 } among all
signal estimators s̃(y). The downside of this estimator is that it is difficult to compute in
practice, except for special cases such as those discussed in Sections 10.2.2 and 10.4.2.

1 A similar proportionality relation can be established for the cases where L has a non-empty null space via
the imposition of the boundary conditions of the SDE. While the rigorous formulation can be carried out, it
is often not worth the effort because it will only result in a very slight modification of the solution such that
s1 (r) satisfies some prescribed boundary conditions (e.g., s1 (0) = 0 for a Lévy process), which are artificial
anyway. The pragmatic approach, which better suits real-world applications, is to ignore this technical issue
by adopting the present stationary formulation, and to let the optimization algorithm adjust the null-space
component of the signal to produce the solution that is maximally consistent with the data.
10.2 MAP estimation and regularization 255

10.2 MAP estimation and regularization

Among the statistical estimators that incorporate prior information, the most popular
is the MAP solution, which extracts the mode of the posterior distribution. Its use can
be justified by the fact that it produces the signal estimate that best explains the ob-
served data. While MAP does not necessarily yield the estimator with the best average
performance, it has the advantage of being tractable numerically.
Here, we make use of the prior information that the continuous-domain signal s satis-
fies the innovation model Ls = w, where w is a white Lévy noise. The finite-dimensional
transposition of this model (under the decoupling simplification) is that the discrete
innovation vector Ls = u can be assumed to be i.i.d., 2 where L is the matrix counter-
part of the whitening operator L. For a given set of noisy measurements y = Hs + n
with AWGN of variance σ 2 , we obtain the MAP estimator through the maximization of
(10.11). This results in
 
1
sMAP = arg min y − Hs2 + σ
2 2
U ([Ls]k ) , (10.12)
s∈RK 2
k∈

with U (x) = − log pU (x), where pU (x) is given by (10.5). Observe that the cost functio-
nal in (10.12) has two components: a data term 12 y−Hs22 that enforces the consistency
between the data and the simulated, noise-free measurements Hs, and a second regulari-
zation term that favors likeliest solutions in reference to the prior stochastic model. The
balancing factor is the variance σ 2 which amplifies the influence of the prior informa-
tion as the data get noisier. The specificity of the present formulation is that the poten-
tial function is given by the log-likelihood of the infinitely divisible random variable U,
which has strong theoretical implications, as discussed in Sections 10.2.1 and 10.2.3.
For the time being, we observe that the general form of the estimator (10.12) is com-
patible with the standard variational approaches used in signal processing. The three
cases of interest that correspond to valid id (infinitely divisible) log-likelihood func-
tions (see Table 4.1) are:
√ 1 e−x /(2σ0 )
2 2
(1) Gaussian: pU (x) = ⇒ U (x) = 1 2
x + C1
2πσ0 2σ02
λ −λ|x|
(2) Laplace: pU (x) = 2e ⇒ U (x) = λ|x| + C2
 r+ 1
1 1 2  1
(3) Student: pU (x) = " # ⇒ U (x) = r + log(x2 + 1) + C3 ,
B r, 12 x2 + 1 2

where the constants C1 , C2 , and C3 can be ignored since they do not affect the solu-
tion. The first quadratic potential leads to the classical Tikhonov regularizer, which
yields a stabilized linear solution. The second absolute-value potential produces an
1 -type regularizer; it is the preferred solution for solving deterministic compressed-
sensing and sparse-signal-recovery problems. If L is a first-order derivative operator,
2 The underlying discrete innovation sequence u[k] = L s(r)

d r=k is stationary and therefore identically distri-
buted. It is also independent for Markov and/or Gaussian processes, but only approximatively so otherwise.
To justify the decoupling simplification, we like to invoke the minimum-support property of the B-spline
βL = Ld L−1 δ ∈ L1 (Rd ), where Ld is the discrete counterpart of the whitening operator L.
256 Recovery of sparse signals

then (10.12) maps into total-variation (TV) regularization, which is widely used in appli-
cations [ROF92]. The third log-based potential is interesting as well because it relates to
the limit on an p -relaxation scheme when p tends to zero [WN10]. The latter has been
proposed by several authors as a practical “debiasing” method for improving the spar-
sity of the solution of a compressed-sensing problem [CW08]. The connection between
log and p norm relaxation is provided by the limit
x2p − 1
log x2 = lim ,
p→0 p
which is compatible with the Student prior for x2 ' 1.

10.2.1 Potential function


In the present Bayesian framework, the potential function U (x) = − log pU (x) is deter-
mined by the Lévy exponent f (ω) of the continuous-domain innovation w or, equi-
valently, by the canonical noise pdf pid (x) in Proposition 4.12. Specifically, pU (x) is
infinitely divisible with modified Lévy exponent fβ̃∗β ∨ (ω) given by (10.7). While the
L
exact form of pU (x) also depends on the B-spline kernel βL , a remarkable aspect of
the theory is that its global characteristics remain qualitatively very similar to those of
pid (x). Indeed, based on the analysis of Chapter 9 with ϕ = β̃ ∗ βL∨ ∈ Lp (Rd ), we can
infer the following properties:
• If pid (x) is symmetric and unimodal, then the same is true for pU (x) (by
Corollary 9.5). These are very desirable properties for they ensure that U (x) =
U (|x|) is symmetric and increases monotonically away from zero. The preser-
vation of these features is important because most image-processing practitioners
would be reluctant to apply a regularization scheme that does not fulfill these basic
monotonicity constraints and cannot be interpreted as a penalty.
• In general, pU (x) will not be Gaussian unless pid (x) is Gaussian to start with, in which
case U (x) is quadratic.
• If pid (x) is stable (e.g., SαS) then the property is preserved for pU (x) (by Proposi-
tion 9.8). This corresponds to a sparse scenario as soon as α  = 2.
• In general, pU (x) will display the same sparsity patterns as pid (x). In particular, if
pid (x) = O(1/|x|p ) with p > 1 (heavy-tailed behavior), then the  same holds true for
pU (x). Likewise, its pth-moment will be finite if and only if R |x| pid (x) dx < ∞
p

(see Proposition 9.10).


• The qualitative behavior of pU (x) around the origin will be the same as that of pid (x),
regarding properties such as symmetry and order of differentiability (or lack thereof).
An important remark concerning the last point is in order. We have already seen
that compound-Poisson distributions exhibit a Dirac delta impulse at the origin and
that the Poisson property is preserved through the noise-integration process when βL
is compactly supported (set p = 0 in Proposition 9.10). Since taking the logarithm
of δ(x) is not an admissible operation, we cannot define a proper potential function
in such cases (finite rate of innovation scenario). Another way to put this is that the
10.2 MAP estimation and regularization 257

Table 10.1 Asymptotic behavior of the potential function X (x ) for the infinitely divisible distributions
in Table 4.1.a

a (z) and ψ (1) (r) are Euler’s gamma and first-order polygamma functions, respectively (see Appendix C).

compound-Poisson MAP estimator would correspond to u = 0 because the probability


of getting zero is overwhelmingly larger than that of observing any other value.
These limitations notwithstanding, we can rely on the properties of infinitely divi-
sible laws to make some general statements about the asymptotic form of the potential
function. First, U (x) will typically exhibit a Gaussian (i.e., quadratic) behavior near
the origin. Indeed, when pU (x) is symmetric and twice-differentiable at the origin, we
can write the Taylor series of the potential as
U (0) 2
U (x) = U (0) + x + O(|x|4 ), (10.13)
2
with

d2 log pU (x)  p (0)
U (0) = − 2  =− U .
dx x=0 pU (0)
By contrast, unless U is Gaussian, the rate of growth of U decreases progressively
away from the origin and becomes less than quadratic. Indeed, by using the tail proper-
ties of id distributions, one can prove that a non-Gaussian id potential cannot grow faster
than x log(x) as x →  ∞. Amore typical asymptotic trend is O(x) when pU (x) has expo-
nential decay, or O log(x) when pU (x) has algebraic decay. These behaviors are exem-
plified in Table 10.1. For the precise specification of these potential functions, including
the derivation of their asymptotics, we refer to Appendix C. In particular, we note that
the families of sym gamma and Meixner distributions give rise to an intermediate range
258 Recovery of sparse signals

of behaviors with potential functions falling in between the linear solution (Laplace
and hyperbolic secant) and the log form that is characteristic of the heavier-tailed laws
(Student and SαS).

10.2.2 LMMSE/Gaussian solution


Before addressing the general MAP estimation problem, it is instructive to investigate
the Gaussian scenario, which admits a closed-form solution. Moreover, under the Gaus-
sian hypothesis, we can compute a perfectly decoupled representation by using a modi-
fied discrete filter LG that performs an exact whitening 3 of the signal in the discrete
domain (see Section 8.3.4). In the present setting, the Gaussian MAP estimator is the
minimizer of the quadratic cost functional
1 1
C 2 (s, y) = y − Hs22 + σ 2 LG s22
2 2
sMAP (y) = arg min C 2 (s, y)
s∈RK
−1/2
with LG = Css , where Css = E{ssT } is the (K×K) symmetric covariance matrix of the
signal. The implicit assumption here is that Css is invertible and E{s} = 0 (zero-mean
signal). The gradient of C 2 (s, y) is given by
∂ C 2 (s, y)
= −HT (y − Hs) + σ 2 LTG LG s,
∂s
and the MAP estimator is found by equating it to zero. This yields the classical linear
solution
" #−1
sMAP (y) = HT H + σ 2 LTG LG HT y (10.14)
" #−1
= HT H + σ 2 C−1ss HT y.

Note that the term σ 2 C−1


ss , which is bounded from above and below (due to the finite-
variance and invertibility hypotheses), acts as a stabilizer. It ensures that the matrix
inverse in (10.14) is well defined, and hence that the solution is unique.
Alternatively, it is also possible to derive the LMMSE estimator (or Wiener filter)
which takes the standard form
" #−1
sLMMSE (y) = Css HT HCss HT + Cnn y, (10.15)

where Cnn = σ 2 I is the (N × N) covariance matrix of the noise. We note that the
LMMSE solution is also valid for the non-Gaussian, finite-variance scenarios: it pro-
vides the MMSE solution among all linear estimators, irrespective of the type of signal
model.
3 We recommend the use of the classical discrete whitening filter L = C−1/2 as a substitute for L in the
G ss
Gaussian case because it results in an exact formulation. However, we do not advise one to do so for
non-Gaussian models because it may induce undesirable long-range dependencies, the difficulty being that
decorrelation alone is no longer synonymous with statistical independence.
10.2 MAP estimation and regularization 259

It is well known from estimation theory that the Gaussian MAP and LMMSE solu-
tions are equivalent. This can be seen by considering the following sequence of equiva-
lent matrix identities:
HT HCss HT + σ 2 C−1 T T T
ss Css H = H HCss H + σ H
2 T
" # " #
HT H + σ 2 C−1
ss Css HT
= HT
HCss H T
+ σ 2
I
" #−1 " #−1
Css HT HCss HT + σ 2 I = HT H + σ 2 C−1ss HT ,
where have used the hypothesis that the covariance matrix Css is invertible.
The availability of the closed-form solution (10.14) or (10.15) is nice conceptually,
but it is not necessarily applicable for large-scale problems because the system matrix
is too large to be stored in memory and inverted explicitly. The usual numerical
approach is to solve the corresponding system of linear equations iteratively using the
conjugate-gradient (CG) method. The convergence speed of CG can often be improved
by applying some problem-specific preconditioner. A particularly favorable situation is
when the matrix HT H + σ 2 LTG LG is block-Toeplitz (or circulant) and is diagonalized by
the Fourier transform. The signal reconstruction can then be computed very efficiently
with the help of an FFT-based inversion. This strategy is applicable for the basic fla-
vors of deconvolution, computed tomography, and magnetic resonance imaging. It also
makes the link with the classical methods of Wiener filtering, filtered backprojection,
or backprojection filtering, which result in direct image reconstruction.

10.2.3 Proximal operators


The optimization problem (10.12) becomes more challenging when the potential func-
tion U is non-quadratic. To get a better handle on the effect of such a regularization, it
is instructive to consider a simplified scalar version of the problem. Conceptually, this
is equivalent to treating the components of the problem as if they were decoupled. To
that end, we define the proximal operator with weight σ 2 as
1
proxU (y; σ 2 ) = arg min |y − u|2 + σ 2 U (u), (10.16)
u∈R 2
which is tied to the underlying stochastic model. Since U (u) = − log pU (u) is boun-
ded from below and cannot grow faster that O(|u|2 ), the solution of (10.16) exists for
any y ∈ R. When the minimizer is not unique, we can establish arbitrary preference (for
instance, pick the smallest solution) and specify ũ = proxU (y; σ 2 ) as the proximal esti-
mator of y. This function returns the closest approximation of y, subject to the penalty
induced by σ 2 U (u). In practice, σ 2 is fixed and the mapping ũ = proxU (y; σ 2 ) can be
specified either analytically or, at least, numerically, in the form of a 1-D lookup table
that maps y to ũ (shrinkage function).
The proximal operator induces a perturbation of the identity map that is more or less
pronounced, depending upon the magnitude of σ 2 and the characteristics of U (u). At
the points u where is U (u) is differentiable, ũ satisfies
−y + ũ + σ 2 U (ũ) = 0, (10.17)
260 Recovery of sparse signals

where U (u) = ddu U (u)


. In particular if U (u) is twice-differentiable with

1 + σ U (u) ≥ 0, then the above mapping is one-to-one, which implies that ũ(y)
2

is the inverse function of y(ũ) = prox−1 2 2 


U (ũ; σ ) = ũ + σ U (ũ). The quantity
  p  (u)
−U (u) = pUU (u) is sometimes called the score. It is related to the Fisher information
E{|U (U)|2 } ≥ 1/Var{U}, which is minimum in the Gaussian case.
Equation (10.17) can be used to derive the closed-form representation of the two
better-known examples of proximal maps,
" #
T1 (y) = prox1 (y; σ 2 ) = sign(y) |y| − λσ 2 (10.18)
+
σ02
T2 (y) = prox2 (y; σ 2 ) = y, (10.19)
σ02 + σ 2

with 1 (u) = λ|u| (Laplace law) and 2 (u) = |u|2 /(2σ02 ) (Gaussian). The first is a soft-
threshold (shrinkage operator) and the second a linear scaling (scalar Wiener filter).
While not all id laws lend themselves to such an analytic treatment, we can nevertheless
determine the asymptotic form of their proximal operator. For instance, if U is sym-
metric and twice-differentiable at the origin, which is the case for most 4 examples in
Table 10.1, then
y
proxU (y; σ 2 ) = as y → 0.
1 + σ U (0)
2

The result is established by using a basic first-order Taylor series argument: U (u) =
U (0)u + O(u2 ). The required slope parameter is
 2
 pU (0) ωpU (ω) dω
U (0) = − = R , (10.20)
pU (0) R
pU (ω) dω
which is computable from the moments of the characteristic function.
To determine the larger-scale behavior, one has to distinguish between the (non-
Gaussian) intermediate scenarios, where the asymptotic trend of the potential is predo-
minantly linear (e.g., Laplace, hyperbolic secant, sym gamma, and Meixner families),
and the distributions with algebraic decay, where it is logarithmic (Student and SαS
stable), as made explicit in Table 10.1. (This information can readily be extracted
from the special-function formulas in Appendix C.) For the first exponential and sub-
exponential category, we have that limu→∞ U (u) = b1 + O(1/u) with b1 > 0, which
implies that
proxU (y; σ 2 ) ∼ y − σ 2 b1 as y → +∞.

This corresponds to a shrinkage-type estimator.


The limit behavior in the second heavy-tailed category is limu→∞ U (u) = b2 /u with
b2 > 0, which translates into the asymptotic identity-like behavior

proxU (y; σ 2 ) ∼ y as y → ∞.

4 Only the Laplace law and its sym gamma variants with r < 3/2 fail to meet the differentiability
requirement.
10.2 MAP estimation and regularization 261

15
30
25 10
20
5
15
10 0
5
-5
-30 -20 -10 0 10 20 30 -10
(a)
-15
-20 -10 0 10 20
(b)

Figure 10.1 Normalized Student potentials for r = 2, 4, 8, 16, 32 (dark to light) and corresponding
proximity maps. (a) Student potentials Student (y) with unit signal variance. (b) Shrinkage
operator proxStudent (y; σ 2 ) for σ 2 = 1.

Some examples of normalized potential functions for symmetric Student distribu-


tions with increasing asymptotic decay are shown in Figure 10.1, together with their
corresponding proximity maps for a unit signal-to-noise ratio. The proximity maps fall
nicely in between the best linear (or 2 ) solution (solid black line) and the identity map
(dashed black line). Since the Student distributions are maximal at y = 0, symmetric,
and infinitely differentiable, the thresholding functions are linear around the origin – as
predicted by the theory – and remarkably close to the pointwise Wiener (LMMSE) solu-
tion. For larger values of the input, the estimation progressively switches to an identity,
which is consistent with the algebraic decay of the Student pdfs, the transition being
faster for the heavier-tailed distributions (r small).
In summary, the proximal maps associated with id distributions have two predomi-
nant regimes: (1) a linear, Gaussian mode (attenuation) around the origin and (2) a
shrinkage mode as one moves away from the origin. The amount of asymptotic shrink-
age depends on the rate of decay of the pdf – it converges to zero (identity map) when
the decay is algebraic, as opposed to exponential.

10.2.4 MAP estimation


We have now all the elements in hand to derive a generic MAP estimation algorithm.
The basic idea is to consider the innovation vector u as an auxiliary variable and to
reformulate the MAP estimation as a constrained optimization problem,
 
1  
sMAP = arg min y − Hs22 + σ 2 U u[k] subject to u = Ls.
s∈RK 2
k∈
Instead of addressing the constrained optimization heads-on, we impose the constraint
by adding a quadratic penalty term μ2 Ls − u22 with μ sufficiently large. To make
the scheme more efficient, we rely on the augmented-Lagrangian (AL) method.
The corresponding AL functional is
1   μ
L A (s, u, α) = y − Hs22 + σ 2 U u[k] − α T (Ls − u) + Ls − u22 ,
2 2
k∈
262 Recovery of sparse signals

where the vector α = (αk )k∈ represents the Lagrange multipliers for the desired
constraint.
To find the solution, we handle the optimization task sequentially according to the
alternating-direction method of multipliers (ADMM) [BPCE10]. Specifically, we consi-
der the unknown variables s and u in succession, minimizing LA (s, u, α) with respect
to each of them, while keeping α and the other one fixed. This is combined with an
update of α to refine the current estimate of the Lagrange multipliers. By using the
index k to denote the iteration number, the algorithm cycles through the following steps
until convergence:

sk+1 ← arg min LA (s, uk , α k )


s∈RN
 
α k+1 = α k − μ Lsk+1 − uk
uk+1 ← arg min LA (sk+1 , u, α k+1 ).
u∈RN

We now look into the details of the optimization. The first step amounts to the mini-
mization of the quadratic form in s given by
1 μ
L A (s, uk , α k ) = y − Hs22 − (Ls − uk )T α k + Ls − uk 22 + C1 , (10.21)
2 2
where C1 = C1 (uk ) is a constant that does not depend on s. By setting the gradient of
the above expression to zero, as in
∂ LA (s, uk , α k )  
= −HT (y − Hs) − LT α k − μ(Ls − uk ) = 0, (10.22)
∂s
we obtain the intermediate linear estimate,
" #−1 " #
sk+1 = HT H + μLT L HT y − zk+1 , (10.23)
 
with zk+1 = LT α k + μuk . Remarkably, this is essentially the same result as the
Gaussian solution (10.14) with a slight adjustment of the data term and regularization
strength. Note that the condition Ker(H) ∩ Ker(L) = {0} is required for this linear
solution to be well defined and unique.
To justify the update of the Lagrange multipliers, we note that (10.22) can be
rewritten as
∂ LA (s, uk , α k )
= −HT (y − Hs) − LT α k+1 = 0,
∂s
which is consistent with the global optimality conditions

Ls − u = 0,

−HT (y − Hs ) − LT α  = 0,

where (s , u , α  ) is a stationary point of the optimization problem. This ensures that
α k+1 gets closer to the correct vector of Lagrange multipliers α  as uk converges to
u = Ls .
10.3 MAP reconstruction of biomedical images 263

As for the third step, we define ũ[k] = [Lsk+1 ]k and rewrite the AL criterion as
  μ
L A (sk+1 , u, α k+1 ) = C2 + σ 2 U u[k] + αk u[k] + (ũ[k] − u[k])2
2
k∈
  2
μ αk  
= C3 + ũ[k] − − u[k] + σ 2 U u[k] ,
2 μ
k∈

where C2 and C3 are constants that do not depend on u. This shows that the optimization
problem is decoupled and that the update can be obtained by direct application of the
proximal operator (10.16) in a coordinate-wise fashion, so that
 σ2 
uk+1 = proxU Lsk+1 − μ1 α k+1 ; . (10.24)
μ
A few remarks are in order. While there are other possible numerical approaches
to the present MAP-estimation problem, the proposed algorithm is probably the simp-
lest to deploy because it makes use of two very basic modules: a linear solver (akin
to a Wiener filter) and a model-specific proximal operator that can be implemented as
a pointwise non-linearity. The approach provides a powerful recipe for improving on
some prior linear solution by reapplying the solver sequentially and embedding it into a
proper computational loop. The linear solver needs to be carefully engineered because
it has a major impact on the efficiency of the method. The adaptation to a given sparse
stochastic model amounts to a simple adjustment of the proximal map (lookup table).
The ADMM is guaranteed to converge to the global optimum when the cost func-
tional is convex. Unfortunately, convexity is not necessarily observed in our case (see
Table 10.1). But, since each step of the algorithm involves an exact minimization, the
cost functional is guaranteed to decrease so that the method remains applicable in non-
convex situations. There is the risk, though, that it gets trapped into a local optimum. A
way around the difficulty is to consider a warm start that may be obtained by running
the 2 (Gauss) or 1 (Laplace) version of the method.

10.3 MAP reconstruction of biomedical images

In this section, we apply the MAP-estimation paradigm to the reconstruction of biome-


dical images. We concentrate on three imaging modalities: deconvolution microscopy,
magnetic resonance imaging (MRI), and X-ray computed tomography (CT). In each
case, we briefly recall the underlying linear image-formation model and then proceed
with the determination of the system matrix H in accordance with the discretization
principle presented in Section 10.1. An important practical aspect is the selection of an
appropriate reconstruction basis. This selection is guided by approximation-theoretic
and computational considerations. The outcome for each modality is a generic iterative
reconstruction algorithm whose regularization parameters (whitening/regularization
operator, potential function) can be tuned to the characteristics of the underlying class
of biomedical images. Since our primary intent here is illustrative, we have fixed the
regularization operator to the most popular choice in the field (i.e., magnitude of the
264 Recovery of sparse signals

gradient) and are presenting practical reconstruction results that highlight the influence
of the potential function.

10.3.1 Scale-invariant image model and common numerical setup


A well-documented observation is that natural images tend to exhibit a Fourier spec-
trum that is mostly isotropic and roughly decaying like 1/ωγ with 1/2 ≤ γ ≤ 2. The
same holds true for many biomedical images, in both 2-D and 3-D. This is consistent
with the idea of rotation- and scale-invariance since the objects and elementary struc-
tures in an image can appear at arbitrary orientations and magnifications. In the bio-
medical context, it can be argued that natural growth often induces fractal-like patterns
whose appearance is highly irregular despite the simplicity of the underlying generative
rules. Prominent examples of such structures are (micro)vascular networks, dendritic
trees, trabecular bone, and cellular scaffolds. The other important observation is that the
wavelet-domain statistics of images are typically non-Gaussian, with a small proportion
of large coefficients (typically corresponding to contours) and the majority being close
to zero (in smooth image regions). This is the primary reason why sparsity has become
such an important topic in signal and image processing (see Introduction).
Our justification for the present model-based approach is that these qualitative pro-
perties of images are consistent with the stochastic models investigated in Chapter 7
if we take the whitening operator to be (− )γ /2 , the fractional Laplacian of order γ .
While it could make sense to fit such a model very precisely and apply the correspond-
ing (fractional) isotropic localization operator to uncouple the information, we prefer to
rely on a robust scheme that involves shorter filters, as justified by the analysis of Sec-
tion 8.3.5. A natural refinement over the series of coordinate-wise differences suggested
by Theorem 7.7 is to consider the magnitude of the discrete gradient. This makes the
regularization rotation-invariant without adding to the computational cost, and provides
an efficient scheme for obtaining a robust estimate of the absolute value of the dis-
crete increment process. Indeed, working with the absolute value is legitimate as long
as the potential function is symmetric, which is always the case for the priors used in
practice.
Before moving to the specifics of the various imaging modalities, we briefly describe
the common numerical setup used for the experiments. Image acquisition is simula-
ted numerically by applying the forward model (matrix H) to the noise-free image s
and subsequently adding white Gaussian noise of variance σ 2 . The images are then
reconstructed from the noisy measurements y by applying the algorithm described in
Section 10.2.4, with L being the discrete gradient operator. The latter is implemen-
ted using forward finite differences while the corresponding LT L is the (scalar) dis-
crete Laplacian. Our discretization assumes that the images are extended using periodic
boundary conditions. This lends itself to the use of FFT-based techniques for the imple-
mentation of the matrix multiplications required by the linear step of the algorithm:
Equation (10.23) or its corresponding gradient updates.
For each set of measurements, we are presenting three reconstruction results to com-
pare the effect of the primary types of potential functions U .
10.3 MAP reconstruction of biomedical images 265

• Gaussian prior: Gauss (x) = Ax2 . In this case, the algorithm implements a classical
linear reconstruction. This solution is representative of the level of performance of the
reconstruction methods that are currently in use and that do not impose any sparsity
on the solution.
• Laplace prior: Laplace (x) = B|x|. This configuration is at the limit of convexity and
imposes some medium level of sparsity. It corresponds to an 1 -type minimization
and is presently very popular for the recovery of sparse signals in the context of
compressed sensing.
• Student prior: Student (x) = C log(x2 + ). This log-like penalty is non-convex and
allows for the kind of heavy-tail behavior associated with the sparsest processes.
The regularization constants A, B, C are optimized for each experiment by comparison
of the solution with the noise-free reference (oracle) to obtain the highest possible SNR.
The proximal step of the algorithm described by (10.24) is adapted to handle the dis-
crete gradient operator by merely shrinking its magnitude. This is the natural vectorial
extension dictated by Definition (10.16). The shrinkage function for the Laplace prior
is a soft-threshold see (10.18) , while Student’s solution with  small is much closer
to a hard threshold and favors sparser solutions. The reconstruction is initialized in a
systematic fashion: the solution of the Gaussian estimator is used as initialization for
the Laplace estimator and the output of the Laplace estimator is used as initial guess for
Student’s estimator. The parameter for Student’s estimator is set to  = 10−2 .

10.3.2 Deconvolution of fluorescence micrographs


The signal of interest in fluorescence microscopy is the 3-D spatial density of the fluo-
rescent labels that are embedded in a sample [VAVU06]. A molecular label is a fluoro-
phore which, upon illumination at a specific wavelength, has the ability to re-emit light
at another (typically) longer wavelength. A notable representative is the green fluores-
cence protein (GFP) that has its primary excitation peak at 395 nm and its emission peak
at 509 nm.

Physical model of a diffraction-limited microscope


In a standard wide-field microscope, the fluorophores are excited by applying a uniform
beam of light. The optical system includes an excitation filter that is tuned to the exci-
tation wavelength of the fluorophore and an emission filter that collects the re-emitted
light. Due to the random nature of photon re-emission, each point in the sample contri-
butes independently to the light intensity in the image space, so that the overall system
is linear. The microscope makes use of lenses to obtain a magnified view of the object.
When the optical axis is properly aligned (paraxial configuration), the lateral 5 transla-
tion of a point source produces a corresponding (magnified) shift in the image plane. If,
on the other hand, the point source moves axially out of focus, one observes a progres-
sive spreading of the light (blurring), the effect being stationary with respect to its lateral
5 In microscopy, “lateral” refers to the x-y focal plane, while “axial” refers to the z coordinate (depth), which
is along the optical axis.
266 Recovery of sparse signals

position. This implies that the fluorescence microscope acts as a linear shift-invariant
system. It can therefore be described by the convolution equation

g(x, y, z) = (h3D ∗ s)(x, y, z), (10.25)

where h3D is the 3-D impulse response, also known as the point-spread function
(PSF) of the microscope. A high-performance microscope can be modeled as a perfect
aberration-free optical system whose only limiting factor is the finite size of the pupil
of the objective. The PSF is then given by
 " x y z #2
 
h3D (x, y, z) = I0 pλ , ,  , (10.26)
M M M2
where I0 is a constant gain, M is the magnification factor, and pλ is the coherent dif-
fraction pattern of an ideal point source with emission wavelength λ that is induced by
the pupil. The square modulus accounts for the fact that the quantity measured at the
detector is the (incoherent) light intensity, which is the energy of the electric field. Also
note that the rescaling is not the same along the z dimension. The specific form of pλ ,
as provided by the Fraunhofer theory of diffraction, is
   
ω12 + ω22 xω1 + yω2
pλ (x, y, z) = P(ω1 , ω2 ) exp j2πz exp −j2π dω1 dω2 ,
R2 2λf20 λf0
(10.27)
where f0 is the focal length of the objective and P(ω1 , ω2 ) = 1ω<R0 is the pupil func-
tion. The latter is an indicator function that describes the circular aperture of radius
R0 in the so-called Fourier plane. The ratio R0 /f0 is a good predictor of the numerical
aperture 6 (NA), the optical parameter that is used in microscopy to specify the resolu-
tion of an objective through Abbe’s law.
The 3-D PSF given by (10.26) is shown in Figure 10.2. We observe that h3D is cir-
cularly symmetric with respect to the origin (x, y) = (0, 0) in the planes perpendicular
to the optical axis z and that it exhibits characteristic diffraction rings. It is narrowest
in the focal x-y plane with z = 0. The focal spot in Figure 10.2a is the Airy pattern
that determines the lateral resolution of the microscope (see (10.29) and the discus-
sion below). By contrast, the PSF is significantly broader in the axial direction. It also
spreads out linearly along the lateral dimension as one moves away from the focal plane.
The latter represents the effect of defocusing, with the external cone-shaped envelope in
Figure 10.2c being consistent with the simplified behavior predicted by ray optics. This
shows that, besides the fundamental limit on the lateral resolution that is imposed by the
pupil function, the primary source of blur in wide-field microscopy is along the optical
axis and is due to the superposition of the light contributions coming from the neighbor-
ing planes that are out of focus. The good news is that these effects can be partly com-
pensated through the use of 3-D deconvolution techniques. In practical deconvolution
6 The precise definition is NA = n sin θ , where n is the index of refraction of the operating medium (e.g., 1.0
for air, 1.33 for pure water, and 1.51 for immersion oil) and θ is the half-angle of the cone of light entering
the objective. The small-angle approximation for a normal use in air is NA ≈ R0 /f0 . A finer analysis that
takes into account the curvature of the lens shows that this simple ratio formula remains accurate even at
large numerical apertures in a well-corrected optical system.
10.3 MAP reconstruction of biomedical images 267

x
y
−2
0
−1

1
0
1
2

−2 −1 0 1 2

(a) z

−2 −2

−1 −1

0 0

1 1

2 2

−2 −1 0 1 2 −2 −1 0 1 2

(b) (c)

Figure 10.2 Visualization of the 3-D PSF of a wide-field microscope in a normalized coordinate
system. (a) Cut through the lateral x-y plane with z = 0 (in focus). (b) Cut through a lateral x-y
plane with z = 1 (out of focus). (c) Cut through the axial x-z plane with y = 0.

microscopy, one typically acquires a focal series of images (called a z-stack) which are
then deconvolved in 3-D by using a PSF that is derived either experimentally, through
the imaging of fluorescent nano beads, or theoretically see (10.26) and (10.27) , based
on the optical parameters of the microscope (e.g., NA, λ, M). Note that there are also
optical solutions, such as confocal or light-sheet microscopy, for partially suppressing
the out-of-focus light, but that these require more sophisticated instrumentation and lon-
ger acquisition times. These modalities may also benefit from deconvolution, but to a
lesser extent.
Here, for simplicity, we shall concentrate on the simpler 2-D scenario where one is
imaging a thin horizontal specimen s whose fluorescence emitters are confined to the
focal plane z = 0. The image formation then simplifies to the purely 2-D convolution

g(x, y) = (h2D ∗ s)(x, y), (10.28)

where the PSF is now given by


 
 J1 (r/r0 ) 2

h2D (x, y) = I0 2  (10.29)
r/r0 
2 λf0
with r = x2 + y2 , r0 = 2π R0 , and J1 (r) is the first-order Bessel function. This charac-
teristic pattern is called the Airy disc, while J1 (r)/r is the inverse Fourier transform of
a circular pupil of unit diameter (see (10.27) with z = 0).
The Fourier transform of the 2-D PSF of a microscope is called the modulation
transfer function. Because of the squaring in (10.29), it is equal to the Fourier-domain
268 Recovery of sparse signals

autocorrelation function of the circular pupil function P(ω1 , ω2 ). The calculation of this
convolution yields
⎧  7 

⎪ " # " #2
⎨ ω ω ω
π arccos ω0 − ω0 1 − ω0 , for 0 ≤ ω < ω0
2

h2D (ω) = (10.30)


⎩ 0, otherwise,

where
2R0 π 2NA
ω0 = = ≈
λf0 r0 λ
is the Rayleigh frequency. This shows that a microscope is a radially symmetric low-
pass filter whose cutoff frequency ω0 imposes a fundamental limit on resolution. A
Shannon-type sampling argument suggests that the ultimate resolution is r0 , assuming
that one is able to deconvolve the image. Alternatively, one can apply Rayleigh’s space-
domain criterion, which stipulates that the minimum distance at which two point sources
can be separated is when the first diffraction minimum coincides with the maximum
response of the second source. Since the function (J1 (r)/r)2 reaches its first zero at
r = 3.8317, this corresponds to a resolution limit at dRayleigh = 0.61λ × (f0 /R0 ). This
is consistent with Abbe’s celebrated formula for the diffraction limit of a microscope:
dAbbe = λ/(2NA). The better objectives have a large numerical aperture with the cur-
rent manufacturing limit being NA < 1.45 (with oil immersion). This puts the resolu-
tion limit to about one-third the wavelength λ (or, one-half the wavelength for the more
typical value of NA = 1). Sometimes in the literature, the PSF is approximated by a 2-D
isotropic Gaussian. The standard deviation that provides the closest fit with the physical
model (10.29) is σ0 = 0.42λ(f0 /R0 ) = 0.84/ω0 .

Discretization
From now on, we assume that ω0 ≤ π so that we can sample the data on the integer
grid while meeting the Nyquist criterion. The corresponding analysis functions, which
are indexed by m = (m1 , m2 ), are therefore given by

ηm (x, y) = h2D (x − m1 , y − m2 ).

In order to discretize the system, we select a sinc basis {sinc(x − k)}k∈Z2 with

sinc(x, y) = sinc(x)sinc(y),

where sinc(x) = sin(πx)/(πx). The entries of the system matrix in (10.9) are then ob-
tained as

[H]m,k = ηm , sinc(· − k)


= h2D (· − m), sinc(· − k)
 
= sinc ∗ h2D (m − k) = h2D (m − k).

In effect, this is equivalent to constructing the system matrix from the samples of the
PSF since h2D is already band-limited as a result of the imaging physics (diffraction-
limited microscope).
10.3 MAP reconstruction of biomedical images 269

(a) (b) (c)

Figure 10.3 Images used in deconvolution experiments. (a) Stem cells surrounded by goblet cells.
(b) Nerve cells growing around fibers. (c) Artery cells.

An important aspect for the implementation of the signal-recovery algorithm is that H


is a discrete convolution matrix which is diagonalized by the discrete Fourier transform.
The same is true for the regularization operator L as well as for any linear combination,
product, or inverse of such convolution matrices. This allows us to convert (10.23) to a
simple Fourier-domain multiplication which yields a fast and direct implementation of
the linear step of the algorithm. The computational cost is essentially that of two FFTs
(one forward and one backward Fourier transform).

Experimental results
The reference data are provided by the three microscopic images in Figure 10.3, which
display different types of cells. The input images of size (512 × 512) are blurred with
a Gaussian PSF of support (9 × 9) and standard deviation σ0 = 4 to simulate the effect
of a wide-field microscope with a low-NA objective. The measurements are degraded
with additive white Gaussian noise so as to meet some prescribed blurred SNR (BSNR),
defined as BSNR = var(Hs)/σ 2 .
For deconvolution, the algorithm is run for a maximum of 500 iterations, or until the
absolute relative error between the successive iterates is less than 5×10−6 . The results are
summarized in Table 10.2. The first observation is that the standard linear deconvolution
(MAP estimator based on a Gaussian prior) performs remarkably well for the image in
Figure 10.3a, which is heavily textured. The MAP estimator based on the Laplace prior,
on the other hand, yields the best performance for images having sharp edges with a
moderate amount of texture, such as those in Figures 10.3b,c. This confirms the general
claim that it is possible to improve the reconstruction performance through the promotion
of sparse solutions. However, as the application of the Student prior to images typically
encountered in microscopy demonstrates, exaggeration in the enforcement of sparsity is
a distinct risk. Finally, we note that the Gaussian and Laplace versions of the algorithm
are compatible with the methods commonly used in the field; for instance, 2 -Tikhonov
regularization [PMC93] and 1 /TV regularization [DBFZ+ 06].

10.3.3 Magnetic resonance imaging


Magnetic resonance refers to the property of atomic nuclei in a static magnetic field to
absorb and re-emit electromagnetic radiation. This energy is re-emitted at a resonance
270 Recovery of sparse signals

Table 10.2 Deconvolution performance of MAP estimators based on different prior distributions. The
best results are shown in boldface.

BSNR (dB) Estimation performance (SNR in dB)

Gaussian Laplace Student

Stem cells 20 14.43 13.76 11.86


30 15.92 15.77 13.15
40 18.11 18.11 13.83
Nerve cells 20 13.86 15.31 14.01
30 15.89 18.18 15.81
40 18.58 20.57 16.92
Artery cells 20 14.86 15.23 13.48
30 16.59 17.21 14.92
40 18.68 19.61 15.94

frequency that is proportional to the strength of the magnetic field. The basic idea of
magnetic resonance imaging (MRI) is to induce a space-dependent variation of the fre-
quency of resonance by imposing spatial magnetic gradients. The specimen is then exci-
ted by applying pulsed radio waves that cause the nuclei (or spins) in the specimen to
produce a rotating magnetic field detectable by the receiving coil(s) of the scanner.
Here, we shall focus on 2-D MRI, where the excitation is confined to a single plane. In
effect, by applying a proper sequence of magnetic gradient fields, one is able to sample
the (spatial) Fourier transform of the spin density s(r) with r ∈ R2 . Specifically, the mth
(noise-free) measurement is given by


s(ωm ) = s(r)e−jωm ,r dr,
R2

where the sampling occurs according to some predefined k-space trajectory (the conven-
tion in MRI is to use k = ωm as the spatial frequency variable). This is to say that the
underlying analysis functions are the complex exponentials ηm (r) = e−jωm ,r .
The basic problem in MRI is then to reconstruct s(r) based on the partial knowledge
of its Fourier coefficients which are also corrupted by noise. While the reconstruction in
the case of a dense Cartesian sampling amounts to a simple inverse Fourier transform,
it becomes more challenging for other trajectories, especially as the sampling density
decreases.
For simplicity, we discretize the forward model by using the same sinc basis functions
as for the deconvolution problem of Section 10.3.2. This results in the system matrix

[H]m,n = ηm , sinc(· − n)


= e−jωm ,· , sinc(· − n) = e−jωm ,n
10.3 MAP reconstruction of biomedical images 271

Table 10.3 MR reconstruction performance of MAP estimators based on different prior distributions.

Radial lines Estimation performance (SNR in dB)

Gaussian Laplace Student

Wrist 20 8.82 11.8 5.97


40 11.30 14.69 13.81
Angiogram 20 4.30 9.01 9.40
40 6.31 14.48 14.97

(a) (b) (c)

Figure 10.4 Data used in MR reconstruction experiments. (a) Cross section of a wrist. (b)
Angiography image. (c) k-space sampling pattern along 40 radial lines.

under the assumption that ωm ∞ ≤ π. The clear advantage of using the sinc basis
is that H reduces to a discrete Fourier-like matrix, with the caveat that the frequency
sampling is not necessarily uniform.
A convenient feature of this imaging model is that the matrix HT H is circulant so that
the linear iteration step of the algorithm can be computed in exact form using the FFT.

Experimental results
To illustrate the method, we consider the reconstruction of the two MR images of size
(256 × 256) shown in Figure 10.4: a cross section of a wrist and a MR angiogram. The
Fourier-domain measurements are simulated using the type of radial sampling pattern
shown in Figure 10.4c. The reconstruction algorithm is run with the same stopping
criteria as in Section 10.3.2. The reconstruction results for two sampling scenarios are
quantified in Table 10.3.
The first observation is that the estimator based on the Laplace prior generally outper-
forms the Gaussian solution, which corresponds to the traditional type of linear recons-
truction. The Laplace prior is a clear winner for the wrist image, which has sharp edges
and some amount of texture. While this is similar to the microscopy scenario, the ten-
dency appears to be more systematic for MRI: we were unable to find a single MR scan
in our database for which the Gaussian solution performs best. Yet, the supremacy of
the 1 solution is not universal, as illustrated by the reconstruction of the angiogram, for
272 Recovery of sparse signals

)
(t
s}
θ{
y

R
r

θ
x

(a)

θ
(b) (c)

Figure 10.5 X-ray tomography and the Radon transform. (a) Imaging geometry. (b) 2-D
reconstruction of a tomogram. (c) Its Radon transform (sinogram).

which the Student prior yields the best results, because this image is inherently sparse
and composed of piecewise-smooth components. Similarly to the microscopy modality,
we note that the present MAP formulation is compatible with the deterministic schemes
used for practical MRI reconstruction; in particular, the methods that rely on total varia-
tion [BUF07] and log-based regularization [TM09], which are in direct correspondence
with the Laplace and Student priors, respectively.

10.3.4 X-ray tomography


X-ray computed tomography (CT) aims at the reconstruction of an object (X-ray absorp-
tion map) from its projections (or line integrals) taken along different directions. The
projections are obtained by integrating the function along a set of parallel rays. To spe-
cify the imaging geometry, we refer to Figure 10.5a. Letting x = (x, y) ∈ R2 be the
spatial coordinates of the input function, we have that
x = tθ + rθ ⊥ ,
where θ = (cos θ , sin θ ) is the vector defining the t axis and θ ⊥ = (− sin θ , cos θ ) is
the unit vector that points along the direction of integration (r axis). The mathemati-
cal model of conventional CT is based on the Radon transform, which is an invertible
10.3 MAP reconstruction of biomedical images 273

operator from L2 (R2 ) → L2 (R × [−π, π]). It is defined as

Rθ {s(x)}(t) = s(tθ + rθ ⊥ )dr (10.31)


R

= s(x)δ(t − x, θ ) dx, (10.32)


R2

with index variables t ∈ R and θ ∈ [−π, π]. An example of a Radon transform corres-
ponding to a cross section of the human thorax is shown in Figure 10.5c. The resulting
family of functions gθ (t) = Rθ {s}(t) is called the sinogram, owing to the property that
the trajectory of a point (x0 , y0 ) = (R0 cos φ0 , R0 sin φ0 ) in Radon space is a sinusoidal
curve given by t0 (θ ) = x0 cos θ + y0 sin θ = R0 cos(θ − φ0 ).
In practice, the measurements correspond to the sampled values of the Radon trans-
form of the absorption map s(x) at a series of points (tm , θm ), m = 1, . . . , M. From
(10.32), we deduce that the analysis functions are
 
ηm (x) = δ tm − x, θ m ,

which represent a series of idealized lines in R2 perpendicular to θ m = (cos θm , sin θm ).

Discretization
For discretization purposes, we represent the absorption distribution as the weighted
sum of separable B-spline-like basis functions:

s(x) = s[k]β(x − k) ,
k

with β(x) = β(x)β(y), where β(x) is a suitable symmetric kernel (typically, a poly-
nomial B-spline of degree n). The constraint here is that β ought to have a short support
to reduce computations, which rules out the use of the sinc basis.
In order to determine the system matrix, we need to compute the Radon transform
of the basis functions. The properties of the Radon transform that are helpful for that
purpose are
(1) Projected translation-invariance

Rθ {ϕ(· − x0 )}(t) = Rθ {ϕ}(t − x0 , θ ) (10.33)

(2) Pseudo-distributivity with respect to convolution

Rθ {ϕ1 ∗ ϕ2 }(t) = (Rθ {ϕ1 } ∗ Rθ {ϕ2 }) (t) (10.34)

(3) Fourier central-slice theorem

Rθ {ϕ}(t)e−jωt dt = 
ϕ (ω)|ω=ωθ . (10.35)
R

The first result is obtained by a simple change of variable in (10.32). The second is
a direct consequence of (10.35). The Fourier central-slice theorem states that the 1-D
Fourier transform of the projection of ϕ at angle θ is equal to a corresponding central
274 Recovery of sparse signals

cut of the 2-D Fourier transform of the function. The result is easy to establish for θ = 0,
in which case the t and x axes coincide. For this particular configuration, we have that
 
−jωx
R0 {ϕ}(x)e dx = ϕ(x, y) dy e−jωx dx
R R R

= ϕ(x, y)e−jωx dx dy
R2
=
ϕ (ω, 0), (by definition)

where the interchange of integrals in the second line requires that ϕ ∈ L1 (R2 ) (Fubini).
The angle-dependent formula (10.35) is obtained by rotating the system of coordinates
and invoking the rotation property of the Fourier transform.
Next we show that the Radon transform of the basis functions can be obtained through
the convolution of two rescaled 1-D kernels.

P R O P O S I T I O N 10.1 The Radon transform of the separable function ϕ(x − x0 ) where


ϕ(x) = ϕ1 (x)ϕ2 (y) is given by

Rθ {ϕ(· − x0 )}(t) = ϕθ (t − t0 ),

where t0 = x0 , θ and


"  ·   · #
ϕθ (t) = 1
| cos θ| ϕ1 ∗ 1
| sin θ| ϕ2 (t),
cos θ sin θ
with the convention that
 
1 x
lim ϕ = δ(x) ϕ(x) dx .
a→0 |a| a R

Proof Since ϕ is separable, its 2-D Fourier transform is given by


ϕ (ω) = 
ϕ1 (ω1 ) 
ϕ2 (ω2 ),

where ϕ1 and ϕ2 are the 1-D Fourier transforms of ϕ1 and ϕ2 , respectively. The Fourier
central-slice theorem then implies that

 4
ϕθ (ω) = Rθ ϕ(ω) = 
ϕ1 (ω cos θ ) 
ϕ2 (ω sin θ ).

Next, we note that  ϕ1 (ω cos θ ) and ϕ2 (ω sin θ ) are the 1-D Fourier transforms of | cos1 θ|
 t   
ϕ1 cos θ and | sin1 θ| ϕ2 sint θ , respectively. The final result then follows from (10.33) and
the property that the Fourier-domain product maps into a time-domain convolution.

This allows us to write the entries of the system matrix as

[H]m,k = δ(tm − ·, θ m ), β(· − k)


= Rθm {β(· − k)} (tm ) = βθm (tm − k, θ m ),

where βθm (t) is the projection of β(x) = β(x)β(y) along the direction θm , as specified in
Proposition 10.1.
10.3 MAP reconstruction of biomedical images 275

We shall now apply the result of Proposition 10.1 to determine the Radon transform
of a symmetric tensor-product polynomial spline of degree n. The relevant 1-D formula
for β(x) is
" #n
n+1   x − k + n+1
n+1 2 +
β n (x) = (−1)k ,
k n!
k=0

which is the recentered version of (1.11). Next, by making use of the distributivity of
convolution and the relation
tn+1 tn+2 t+n1 +n2 +1
∗ = ,
n1 ! n2 ! (n1 + n2 + 1)!
we find that
 
Rθ β n (x)β n (y) (t) =
" #2n+1
n+1 n+1    t +  n+1 − k cos θ +  n+1 − k  sin θ
 n+1 n+1 2 2
(−1)k+k +
,
k k | cos θ |n+1 | sin θ |n+1 (2n + 1)!
k=0 k =0
(10.36)
which provides an explicit
 formula for the Radon transform of a B-spline of any degree
n for θ  = 0, ± π2 , ±π when the projection is along a coordinate axis, the Radon trans-
form is simply β n (t) . This result is a special case of the general box-spline calculus
described in [ENU12].
For the present experiments, we select n = 1. This corresponds to piecewise bilinear
basis functions whose Radon transforms are the (non-uniform) cubic splines specified
by (10.36) for θ  = 0, ± π2 , ±π, or simple triangle functions otherwise. The Radon
profiles are stored in a lookup table to speed up computations. In essence, the for-
ward matrix H amounts to a “standard” projection with angle-dependent interpolation
weights given by βθ , while HT is the corresponding
 backprojection.
 For a parallel geo-
metry, their computation complexity is O N × Mθ × (n + 1) where N is the number of
pixels (or B-spline coefficients) of the reconstructed image, Mθ the number of distinct
angular directions, and n the degree of the B-spline.

Experimental results
We consider the two images shown in Figure 10.6. The first is the Shepp–Logan (SL)
phantom of size (256 × 256), while the second is a real CT reconstruction of the cross
section of the lung of size (750 × 750). In the simulations of the forward model, we
use a standard parallel geometry with an angular sampling that is matched to the size
of the images. Specifically, the projections are taken along Mθ = 180, 360 equiangular
directions for the lung image and Mθ = 120, 180 directions for the SL phantom. The
measurements are degraded with Gaussian noise with a signal-to-noise ratio of 20 dB.
For the reconstruction, we solve the quadratic minimization problem (10.21) itera-
tively by using 50 conjugate-gradient (inner) iterations. The reconstruction results are
reported in Table 10.4.
276 Recovery of sparse signals

Table 10.4 Reconstruction results of X-ray computed tomography using different estimators.

Directions Estimation performance (SNR in dB)

Gaussian Laplace Student

SL phantom 120 16.8 17.53 18.76


SL phantom 180 18.13 18.75 20.34
Lung 180 22.49 21.52 21.45
Lung 360 24.38 22.47 22.37

(a) (b)

Figure 10.6 Images used in X-ray tomographic reconstruction experiments. (a) The Shepp-Logan
(SL) phantom. (b) Cross section of a human lung.

We observe that the imposition of the strong level of sparsity brought by Student
priors is advantageous for the SL phantom. This is not overly surprising given that the
SL phantom is an artificial construct composed of piecewise-constant regions (ellipses).
For the realistic lung image (true CT), we find that the Gaussian solution outperforms
the others. Similarly to the deconvolution and MRI problems, the present MAP esti-
mators are in line with the Tikhonov-type [WLLL06] and TV [XQJ05] reconstructions
used for X-ray CT.

10.3.5 Discussion
During our investigation of real image-reconstruction problems, we have highlighted
the similarity between deterministic sparsity-promoting methods and MAP estimators
for sparse stochastic processes. The experiments we conducted with different imaging
modalities confirm the importance of sparse modeling in the reconstruction of biomed-
ical images. We found that imposing a medium level of sparsity, as afforded by the
Laplace prior (1 -norm minimization), is beneficial in most instances. Heavier-tailed
priors are available too, but they are helpful only for a limited class of images that
are inherently sparse. At the other end of the palette is the “classical” linear type of
10.4 The quest for the minimum-error solution 277

reconstruction (Gaussian prior), which performs remarkably well for images whose
content is more diffuse/textured or when the inverse problem is well conditioned. This
confirms that the efficiency of a potential function depends strongly on the type of image
being considered. In our model, this is related to the Lévy exponent of the underlying
continuous-domain innovation process w, which is in direct relationship with the signal
prior.
As far as the relevance of the underlying model is concerned, we like to view the
present set of techniques and continuous-domain stochastic models as a conceptual
framework for deriving and refining state-of-the-art algorithms in a principled fashion.
The reassuring aspect is that the approach gives support to several algorithms that are
presently used in the field.
The next step, of course, would be to determine how to best fit the model to the
data. However, the inherent difficulty with this Bayesian view of the problem is that
there is actually no guarantee that (non-Gaussian) MAP estimation performs best for
the class of signals for which it is designed. There is even evidence that a slight model
mismatch (e.g., modification of the MAP criterion) can be beneficial in some instances
(see Section 10.4.3 for explicit illustrations of this statement).
The current challenge is to take full advantage of the statistical model and to find a
proper way of constraining the solution. One possible approach is to specify recons-
truction methods that are (sub)optimal in the MMSE sense for particular classes of
stochastic processes. While it is still not clear how this can be achieved in full general-
ity, Section 10.4 demonstrates how to proceed for the simpler signal-denoising scenario
where the system matrix is the identity.

10.4 The quest for the minimum-error solution

In the Bayesian framework where the prior distribution of the signal is known, the
optimal (MMSE) reconstruction is the conditional expectation of the signal given the
measurements. Unfortunately, the direct computation of the MMSE solution, which is
specified by an N-dimensional integral, is intractable numerically for the (non-Gaussian)
cases of interest to us. This is to be contrasted with the previous MAP formulation which
translates into the “Gibbs energy” minimization problem (10.12) that can be solved
numerically using standard optimization techniques. Since the algorithms favored by
practitioners are based on similar variational principles, a key issue is to characterize
their degree of (sub)optimality and, in the case of deficiency of the MAP criterion, to
understand how the energy functional should be modified in order to improve the quality
of the reconstruction.
In this section, we investigate the problem of the denoising of Lévy processes for
which questions regarding optimality can be answered to a large extent. Specifically, we
shall define the corresponding MMSE signal estimator, derive a computational solution
based on belief propagation, and use the latter as gold standard to assess the performance
of the primary types of MAP estimators previously considered.
278 Recovery of sparse signals

10.4.1 MMSE estimators for first-order processes


We now focus on the problem of the recovery of non-Gaussian AR(1) and Lévy pro-
cesses from their noisy sampled values. The corresponding statistical measurement
model is
$
N
p(Y1 :YN |X1 :XN ) (y|x) = pY|X (yn |xn ),
n=1

which assumes that the noise contributions are independent and characterized by the
conditional pdf pY|X . For instance, in the case of AWGN, we have that pY|X (y|x) =
gσ (y − x), where gσ is a centered Gaussian distribution with standard deviation σ .
Given the measurements y = (y1 , . . . , yN ), the problem is to recover the unknown
signal vector x = (x1 , . . . , xN ) based on the knowledge that the latter is a realization
of a sparse first-order process of the type characterized in Section 8.5.2. This prior
information is summarized by the stochastic difference equation

un = xn − a1 xn−1 ,

where (un ) is an i.i.d. sequence with infinitely divisible pdf pU , with the implicit
convention that x0 = 0 (or, alternatively, x0 = xN if we are applying circular boundary
conditions). This model covers the cases of the non-Gaussian AR(1) processes (when
|a1 | < 1) and of the Lévy processes for a1 = 1. The posterior distribution of the signal
is therefore given by

1$ $ 
N N
 
p(X1 :XN |Y1 :YN ) (x|y) = pY|X yn |xn pU xn − a1 xn−1 , (10.37)
Z   
n=1 n=1 un

where Z is a proper normalization constant. We can then formally specify the optimal
signal estimate as

xMMSE (y) = E{x|y} = x p(X1 :XN |Y1 :YN ) (x|y) dx. (10.38)
RN

This is to be contrasted with the MAP estimator, which is defined as


 
xMAP (y) = arg max p(X1 :XN |Y1 :YN ) (x|y) . (10.39)
x∈RN

For completeness, we now briefly show that the conditional mean (10.38) minimizes
the mean-square estimation error among all signal estimators. An estimator x̃ = x̃(y) is
a specific function of the measurement vector y and its performance is measured by the
(conditional) mean-square estimation error

E{(x − x̃) |y} = (x − x̃)2 p(X1 :XN |Y1 :YN ) (x|y) dx.
2
(10.40)
RN
10.4 The quest for the minimum-error solution 279

Since y is fixed, we can minimize this expression by annihilating its partial derivatives
with respect to x̃. This gives
∂ E{(x − x̃)2 |y}
=− 2(x − x̃)p(X1 :XN |Y1 :YN ) (x|y) dx = 0
∂ x̃ RN

⇒ x̃opt = x p(X1 :XN |Y1 :YN ) (x|y) dx,


RN

which
 proves that (10.38) is the MMSE solution. The implicit assumption here is that
R N |x| np
X|Y (x|y) dx < ∞ for n = 1, 2 so that (10.40) is well defined and so that we
can safely differentiate under the integral sign (by Lebesgue’s dominated-convergence
theorem).
Finally, we note that the MMSE estimator provided by (10.38) has the following
properties:
• It is unbiased with E{xMMSE (y)} = E{x}.
• It satisfies the statistical “orthogonality” principle
%  T &
E x̃(y) x − xMMSE (y) =0

for any estimator x̃(y) : RN → RN with E{x̃(y)2 } < ∞ that is a (non-linear)


function of the measurements.
• It is typically non-linear unless y and x are jointly Gaussian.

10.4.2 Direct solution by belief propagation


The first approach that we consider is an explicit calculation of (10.38) based on a
recursive evaluation of the required integrals. The algorithm relies on belief propaga-
tion (BP). BP is a general graph-based technique for computing the marginal distribu-
tions of a high-dimensional posterior pdf that admits a decomposition into a product of
simple low-order factors [KFL01]. It operates by passing messages along the edges of
a factor graph (see Figure 10.7). The role of these messages is twofold: (1) to perform
the summation
 (or integration)
 of the factors of the pdf with respect to a given node
variable e.g., xn in (10.37) and (2) to progressively combine the factors into more
global entities by forming appropriate products.

Example: estimation of a three-point Lévy process


The best way of explaining BP is to detail a simple example. To that end, we consider
the signal x = (x1 , x2 , x3 ) corresponding to three consecutive samples of a Lévy process.
The posterior distribution of the signal given the vector of noisy measurements y =
(y1 , y2 , y3 ) is obtained from (10.37) with N = 3 and a1 = 1. The complete factorized
expression, which is represented by the factor graph in Figure 10.7, is

p(X1 :X3 |Y1 :Y3 ) (x|y) ∝ pU (x1 ) pY|X (y1 |x1 ) pU (x2 − x1 )×
pY|X (y2 |x2 ) pU (x3 − x2 ) pY|X (y3 |x3 ). (10.41)
280 Recovery of sparse signals

pY |X (y1 |x1 ) pY |X (y2 |x2 ) pY |X (y3 |x3 )

2 4 6

¹−
X2 (x)
¹+
X2 (x)

1 X1 3 X2 5 X3

pU(x1) pU(x2−x1) pU(x3−x2)

Figure 10.7 Factor-graph representation of the posterior distribution (10.41) or (10.37) in greater
generality. The boxed nodes represent the factors of the pdf and the circled nodes the unknown
variables. The presence of an edge between a factor and a variable node indicates that the
latter variable is active within the factor. The functions μ− +
X2 (x) and μX2 (x) represent the beliefs
at the variable node X2 ; they condense all the statistical information coming from the left and
right of the graph, respectively.

The crucial step for exact Bayesian inference is to marginalize p(X1 :XN |Y1 :YN ) (x|y) with
respect to xn . In short, we need to integrate the posterior pdf over all other variables.
For instance, in the case of x2 , we get

p(X2 |Y1 :Y3 ) (x2 |y) = p(X1 :X3 |Y1 :Y3 ) (x|y) dx1 dx3
R R
μ−
X (x1 )
  
1

∝ pU (x1 ) pY|X (y1 |x1 ) pU (x2 − x1 ) dx1


R  
μ−
X (x2 )
2

μ+X (x3 )

3

· pY|X (y2 |x2 ) · pU (x3 − x2 ) pY|X (y3 |x3 ) · 1 dx3 .


R  
μ+X (x2 )
2

To evaluate the marginal with respect to x3 , we take advantage of the previous inte-
gration over x1 encoded in the “belief” function μ−
X2 (x2 ) and proceed as follows:

p(X3 |Y1 :Y3 ) (x3 |y) = p(X1 :X3 |Y1 :Y3 ) (x|y) dx1 dx2
R R

∝ μ−
X2 (x2 ) pY|X (y2 |x2 ) pU (x3 − x2 ) dx2 · pY|X (y3 |x3 ) · 
1 .
R
   +
μX (x3 )
3
μ−
X (x3 )
3
10.4 The quest for the minimum-error solution 281

The emerging pattern is that the marginal distribution of the variable xn can be expressed
as a product of three terms:

p(Xn |Y1 :YN ) (xn |y) = μ−


Xn (xn ) · pY|X (yn |xn ) · μXn (xn ),
+

where the so-called belief functions μ− +


Xn and μXn condense the statistical information
carried by the variables with indices below n and above n, respectively. This aggregation
mechanism is summarized graphically with arrows in Figure 10.7.

BP for Lévy and non-Gaussian AR(1) processes


The fundamental idea for extending the scheme to a larger number of samples is that
the beliefs μ− +
Xn (x) and μXn (x) can be updated recursively. The algorithm below is a
generalization that is applicable to the MMSE denoising of the broad class of Markov-1
signals. Since the underlying factor graph has no loops, it computes exact marginals and
terminates after one forward and backward sweep of message passing.

• Initialization. Set

μ−
X1 (x) = pU (x)
μ+XN (x) = 1

• Forward message recursion. For n = 2 to N, compute


   
μ−
Xn (x) ∝ μ−
Xn−1 (z) pY|X yn−1 |z pU x − a1 z dz (10.42)
R

• Backward message recursion. For n = (N − 1) down to 1, compute


   
μ+Xn (x) ∝ pU z − a1 x pY|X yn+1 |z μ+Xn+1 (z) dz (10.43)
R

• Results. For n = 1 to N, compute


  +
p(Xn |Y1 :YN ) (x|y) ∝ μ−
Xn (x) · pY|X yn |x · μXn (x)

[xMMSE ]n = x p(Xn |Y1 :YN ) (x|y) dx. (10.44)


R

The symbol ∝ denotes a renormalization such that the resulting function integrates to
one. The critical part of this algorithm is the evaluation
 of the convolution-like
N inte-
− +
grals (10.42) and (10.43). The scalar belief functions μXn (x), μXn (x) n=1 that result
from these calculations also need to be stored, which presupposes some form of discre-
tization.

Fourier-based version of BP for Lévy processes


There are two practical reasons for transcribing the above BP estimation algorithm into
the Fourier domain. The first is that closed-form expressions are not available for all
infinitely divisible pdfs. The preferred mode of description is the characteristic function

pU (ω) = e fU (ω) , where fU is the Lévy exponent of the innovation, as has been made
282 Recovery of sparse signals

clear
 in Chapters 4and 9. The second reason is that, for a1 = 1 (Lévy process), (10.42)
as well as (10.43) may be rewritten as the convolution

μ−
Xn (x) ∝ g(z)pU (x − z) dz = (pU ∗ g)(x) = F −1 {e fU (ω) 
g(ω)}(x),
R
 
where g(z) = μ− Xn−1 (z) pY|X yn−1 |z . In order to obtain a cost-effective implementation,
we suggest evaluating the various formulas by relying only on products by switching
back and forth between the “time” domain to compute g(z) and the Fourier domain to
evaluate the convolution. The corresponding algorithm is summarized below. For sim-
plicity, we are assuming that the underlying pdfs are symmetric; therefore, the presence
of complex conjugation and the differences in convention from the Fourier transform as
used by statisticians are inconsequential.

• Initialization. Set

μ−
X1 (ω) = e
fU (ω)


μ+XN (ω) = δ(ω)

• Forward message recursion. For n = 2 to N, compute


%   &
g(ω) = Fz pY|X yn−1 |z F −1 {
 μ− Xn−1 }(z) (ω)

μ−
Xn (ω) ∝ 
g(ω) 
pU (ω)

• Backward message recursion. For n = (N − 1) down to 1, compute


%   &
g(ω) = Fz pY|X yn+1 |z F −1 {
 μ+Xn+1 }(z) (ω)
μ+Xn (ω) ∝ 
 pU (ω) 
g(ω)

• Results. For n = 1 to N, compute


%   −1 + &
p(Xn |Y1 :YN ) (ω) ∝ F F −1 {
 μ−Xn }(x) pY|X ny |x F {
μXn }(x) (ω)

p(Xn |Y1 :YN ) (ω) 
d
[xMMSE ]n = j  . (10.45)
dω ω=0

The symbol ∝ denotes a renormalization such that the value of the Fourier transform at
the origin is one. We have taken advantage of the moment-generating property of the
Fourier transform to establish (10.45).
The conventional and Fourier-based versions of the BP algorithm yield the exact
MMSE estimator for our problem. However, both involve continuous mathematics
(integrals and/or Fourier transforms) and neither one can be implemented in the given
form. The simplest and most practical generic solution is to represent the belief func-
tions by their samples on a uniform grid with a sampling step that is sufficiently fine
and to truncate their support while maintaining the error within an acceptable bound.
Integrals are then approximated by Riemann sums and the Fourier transform is imple-
mented using the FFT.
10.4 The quest for the minimum-error solution 283

(a) Gaussian

(b) Laplace

(c) Poisson

(d) Cauchy

Figure 10.8 Examples of Lévy processes with increasing degree of sparsity. (a) Brownian motion
(with Gaussian increments). (b) Lévy–Laplace motion. (c) Compound–Poisson process. (d)
Lévy flight with Cauchy-distributed increments.

10.4.3 MMSE vs. MAP denoising of Lévy processes


We now present a series of denoising experiments with the goal of comparing the
various estimation techniques introduced so far. We have considered signals associated
with the four Lévy processes displayed in Figure
 10.8. These
 were generated by discrete
integration of sequences of i.i.d. increments see (8.2) with statistical distributions as
indicated below.
• Brownian motion. (un ) follows the standard Gaussian distribution pU (x) = g1 (x).
• Laplace motion. (un ) follows a Laplace distribution with dispersion parameter 1/λ =
1.
• Compound-Poisson process. (un ) follows a compound-Poisson distribution with
parameters (λ = 0.6, pA ) such that pA = g1 is a standard Gaussian distribution and
Prob(un = 0) = e−λ = 0.549.
• Lévy flight. (un ) follows a Cauchy distribution (SαS with α = 1 or, equivalently,
Student with r = 12 ) with dispersion s0 = 1.
The reader can refer to Table 4.1 for the precise definition of the underlying pdfs.
Each realization x of the signal was corrupted with additive white Gaussian noise of
variance σ 2 to yield the measurement vector y. Based on (10.12) with H = I and s = x,
the relevant MAP estimators are therefore given by
 N

1  
xMAP (y) = arg min y − x2 + τ
2
U [Lx]n , (10.46)
x∈RN 2
n=1
284 Recovery of sparse signals

where τ ∝ σ 2 is a regularization parameter and U is one of the following potential


functions:
• Gauss (x) = 12 |x|2 (Gaussian prior), which is known to yield the MMSE solution for
Brownian motion and the best linear estimator otherwise (Wiener filter or LMMSE
solution).
• Laplace (x) = |x| (Laplace prior), also termed TV, because it provides the signal esti-
mate whose discrete total variation is minimum. The corresponding estimator, which
involves the minimization of an 1 -norm, is illustrative of the kind of recovery tech-
niques used in compressed
 2 sensing.
• Student (x) = log |x| + 1 (Student or Cauchy prior), an instance of a non-convex
potential function that promotes sparsity even further than Gauss .
These MAP estimators are computed iteratively according to the generic procedure des-
cribed in Section 10.2.4 using the LMMSE solution as warm start. The performance
of an estimator x̃(y) is measured by the signal-to-noise ratio improvement, which is
given by
 
y − x2
SNR(x̃, x) = 10 log10 .
x̃(y) − x2

The signal length was set to N = 100. Each data point in a graph is an average resulting
from 500 realizations of the denoising experiment. The MMSE estimator (Fourier-based
implementation of belief propagation) relies on the correct prior signal model, while the
regularization parameter τ of each of the other three estimators is kept constant for a
given noise level, and adjusted to yield the lowest collective estimation error.
We show in the graph of Figure 10.9 the signal-to-noise improvement of the various
algorithms for the denoising of Brownian motion. The first observation is that the results
of the BP MMSE estimator and the Wiener filter (LMMSE=MAP Gaussian) are indis-
tinguishable and that these methods perform best over the whole range of experimenta-
tion, in agreement with the theory. The worst results are obtained for TV regularization,
while the Log penalty gives intermediate results. A possible explanation of the latter
finding is that the Log potential is quadratic around the origin, so that it can replicate
the behavior of 2 , but only over some limited range of input values.
A similar scenario is repeated in Figure 10.10 for the compound-Poisson process. We
note that the corresponding MAP estimate is a constant signal since the probability of
its increments being zero is overwhelmingly larger (Dirac mass at the origin) than any
other acceptable value. This trivial estimator is excluded from the comparison. At low
noise levels where the sparsity of the source dictates the structure, the performance of
the TV estimator is very close to that of the MMSE denoiser, which can be considered
as gold standard. Yet, the relative performance of the TV estimator deteriorates with
increasing noise, so much so that TV ends being worst at the other end of the scale. One
can observe a reverse trend for the LMMSE estimator, which progressively converges
to the MMSE solution as the variance of the noise increases. Here, the explanation is
that the statistics of the noisy signal is dominated by the Gaussian constituent, which is
favorable to the LMMSE estimator.
10.4 The quest for the minimum-error solution 285

Figure 10.9 SNR improvement as a function of the level of noise for Brownian motion. The
denoising methods by order of decreasing performance are: MMSE estimator (which is
equivalent to the LMMSE and MAP estimators), Log regularization, and TV regularization.

Figure 10.10 SNR improvement as a function of the level of noise for a piecewise-constant signal
(compound-Poisson process). The denoising methods are: MMSE estimator, Log regularization,
TV regularization, and LMMSE estimator.

The more challenging case of a Lévy flight, which is the sparsest process in the series,
is documented in Figure 10.11. Here, the MAP estimator (Log potential) performs well
over the whole range of experimentation. A possible explanation is that the corrupted
signal looks sparse even at large noise powers since the convolution of a heavy-tailed
pdf with a Gaussian remains heavy-tailed. The dominance of the non-Gaussian regime
also explains why the LMMSE performs so poorly. The main limitation of the LMMSE
algorithm is that it fails to preserve the sharp edges that are characteristic of this type of
signal.
The last example, in Figure 10.12, is particularly telling because the results go against
our initial expectation, especially at higher noise levels. While the MAP (TV) estimator
performs well in the low-noise regime, it progressively falls behind all other estimators
as the variance of the noise increases. Particularly surprising is the good behavior of the
LMMSE algorithm, which matches the MMSE solution at higher noise levels. Apart
from the MMSE denoiser, there is no single estimator that outperforms the others over
the whole range of noise. The possible reason for the poor performance of MAP is that
the underlying signal is at the very low
 end of sparsity, with
 its general appearance being
rather similar to Brownian motion see Figure 10.8a,b . This finding suggests that one
should be cautious with the Bayesian justification of 1 -norm minimization techniques
286 Recovery of sparse signals

Figure 10.11 SNR improvement as a function of the level of noise for a Lévy flight with
Cauchy-distributed increments. The denoising methods by order of decreasing performance are:
MMSE estimator, MAP estimator (Log regularization), TV regularization, and LMMSE
estimator.

Figure 10.12 SNR improvement as a function of the level of noise for a Lévy process with
Laplace-distributed increments. The denoising methods are: MMSE estimator, LMMSE
estimator, MAP estimator (TV regularization), and Log regularization.

based on Laplace priors since MAP-TV does not necessarily perform so well for the
model to which it is matched – in fact, often worse than the classical Wiener solution
(LMMSE).
While this series of experiments stands as a warning against a strict application of
the MAP principle, it also shows that it is possible to specify variational estimators that
approximate the MMSE solution well. The caveat is that the best performing potential is
not necessarily the prior log-likelihood function associated with the probability model,
which calls for further investigation.

10.5 Bibliographical notes

Image reconstruction is a classical topic in image processing and biomedical imaging


[Nat84, Pra91]. In the variational formulation, the reconstruction task is cast as an opti-
mization problem. The idea of including a special penalty term (regularization) to stabil-
ize ill-posed problems can be traced back to Tikhonov [Tik63, TA77]. Classical Tikho-
nov regularization involves quadratic terms only; it results in a linear reconstruction
10.5 Bibliographical notes 287

(regularized least-squares solution) and is equivalent to ridge regression in statistics


[HK70]. An informative discussion of such techniques in the context of image restor-
ation and their connection with statistical modeling is given in [KV90]. The Bayesian
interpretation of this type of algorithm suggests a natural refinement which is to combine
the quadratic regularization with a (non-quadratic) log-likelihood data term that incor-
porates the knowledge of the statistics of the noise [Hun77, TH79]. Still, it remains the
case that quadratic regularization has a tendency to weaken signal discontinuities, which
has led researchers to look for alternatives. A significant step forward was the introduc-
tion of potential functions that have a better ability to preserve edges [GR92,CBFAB97].
Total variation is a popular instance of such a non-quadratic regularizer that has the
important property of being convex [ROF92]. The connection between (Gibbs) energy
minimization and MAP estimation is well understood and has been exploited to make
the link with discrete Markov random fields [GG84, BS93].

Sections 10.1 and 10.2


While researchers have spent a significant effort on the discrete stochastic modeling of
images for defining suitable non-Gaussian priors [GG84, BS93], there is little work on
the use of continuous-domain stochastic models, except in the Gaussian case where the
MMSE solution is linear and intimately linked to smoothing splines [KW70] and hybrid
Wiener filtering [UB05b,RVDVBU08]. The primary reference for the model-based dis-
cretization procedure of Sections 10.1–10.2 is [BKNU13]. The generic algorithm pre-
sented in Section 10.2.4 is inspired by the work of Ramani and Fessler, who proposed
an ADMM algorithm for the reconstruction of parallel MRI [RF11]. Other relevant
works on iterative reconstruction algorithms are [WYYZ08, BT09a, BT09b, ABDF11].
See also [CP11] for the general definition and application of proximal operators.

Section 10.3
Self-similar probability models are commonly used as prior knowledge for image pro-
cessing [PPLV02]. The property of scale-invariance is supported by empirical obser-
vations of the power spectrum of natural images [Fie87, RB94, OF96b, SO01]; it is
also motivated by physics and biology [Man82, Rud97, MG01]. The non-Gaussian cha-
racter of images is well documented too [SLSZ03]. The wavelet-domain histograms,
in particular, are typically leptokurtotic (with a peak at the origin) and heavy-tailed
[Fie87,Mal89,SLG02,PSWS03]. The same holds true for the pdfs of image derivatives,
which are often exponential or even subexponential. Grenander and Srivastava have
shown that this behavior can be induced by a simple generative model that involves the
random superposition of a fixed collection of templates [GS01]. Mumford and Gidas
have introduced a scale-invariant adaptation of this model that takes the form of a sto-
chastic wavelet expansion with a random placement and scaling of the atoms [MG01].
This random wavelet model has the same phenomenological characteristics – infinite
divisibility and wide-sense self-similarity – as the sparse stochastic processes being
considered here. However, it does not lend itself as well to statistical inference because
of the lack of an underlying innovation model.
288 Recovery of sparse signals

The primary areas of application of deconvolution are astronomy and fluorescence


microscopy, where the dominant source of noise is photon counting (Poisson statistics)
[MKCC99, SPM02]. Deconvolution is particularly effective in wide-field fluorescence
microscopy because it can be deployed in full 3-D [AS83]. The traditional deconvo-
lution method is the Richardson–Lucy (RC) algorithm: a multiplicative gradient-based
technique that maximizes a Poisson-likelihood criterion [Ric72, Luc74]. Note that the
level of smoothing of RC is controlled through the number of iterations since the cost
function does not include a regularization term. While this type of algorithm may seem
primitive by modern standards, it is the workhorse of many applications, with some of
its variants performing remarkably well [vKvVVvdV97]. A significant advantage of RC
is that it results in a positive solution. In that respect, we note that the iterative algorithm
of Carrington et al. optimizes a quadratic Tikhonov-type criterion subject to the positiv-
ity constraint [CLM+ 95]. By contrast, the use of unconstrained penalized least-squares
methods, as in Section 10.3.2, is slightly more academic, but typical of the evaluation of
deconvolution algorithms [BT09b, BDF07]. The MAP formulation can also be refined
by adopting the Poisson-likelihood term of RC and complementing it with a suitable
penalty such as total variation [DBFZ+ 06, FBD10].
Iterative reconstruction methods became relevant for MRI with the invention of paral-
lel imaging (SENSE) by Pruessmann et al. [PWSB99]. The first-generation algorithms
were linear, with the optimization being typically performed by the conjugate gradient
method [Pru06]. Non-linear techniques based on TV-regularizer were introduced later
as a natural extension [BUF07] and, more significantly, as one of the first practical
demonstrations of the “compressed-sensing” (CS) methodology [LDP07].
X-ray computed tomography is a classical medical imaging modality that relies on
the numerical inversion of the Radon transform [Nat84, Kal06]. Tomographic recons-
truction is also of interest in biology for the imaging of small animals using micro-
scanners [HT02] and for the determination of 3-D molecular structure from cryoelectron
micrographs [Fra92, LFB05]. When the number of projections is large with a uniform
angular distribution, the tomogram is reconstructed by filtered backprojection (FBP)
[RL71, PSV09]. For less ideal acquisition conditions (e.g., noisy and/or missing data,
non-even angular distribution of the projections), better results are obtained using iter-
ative algorithms such as the algebraic reconstruction technique (ART) [GBH70]. ART
is a basic linear reconstructor that computes the solution of a least-squares minimiza-
tion problem [HL76]; it may also be interpreted as a Bayesian estimator with a simple
Gaussian prior [HHLL79]. The recent trend is to consider non-quadratic regularizers
imposed upon the gradient, with a marked preference for total variation [PBE01, SP08,
RBFK11]. A precursor to this type of algorithm is Bouman and Sauer’s MAP estimator
that involves a Markov random-field model with generalized Gaussian priors [BS93].
The discretization of the tomography reconstruction problem using local basis func-
tions (including square pixels and low-order B-splines) was pioneered by Hanson et al.
[HW85]. Many iterative algorithms use isotropic band-limited basis functions, which
facilitates the implementation of the forward model since the Radon transform of a
basis function does not depend on the direction of integration [Lew90]. The spline-based
10.5 Bibliographical notes 289

discretization that is used in the present experiments is slightly more involved but has
better approximation properties [ENU12].

Section 10.4
The connection between TV denoising and the MAP estimation of Lévy processes was
pointed out in [UT11]. The direct solution for the MMSE denoising of Lévy processes,
which is based on belief propagation, was proposed by Kamilov et al. [KPAU13]. For
a general presentation of the methods of belief propagation and message passing, we
refer to the articles of Loeliger et al. [KFL01, Loe04]. The paper by Amini et al.
[AKBU13] provides the basic tools for the proper statistical formulation of signal recov-
ery for higher-order sparse stochastic models. It also includes the type of experimental
comparison presented in Section 10.4.3. The term “Lévy flight” was coined by Mandel-
brot [Man82]. This stochastic model induces a chaotic behavior with random displace-
ments interspersed with sudden jumps. It is characteristic of the path followed by birds
and other animals when searching for food [VBB+ 02].
Several authors have identified deficiencies of non-Gaussian MAP estimation tech-
niques [Nik07, Gri11, SDFR13]. Conversely, Gribonval has shown that, in the AWGN
scenario, there exists a penalized least-squares estimator with an appropriate penalty
that is equivalent to the MMSE solution, and that this modified penalty may not coincide
with the prior log-likelihood function associated with the underlying statistical model.
11 Wavelet-domain methods

A simple and surprisingly effective approach for removing noise in images is to expand
the signal in an orthogonal wavelet basis, apply a soft-threshold to the wavelet coef-
ficients, and reconstruct the “denoised” image by inverse wavelet transformation. The
classical justification for the algorithm is that i.i.d. noise is spread out uniformly in the
wavelet domain while the signal gets concentrated in a few significant coefficients (spar-
sity property) so that the smaller values can be primarily attributed to noise and easily
suppressed.
In this chapter, we take advantage of our statistical framework to revisit such wave-
let-based reconstruction methods. Our first objective is to present some alternative
dictionary-based techniques for the resolution of general inverse problems based on
the same stochastic models as in Chapter 10. Our second goal is to take advantage
of the orthogonality of wavelets to get a deeper understanding of the effect of proxi-
mal operators while investigating the possibility of optimizing shrinkage/thresholding
functions for better performance. Finally, we shall attempt to bridge the gap between
operator-based regularization, as discussed in Sections 10.2–10.3, and the imposition
of sparsity constraints in the wavelet domain. Fundamentally, this relates to the dicho-
tomy between an analysis point of view of the problem (typically in the form of the
minimization of an energy functional with a regularization term) vs. a synthesis point
of view, where a signal is represented as a sum of elementary constituents (wavelets.)
The chapter is composed of two main parts. The first is devoted to inverse problems
in general. Specifically, in Section 11.1 we apply our general discretization and model-
ing paradigm to the derivation of wavelet-domain MAP estimators for the resolution of
linear inverse problems. One of the key differences from the innovation-based formula-
tion of Chapter 10 is the presence of scale-dependent potential functions whose form is
specified by the stochastic model. We then address practical issues in Section 11.2 with
the presentation of the two primary iterative thresholding algorithms (ISTA and FISTA).
These methods are illustrated with the deconvolution of fluorescence micrographs.
The second part of the chapter focuses on the denoising problem, with the aim
of improving upon simple soft-thresholding and wavelet-domain MAP estimation.
Section 11.3 presents a detailed investigation of shrinkage functions in relation to infi-
nitely divisible laws, with the emphasis on pointwise estimators that are optimal in the
MMSE sense. In Section 11.4, we show how the performance of wavelet denoising
can be boosted even further through the use of redundant representations (tight wavelet
frames). In particular, we describe the concept of consistent cycle spinning, which
11.1 Discretization of inverse problems in a wavelet basis 291

provides a conceptual bridge with the optimal estimation techniques of Section 10.4.
We then close the circle in Section 11.4.4 by combining all ingredients – tight operator-
like wavelet frames, MMSE shrinkage functions, and consistent cycle spinning – and
present an iterative wavelet-based algorithm that converges empirically to the reference
MMSE solution of Section 10.4.2.

11.1 Discretization of inverse problems in a wavelet basis

As an alternative to the shift-invariant formulation presented in Chapter 10, we may


choose to discretize a linear inverse problem in a wavelet basis. To that end, we consider
a biorthogonal wavelet system of the type investigated in Chapter 8 that is matched to
the whitening operator L. The underlying signal representation is the wavelet counter-
part of (10.2) in Section 10.1. It is given by

s1 (r) = vi [k]ψi,k (r) = s[k]βL (r − k), (11.1)
i=1 k∈Zd \DZd k∈Zd

where the wavelet coefficients are obtained as

vi [k] = ψ̃i,k , s .

The mathematical requirement is that the family of analysis/synthesis functions (ψ̃i,k ,


ψi,k ) forms a biorthonormal wavelet basis.
Observe that the central wavelet expansion in (11.1) excludes the finer-scale wavelet
coefficients with i < 1, so that the signal approximation s1 (r), which is the projection
of s(r) onto the reference space V0 , can also be represented as a linear combination of
the integer shifts of the scaling function βL .
The crucial ingredient for our formulation (see Section 6.5.3) is that the analysis
wavelets are such that

ψ̃i,k (r) = L∗ φ̃i (r − Di−1 k), (11.2)

where φ̃i ∈ L1 (Rd ) is some suitable (possibly, scale-dependent) smoothing kernel and
D is the dilation matrix that specifies the multiresolution decomposition. Recalling that
s = L−1 w, this implies that

vi [k] = ψ̃i,k , s = L∗ φ̃i (· − Di−1 k), L−1 w


= φ̃i (· − Di−1 k), w ,

so that it is possible to derive any finite-dimensional joint pdf of the wavelet coefficients
vi [·] by using the general white-noise analysis exposed in Chapters 8 and 9. In particular,
Proposition 8.6 tells us that pVi , the pdf of the wavelet
 coefficients
 at scale i, is infinitely
divisible with modified Lévy exponent fφ̃i (ω) = Rd f ωφ̃i (r) dr.
292 Wavelet-domain methods

11.1.1 Specification of wavelet-domain MAP estimator


To obtain a practical reconstruction model, we adopt the same strategy as in
Section 10.1.2: we truncate the signal over a spatial region  and introduce problem-
specific boundary conditions that are enforced by suitable modifications of the basis
functions. This yields the finite-dimensional signal model
Imax
s1 (r) = vi [k]ψi,k (r) = s[k]βL,k (r), (11.3)
i=1 k∈i k∈

where i denotes the wavelet-domain index set corresponding to the region of interest
(ROI) . Note that the above expansion spans the same signal space as (10.8), provided
that we select β = βL as the scaling function of the wavelet system {ψi,k }.
The signal in (11.3) is uniquely specified by an N-dimensional vector v of pooled
wavelet coefficients vi [k], k ∈ i , i = 1, . . . , Imax . The right-hand side of (11.3) also
indicates that there is a linear, one-to-one correspondence between the sequence of
wavelet coefficients vi [·] and the discrete signal s[·]. This mapping specifies the discrete
wavelet transform which admits a fast filterbank implementation. In vector notation,
this translates into
v = W̃s ⇔ s = Wv
−1
with W = W̃ , where the entries of the (N × N) wavelet matrices W̃ and W are
given by
[W̃](i,k),k = ψ̃i,k , βL,k
[W]k ,(i,k) = β̃L,k , ψi,k ,

respectively. Also note that the wavelet basis is orthonormal if and only if ψ̃i,k = ψi,k ,
which translates into W̃ = WT being an orthonormal matrix; this latter property presup-
poses that the underlying scaling functions are orthogonal too.
With the above convention, we write the wavelet version of the measurement equa-
tion (10.9) as
y = Hwav v + n,
with wavelet-domain system matrix Hwav whose entries are given by
[Hwav ]m,(i,k) = ηm , ψi,k , (11.4)
where ηm is the analysis function corresponding to the mth measurement. The link with
(10.10) in Section 10.1.2 is Hwav = HW with the proper choice of analysis function
β̃ = β̃L .
For the purpose of simplification and mathematical tractability, we now make the
same kind of decoupling simplification as in Section 10.1.2, treating the wavelet com-
ponents as if they were independent. 1 Using Bayes’ rule, we get the corresponding
1 While this approximation is legitimate within a given scale for sufficiently well-localized wavelets, it is
less so between scales because the wavelet smoothing kernels φ̃i and φ̃i typically overlap. (A more refined
probabilistic model should take those inter-scale dependencies into consideration.)
11.1 Discretization of inverse problems in a wavelet basis 293

expression for the posterior probability distribution as


 
y − Hwav v2
pV|Y (v|y) ∝ exp − pV (v)
2σ 2
  Imax $
y − Hwav v2 $  
≈ exp − pVi vi [k] ,
2σ 2
i=1 k∈i

f (ω)
where pVi is the (conjugate) inverse Fourier transform of 
pVi (ω) = e φ̃i . By maximiz-
ing pV|Y , we derive the wavelet-domain version of the MAP estimator
⎧ ⎫
⎨1  ⎬
vMAP (y) = arg min y − Hwav v22 + σ 2 Vi vi [k] , (11.5)
v ⎩2 ⎭
i k∈i

which is similar to (10.12), except that it now involves the series wavelet potentials
Vi (x) = − log pVi (x).
The specificity of the present MAP formulation is that the potential functions Vi are
scale-dependent and tied to the Lévy exponent f of the continuous-domain innovation w.
Since the pdfs pVi of the wavelet coefficients are infinitely divisible with Lévy exponent
fφ̃i , we can determine the exact form of the potentials as
" # dω
Vi (x) = − log exp fφ̃i (ω) − jωx (11.6)
R 2π
with
 
fφ̃i (ω) = f ωφ̃i (r) dr,
Rd

where φ̃i is the wavelet smoothing kernel at resolution i in (11.2). Moreover, we can
rely on the theoretical analysis of id potentials in Section 10.2.1, which remains valid in
the wavelet domain, to extract the global characteristics of Vi . The general trend that
emerges is that these characteristics are mostly insensitive to the exact shape of φ̃i , and
hence to the choice of a particular wavelet basis.

11.1.2 Evolution of the potential function across scales


In a conventional wavelet analysis, the basis functions are dilated versions of a small
number (N0 = det(D) − 1, typically) of mother wavelets. For simplicity of notation, we
consider the case of a single mother wavelet ψ̃ and a dyadic dilation matrix D = 2I. The
analysis wavelets at scale a = 2i (or resolution i) are shifted versions of
ψ̃i (r) = 2−id/2 ψ̃(r/2i ) = L∗ φ̃i (r),
where L is scale-invariant of order γ and
φ̃i (r) = 2i(γ −d/2) φ̃(r/2i )
in accordance with (9.19).
294 Wavelet-domain methods

The assumption that the underlying signal s(r) is (second-order) self-similar has
direct repercussions on the form of the potentials and their evolution across scale. Based
on the analysis in Section 9.8, we find that the wavelet-domain pdfs pVi are members of
the same class. Specifically, their Lévy exponent at resolution i is given by
 
fφ̃i (ω) = 2id fφ̃ 2i(γ −d/2) ω . (11.7)
It follows that the wavelet potential at resolution i can be written as
 
x id
Vi (x) = i log b1 +  i ; 2 (11.8)
b1

with b1 = 2γ −d/2 and


−1
% &
τ f (ω)
(x, τ ) = − log F e φ̃ (x).

The main point is that, up to a dilation by (b1 )i , the wavelet potentials are part of the
parametric family (x, τ ), which corresponds to the natural semigroup extension of the
wavelet pdf at scale i = 0.
Interestingly, we can also provide an iterated convolution interpretation of this result
by considering the pdfs of the scale-normalized wavelet coefficients zi = vi /(b1 )i . To
see this, we express the characteristic function of zi as
"  #
pZi (ω) = 
pVi (ω/bi1 ) = exp 2id fφ̃ ω
 2d
=  pZi−1 (ω)
 2id
= pZ0 (ω) ,
which indicates that pZi is the 2id -fold convolution of pZ0 = pV0 , which is itself the
pdf of the wavelet coefficients at resolution 0. In 1-D, this translates into the recursive
relation
 
pZi (x) = pZi−1 ∗ pZi−1 (x), (11.9)
which we like to view as the probabilistic counterpart of the two-scale relation of the
multiresolution theory of the wavelet transform. Incidentally, this iterated convolution
relation also explains why pZi spreads out and converges to a Gaussian as the scale
increases. Observe that the effect is more pronounced in higher dimensions since the
number of elementary convolution factors in the probabilistic two-scale relation grows
exponentially with d.

11.2 Wavelet-based methods for solving linear inverse problems

Having specified the statistical reconstruction problem in a wavelet basis, we now des-
cribe numerical methods of solution. To that end, we consider the general optimization
problem
* +
1
min y − Hs22 + τ (WT s) , (11.10)
s 2
11.2 Solving linear inverse problems 295

N T −1 an
where (v) = n=1 n (vn ) is a separable potential function and W = W
orthonormal transform matrix. The qualitative effect of the second term in (11.10) is
to favor solutions that admit a sparse wavelet expansion; the strength of this “regu-
larization” constraint is controlled by the parameter τ ∈ R+ . Clearly, the solution
of (11.10) is equivalent 2
    to the MAP estimator (11.5) if we set τ = σ and (v) =
i k∈i Vi vi [k] .
While a possible approach for solving (11.10) is to apply the ADMM algorithm of
Section 10.2.4 with the replacement of L by WT and a slight adjustment for scale-
dependent potentials, we shall present two alternative techniques (ISTA and FISTA)
that capitalize on the orthogonality of the matrix W. The second algorithm (FISTA) is
a modification of the first one that results in faster convergence.

11.2.1 Preliminaries
To exploit the separability of the potential function , we restate the reconstruction
problem in terms of the wavelet coefficients v = (v1 , . . . , vN ) = WT s as the minimization
of the cost functional
N
1
C (v) = y − Hwav v22 + τ n (vn ), (11.11)
2
n=1

where Hwav = HW. In order to gain insights into the algorithmic components of ISTA,
we first investigate two extreme cases for which the solution can be written down
explicitly.

Least-squares estimation
For τ = 0, the minimization of (11.11) reduces to a classical least-squares estimation
problem, and there is no advantage in expressing the signal in terms of wavelets. The
solution of the reconstruction problem is given by

sLS = (HT H)−1 HT y

under the assumption that HT H is invertible. When the underlying matrix is too large
to be inverted numerically, the corresponding linear system of equations is solved itera-
tively. The simplest iterative reconstruction method is the Landweber algorithm
 
sk+1 = sk + μ HT y − Hsk (11.12)

with μ ∈ R+ , which progressively builds up the solution by applying a steepest-descent


update. It is a first-order optimizer whose efficiency depends on the step size μ and
the conditioning of H. A classical result is that this iterative scheme will converge to
the solution provided that 0 < μ < 2/L, where L = λmax (HT H) is the spectral radius
of the iteration matrix A = HT H.
296 Wavelet-domain methods

Simple denoising problem


When both s and y are expressed in the wavelet basis and H = I, (11.10) reduces to a
separable denoising problem. Specifically, by defining z = WT y, we get
* +
1
ṽ = arg min y − Wv22 + τ (v) (11.13)
v 2
 N

1
= arg min z − v2 + τ
2
n (vn ) , (by Parseval)
v 2
n=1

so that
⎛ ⎞
prox1 (z1 ; τ )
  ⎜ ⎜ ..


ṽ = prox z; τ = ⎜ . ⎟, (11.14)
⎝ ⎠
proxN (zN ; τ )
where the definition of the underlying proximal operators (vectorial and scalar) is
consistent with the formulation of Section 10.2.3. Hence, the solution ṽ can be com-
puted by applying a series of component-wise shrinkage/thresholding functions to the
wavelet coefficients of y. This is the model-based version of the standard denoising
algorithm mentioned in the introduction. The relation between proxn and the under-
lying probability model is investigated in more detail in Section 11.3. The bottom line
is that these are scale-dependent non-linear maps (see examples in Figure 11.5) that
can be precomputed and stored in a lookup table, which makes the denoising procedure
very efficient.

11.2.2 Iterative shrinkage/thresholding algorithm


The idea behind the iterative shrinkage/thresholding algorithm (ISTA) is to solve (11.10)
iteratively by alternatively switching between a simple Landweber update and a denoi-
sing step.
ISTA produces a sequence vk that converges to the minimizer v of (11.11) when 
is convex. At each step, it minimizes a simpler auxiliary cost C  (v, vk ) that depends on
the current estimate vk . The design constraint is that C  (v, vk ) ≥ C (vk ), with equality
when v = vk . This guarantees that the cost functional decreases monotonically with the
iteration number k. The standard choice is
C  (v, vk ) = C (v) + L2 v − vk 22 − 12 Hwav (v − vk )22 , (11.15)
  
≥0

with L such that


Le2 ≥ Hwav e2 ,

for all e ∈ RN . The critical value of L is λmax HTwav Hwav ), which is the same L as in
the Landweber algorithm of Section 11.2.1 since W is unitary. The derivation of ISTA
is based on the rewriting of (11.15) as
C  (v, vk ) = L2 v − zk 22 + τ (v) + C0 (vk , y), (11.16)
11.2 Solving linear inverse problems 297

where C0 (vk , y) is a term that does not depend on v and where the auxiliary variable zk
is given by

zk = vk + L1 HTwav y − Hwav vk )
  
= WT sk + L1 HT y − Hsk ) . (11.17)

The crucial point is that the minimization of (11.16) with respect to v is equivalent to
the denoising problem (11.13). This implies that
" τ#
arg min C  (v, vk ) = prox zk ; ,
v∈RN L
which corresponds to a shrinkage/thresholding of the wavelet coefficients of the signal.
The form of the update
 equation (11.17) is also highly suggestive, for it boils down to a
Landweber iteration see (11.12) followed by a wavelet transform. The resulting ISTA
is summarized in Algorithm 1.

% &
2 y − Hs2
1 2
Algorithm 1: ISTA solves s = arg mins + τ (WT s)

input: A = HT H, a = HT y, s0 , τ , and L
set: k ← 0
repeat  
sk+1 ← sk + L1 a − Ask  (Landweber step)
vk+1 ← prox WT sk+1 ; Lτ (wavelet-domain denoising)
sk+1 ← Wvk+1 (inverse wavelet transform)
k←k+1
until stopping criterion
return sk

The remarkable aspect is that this simple sequence of Landweber updates and
wavelet-domain thresholding operations converges to the solution of (11.10). The only
subtle point is that the strength of the thresholding (τ/L) is tied to the step size of the
gradient update.

11.2.3 Fast iterative shrinkage/thresholding algorithm


While ISTA converges to a (possibly local) minimum, it may do so rather slowly since
the amount of error reduction at each step is dictated by the Landweber update. The
latter, which is a basic first-order technique, is known to be quite inefficient when the
system matrix is poorly conditioned.
When  is convex, it is possible to characterize the convergence behavior of ISTA.
Specifically, Beck and Teboulle [BT09b, Theorem 3.1] have shown that, for any k > 1,
L k
C (vkISTA ) − C (v ) ≤ v − v 22 ,
2k ISTA
which indicates that the cost function decreases linearly with the iteration number k.
298 Wavelet-domain methods

In the same paper, these authors have proposed a refinement of the scheme, called
the fast iterative shrinkage/thresholding algorithm (FISTA), which improves the rate of
convergence by one order. This is achieved via a controlled over-relaxation that utilizes
the previous iterates to produce a better guess for the next update. A possible imple-
mentation of FISTA is shown in Algorithm 2.

% &
2 y − Hs2
1 2
Algorithm 2: FISTA solves s = arg mins + τ (WT s)

input: A = HT H, a = HT y, s0 , τ and L
set: k ← 0, w0 ← Ws0 , t0 ← 0;
repeat  
 1  τ
wk+1 ← prox WT sk + (a − Ask ) ; (ISTA step)
 3 L L
1
tk+1 ← 1 + 1 + 4tk 2
2
tk − 1  k+1 
vk+1 ← wk+1 + w − wk
tk+1
sk+1 ← Wvk+1
k←k+1
until stopping criterion
return sk

The only difference from ISTA is the update of vk+1 , which is an extrapolation of the
two previous ISTA computations wk+1 and wk . The variable tk controls the strength of
the over-relaxation, which increases with k up to some asymptotic limit.
The theoretical justification for FISTA (see [BT09b, Theorem 4.4]) is that the scheme
improves the convergence such that, for any k > 1,
2L
C (vkFISTA ) − C (v ) ≤ vk − v 22 .
(k + 1)2 FISTA
Practically, switching from a linear to a quadratic convergence rate can translate to
a spectacular speed improvement over ISTA, with the advantage that this change of
regime essentially comes for free. FISTA therefore constitutes the method of choice for
wavelet-based regularization; it typically delivers state-of-the-art performance for the
kind of large-scale optimization problems encountered in imaging.

11.2.4 Discussion of wavelet-based image reconstruction


Iterative shrinkage/thresholding algorithms can be applied to the reconstruction of
images for a whole variety of biomedical imaging modalities in the same way as
we saw in Section 10.3. For illustration purposes, we have applied ISTA and FISTA
to the deconvolution of the fluorescence micrographs of Section 10.3.2. In order to
mimic the regularizing effect of the gradient operator, we have selected 2-D Haar
wavelets, which act qualitatively as (smoothed) first-order derivatives. We have also
11.2 Solving linear inverse problems 299

ISTA
FISTA
Cost functional 100.5

100.4

100.3

0 50 100 150 200 250 300 350 400 450 500


# iterations

Figure 11.1 Comparison of the convergence properties of ISTA (light) and FISTA (dark) for the
image in 11.2(b) as a function of the iteration index.

used the same type of potential functions: Gauss (x) = Ai |x|2 , Laplace (x) = Bi |x|, and
Student (x) = Ci log(x2 + ), where Ai , Bi , and Ci are some proper scale-dependent
constants. As in the previous experiments, the overall regularization strength τ was
tuned for best performance (maximum SNR with respect to the reference). Here, we are
presenting the results for the image of nerve cells (see Figure 10.3b) with the use of 1
wavelet-domain regularization.
The plot in Figure 11.1 documents the evolution of the cost functional (11.11) as a
function of the iteration index for both ISTA and FISTA. It illustrates the faster conver-
gence rate of FISTA, in agreement with Beck and Teboulle’s prediction. As far as qual-
ity is concerned, a general observation is that the output of the basic version of the
wavelet-based reconstruction algorithm is not on a par with the results of Section 10.3.
The main problem (see Figure 11.2f) is that the reconstructed images suffer from arti-
facts (in the form of wavelet footprints) that are typically the consequence of the lack
of shift-invariance of the wavelet representation. Fortunately, there is a simple remedy
to correct for this effect via a mechanism called cycle spinning. 2 The approach is to
randomly shift the signal back and forth during the course of iterations, which is equi-
valent to cycling through a family of shifted wavelet transforms, as will be described in
Section 11.4.2. Incorporating cycle spinning in ISTA does not increase the compu-
tational cost but improves the SNR of the reconstruction significantly, as shown in
Figure 11.2e. Hence, we end up with a result that is comparable in quality to the output
of the MAP reconstruction algorithm of Section 10.2 (see Figure 11.2c). This trend per-
sists with other images and across imaging modalities. Combining cycle spinning with
FISTA is feasible as well, with the advantage that the convergence rate of the latter is
typically superior to that of the ADMM technique.
While averaging across shifts appears to be essential for making wavelets competi-
tive, we are left with the conceptual problem that the cycle-spun version of ISTA does
not rigorously fit our statistical formulation. It converges to a solution that is not a strict
2 Cycle spinning is used almost systematically for the wavelet-based reconstructions showcased in the liter-
ature. However, the method is rarely accounted for in the accompanying theory.
300 Wavelet-domain methods

TV
(a) (b) (c)

Ref TV

ISTA ISTA-CS ISTA ISTA-CS


(d) (e) (f)

Figure 11.2 Results of deconvolution experiment: (a) Blurry and noisy input of the deconvolution
algorithm (BSNR=20dB). (b) Ground truth image (nerve cells). (c) Result of MAP
deconvolution with TV regularization (SNR=15.23 dB). (d) Result of wavelet-based
deconvolution (SNR=12.73dB). (e) Result of wavelet-based deconvolution with cycle spinning
(SNR=15.18dB). (f) Zoomed comparison of results for the region marked in (b).

minimizer of (11.10) but, rather, to some kind of average over a family of “shifted”
wavelet transforms. While this description is largely empirical, there is a theoretical
explanation of the phenomenon for the simpler signal-denoising problem. Specifically,
in Section 11.4, we shall demonstrate that cycle spinning necessarily improves denois-
ing performance (see Proposition 11.3) and that it can be seen as an alternative means
of computing the “exact” MAP estimators of Section 10.4.3. In other words, cycle spin-
ning somehow compensates for the inter-scale dependencies of wavelet coefficients that
were neglected when writing (11.5).
The most favorable aspect of wavelet-domain processing is that it offers direct control
over the reconstruction error, thanks to Parseval’s relation. In particular, it allows for
a more refined design of thresholding functions based on the minimum-mean-square-
error (MMSE) principle. This is the reason why we shall now investigate non-iterative
strategies for improving simple wavelet-domain denoising.

11.3 Study of wavelet-domain shrinkage estimators

In the remainder of the chapter, we concentrate on the problem of signal denoising with
H = I (identity) or, equivalently, Hwav = W, under the assumption that the transform
11.3 Study of wavelet-domain shrinkage estimators 301

matrix W is orthonormal. The latter ensures that any reduction of the quadratic error
achieved in the wavelet domain is automatically transferred to the signal domain.
In this particular setting, we can address the important issue of the dependency be-
tween the wavelet-domain thresholding functions and the prior probability model. Our
practical motivation is to improve the standard algorithm by identifying the solution that
minimizes the mean-square estimation error. To specify the underlying scalar estimation
problem, we transpose the measurement equation y = s + n into the wavelet domain as

z = WT s + WT n = v + n
⇔ zi [k] = vi [k] + ni [k], (11.18)

where vi and ni are the wavelet coefficients of the noise-free signal s and of the AWGN
n, respectively. Since the wavelet transform is orthonormal, the transformed noise n =
WT n remains white, so that ni is Gaussian i.i.d. with variance σ 2 . Now, when the wave-
let coefficients vi are statistically independent as has been assumed so far, the denoising
can be performed in a separable fashion by considering the wavelet coefficients indivi-
dually. The estimation problem is then to recover v from the noisy coefficient z = v + n,
where we have dropped the wavelet indices to simplify the notation. Irrespective of the
statistical criterion used (MAP vs. MMSE), the estimator ṽ(z) will be a function of the
(scalar) noisy input z, in agreement with the standard wavelet-denoising procedure.
Next, we develop the theory associated with the statistical wavelet-based estimators.
The prior information is provided by the wavelet-domain pdfs pVi , which are known
to be infinitely divisible (see Proposition 8.6). We then make use of those results to
characterize and compare the shrinkage/thresholding functions associated with the id
distributions of Table 4.1.

11.3.1 Pointwise MAP estimators for AWGN


Our baseline is the MAP solution to the denoising problem given by (11.14). For later
reference, we give the scalar formulation of this estimator
* +
1  2
vMAP (z) = arg min z − v + σ 2 Vi (v)
v∈R 2
 
= proxV z; σ 2 , (11.19)
i

which involves a scale-specific proximity operator of the type investigated in Sec-


tion 10.2.3. Explicit formulas and graphs of vMAP (z) for the primary types of probability
models/sparsity patterns are presented in Section 11.3.3.

11.3.2 Pointwise MMSE estimators for AWGN


From a mean-square-error point of view, performing denoising in the wavelet domain is
equivalent to signal-domain processing since the 2 -error is preserved. This makes the
use of MMSE shrinkage functions highly relevant, even when the wavelet coefficients
302 Wavelet-domain methods

are only approximately independent. The MMSE estimator of vi , given the noisy coef-
ficient z, is provided by the posterior mean

vMMSE (z) = E{V|Z = z} = v · pV|Z (v|z) dv, (11.20)


R
p (z|v)·p (v)
where pV|Z (v|z) = Z|V pZ (z) Vi by Bayes’ rule. In the present context of AWGN, we
have that pZ|V (z|v) = gσ (z − v) and pZ = gσ ∗ pVi , where gσ is a centered Gaussian
distribution with standard deviation σ . Moreover, we can bypass the integration step in
(11.20) by taking advantage of the Miyasawa–Stein formula for the posterior mean of a
random variable corrupted by Gaussian noise [Miy61, Ste81], which states that

vMMSE (z) = z − σ 2 Z (z), (11.21)


p (z)
where Z (z) = − dzd
log pZ (z) = − pZZ (z) . This classical formula, which capitalizes on
special properties of the Gaussian distribution, is established as follows:

σ 2 pZ (z) = σ 2 (gσ ∗ pVi )(z)

= −(z − v)gσ (z − v)pVi (v) dv


R
gσ (z − v)pVi (v)
= −z gσ (z − v)pVi (v) dv + pZ (z) v dv
R R pZ (z)
= −zpZ (z) + pZ (z)vMMSE (z).

This means that we can derive the explicit form of vMMSE (z) for any given pVi via the
evaluation of the Gaussian convolution integrals
* +
−1 ω2 σ 2
pZ (z) = (gσ ∗ pVi )(z) = F e− 2  pVi (ω) (z) (11.22)
* +
−1 ω2 σ 2
pZ (z) = (gσ ∗ pVi )(z) = F jωe− 2  pVi (ω) (z). (11.23)

These can be calculated in either the time or frequency domain. The frequency-domain
formulation offers more convenience for the majority of id distributions and is also
directly amenable to numerical computation with the help of the FFT. Likewise, we use
(11.21) to infer the general asymptotic behavior of this estimator.

THEOREM 11.1 Let z = v + n, where v is infinitely divisible with symmetric pdf pV and
n is Gaussian-distributed with variance σ 2 . Then, the MMSE estimator of v given z has
the linear behavior around the origin given by
" #
vMMSE (z) = z 1 − σ 2 Z (0) + O(z3 ), (11.24)

where
 2 2
2 e− ω 2σ
Rω 
pVi (ω) dω
Z (0) =  2 2
> 0. (11.25)
− ω 2σ
Re 
pVi (ω) dω
11.3 Study of wavelet-domain shrinkage estimators 303

If, in addition, pVi is unimodal and does not decay faster than an exponential, then
vMMSE (z) ∼ vMAP (z) ∼ z − σ 2 b1 as z → ∞,
where b1 = limx→∞ Z (x) = limx→∞ V (x) ≥ 0.
Proof Since the Gaussian kernel gσ is infinitely differentiable, the same holds true
for pZ = pVi ∗ gσ even if pVi is not necessarily smooth to start with (e.g., it is a
compound-Poisson or Laplace distribution.) This implies that the second-order Taylor
series Z (z) = − log (pZ (z)) = Z (0) + 12 Z (0)z2 + O(z4 ) is well defined, which yields
(11.24). The expression for Z (0) follows from (10.20) with  pZ (ω) = e−ω σ /2
2 2
pVi (ω).
We also note that the Fourier-domain moments that appear in (11.25) are positive and
f (ω)
finite because pVi (ω) = e φ̃i ≥ 0 is tempered by the Gaussian window. Next, we recall
that the Gaussian is part of the family of strongly unimodal functions which have the
remarkable property of preserving the unimodality of the functions they are convolved
with [Sat94, pp. 394–399]. The second part then follows from the fact that the convolu-
tion with gσ , which decays much faster than pVi , does not modify the decay at the tail
of the distribution.
Several remarks are in order. First, the linear approximation (11.24) is exact in the
Gaussian case. It actually yields the classical linear (LMMSE) estimator
σi2
vLMMSE (z) = z,
σi2 + σ 2
where σi2 is the variance of the signal contribution in the ith wavelet channel. Indeed,
z2
when pVi is a Gaussian distribution, we have that Z (z) = 2 2 , which, upon substi-
2(σi +σ )
tution in (11.21), yields the vLMMSE estimator.
Second, by applying Parseval’s relation, we can express the slope of the MMSE esti-
mator at the origin as the ratio of time-domain integrals
 x
σ 2 −x2 − 2σ 2
2

2 R
e pVi (x) dx
1 − σ 2 Z (0) = 1 − σ σ 4
 − x2
2σ 2
Re pVi (x) dx
 2
2
− x2
Rx e pVi (x) dx

= , (11.26)
 − x2
σ2 Re
2σ 2 pVi (x) dx
which may be simpler to evaluate for some id distributions.

11.3.3 Comparison of shrinkage functions: MAP vs. MMSE


In order to gain practical insights and to make the connection with existing methods,
we now investigate solutions that are tied to specific id distributions. We consider the
prior models listed in Table 4.1, which cover a broad range of sparsity behaviors. The
common feature is that these pdfs are symmetric and unimodal with tails fatter than a
Gaussian. Their practical relevance is that they may be used to fit the wavelet-domain
304 Wavelet-domain methods

statistics of real-world signals or to derive corresponding families of parametric algo-


rithms. Unless stated otherwise, the graphs that follow display series of comparable
estimators with a normalized signal input (SNR0 = 1).

Laplace distribution
The Laplace distribution with parameter λ is defined as

1 −λ|x|
pLaplace (x; λ) = λe .
2
2
Its variance is given by σ02 = λ2
. The Lévy exponent is fLaplace (ω; λ) = log
pLaplace (ω; λ)
λ2
= log( λ2 +ω2 ), which is p-admissible with p = 2. The Laplacian potential is

Laplace (x; λ) = λ|x| − log(λ/2).

Since the second term of Laplace does not depend on x, this translates into a MAP
estimator that minimizes the 1 -norm in the corresponding wavelet channel. It is well
known that the solution of this optimization problem yields the soft-threshold estimator
(see [Tib96, CDLL98, ML99])


⎨ z − λ, z > λ

vMAP (z; λ) = 0, z ∈ [−λ, λ]



z + λ, z < λ.

By applying the time-domain versions of (11.22) and (11.23), one can also derive the
analytical form of the corresponding MMSE estimator in AWGN. For reference pur-
poses, we give its normalized version with σ 2 = 1 as
" " # " # #
λ erf z−λ
√ − e2λz erfc λ+z
√ +1
vMMSE (z; λ) = z − " 2# " 2# ,
z−λ
erf √ + e2λz erfc λ+z √ +1
2 2

where erfc(t) = 1−erf(t) denotes the complementary (Gaussian) error function, which is
a result that can be traced back to [HY00, Proposition 1]. A comparison of the estimators
for the Laplace distribution with λ = 2 and unit noise variance is given in Figure 11.3b.
While the graph of the MMSE estimator has a smoother appearance than that of the
soft-thresholding function, it does also exhibit two distinct regimes that are well repre-
sented by first-order polynomials: behavior around the origin vs. behavior at ±∞. How-
ever, the transition between the two regimes is much more progressive in the MMSE
case. Asymptotically, the MAP and MMSE estimators are equivalent, as predicted by
Theorem 11.1. The key difference occurs around the origin, where the MMSE estimator
is linear (in accordance with Theorem 11.1) and quite distinct from a thresholding func-
tion. This means that the MMSE estimator will never annihilate a wavelet coefficient,
which somewhat contradicts the predominant paradigm for recovering sparse signals.
11.3 Study of wavelet-domain shrinkage estimators 305

0
-4 -2 0 2 4
(a)

3 3

2 2

1 1

0 0

-1 -1

-2 -2

-3 -3

-4 -2 0 2 4 -4 -2 0 2 4
(b) (c)

Figure 11.3 Comparison of potential functions (z) and pointwise estimators v(z) for signals
with matched Laplace and sech distributions corrupted by AWGN with σ = 1. (a) Laplace
(dashed) and sech (solid) potentials. (b) Laplace MAP estimator (light), MMSE (dark) estimator,
and its first-order equivalent (dot-dashed line) for λ = 2. (c) Sech MAP (light) and MMSE (dark)
estimators for σ0 = π/4.

Hyperbolic secant distribution


The hyperbolic secant (reciprocal of the hyperbolic cosine) is a classical example of id
distribution [Fel71]. It seems to us as interesting a candidate for regularization as the
Laplace distribution. Its generic version with standard deviation σ0 is given by
" #
πx
sech 2σ0 1
psech (x; σ0 ) = = " πx πx # .
2σ0 −
σ e 2σ0 + e 2σ0 0

Remarkably, its characteristic function is part of the same class of distributions, with


psech (ω; σ0 ) = sech (ωσ0 ) .

The hyperbolic secant potential function is


πx
− 2σ πx
sech (x; σ0 ) = − log psech (x; σ0 ) = log(e 0 + e 2σ0 ) + log σ0 ,

which is convex and increasing for x ≥ 0. Indeed, the second derivative of the potential
function is
 
 π2 2 πx
sech (x) = sech ,
4σ0 2σ0
306 Wavelet-domain methods

π |x| + log σ ,
which is positive. Note that, for large absolute values of x, sech (x) ∼ 2σ0
0
suggesting that it is essentially equivalent to the 1 -type Laplace potential (see
Figure 11.3a). However, unlike the latter, it is infinitely differentiable everywhere,
with a quadratic behavior around the origin.
The corresponding MAP and MMSE estimators, with a parameter value that is
matched to the Laplace example, are shown in Figure 11.3c. An interesting observation
is that the sech MAP thresholding functions are very similar to the MMSE Laplacian
ones over the whole range of values. This would suggest using hyperbolic-secant-
penalized least-squares regression as a practical substitute for the MMSE Laplace
solution.

Symmetric Student family


We define the symmetric Student distribution with standard deviation σ0 and algebraic
decay parameter r > 1 as
 r+ 1
1 2
pStudent (x; r, σ0 ) = Ar,σ0 (11.27)
Cr,σ0 + x2
(Cr,σ0 )r
with Cr,σ0 = σ02 (2r − 2) > 0 and normalizing constant Ar,σ0 = , where B(r, 12 )
B(r, 12 )
is the beta function (see Appendix C.2). Despite the widespread use of this distribution
in statistics, it took until the late 1970s to establish its infinite divisibility [SVH03]. The
interest for signal processing is that the Student model offers fine control of the behavior
of the tail, which conditions the level of sparsity of signals. The Student potential is
logarithmic:
  " #
1
Student (x; r, σ0 ) = a0 + r + log Cr,σ0 + x2 (11.28)
2
with a0 = log Ar,σ0 .

15 10

10
5
5

0 0

-5
-5
-10
-10
-15-20 -10 0 10 20 -10 -5 0 5 10
(a) (b)

Figure 11.4 Examples of pointwise MMSE estimators vMMSE (z) for signals corrupted by
AWGN with σ = 1 and the fixed SNR = 1. (a) Student priors with r = 2, 4, 8, 16, 32, +∞ (dark to
light). (b) Compound-Poisson priors with λi = 1/8, 1/4, 1/2, 1, 2, 4, ∞ (dark to light) and
Gaussian amplitude distribution.
11.3 Study of wavelet-domain shrinkage estimators 307

The Student MAP estimator is specified by a third-order polynomial equation that can
be solved explicitly. This results in the thresholding functions shown in Figure 10.1b.
We have also observed experimentally that the Student MAP and MMSE estimators are
rather close to each other, with linear trends around the origin that become indistingui-
shable as r increases; this can be verified by comparing Figure 10.1b and Figure 11.4a.
This finding is also consistent with the distributions becoming more Gaussian-like for
larger r.
Note that Definition (11.27) remains valid in the super-sparse regimes with r ∈ (0, 1],
provided that the normalization constant C > 0 is no longer tied to r and σ0 . The catch
is that the variance of the signal is unbounded for r ≤ 1, which tends to flatten the
shrinkage function around the origin, but maintains continuity since Student is infinitely
differentiable.

Compound-Poisson family
We have already mentioned that the Poisson case results in pdfs that exhibit a Dirac
distribution at the origin and are therefore unsuitable for MAP estimation. A compound-
Poisson variable is typically generated by integration of a random sequence of Dirac
impulses with some amplitude distribution pA and a density parameter λ corresponding
to the average number of impulses within the integration window. The generic form of
a compound-Poisson pdf is given by (4.9). It can be written as pPoisson (x) = e−λ δ(x) +
(1 − e−λ )pA,λ (x), where the pdf pA,λ describes the distribution of the non-zero values.
The determination of the MMSE estimator from (11.21) requires the computation
of Z (z) = −pZ (z)/pZ (z). The most convenient approach is to evaluate the required
factors using the right-hand expressions in (11.22) and (11.23), where  pVi is specified
by its Poisson parameters as in Table 4.1. This leads to
 
 pAi (ω) − 1) ,
pVi (ω) = exp λi (

where λi ∈ R+ and  pAi : R → C are the Poisson rate and the characteristic function
of the Poisson amplitude distribution at resolution i, respectively. Moreover, due to the
multiscale structure of the analysis, the wavelet-domain Poisson parameters are related
to each other by

λi = λ0 2id
 
 pA0 2i(γ −d/2) ω ,
pAi (ω) = 

which follows directly from (11.7). The first formula highlights the fact that the sparse-
ness of the wavelet distributions, as measured by the proportion e−λi of zero coeffic-
ients, decreases substantially as the scale gets coarser. Also note that the strength of this
effect increases with the number of dimensions.
Some examples of MMSE thresholding functions corresponding to a sequence
of compound-Poisson signals with Gaussian amplitude distributions are shown in
Figure 11.4b. Not too surprisingly, the smaller λ (dark curve), the stronger the thre-
sholding behavior at the origin. In that experiment, we have considered a wavelet-like
progression of the rate parameter λ, while keeping the signal-to-noise ratio constant to
308 Wavelet-domain methods

facilitate the comparison. For larger values of λ (light), the estimator converges to the
LMMSE solution (thin black line), which is consistent with the fact that the distribution
becomes more and more Gaussian-like.

Evolution of the estimators across wavelet scales


The increase of λi in Figure 11.4b is consistent with the one predicted for a wavelet-
domain analysis. Nonetheless, this graph only accounts for part of the story because
we enforced a constant signal-to-noise-ratio. In a realistic wavelet-domain scenario,
another effect that predominates as the scale gets coarser must also be accounted for:
the amplification of the quadratic signal-to-noise ratio that follows from (9.23) and that
results in
Var(Zi ) " 2γ #i
SNRi = = 2 SNR0 ,
σ2
where γ is the scaling order of the stochastic process. Consequently, the potential func-
tion  is dilated by bi = (2γ −d/2 )i . The net effect is to make the estimators more
identity-like as i increases, both around the origin and at infinity because of the cor-
responding decrease of the magnitude of  (0) and limx→∞  (x), respectively. This

6
4
2
0
–2
–4
–6

–10 –5 0 5 10
(a)

6 6
4 4
2 2
0 0
–2 –2
–4 –4
–6 –6
–10 –5 0 5 10 –10 –5 0 5 10
(b) (c)

Figure 11.5 Sequence of wavelet-domain estimators vi (z) for a Laplace-type Lévy process
corrupted by AWGN with σ = 1 and wavelet resolutions i = 0 (dark) to 4 (light). (a) LMMSE
(or Brownian-motion MMSE) estimators. (b) Sym gamma MAP estimators. (c) Sym gamma
MMSE estimators. The reference (fine-scale) parameters are σ02 = 1 (SNR0 = 1) and r0 = 1
(Laplace distribution). The scale progression is dyadic.
11.3 Study of wavelet-domain shrinkage estimators 309

progressive convergence to the identity map is easiest to describe for a Gaussian signal
where the sequence of estimators is linear – this is illustrated in Figure 11.5a for γ = 1
and d = 1 (Brownian motion) so that b1 = 21/2 .
In the non-Gaussian case, the sequence of wavelet-domain estimators will be part of
some specific family that is completely determined by pV0 , the wavelet pdf at scale 0, as
described in Section 11.1.2. When the variance of the signal is finite, the implication of
the underlying semigroup structure and iterated convolution relations is that the MAP
and MMSE estimators both converge to the Gaussian solution (LMMSE estimator) as
the scale gets coarser (light curves). Thus, the global picture remains very similar to
the Gaussian one, as illustrated in Figure 11.5b,c. Clearly, the most significant non-
linearities can be found at the finer scale (dark curves), where the sparsity effect is
prominent.
The example shown (wavelet analysis of a Lévy process in a Haar basis) was set up
so that the fine-level MAP estimator is a soft-threshold. As discussed next, the wavelet-
domain estimators are all part of the sym gamma family, which is the semigroup exten-
sion of the Laplace distribution. An interesting observation is that the thresholding
behavior fades with a coarsening of the scale. Again, this points to the fact that the
non-Gaussian effects (non-linearities) are the most significant at the finer levels of the
wavelet analysis where the signal-to-noise ratio is also the least favorable.

Symmetric gamma and Meixner distributions


Several authors have proposed representing wavelet statistics using Bessel K-forms
[SLG02, FB05]. The Bessel K-forms are in fact equivalent to the symmetric gamma
(sym gamma) distributions specified by Table 4.1. Here, we would like to emphasize a
further theoretical advantage: the semigroup structure of this family ensures compatibi-
lity across wavelet scales, provided that one properly links the distribution parameters.
The sym gamma distribution with bandwidth λ and order parameter r is best specified
in terms of its characteristic function
 r
λ2

pgamma (ω; λ, r) = .
λ2 + ω 2
The inverse Fourier transform of this expression yields a Bessel function of the second
kind (a.k.a. Bessel K-form), as discussed in Appendix C. The variance of the distribution
is 2r/λ2 . We also note that pgamma (x; λ, 1) is equivalent to the Laplace pdf and that
pgamma (x; λ, r) is the r-fold convolution of the former (semigroup property). In the case
of a dyadic wavelet analysis, we invoke (11.7) to show that the evolution of the sym
gamma parameters across scales is given by
λ0
λi =  i
2γ −d/2
 i
ri = r0 2d ,

where (λ0 , r0 ) are the parameters at resolution i = 0. These relations


√ underly the gen-
eration of the graphs in Figure 11.5b,c with γ = 1, d = 1, λ0 = 2, and r0 = 1.
310 Wavelet-domain methods

15 15
r = 1/4
Gamma MAP Meixner MAP
10 10

5 ∞ 5

0 0

–5 –5

–10 –10

–15 –15
–20 –10 0 10 20 –20 –10 0 10 20
(a) (b)
15 15
Gamma MMSE Meixner MMSE
10 10

5 5

0 0

–5 –5

–10 –10

–15 –15
–20 –10 0 10 20 –20 –10 0 10 20
(c) (d)

Figure 11.6 Comparison of MAP and MMSE estimators v(z) for a series of sym gamma and
Meixner-distributed random variables with r = 1/4, 1, 4, 64, +∞ (dark to light) corrupted by
white Gaussian noise of the same power as the signal (SNR=1). (a) Sym gamma MAP
estimators. (b) Meixner MAP estimators. (c) Sym gamma MMSE estimators. (d) Meixner
MMSE estimators.

Some further examples of sym gamma MAP and MMSE estimators over a range of
orders are shown in Figure 11.6 under constant signal-to-noise ratio to highlight the
differences in sparsity behavior. We observe that the MAP estimators have a hard-to-
soft-threshold behavior for r < 3/2, which is consistent with the discontinuity of the
potential at the origin. For larger values of r, the trend becomes more linear. By contrast,
the MMSE estimator is much closer to the LMMSE (thin black line) around the origin.
For larger signal values, both estimators result in a more or less progressive transition
between the two extreme lines of the cone (identity and LMMSE) that is controlled by
r – the smaller values of r correspond to the sparser scenarios with vMMSE being closer
to identity.
The Meixner family in Table 4.1 with order r > 0 and scale parameter s0 ∈ R+
provides the same type of extension for the hyperbolic secant distribution with essen-
tially the same functionality. Mathematically, it is closely linked to the gamma function
whose relevant properties are summarized in Appendix C. As shown in Table 10.1,
the Meixner potential has the same asymptotic behavior as the sym gamma potential
at infinity, with the advantage of being much smoother (infinitely differentiable) at the
origin. This implies that the curves of the gamma and Meixner estimators are globally
11.3 Study of wavelet-domain shrinkage estimators 311

quite similar. The main difference is that the Meixner MAP estimator is guaranteed to
be linear around the origin, irrespective of the value of r, and in better agreement with
the MMSE solution than its gamma counterpart.

Cauchy distribution
The prototypical example of a heavy-tail distribution is the symmetric Cauchy distribu-
tion with dispersion parameter s0 , which is given by
s0
pCauchy (x; s0 ) =  2 . (11.29)
π s0 + x2
It is a special case of a SαS distribution (with α = 1) as well as a symmetric Student
with r = 12 .
Since the Cauchy distribution is stable, we can invoke Proposition 9.8, which ensures
that the wavelet coefficients of a Cauchy process are Cauchy-distributed too. For illus-
tration purposes, we consider the analysis of a stable Lévy process (a.k.a. Lévy flight)
in an orthonormal Haar wavelet basis with ψ = D∗ φ, where φ is a triangular smoothing
kernel. The corresponding wavelet-domain Cauchy parameters √ may be determined from
(9.27) with γ = 1, d = 1, and α = 1, which yields si = s0 (2 2)i .
While the variance of the Cauchy distribution is unbounded, an analytical characteriz-
ation of the corresponding MAP estimator can be obtained by solving a cubic equation.
The MMSE solution is then described by a cumbersome formula that involves expo-
nentials and the error function erf. In particular, we can evaluate (11.24) to linearize its
behavior around the origin as
⎛ ⎛3 s2
⎞⎞
2 − 2i
⎜ ⎜ π e s ⎟⎟
vMMSE (zi ; si ) = zi ⎝1 − σ 2 ⎝ " # − s2i − 1⎠⎠ + O(z3i ). (11.30)
s
erfc √ i
2

The corresponding MAP and MMSE shrinkage functions with s0 = 14 and resolution
levels i = 0, . . . , 4 are shown in Figure 11.7. The difference between the two types of
estimator is striking around the origin and is much more dramatic at finer scales (i = 0

3 3
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3

-4 -2 0 2 4 -4 -2 0 2 4
(a) (b)

Figure 11.7 Comparison √ of pointwise wavelet-domain estimators v(z) with


{si }4i=0 = { 14 , √1 , 2, 4 2, 16} (dark to light) for a Cauchy-Lévy process corrupted by AWGN
2
with σ = 1. (a) Cauchy MAP estimators. (b) Cauchy MMSE estimators.
312 Wavelet-domain methods

(dark) and i = 1). As expected, all estimators converge to the identity map for large
input values, due to the slow (algebraic) decay of the Cauchy distribution. We observe
that the effect of processing (deviation from identity) becomes less and less significant
at coarser scales (light curves). This is consistent with the relative increase of the signal
contribution while the power of the noise remains constant across wavelet channels.

11.3.4 Conclusion on simple wavelet-domain shrinkage estimators


The general conclusions that can be drawn from the present statistical analysis are as
follows:
• The thresholding functions should be tuned to the wavelet-domain statistics, which
necessarily involve infinitely divisible distributions. In particular, this excludes the
generalized Gaussian models with 1 < p < 2, which have been invoked in the past to
justify p -minimization algorithms. The specification of wavelet-domain estimators
can be carried out explicitly – at least, numerically – for the primary families of sparse
processes characterized by their Lévy exponent f (ω) and scaling order γ .
• Pointwise MAP and MMSE estimators can differ quite substantially, especially for
small input values. The MAP estimator sometimes acts as a soft-threshold, setting
small wavelet coefficients to zero, while the MMSE solution is always linear around
the origin where it essentially replicates the traditional Wiener solution. On the other
hand, the two estimators are indistinguishable for large input values: they exhibit a
shrinkage behavior with an offset that depends on the decay (exponential vs. alge-
braic) of the canonical id distribution.
• Wavelet-domain shrinkage functions must be adapted to the scale. In particular, the
present statistical formulation does not support the use of a single, universal denois-
ing function – such as the fixed soft-threshold dictated by a global 1 -minimization
argument – that could be applied to all coefficients in a non-discriminative manner.
• Two effects come into play as the scale gets coarser. The first is a progressive increase
of the signal-to-noise ratio, which results in a shrinkage function that becomes more
and more identity-like. This justifies the heuristic strategy to leave the coarser-scale
coefficients untouched. The second is a Gaussianization of the wavelet-domain statis-
tics (under the finite-variance hypothesis) due to the summation of a large number of
random components (generalized version of the central-limit theorem.) Concretely,
this means that the estimator ought to progressively switch to a linear regime when
the scale gets coarser, which is not what is currently done in practice.
• The present analysis did not take into account the statistical dependencies of wavelet
coefficients across scales. While these dependencies affect neither the performance
of pointwise estimators nor the present conclusions, their existence clearly suggests
that the basic application of wavelet-domain shrinkage functions is suboptimal. A
possible refinement is to specify higher-order estimators (e.g., bivariate shrinkage
functions.) Such designs could benefit from a tree-like structure where each wavelet
coefficient is statistically linked to its parents. The other alternative is the algorith-
mic solution described next, which constitutes a promising mechanism for turning a
suboptimal solution into an optimal one.
11.4 Improved denoising by consistent cycle spinning 313

11.4 Improved denoising by consistent cycle spinning

A powerful strategy for improving the performance of the basic wavelet-based denoisers
described in Section 11.3 is through the use of an overcomplete representation. Here,
we formalize the idea of cycle spinning by expanding the signal in a wavelet frame. In
essence, this is equivalent to considering a series of “shifted” orthogonal wavelet trans-
forms in parallel. The denoising task thereby reduces to finding a consensus solution.
We show that this can be done either through simple averaging or by constructing a
solution that is globally consistent by way of an iterative refinement procedure.
To demonstrate the concept and the virtues of an optimized design, we concentrate
on the model-based scenario of Section 10.4. The first important ingredient is the proper
choice of basis functions, which is discussed in Section 11.4.1. Then, in Section 11.4.2,
we switch to a redundant representation (tight wavelet frame) with a demonstration
of its benefits for noise reduction. In Section 11.4.3, we introduce the idea of consistent
cycle spinning, which results in an iterative variant of the basic denoising algorithm. The
impact of each of these refinements, including the use of the MMSE shrinkage functions
of Section 11.3, is evaluated experimentally in Section 11.4.4. The final outcome is
an optimized wavelet-based algorithm that is able to replicate the MMSE results of
Chapter 10.

11.4.1 First-order wavelets: design and implementation


In line with the results of Sections 8.5 and 10.4, we focus on the first-order (or Markov)
processes, which lend themselves to an analytical treatment. The underlying statistical
model is characterized by the first-order whitening operator L = D − α1 Id with α1 ∈ R
and the Lévy exponent f of the innovation. We then apply the design procedure of Sec-
tion 6.5 to determine the operator-like wavelet at resolution level i = 1, which is given
by ψα1 ,1 (t) = L∗ ϕint (t − 1) where L∗ = −D − α1 Id. Here, ϕint is the unique interpolant
in the space of cardinal L∗ L-splines which is calculated as
1  ∨ 
ϕint (t) =  βα1 ∗ βα1 (t)
(βα∨1∗ βα1 (0)
⎧ −α |t| 2α +α |t|
⎪ e 1 −e 1 1

⎨ 1−e2α1
, for t ∈ [−1, 1] and α1  = 0
= 1 − |t|, for t ∈ [−1, 1] and α1 = 0



0, otherwise,
where βα1 is the first-order exponential spline defined by (6.21).
Examples of the functions βα1 ∝ βα1 ,0 (B-spline), ϕint = φ (wavelet smoothing
kernel), and ψα1 ,1 (operator-like wavelet) are shown in Figure 11.8. The B-spline βα1 ,1
in Figure 11.8b is an extrapolated version of βα1 ; it generates the coarser-resolution
space V1 = span{βα1 ,1 (·−2k)}k∈Z which is such that V0 = span{βα1 (·−k)}k∈Z = V1 +W1
with W1 = span{ψα1 ,1 (· − 2k)}k∈Z and W1 ⊥ V1 . A key property of the first-order
model is that these basis functions are orthogonal and non-overlapping, as a result of
the construction (B-spline of unit support).
314 Wavelet-domain methods

1 1

e α1

1 1 2
(a) B-splines at fine level (c) Smoothing kernels

e α1
2

1 2
(b) B-splines at coarse level (d) Operator-like wavelets

Figure 11.8 Operator-like wavelets and exponential B-splines for the first-order operator
L = D − α1 Id with α1 = 0 (light) and α1 = −1 (dark). (a) Fine-level exponential B-splines
βα1 ,0 (t). (b) Coarse-level exponential B-splines βα1 ,1 (t). (c) Wavelet smoothing kernels
ϕint (t − 1). (d) Operator-like wavelets ψα1 ,1 (t) = L∗ ϕint (t − 1).

The fundamental ingredient for the implementation of the wavelet transform is that
the scaling function (exponential B-splines) and wavelets at resolution i satisfy the two-
scale relation
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
(i−1)
βα1 ,i (t − 2i k) 1 e2 α1 βα1 ,i−1 (t − 2i k)
⎣ ⎦∝⎣ (i−1)
⎦·⎣ ⎦,
ψα1 ,i (t − 2i k) −e2 α1 1 βα1 ,i−1 (t − 2i k − 2i−1 )
(11.31)

which involves two filters of length 2 (row vectors of the (2 × 2) transition matrix)
since the underlying B-splines and wavelets are non-overlapping for distinct values
of k. In particular, for i = 1 and k = 0, we get that βα1 ,1 (t) ∝ βα1 (t) + a1 βα1 (t − 1) and
ψα1 ,1 (t) ∝ −a1 βα1 (t) + βα1 (t − 1) with a1 = eα1 . These relations can be visualized in
Figures 11.8b and 11.8d, respectively. Also, for α1 = 0, we recover the Haar system
i
for which the underlying filters h and g (sum and difference with e2 α1 = 1 = a1 ) do
not depend upon the scale i (see (6.6) and (6.7) in Section 6.1). Finally, we note
that the proportionality factor in (11.31) is set by renormalizing the basis func-
tions on both sides such that their norm is unity, which results in an orthonormal
transform.
The corresponding fast wavelet-transform algorithm is derived by assuming that the

fine-scale expansion of the input signal is s(t) = k∈Z s[k]βα1 ,0 (t − k). To specify  the
first iterationof the algorithm, we observe that the support of ψα1 ,1 (· − 2k) resp.,
βα1 ,1 (· − 2k) overlaps with the fine-scale B-splines at locations 2k and 2k + 1 only.
Due to the orthonormality of the underlying basis functions, this results in
11.4 Improved denoising by consistent cycle spinning 315

⎡ ⎤ ⎡ ⎤
s1 [k] s, βα1 ,1 (· − 2k)
⎣ ⎦=⎣ ⎦
v1 [k] s, ψα1 ,1 (· − 2k)
⎡ ⎤ ⎡ ⎤
1 1 eα1 s[2k]
=2 ⎣ ⎦·⎣ ⎦, (11.32)
1 + |e 1 |
α 2 −eα 1 1 s[2k + 1]

which is consistent with (11.31) and i = 1. The reconstruction algorithm is obtained by


straightforward matrix inversion:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
s[2k] 1 1 −eα1 s1 [k]
⎣ ⎦= 2 ⎣ ⎦·⎣ ⎦. (11.33)
s[2k + 1] 1 + |eα1 |2 eα 1 1 v1 [k]

The final key observation is that the computation of the first level of wavelet coef-
ficients is analogous to the determination of the discrete increment process u[k] =
s[k] − a1 s[k − 1] (see Section 8.5.1) in the sense that v1 [k] ∝ u[2k + 1] is a subsampled
version of the latter.

11.4.2 From wavelet bases to tight wavelet frames


From now on, we shall pool the computed wavelet and approximation coefficients
of a signal s ∈ RN in the wavelet vector v and formally  represent the decomposi-
tion/reconstruction process Equations (11.32) and (11.33) by their vector-matrix coun-
terparts v = WT s and s = Wv, respectively. Moreover, since the choice of the origin of
the signal is arbitrary, we shall consider a series of “m-shifted” versions of the wavelet
transform matrix Wm = Zm W, where Z (resp., Zm ) is the unitary matrix that circularly
shifts the samples of the vector to which it is applied by one (resp., by m) to the left.
With this convention, the solution to the denoising problem (11.13) in the orthogonal
basis Wm is given by
* +
1
ṽm = arg min v − WTm y22 + τ (v)
v 2
 T 
= prox Wm y; τ , (11.34)
which amounts to a component-wise shrinkage of the wavelet coefficients. For reference,
we also give the equivalent signal-domain (or analysis) formulation of the algorithm:
* +
1
s̃m = arg min s − y2 + τ (Wm s)
2 T
s 2
 T 
= Wm prox Wm y, τ . (11.35)
Next, instead of a single orthogonal wavelet transform, we shall consider a wavelet
frame expansion which is built from the concatenation of M shifted orthonormal trans-
forms. The corresponding (MN × N) transformation matrix is denoted by
⎡ ⎤
WT1
⎢ ⎥
⎢ . ⎥
A = ⎢ .. ⎥ (11.36)
⎣ ⎦
WTM
316 Wavelet-domain methods

while the augmented wavelet vector is


⎡ ⎤
v1
⎢ ⎥
⎢ .. ⎥
z = As = ⎢ . ⎥,
⎣ ⎦
vM

where vm = WTm s.

PROPOSITION 11.2 The transformation matrix A : RN → RMN , which is formed


from the concatenation of M orthonormal matrices Wm as in (11.36), defines a tight
frame of RN in the sense that
Ax2 = Mx2

for all x ∈ RN . Moreover, its pseudo-inverse A† : RMN → RN is given by


, 1 1 T
A† = M1
W1 · · · W M = M A ,

with the property that


% &
arg min z − Ax2 = A† z
x∈RN

for all z ∈ RMN and A† A = I.

Proof The frame expansion of x is z = Ax = (v1 , . . . , vM ). The energy preservation


then follows from
M M
z22 = vm 22 = WTm x22 = M x22 ,
m=1 m=1

where the equality on the right-hand side results from the application of Parseval’s iden-
tity for each individual basis. Next, we express the quadratic error between an arbitrary
vector z = (z1 , . . . , zM ) ∈ RMN and its approximation by Ax as
M
z − Ax2 = zm − WTm x2
m=1
M
= Wm zm − x2 . (by Parseval)
m=1

This error is minimized by setting its gradient with respect to x to zero; that is,
M

z − Ax2 = − (Wm zm − x) = 0,
∂x
m=1

which yields
M
1
xLS = Wm zm = A† z.
M
m=1
11.4 Improved denoising by consistent cycle spinning 317

Finally, we check the left-inverse property,


M
1
A† A = Wm WTm = I,
M
m=1

which follows from the orthonormality of the matrices Wm .

In practice, when the wavelet expansion is performed over I resolution levels, the
number of distinct shifted wavelet transforms is at most M = 2I . In direct analogy with
(11.13), the transposition of the wavelet-denoising problem to the context of wavelet
frames is then
⎧ ⎫
⎨ ⎬
z̃ = arg min 12  
A† z −y22 + M τ
(z) (11.37)
z ⎩ ⎭
s
% &
= arg min 12 z − Ay22 + τ (z) (due to the tight-frame property)
z
 
= prox Ay; τ = (ṽ1 , . . . , ṽM ),

which simply amounts to performing M basic wavelet-denoising operations in parallel


since the cost function is separable. We refer to (11.37) as the synthesis-with-cycle-
spinning formulation of the wavelet-denoising problem. The corresponding signal
reconstruction is given by
M
1
s̃ = A† z̃ = s̃m , (11.38)
M
m=1

which is the average of the solutions (11.34) obtained with each individual wavelet
basis.
A remarkable property is that the cycle-spun version of wavelet denoising is guaran-
teed to improve upon the non-redundant version of the algorithm.

P R O P O S I T I O N 11.3
Let y = s+n be the samples of a signal s corrupted by zero-mean
i.i.d. noise n and s̃m the corresponding signal estimates given by the wavelet-based
denoising algorithm (11.35) with m = 1, . . . , M. Then, under the assumption that the
mean-square errors of the individual wavelet denoisers are equivalent, the averaged
signal estimate (11.38) satisfies

E{s̃ − s } ≤ E{s̃m − s }
2 2

for any m = 1, . . . , M.

Proof The residual noise in the orthonormal wavelet basis Wm is (ṽm − E{vm }), where
ṽm = WTm s̃m and E{vm } = WTm E{y} = WTm s, because of the assumption of zero-mean
noise. This allows us to express the total noise power over the M wavelet bases as
M
z̃ − As2 = ṽm − E{vm }2 .
m=1
318 Wavelet-domain methods

The favorable aspect of considering a redundant representation is that the inverse frame
operator A† is an orthogonal projector onto the signal space RN with the property that
A† w2 ≤ M w
1 2

for all w ∈ RMN . This follows from the Pythagorean relation w2 = AA† w2 +
(I − AA† )w2 (projection theorem) and the tight-frame property, which is equivalent
to AA† w2 = MA† w2 . By applying this result to w = z̃ − As, we obtain
M
1
A† (z̃ − As)2 = s̃ − s2 ≤ ṽm − E{vm }2 . (11.39)
M
m=1
Next, we take the statistical expectation of (11.39), which yields

1
M % &
E{s̃ − s } ≤ E ṽm − Wm s .
2 T 2
(11.40)
M
m=1
The final result then follows from Parseval’s relation (norm preservation of individual
wavelet transforms) and the weak stationarity hypothesis (MSE equivalence of shifted-
wavelet denoisers).
Note that this general result does not depend on the type of wavelet-domain proces-
sing – MAP vs. MMSE, or even scalar vs. vectorial – as long as the non-linear mapping
ṽm = f (vm ) is fixed and applied in a consistent fashion. The inequality in Proposition
11.3 also suggests that one can push the denoising performance further by optimizing
the MSE globally in the signal domain, which is not the same as minimizing the error
for each individual wavelet denoiser. The only downside of the redundant synthesis
formulation (11.37) is that the underlying cost function loses its statistical interpretation
(e.g., MAP criterion) because of the inherent coupling that results from considering
multiple series of wavelet coefficients. The fundamental limitation there is that it is
impossible to specify a proper innovation model in an overcomplete system.

11.4.3 Iterative MAP denoising


The alternative way of making use of wavelet frames is the dual analysis formulation
of the denoising problem
% &
s̃ = arg min 12 s − y22 + M
τ
(As) . (11.41)
s
The advantage there is that the cost function is compatible with the statistical innovation-
based formulation of Section 10.4.3, provided that the weights and components of
the potential function are properly chosen. In light of the comment at the end of
Section 11.4.1, the equivalence with MAP estimation is exact if we perform a single
level of wavelet decomposition with M = 2 and if we do not apply any penalty to
the lowpass coefficients s1 . Yet, the price to pay is that we are now facing a harder
optimization problem.
The difficulty stems from the fact that we do no longer benefit from Parseval’s norm
equivalence between the signal and wavelet domains. A workaround is to reinstate the
11.4 Improved denoising by consistent cycle spinning 319

equivalence by insisting that the wavelet-frame expansion be consistent with the signal.
This leads to the reformulation of (11.41) in synthesis form as
* +
1
z̃ = arg min z − Ay2 + τ (z) s.t. AA† z = z,
2
(11.42)
z 2
which is the consistent cycle-spinning version of denoising. Rather than attempting to
solve the constrained optimization problem (11.42) directly, we shall exploit the link
with conventional wavelet shrinkage. To that end, we introduce the augmented Lagrang-
ian penalty function
μ
L A (z, x, λ; μ) = 12 z − Ay22 + τ (z) + 2 z − Ax22 − λT (z − Ax) (11.43)

with penalty parameter μ ∈ R+ and Lagrangian multiplier vector λ ∈ RMN . Observe that
the minimization of (11.43) over (z, x, λ) is equivalent to solving (11.42). Indeed, the
consistency condition z = Ax asserted by (11.43) is equivalent to AA† z = z, while the
auxiliary variable x = A† z is the sought-after signal.
The standard strategy in the augmented-Lagrangian method of multipliers is to solve
the problem iteratively by first minimizing LA (z, x, λ; μ) with respect to (z, x) while
keeping μ fixed and updating λ according to the rule

λk+1 = λk − μ(zk+1 − Axk+1 ).

Here, the task is simplified by applying the alternating-direction method of multipliers;


that is, by first minimizing LA (z, x, λ; μ) with respect to z with x fixed and then the
other way around. The link with conventional wavelet denoising is obtained by rewrit-
ing (11.43) as
1+μ
L A (z, x, λ; μ) = 2 z − z̃22 + τ (z) + C0 (x, λ; μ),

where
1
z̃ = 1+μ (Ay + μAx + λ)

and C0 is a term that does not depend on z. Since the underlying cost function is separ-
able, the solution of the minimization of LA with respect to z is obtained by suitable
shrinkage of z̃, leading to
 
zk+1 = prox z̃k+1 ; 1+μ
τ
(11.44)

and involving the same kind of pointwise non-linearity as algorithm (11.34). The
converse task of optimizing LA over x with z = zk+1 fixed is a quadratic problem. The
required partial derivatives are obtained as
∂ LA (z, x, λ; μ)
= −μAT (z − Ax) − AT λ.
∂x
This leads to the closed-form solution
1 † k
xk+1 = A† zk+1 − A λ,
μ
320 Wavelet-domain methods

where we have taken advantage of the tight-frame/pseudo-inverse property A† A = I


1 T
with A† = M A .
The complete CCS (consistent cycle spinning) denoising procedure is summarized
in Algorithm 3. It is an iterative variant of wavelet shrinkage where the thresholding

Algorithm 3: CCS denoising solves Problem (11.41) where A is a tight-frame


matrix
input: y, s0 ∈ RN , τ , μ ∈ R+
set: k = 0, λ0 = 0, u = Ay;
repeat  1   τ 
zk+1 = prox 1+μ u + μAsk + λk ; 1+μ
 
sk+1 = A† zk+1 − μ1 λk
 
λk+1 = λk − μ zk+1 − Ask+1
k=k+1
until stopping criterion
return s = sk

function is determined by the statistics of the signal and applied in a consensus fashion.
Its cost per iteration is O(N × M) operations, which is essentially that of the fast wavelet
transform. This makes the method very fast. Since every step is the outcome of an exact
minimization, the cost function decreases monotonically until the algorithm reaches a
fixed point. The convergence to a global optimum is guaranteed when the potential
function  is convex.
One may also observe that the CCS denoising algorithm is similar to the MAP esti-
mation method of Section 10.2.4 since both rely on ADMM. Besides the fact that the
latter can handle an arbitrary system matrix H, the crucial difference is in the choice of
the auxiliary variable u = Ls (discrete innovation) vs. z = As (redundant wavelet trans-
form). While the two representations have a significant intersection, the tight wavelet
frame has the advantage of resulting in a better conditioned problem (because of the
norm-preservation property) and hence a faster convergence to the solution.

11.4.4 Iterative MMSE denoising


CCS denoising constitutes an attractive alternative to the more traditional iterative MAP
estimators described in Chapter 10. The scheme is appealing conceptually because it
bridges the gap between finite-difference and wavelet-based schemes.
The fact that the algorithm cycles through a series of orthogonal wavelet represen-
tations facilitates the Bayesian interpretation of the procedure. Fundamentally, there
are two complementary mechanisms at play: the first is a wavelet-domain shrinkage
whose first iteration is identical to the direct wavelet-denoising methods investigated
in Section 11.3. The second is the cycling through the “shifted” transforms which
results in the progressive refinement of the solution. Clearly, it is the search for a
signal that is globally consistent that constitutes the major improvement upon simple
11.4 Improved denoising by consistent cycle spinning 321

averaging. The natural idea that we shall test now is to replace the proximal opera-
tor prox by the optimal MMSE shrinkage function dictated by the theory of Section
11.3.2. This change essentially comes for free – the mere substitution in a lookup table
of prox (z, τ ) = vMAP (z; σ ) by vMMSE (z; σ ) as defined by (11.20) – but it has a tre-
mendous effect on performance, to the point that the algorithm reaches the best level
achievable. While there is not yet a proof that the proposed scheme converges to the
true MMSE solution, we shall document the behavior experimentally.
In order to compare the various wavelet-based denoising techniques, we have applied
them to the same series of Lévy processes as in Section 10.4.3: Brownian motion,
Laplace motion, compound-Poisson process with a standard Gaussian distribution and
λ = 0.6, and Lévy flight with Cauchy-distributed increments. Each realization of the
signal of length N = 100 is corrupted by AWGN of variance σ 2 . The signal is then
expanded in a Haar basis and denoised using the following algorithms:
• Soft-thresholding ((z) ∝ |z|) in the Haar wavelet basis with optimized τ (ortho-ST)
• Model-based shrinkage in the orthogonal wavelet basis (ortho-MAP vs. ortho-
MMSE)
• Model-based shrinkage in a tight frame with M = 2 (frame-MAP vs. frame-MMSE)
• Global MAP estimator implemented by consistent cycle spinning (CCS-MAP), as
described in Section 11.4.3
• Model-based consistent cycle spinning with MMSE shrinkage function (CCS-
MMSE)
To simplify the comparison, the depth of the transform is set to I = 1 and the lowpass
coefficients are kept untouched. The experimental conditions are exactly the same as in
Section 10.4.3, with each data point (SNR value) being the result of an average over
500 trials.
The model-based denoisers are derived from √ the knowledge
√ of the pdf of the wave-
let coefficients, which is given by pV1 (x) = 2pU (√ 2x), where pU is the pdf of the
increments of the Lévy process. The rescaling by 2 accounts for the fact that the
(redundant) Haar wavelet coefficients are a renormalized version of the increments. For
the direct methods, we set τ = σ 2 , which corresponds to a standard wavelet-domain
MAP estimator, as described by (11.13). For the iterative CCS-MAP solution, the iden-
tification of (11.41) with the standard
√ form (10.46) of the MAP estimator dictates the
2
choice τ = σ and (z) = U ( 2z). Similarly, the setting of the shrinkage function for
the MMSE version of CCS relies on the property that the noise variance in the wavelet
domain is σ 2 even though the components are no longer independent. By exploiting
the analogy with the orthogonal scenario, one then simply replaces the proximal step in
Algorithm 3, described in Equation (11.44), by
 σ2

zk+1 = vMMSE z̃k+1 ; 1+μ ,

which amounts to the component-wise application of the theoretical MMSE  estimator


that is derived from the wavelet-domain statistics see (11.20) and (11.21) .
In the case of Brownian motion for which the wavelet coefficients are Gaussian-
distributed, there is no distinction between the MAP and MMSE shrinkage functions,
322 Wavelet-domain methods

Figure 11.9 SNR improvement as a function of the level of noise for Brownian motion. The
wavelet-denoising methods by reverse order of performance are: standard soft-thresholding
(ortho-ST), optimal shrinkage in a wavelet basis (ortho-MAP/MMSE), shrinkage in a redundant
system (frame-MAP/MMSE), and optimal shrinkage with consistent cycle spinning
(CCS-MAP/MMSE).

Figure 11.10 SNR improvement as a function of the level of noise for a Lévy process with
Laplace-distributed increments. The wavelet-denoising methods by reverse order of
performance are: ortho-MAP (equivalent to soft-thresholding with fixed τ ), ortho-MMSE,
frame-MMSE, frame-MAP, CCS-MAP, and CCS-MMSE. The results of CCS-MMSE are
indistinguishable from the ones of the reference MMSE estimator obtained using message
passing (see Figure 10.12).

which are linear. The corresponding denoising results are shown in Figure 11.9. They
are consistent with our expectations: the model-based approach (ortho-MAP/MMSE)
results in a (slight) improvement over basic wavelet-domain soft-thresholding while the
performance gain brought by redundancy (frame-MAP/MMSE) is more substantial, in
accordance with Proposition 11.3. The optimal denoising (global MMSE=MAP solu-
tion) is achieved by running the CCS version of the algorithm, which produces a linear
solution that is equivalent to the Wiener filter (LMMSE estimator).
The same performance hierarchy can also be observed for the other types of signals
(see Figures 11.10–11.12), confirming the relevance of the proposed series of refine-
ments. For non-Gaussian processes, ortho-MMSE is systematically better than ortho-
MAP, not to mention soft-thresholding, in agreement with the theoretical predictions
of Section 11.3. Since the MAP estimator for the compound-Poisson process is use-
less (identical to zero), the corresponding ortho-MMSE thresholding can be compared
11.4 Improved denoising by consistent cycle spinning 323

Figure 11.11 SNR improvement as a function of the level of noise for a compound-Poisson
process (piecewise-constant signal). The wavelet-denoising methods by reverse order of
performance are: ortho-ST, ortho-MMSE, frame-MMSE, and CCS-MMSE. The results of
CCS-MMSE are indistinguishable from the ones of the reference MMSE estimator obtained
using message passing (see Figure 10.10).

Figure 11.12 SNR improvement as a function of the level of noise for a Lévy flight with
Cauchy-distributed increments. The wavelet-denoising methods by reverse order of performance
are: ortho-MAP, ortho-MMSE, frame-MMSE, frame-MAP, CCS-MAP, and CCS-MMSE. The
results of CCS-MMSE are indistinguishable from the ones of the reference MMSE estimator
obtained using message passing (see Figure 10.11).

against the optimized soft-thresholding (ortho-ST) where τ is tuned for maximum SNR
(see Figure 11.11). This is actually the scenario where this standard sparsity-promoting
scheme performs best, which is not too surprising since a piecewise-constant signal
is intrinsically sparse with a large proportion of its wavelet coefficients being zero.
Switching to a redundant system (frame) is beneficial in all instances. The only caveat
is that frame-MMSE is not necessarily the best design because the corresponding one-
step modification of wavelet coefficients typically destroys Parseval’s relation (lack of
consistency). In fact, one can observe a degradation, with frame-MMSE being worse
than frame-MAP in most cases. By contrast, the full power of the Bayesian formu-
lation is reinstated when the denoising is performed iteratively according to the CCS
strategy. Here, thanks to the consistency requirement, CCS-MMSE is always better
than CCS-MAP which actually corresponds to a wavelet-based implementation of the
true MAP estimator of the signal (see Section 10.4.3). In particular, CCS-MAP under
the Laplace hypothesis is equivalent to the standard total-variation denoiser. Finally,
324 Wavelet-domain methods

the most important finding is that, in all tested scenarios, the results of CCS-MMSE
are indistinguishable from those of the belief-propagation algorithm of Section 10.4.2
which implements the reference MMSE solution. This leads us to conjecture that the
CCS-MMSE estimator is optimal for the class of first-order processes. The practical
benefit is that CCS-MMSE is much faster than BP, which necessitates the computation
of two FFTs per data point.
While we are still missing a theoretical explanation of the ability of CCS-MMSE to
resolve non-Gaussian statistical interdependencies, we believe that the results presen-
ted are important conceptually, for they demonstrate the possibility of specifying itera-
tive signal-reconstruction algorithms that minimize the reconstruction error. Moreover,
it should be possible, by reverse engineering, to reformulate such schemes in terms
of the minimization of a pseudo-MAP cost functional that is tied to the underlying
signal model. Whether such ideas and design principles are transposable to higher-order
signals and/or more general types of inverse problems is an open question that calls for
further investigation.

11.5 Bibliographical notes

The earliest instance of wavelet-based denoising is a soft-thresholding algorithm that


was developed for magnetic resonance imaging [WYHJC91]. The same algorithm was
discovered independently by Donoho and Johnstone for the reconstruction of signals
from noisy samples [DJ94]. The key contribution of these authors was to establish the
statistical optimality of the procedure in a minimax sense as well as its smoothness-
preservation properties [Don95,DJ95]. This series of papers triggered the interest of the
statistical and signal-processing communities and got researchers working on extending
the technique and applying it to a variety of image-reconstruction problems.

Sections 11.1 and 11.2


The discovery of the connection between soft-thresholding and 1 -minimization [Tib96,
CDLL98] was a significant breakthrough. It opened the door to a variety of novel
methods for the recovery of sparse signals based on non-quadratic wavelet-domain
regularization, while also providing a link with statistical estimation techniques. For
instance, Figueiredo and Nowak [FN03] developed an approach to image restoration
based on the maximization of a likelihood criterion that is equivalent to (11.5) with
a Laplacian prior. These authors also introduced an expectation-maximization algo-
rithm that is one of the earliest incarnations of ISTA. The algorithm was brought into
prominence when Daubechies et al. were able to establish its convergence for gen-
eral linear inverse problems [DDDM04]. This motivated researchers to improve the
convergence speed of ISTA through the use of appropriate preconditioning and/or over-
relaxation [BDF07, VU08, FR08, BT09b]. A favored algorithm is FISTA because of its
ease of deployment and superior convergence guarantees [BT09b]. Biomedical-imaging
applications of wavelet-based image reconstruction include image restoration [BDF07],
3-D deconvolution microscopy [VU09], and parallel MRI [GKHPU11]. These works
11.5 Bibliographical notes 325

involve accelerated versions of ISTA or FISTA that capitalize on the specificities of the
underlying system matrices.

Section 11.3
The signal-processing community’s response to the publication of Donoho and Johns-
tone’s work on the optimality of wavelet-domain soft-thresholding was a friendly com-
petition to improve denoising performance. The Bayesian reformulation of the basic
signal-denoising problem naturally led to the derivation of thresholding functions that
are optimal in the MMSE sense [SA96, Sil99, ASS98]. Moulin and Lui presented a
mathematical analysis of pointwise MAP estimators, establishing their shrinkage beha-
vior for heavy-tailed distributions [ML99]. In this statistical view of the problem, the
thresholding function is determined by the assumed prior distribution of the wavelet
coefficients, the most prominent choices being the generalized Gaussian distribution
[Mal89, CYV00, PP06] or a mixture of Gaussians with a peak at the origin [CKM97].
While infinite divisibility is not a property that has been emphasized in the image-
processing literature, researchers have considered a number of wavelet-domain models
that are compatible with the property and therefore part of the general framework inves-
tigated in Section 11.3. These include the Laplace distribution (see [HY00, Mar05] for
pointwise MMSE estimator), SαS laws (see [ATB03] and [ABT01, BF06] for point-
wise MAP and MMSE estimators, respectively), the Cauchy distribution [BAS07], as
well as the sym gamma family [FB05]. The latter choice (a.k.a. Bessel K-form) is
supported by a constructive model of images that is reminiscent of generalized Pois-
son processes [GS01]. There is also experimental evidence that this class of models
is able to fit the observed transform-domain histograms well over a variety of natural
images [MG01, SLG02].
The multivariate version of (11.21) can be found in [Ste81, Equation (3.3)]. This for-
mula is also central to the derivation of Stein’s unbiased risk estimator (SURE), which
provides a powerful data-driven scheme for adjusting the free parameters of a statistical
estimator under the AWGN hypothesis. SURE has been applied to the automatic adjust-
ment of the thresholding parameters of wavelet-based denoising algorithms such as the
SURE-shrink [DJ95] and SURELET [LBU07] approaches.
The possibility of defining bivariate shrinkage functions for exploiting inter-scale
wavelet dependencies is investigated in [SS02].

Section 11.4
The concept of redundant wavelet-based denoising was introduced by Coifman under
the name of cycle spinning [CD95]. The fact that this scheme always improves upon
non-redundant wavelet-based denoising (see Proposition 11.3) was pointed out by
Raphan and Simoncelli [RS08]. A frame is an overcomplete (and stable) generalization
of a basis; see for instance [Ald95, Chr03]. For the design and implementation of the
operator-like wavelets (including the first-order ones which are orthogonal), we refer
to [KU06].
The concept of consistent cycle spinning was developed by Kamilov et al. [KBU12].
The CCS Haar-denoising algorithm was then modified appropriately to provide the
MMSE estimator for Lévy processes [KKBU12].
12 Conclusion

We have presented a mathematical framework that results in the specification of the


broadest possible class of linear stochastic processes. The remarkable aspect is that
these continuous-domain processes are either Gaussian or sparse, as a direct conse-
quence of the theory. While the formulation relies on advanced mathematical concepts
(distribution theory, functional analysis), the underlying principles are simple and very
much in line with the traditional methods of statistical signal processing. The main point
is that one can achieve a whole variety of sparsity patterns by combining relatively
simple building blocks: non-Gaussian white-noise excitations and suitable integral op-
erators. This results in non-Gaussian processes whose properties are compatible with the
modern sparsity-based paradigm for signal processing. Yet, the proposed class of mod-
els is also backward-compatible with the linear theory of signal processing (LMMSE =
linear minimum mean square estimation) since the correlation structure of the processes
remains the same as in the traditional Gaussian case – the processes are ruled by the
same stochastic differential equations and it is only the driving terms (innovations) that
differ in their level of sparsity. On the theoretical front, we have highlighted the crucial
role of the generalized B-spline function βL – in one-to-one relation with the whitening
operator L – that provides the functional link between the continuous-domain specifica-
tion of the stochastic model and the discrete-domain handling of the sample values. We
have also shown that these processes admit a sparse wavelet decomposition whenever
the wavelet is matched to the whitening operator.
Possible applications and directions of future research include:
• The generation of sparse stochastic processes. These may be useful for testing algo-
rithms or for providing artificial sounds or textures.
• The development of identification procedures for estimating: (1) the whitening op-
erator L and (2) the noise parameters (Lévy exponent). The former problem may be
addressed by suitable adaptation of well-established Gaussian estimation techniques.
The second task is less standard and potentially more challenging. One possibility is
to estimate the noise parameters in the transformed domain (generalized increments
or wavelet coefficients) using cumulants and higher-order statistical methods.
• The design of optimal denoising and restoration procedures that expand upon the first-
order techniques described in Sections 10.4 and 11.4. The challenge is to develop
efficient algorithms for MMSE signal estimation using the present class of sparse
prior models. This should provide a principled approach for tackling sparse signal
recovery problems and possibly result in higher-quality reconstructions.
Conclusion 327

• The formulation of a model-based approach for regularization and wavelet-domain


processing with the derivation of optimal thresholding estimators.
• The definition and investigation of richer classes of sparse processes – especially
in higher dimensions – by a mixture of elementary ones associated with individual
operators Li .
• The theory predicts that, for the proposed class of models, the transformed domain
statistics should be infinitely divisible. This needs to be tested on real signals and
images. One also needs to design appropriate estimators for capturing the tail beha-
vior of the pdf, which is the most important indicator of sparsity.
Appendix A Singular integrals

In this appendix, we are concerned with integrals involving functions that are singu-
lar at a finite (or at least countable) number of isolated points. Without further loss of
generality, we consider the singularities to arise at the origin.
Suppose that we are given a function f that is locally integrable in any neighborhood
in Rd that excludes the origin, but not if the origin is included. Then, for any test function
ϕ ∈ D (Rd ) with 0 ∈ / support(ϕ), the integral

ϕ, f = ϕ(r)f(r) dr
Rd

converges in the sense of Lebesgue and is continuous in ϕ for sequences that exclude a
neighborhood of the origin. It may also converge for some other, but not all, ϕ ∈ D . In
general, if f grows no faster than some inverse power of |r| as r → 0, then ϕ, f will
converge for all ϕ whose value as well as all derivatives up to some order k do vanish
at 0. This is the situation that will be of interest to us here.
In some cases we may be able to continuously extend the bilinear form ϕ, f to all
test functions in D (Rd ) (or even in S (Rd )). In other words, it may be possible to find
a generalized function f˜ such that ϕ, f = ϕ, f˜ for all ϕ ∈ D for which the left-hand
side converges in the sense of Lebesgue. f˜ is then called a regularization of f.
Note that regularizations of f, when they exist, are not unique, as they can differ by a
generalized function that is supported at r = 0. This also implies that the difference of
any two regularizations of f can be written as a finite sum of the form

cn δ (n) (r),
|n|≤k

where δ (n) denotes the nth (partial) derivative of Dirac’s δ distribution.


We proceed to present some of the standard ways to regularize singular integrals,
and close the appendix with a table of standard or canonical regularizations of some
common singular functions. The reader who is interested in a more complete account
may refer to the books by Gelfand and Shilov [GS68], Hörmander [Hör05], Mikhlin
and Prössdorf [MP86], and Estrada and Kanwal [EK00], among others, where all of the
examples and explanations given below and more may be found.
A.1 Regularization by analytic continuation 329

A.1 Regularization of singular integrals by analytic continuation

The first approach to regularization we shall describe is useful for regularizing para-
metric families of integrals. As an illustration, consider the family of functions xλ+ =
xλ 1[0,∞) in one scalar variable. For λ ∈ U = {λ ∈ C : Re(λ) > −1}, the integral

ϕ, xλ+ = ϕ(x)xλ+ dx


R

converges for all ϕ ∈ D (R). Moreover, the function

Fϕ (λ) = ϕ(x)xλ+ dx
R
is (complex) analytic in λ over its (initial) domain U, as can be seen by differentiation
under the integral sign. 1 Additionally, as we shall see shortly, Fϕ (λ) has a (necessarily
unique) analytic continuation to a larger domain Ũ ⊃ U in the complex plane. Denoting
the analytic continuation of Fϕ by F̃ϕ , we can use it to define a regularization of xλ+ for
λ ∈ Ũ\U by the identity

ϕ, xλ+ = F̃ϕ (λ) for λ ∈ Ũ\U.

 In theλ above example, one can show that the largest domain to which Fϕ (λ) =
R ϕ(x)x+ dx (integration in the sense of Lebesgue) can be extended analytically is
the set
Ũ = C\{−1, −2, −3, . . .}.

Over Ũ, the analytic continuation of Fϕ (λ) to −1 ≥ Re(λ) > −2, λ  = −1, is found by
using the formula
1 d λ+1
xλ+ = x
λ + 1 dx +
to write
1 d −1 d
ϕ, xλ+ = ϕ, xλ+1
+ =  ϕ, xλ+1
+ ,
λ+1 dx λ + 1 dx
where the rightmost member is well defined and where we have used duality to transfer
the derivative operator to the side of the test function. Similarly, we can find the analytic
continuation for −n ≥ Re(λ) > −(n + 1), λ  = −n, by successive re-application of the
same approach, which leads to
(−1)n dn
ϕ, xλ+ =  n ϕ, x+λ+n .
(λ + 1) · · · (λ + n) dx

1 We can differentiate under the integral sign with respect to λ due to the compact support of ϕ, whereby we
obtain the integral
d
Fϕ (λ) = ϕ(x)xλ+ log x dx,
dλ R

which also converges for λ ∈ U.


330 Singular integrals

Within the band −n > Re(λ) > −(n + 1), we may also compute ϕ, xλ+ using the
formulas
⎛ ⎞
∞ ϕ (k) (0)xk ⎠ λ
ϕ, xλ+ = ⎝ϕ(x) − x dx (A.1)
0 k!
k≤− Re(λ)−1

= ϕ(x)xk dx
1
⎛ ⎞
1 ϕ (k) (0)xk
+ ⎝ϕ(x) − ⎠ xλ dx
0 k!
k≤− Re(λ)−1
∞ ϕ (k) (0) λ+k
− x dx
k!
k≤− Re(λ)−1 1

k
= ϕ(x)x dx
1
⎛ ⎞
1 ϕ (k) (0)xk ⎠ λ
+ ⎝ϕ(x) − x dx
0 k!
k≤− Re(λ)−1

ϕ (k) (0)
+ . (A.2)
k!(λ + k + 1)
k≤− Re(λ)−1

The effect of subtracting the first n terms of the Taylor expansion of ϕ(x) in (A.2) is to
create a zero of sufficiently high order at 0 to make the singularity of xλ+ integrable.
We use the above definition of xλ+ to define xλ− = (−x)λ+ for λ  = −1, −2, −3, . . ., as
well as sign(x)|x|λ = xλ+ −xλ− and |x|λ = xλ+ +xλ− . Due to the cancellation of some poles in
λ, the generalized functions sign(x)|x|λ and |x|λ have an extended domain of definition:
they are defined for λ  = −1, −3, −5, . . . and λ  = −2, −4, −6, . . ., respectively.
The singular function rλ in d dimensions is regularized by switching to
(hyper)spherical coordinates and applying the definition of x+λ+d−1 . Due to symme-
tries, the definition obtained in this way is valid for all λ ∈ C with the exception of
the individual points −d, −(d + 2), −(d + 4), −(d + 6), . . .. As was the case in 1-D, we
find formulas based on removing terms from the Taylor expansion of ϕ for computing
ϕ, rλ in bands of the form −(d + 2m) < Re(λ) < −(d + 2m − 2), m = 1, 2, 3, . . .,
which results in
⎛ ⎞
∞ r n
ϕ, r λ
= ⎝Sϕ (r) − Sϕ (0)⎠ rλ+d−1 dr
(n)
0 n!
n≤− Re(λ)−d
⎛ ⎞
k
r (k) ⎠
= ⎝ ϕ(r) − ϕ (0) rλ dr.
R d k!
|k|≤− Re(λ)−d
A.2 Fourier transform of homogeneous distributions 331

The first equality is obtained by considering (hyper)spherical coordinates; Sϕ (r) therein


denotes the integral of ϕ(r) over the (hyper)sphere of radius r. The second equality is
obtained by rewriting the first in Cartesian coordinates. In the second formula we use
k k
standard multi-index notation rk = xk11 · · · xdd , k! = k1 ! · · · kd !, and ϕ (k) = ∂1k1 · · · ∂d d ϕ.
By normalizing rλ as
rλ
ρ λ (r) = λ  ,
2  λ+d
2
2
 
where the gamma function  λ+d 2 has poles in λ over the same set of points −d, −(d +
2), −(d + 4), −(d + 6), . . ., we obtain a generalized function that is well defined for all
λ ∈ C. Moreover, it is possible to show that, for λ = −(d + 2m), m = 0, 1, 2, 3, . . ., we
have

ρ λ (r) ≡ (− )m δ(r), (A.3)

where m is the mth iteration of the Laplacian operator. For m = 0, we can find the
result (A.3) directly by taking the limit λ → −d of ϕ, ρ λ . From there, the general
result is obtained by iterating the relation

(− )ρ λ = λρ λ−2 ,

which is true for all λ.

A.2 Fourier transform of homogeneous distributions

The definition of the 1-D generalized functions of Section A.1 extends to Schwartz’
space, S (R), and to S (Rd ) in multiple dimensions. We can therefore consider these
families of generalized functions as members of the space S  of tempered distributions.
In particular, this implies that the Fourier transform of any of these generalized functions
will belong to (the complex version of) S  as well. We recall from Section 3.3.3 that,
in general, the (generalized) Fourier transform  g of a tempered distribution g is defined
as the tempered distribution that makes the following identity hold true for all ϕ ∈ S :

g, 
ϕ = 
g, ϕ . (A.4)

The distributions we considered above are all homogeneous, in the sense that, using
g to denote any one of them, g(a·) for some a > 0 is equal to aλ g. It then follows
from the properties of the Fourier transform that 
g is also homogeneous, albeit of order
−(d + λ). By invoking additional symmetry properties of these distributions, one then
finds that the Fourier transform of members of each of these families belongs to the same
family, with some normalization factor that can be computed; for instance, by plugging
a Gaussian function ϕ (which is its own Fourier transform) in (A.4). We summarize the
332 Singular integrals

Table A.1 Table of canonical regularizations


" of some singular#functions, and their Fourier transforms.
The one-sided power function is r+λ = 12 |r |λ + sign (r ) |r |λ , m = 1, 2, 3, · · · , and  denotes the
gamma function. Derivatives of δ are also included for completeness.

Singular function Canonical regularization Fourier transform

(λ+1)
rλ+ , ϕ, r̃λ+
 "  # (jω)λ+1
−m−1 < Re(λ) < −m i (i)
= 0∞ rλ ϕ(r) − 0≤i≤m−1 r ϕ i! (0) dr

rn+ , N/A jn πδ (n) (ω) + n!


(jω)n+1
n = 0, 1, 2, . . .

|r|λ , ϕ, |r̃|λ " −2 sin( π


2 λ)
(λ+1)
 |ω|
λ+1
−2m − 2 < Re(λ) < = 0∞ rλ ϕ(r) + ϕ(−r)
−2m  2i (2i) #
−2 0≤i≤m−1 r ϕ(2i)! (0) dr

|r|λ sign(r), ϕ, |r̃|λ sign(r)


" −2j cos( π
2 λ)
(λ+1)
λ+1 sign(ω)
 |ω|
−2m − 1 < Re(λ) < = 0∞ rλ ϕ(r) − ϕ(−r)
−2m + 1  2i+1 ϕ (2i+1) (0) #
−2 0≤i≤m−1 r (2i+1)! dr

rn , N/A jn 2πδ (n) (ω)


n = 0, 1, 2, . . .

rn sign(r), N/A 2 n!
(jω)n+1
n = 0, 1, 2, . . .
 +∞ ϕ(r)−ϕ(−r)
1/r 0 r dr −jπsign(ω)
⎛ ⎞ " #
− Re(λ)−d k
r 2λ+d πd/2  d+λ
rλ , r ∈ Rd , ⎝ϕ(r) − ϕ (k) (0)⎠ rλ dr " # 2 1
Rd k!  − λ2 ωλ+d
−(d + 2m) < Re(λ) |k|=0
< −(d + 2m − 2)

rn , r ∈ Rd , n ∈ Nd N/A j|n| (2π)d δ (n) (ω)

Fourier transforms found in this way in Table A.1. The interested reader may refer to
Gelfand and Shilov [GS68] for the details of their calculations.

A.3 Hadamard’s finite part

A second approach to normalizing singular integrals is known as Hadamard’s finite part,


and can be considered a generalization of Cauchy’s principal value. We recall that, for
A.3 Hadamard’s finite part 333

∞ ϕ(x)
the singular integral −∞ x dx, the principal value is defined as
∞ ϕ(x) ϕ(x) ∞ − ϕ(x)
p.v. dx = lim dx + dx
−∞ x →0  x −∞ x
∞ ϕ(x) − ϕ(−x)
= lim dx
→0  x
∞ ϕ(x) − ϕ(−x)
= dx
0 x
where the last integral converges in the sense of Lebesgue.
In essence,
 ∞ Cauchy’s
 0 definition of principal value relies on the “infinite parts” of the
integrals 0 and −∞ cancelling one another out. To generalize this idea, consider the
integral

ϕ(x)f(x) dx,
0
where the function f is assumed to be singular at 0. Let

() = ϕ(x)f(x) dx

and suppose that, for some pre-chosen family of functions Hk () approaching infinity
at 0, we can find an n ∈ Z+ and coefficients ak , 1 ≤ k ≤ n, such that
n
lim () − an Hk () = A < ∞.
→0+
k=1
∞
A is then called the finite part (in French, partie finie) of the integral 0 ϕ(x)f(x) dx and
is denoted as [EK00]

p.f. ϕ(x)f(x) dx.
0
In cases of interest to us, the family Hk () consists of inverse integer powers of 
and logarithms. With this choice, the finite-part regularization of the singular integrals
considered in Section A.1 can be obtained. It is found to coincide with their regulariz-
ation by analytic continuation (note that all Hk are analytic in ). But, in addition, we
can use the finite part to regularize xλ+ and related functions in cases where the previous
method fails (namely, for λ = −1, −2, −3, . . .). Indeed, for λ = −n, n ∈ Z+ , we may
write
∞ n−1 1
ϕ(x) ϕ (k) (0)xk−n
dx = dx
 xn k!
k=0 
1 ϕ(x) −
n−1 ϕ (k) (0)xk ∞
k=0 k! ϕ(x)
+ n
dx + dx
 x 1 xn
n−2
ϕ (n−1) (0) ϕ (k) (0)  −n+k+1
−1
=− log  + ·
(n − 1)! k! n−k−1
k=0
n−1 ϕ (k) (0)xk ∞
1 ϕ(x) − k=0 k! ϕ(x)
+ dx + dx.
 xn 1 xn
334 Singular integrals

From there, by discarding the logarithm and inverse powers of  and taking the limit
 → 0 of what remains, we find

∞ ∞
n−1 ϕ (k) (0)xk n−2
ϕ(x) ϕ(x) 1 ϕ(x) − k=0 k! ϕ (k) (0)
p.f. dx = dx + dx − ,
0 xn 1 xn 0 xn k!(n − k − 1)
k=0

where the two integrals of the right-hand side converge in the sense of Lebesgue.
Using similar calculations, for ϕ, xλ+ with λ  = −1, −2, −3, . . ., we find the same
regularization as the one given by (A.2).
In general, a singular function f does not define a distribution in a unique way.
However, in many of the cases that are of interest to us there exists a particular reg-
ularization of f that is considered standard or canonical. For the parametric families
discussed so far, these essentially correspond to the regularization obtained by analytic
continuation. In Table A.1, we have summarized the formulas for canonical regulari-
zation as well as the Fourier transforms of the singular distributions that are of interest
to us.
Finally, we point out that the approaches presented in this appendix to regularize
singularities at the origin generalize in an obvious way to isolated singularities at any
other point and also to a finite (or even countable) number of isolated singularities.

A.4 Some convolution integrals with singular kernels

As we noted earlier, the scaling property of the Fourier transform demands that the
Fourier transform of a homogeneous distribution of order λ be homogeneous of order
−(λ + d). Thus, for Re(λ) ≤ −d where the original distribution is singular at 0, its Fou-
rier transform is locally integrable everywhere and vice versa. Consequently, convo-
lutions with homogeneous singular kernels are often easier to evaluate in the Fourier
domain by employing the convolution-multiplication rule. Important examples of such
convolutions are the Hilbert and Riesz transforms.
The Hilbert transform of a test function ϕ ∈ S (R) can thus be defined either by the
convolution with the singular kernel h(x) = 1/(πx) as

+∞
Hϕ(x) = p.v. ϕ(y)h(x − y) dy,
−∞

which involves a principal-value limit, or by the Fourier-domain formula

ϕ
Hϕ(x) = F −1 {h},

with 
h(ω) = −j sign(ω). These definitions extend beyond S (R) to ϕ ∈ Lp (R) for
1 < p < ∞ (the standard reference here is Stein and Weiss [SW71]).
A.4 Some convolution integrals with singular kernels 335

Similarly, the ith component of the Riesz transform of a test function ϕ ∈ S (Rd )
is defined in the spatial domain by the convolution with the kernel hi (r) = (d/2 +
1/2)ri /(πd/2+1/2 |r|d+1 ) as

Ri ϕ(r) = p.v. ϕ(t)hi (r − t) dt,


Rd
which is equivalent to the Fourier integral
ϕ
Ri ϕ(r) = F −1 {Ri }
with  ωi
Ri (ω) = −j |ω| . Once again, these definitions extend to ϕ ∈ Lp (Rd ) for 1 < p < ∞
(see Mikhlin’s Theorem 3.6).
Appendix B Positive definiteness

Positive–definite functions play a central role in statistics, approximation theory


[Mic86, Wen05], and machine learning [HSS08]. They allow for a convenient Fourier-
domain specification of characteristic functions, autocorrelation functions, and inter-
polation/approximation kernels (e.g., radial basis functions) with the guarantee that
the underlying approximation problems are well posed, irrespective of the location of
the data points. In this appendix, we provide the basic definitions of positive defini-
teness and conditional positive definiteness in the multidimensional setting, together
with a review of corresponding mathematical results. We distinguish between contin-
uous functions on the one hand and generalized functions on the other. We also give
a self-contained derivation of Gelfand and Vilenkin’s characterization of condition-
ally positive definite generalized functions in 1-D and discuss its connection with the
celebrated Lévy–Khintchine formula of statisticians. For a historical account of the rich
topic of positive definiteness, we refer to [Ste76].

B.1 Positive definiteness and Bochner’s theorem

D E F I N I T I O N B.1 A continuous, complex-valued function f of the vector variable


ω ∈ Rd is said to be positive semidefinite if and only if

N N
ξm ξ n f (ωm − ωn ) ≥ 0
m=1 n=1

for every choice of ω1 , . . . , ωN ∈ Rd , ξ1 , . . . , ξN ∈ C, and N ∈ N. Such a function is


called positive definite in the strict sense if the quadratic form is greater than 0 for all
ξ1 , . . . , ξN ∈ C\{0}.

In what follows, we shall abbreviate “positive semidefinite” by positive–definite. This


property is equivalent to the requirement that the N × N matrix F whose elements are
given by [F]m,n = f (ωm − ωn ) is positive semidefinite (or, equivalently, non-negative
definite), for all N, no matter how the ωn are chosen.
B.1 Positive definiteness and Bochner’s theorem 337

The prototypical example of a positive–definite function is the Gaussian kernel


g(ω) = e−ω /2 . To establish the property, we express this Gaussian as the Fourier
2

x 2
transform of g(x) = √1 e− 2 :

N N N N
ξm ξ n 
g(ωm − ωn ) = ξm ξ n e−j(ωm −ωn )x g(x) dx
m=1 n=1 m=1 n=1 R
N N
= ξm ξ n e−j(ωm −ωn )x g(x) dx
R m=1 n=1
 N 2
 
 
=  ξm e−jωm x  g(x) dx ≥ 0,

R m=1  
   >0
≥0

where we have made use of the fact that g(x), the (inverse) Fourier transform of e−ω /2 ,
2

is positive. It is not hard to see that the argument above remains valid for any (multi-
dimensional) function f (ω) that is the Fourier transform of some non-negative kernel
g(x) ≥ 0. The more impressive result is that the converse implication is also true.

THEOREM B.1 (Bochner’s theorem) Let f be a bounded continuous function on Rd .


Then, f is positive definite if and only if it is the (conjugate) Fourier transform of a
non-negative and finite Borel measure μ:

f (ω) = ejω,x μ(dx).


Rd

In particular, Bochner’s theorem implies that f is a valid characteristic function – that


is, f (ω) = E{ejω,x } = Rd ejω,x PX (dx) where PX is some probability measure on
Rd – if and only if f is continuous, positive definite with f (0) = 1 (see Section 3.4.3 and
Theorem 3.7).
Bochner’s theorem is also fundamental to the theory of scattered data interpolation,
although it requires a very slight restriction on the Fourier transform of f to ensure
positive definiteness in the strict sense [Wen05].

THEOREM B.2 A function f : Rd → C that is the (inverse) Fourier transform of a


non-negative, finite Borel measure μ is positive definite in the strict sense if there exists
an open set E ⊆ Rd such that μ(E) = 0.

Proof
 Let g(x) ≥ 0 be the (generalized) density associated
 with μ such that μ(E) =
−jω,x
E g(x) dx for any Borel set E. We then write f (ω) = Rd e g(x) dx and perform
the same manipulation as for the Gaussian example above, which yields
 N 2
N N  
 −jωm ,x 
ξm ξ n f (ωm − ωn ) =  ξm e  g(x) dx > 0.
Rd  m=1  
m=1 n=1
   ≥0
≥0
338 Positive definiteness

 −jωm ,x
The key observation is that the zero set of the sum of exponentials N m=1 ξm e
(which is an entire function) has measure zero. Since the above integral involves posi-
tive terms only, the only possibility for it to be vanishing is that g be identically zero
on the complement of this zero set, which contradicts the assumption on the existence
of E.
In particular, the latter constraint is verified whenever f (ω) = F {g}(ω), where g
is
 a continuous, non-negative function with a bounded Lebesgue integral; i.e., 0 <
Rd g(x) dx < +∞. This kind of result is highly relevant to approximation and learning
theory: indeed, the choice of a strictly positive definite interpolation kernel (or radial
basis function) ensures that the solution of the generic scattered data interpolation prob-
lem is well defined and unique, no matter how the data centers are distributed [Mic86].
Here too, the prototypical example of a valid kernel is the Gaussian, which is (strictly)
positive definite.
There is also an extension of Bochner’s theorem for generalized functions that is due

to Laurent Schwartz. In a nutshell, the idea is to replace eachfinite sum N n=1 ξn f (ω −
ωn ) by an infinite one (integral) Rd ϕ(ω )f (ω − ω ) dω = Rd ϕ(ω − ω )f (ω ) dω =
 f, ϕ(· − ω) , which amounts to considering appropriate linear functionals of f over
Schwartz’ class of test functions S (Rd ). In doing so, the double sum in Definition B.1
collapses into a scalar product between f and the autocorrelation function of the test
function ϕ ∈ S (Rd ), the latter being written as

(ϕ ∗ ϕ ∨ )(ω) = ϕ(ω )ϕ(ω − ω) dω .


Rd

DEFINITION B.2 A generalized function f ∈ S  (Rd ) is said to be positive–definite if


and only if, for all ϕ ∈ S (Rd ),
 f, (ϕ ∗ ϕ ∨ ) ≥ 0.
It can be shown that this is equivalent to Definition B.1 in the case where f (ω) is
continuous.
THEOREM B.3 (Schwartz–Bochner theorem) A generalized function f ∈ S  (Rd ) is
positive definite if and only if it is the generalized Fourier transform of a non-negative
tempered measure μ; that is,

ϕ = 
 f,  f, ϕ = ϕ(x)μ(dx).
Rd

The term “tempered measure” refers to a generic type  of mildly singular generalized
function that can be defined by the Lebesgue integral Rd ϕ(x)μ(dx) < ∞ for all ϕ ∈
S (Rd ). Such measures are allowed to exhibit polynomial growth at infinity subject to
the restriction that they remain finite on any compact set.
The fact that the above form implies positive definiteness can be verified by direct
substitution and application of Parseval’s relation, by which we obtain
1  2 1
 f, (ϕ ∗ ϕ ∨ ) = d
 f, |
ϕ| = |
ϕ (x)|2 μ(dx) ≥ 0,
(2π) (2π)d Rd
B.2 Conditionally positive–definite functions 339

where the summability property against S (Rd ) ensures that the integral is convergent
(since |
ϕ (x)|2 is rapidly decreasing).
The improvement over Theorem B.1 is that μ(Rd ) is no longer constrained to be
finite. While this extension is of no direct help for the specification of characteristic
functions, it happens to be quite useful for the definition of spline-like interpolation
kernels that result in well-posed data fitting/approximation problems. We also note that
the above definitions and results generalize to the infinite-dimensional setting (e.g., the
Minlos–Bochner theorem which involves measures over topological vector spaces).

B.2 Conditionally positive–definite functions

D E F I N I T I O N B.3 A continuous, complex-valued function f of the vector variable


ω ∈ Rd is said to be conditionally positive–definite of (integer) order k ≥ 0 if and
only if
N N
ξm ξ n f (ωm − ωn ) ≥ 0
m=1 n=1

under the condition


N
ξn p(ωn ) = 0, for all p ∈ k−1 (Rd ),
n=1

for all possible choices of ω1 , . . . , ωN ∈ Rd , ξ1 , . . . , ξN ∈ C, and N ∈ N, where


k−1 (Rd ) denotes the space of multidimensional polynomials of degree (k − 1).

This definition is also extendable to generalized functions using the line of thought
that leads to Definition B.2. To keep the presentation reasonably simple and to make the
link with the definition of the Lévy exponents in Section 4.2, we now focus on the 1-D

case (d = 1). Specifically, we consider the polynomial constraint N n=1 ξn ωn = 0, m ∈
m

{0, . . . , k − 1} and derive the generic form of conditionally positive definite generalized
functions of order k, including the continuous ones which are of greatest interest to us.
The distributional
 counterpart of the kth-order constraint for d = 1 is the orthogonal-
ity condition R ϕ(ω)ωm dω = 0 for m ∈ {0, . . . , k − 1}. It is enforced by restricting
the analysis to the class of test functions whose moments up to order (k − 1) are van-
ishing. Without loss of generality, this is equivalent to considering some alternative test
function Dk ϕ = ϕ (k) where Dk is the kth derivative operator.

DEFINITION B.4 A generalized function f ∈ S  (R) is said to be conditionally posi-


tive definite of order k if and only if for all ϕ ∈ S (R)
> ∨ ? > ?
f, (ϕ (k) ∗ ϕ (k) ) = f, (−1)k D2k (ϕ ∗ ϕ ∨ ) ≥ 0.

This extended definition allows for the derivation of the corresponding version
of Bochner’s theorem which provides an explicit characterization of the family of
340 Positive definiteness

conditionally positive–definite generalized functions, together with their generalized


Fourier transform.

THEOREM B.4 (Gelfand–Villenkin) A generalized function f ∈ S  (R) is condition-


ally positive definite of order k if and only if it admits the following representation over
S (R):
 2k−1 (n)
 2k
ϕ (0) ϕ (n) (0)
ϕ = 
 f,  f, ϕ = ϕ(x) − r(x) n
x μ(dx) + an , (B.1)
R\{0} n! n!
n=0 n=0

where μ is a positive tempered Borel measure on R\{0} satisfying

|x|2k μ(dx) < ∞.


|x|<1

Here, r(x) is a function in S (R) such that (r(x) − 1) has a zero of order (2k+1) at x = 0,
while the an are appropriate real-valued constants with the constraint that a2k ≥ 0.

Below, we provide a slightly adapted version of Gelfand and Vilenkin’s proof, which
is remarkably concise and quite illuminating [GV64, Theorem 1, pp. 178], at least if one
compares it with the standard derivation of the Lévy–Khintchine formula, which has a
much more technical flavor (see [Sat94]) and is ultimately less general.
5 6 5 6
Proof Since f, (−1)k D2k (ϕ ∗ ϕ ∨ ) = (−1)k D2k f, (ϕ ∗ ϕ ∨ ) , we interpret Defini-
tion B.4 as the property that (−1)k D2k f is positive definite. By the Schwartz–Bochner
theorem, this is equivalent to the existence of a tempered measure ν such that
> ? > ?
(−1)k D2k f, 
ϕ = f, (−1)k D2k ϕ = f, x2k ϕ = ϕ(x)ν(dx).
R

By defining φ(x) = x2k ϕ(x), this can be rewritten as


φ(x)

f, φ = ,
ν(dx) =  f, φ
R x2k
where φ is a test function that has a zero of order 2k at the origin. In particular, this
 (2k)
implies that lim↓0 |x|< φ(x)
x2k
ν(dx) = φ (2k)!(0) a2k , where a2k ≥ 0 is the ν-measure at
point x = 0. Introducing the new measure μ(dx) = ν(dx)/x2k , we then decompose the
Lebesgue integral as

φ (2k) (0)

f, φ = φ(x)μ(dx) + a2k , (B.2)
R\{0} (2k)!
which specifies f on the subset of test functions that have a 2kth-order zero at the origin.
To extend the representation to the whole space S (R), we associate with every ϕ ∈
S (R) the corrected function
2k−1
ϕ (n) (0) n
φc (x) = ϕ(x) − r(x) x , (B.3)
n!
n=0
B.2 Conditionally positive–definite functions 341

with r(x) as specified in the statement of the theorem. By construction, φc ∈ S (R) and
has the 2kth-order zero that is required for (B.2) to be applicable. By combining (B.2)
and (B.3), we find that
(2k) 2k−1
φc (0) ϕ (n) (0) 

f, ϕ = φc (x)μ(dx) + a2k +  f, r(x)xn .
R\{0} (2k)! n!
n=0

Next, we identify the constants an =  f, r(x)xn and note that φc(2k) (0) = ϕ (2k) (0). The
final step is to substitute these together with the expression (B.3) of φc in the above
formula, which yields the desired result.
To prove the sufficiency of the representation, we apply (B.1) to evaluate the
functional

ϕ (k) ∗ 
 f, ( ϕ (k) ) = 
f, x2k |ϕ(x)|2 = x2k |ϕ(x)|2 μ(dx) + a2k |ϕ(0)|2 ≥ 0,
R

where we have used the property that the derivatives of x2k |ϕ(x)|2 are all vanishing at
the origin, except the one of order 2k, which equals (2k)! |
ϕ (0)|2 for x = 0.

It is important to note that the choice of the function r is arbitrary as long as it fulfills
the boundary condition r(x) = 1 + O(|x|2k+1 ) as x → 0, so as to regularize the potential
kth-order singularity of μ at the origin, and that it decays sufficiently fast to temper
the Taylor-series correction in (B.3) at infinity. If we compare the effect of using two
different tempering functions r1 and  r2 , the modification
 is only in the value of the
constants an , with an,2 − an,1 =  f, r2 (x) − r1 (x) xn . Another way of putting this is that
the corresponding distributions  f1 and  f2 specified by the leading integral in (B.1) will
only differ by a (2k − 1)th-order point distribution that is entirely localized at x = 0; that
2k−1 an,2 −an,1 (n)
is, f2 (x) −
f1 (x) = n=0 n! δ (x), owing to the property that a2k is common to
both scenarios, or, equivalently, that the difference of their inverse Fourier transforms
f1 and f2 is a polynomial of degree (2k − 1).
Thanks to Theorem B.4, it is also possible to derive an integral representation that is
the kth-order generalization of the Lévy–Khintchine formula. For a detailed treatment of
the multidimensional version of the problem, we refer to the works of Madych, Nelson,
and Sun [MN90a, Sun93].

C O R O L L A RY B.5 Let f (ω) be a continuous function of ω ∈ R. Then, f is conditionally


positive–definite of order k if and only if it can be represented as
  2k−1
 2k

1 (jωx)n (jω)n
f (ω) = e − r(x)
jωx
μ(dx) + an ,
2π R\{0} n! n!
n=0 n=0

where μ is a positive Borel measure on R\{0} satisfying

min(|x|2k , 1)μ(dx) < ∞,


R

where r(x) and an are as in Theorem B.4.


342 Positive definiteness

e ←→ 
1 jωx
The result is obtained by plugging ϕ(x) = 2π ϕ (·) = δ(· − ω) into (B.1),
which is justifiable using a continuity argument. The key is that the corresponding
integral is bounded when μ satisfies the admissibility condition, which ensures the
continuity of f (ω) (by Lebesgue’s dominated-convergence theorem), and vice versa.

B.3 Lévy–Khintchine formula from the point of view of generalized functions

We now make the link with the Lévy–Khintchine theorem of statisticians (see Sec-
tion 4.2.1) which is equivalent to characterizing the functions that are conditionally
positive definite of order one. To that end, we rewrite the formula in Corollary B.5 for
k = 1 under the additional constraint that f1 (0) = 0 (which fixes the value of a0 ) as
 
1 a2 2  jωx 
f1 (ω) = a0 + a1 jω − ω + e − r(x) − r(x)jωx μ(dx)
2π 2 R\{0}
b2  jωx 
= b1 jω − ω2 + e − 1 − r(x)jωx v(x) dx,
2 R\{0}

where v(x) dx = 2π 1
μ(dx), bn = 2π 1
an , r(x) = 1 + O(|x|3 ) as x → 0, and
limx→±∞ r(x) = 0. Clearly, the new form is equivalent to the Lévy–Khintchine
formula (4.3) with the slight difference that the bias compensation is achieved by
using a bell-shaped, infinitely differentiable function r instead of the rectangular win-
dow 1|x|<1 (x).
Likewise, we are able to transcribe the generalized Fourier-transform-pair relation
(B.1) for the Lévy–Khintchine representation (4.3), which yields

fL−K , ϕ =  fL−K , 
ϕ
" # b2
= ϕ(x) − ϕ(0) − x 1|x|<1 (x)ϕ (1) (0) v(x) dx + b1 ϕ (1) (0) + ϕ (2) (0).
R\{0} 2
(B.4)
The interest of (B.4) is that it uniquely specifies the generalized Fourier transform of a
Lévy exponent fL−K as a linear functional of ϕ.
We can also give a “time-domain” (or pointwise) interpretation of this result by
characterizing the generalized function
1
g(x) = F {fK−L }(x) = G{δ}(x),

which also represents the impulse response of the infinitesimal semigroup generator G
investigated in Section 9.7. This is achieved by distinguishing between three cases:

(1) Lebesgue-integrable Lévy density


In this simpler scenario, we are able to split the leading integral in (B.4) into its subparts,
which results in
   
  b2
g(x) = v(x) − δ(x) v(a) da − δ (x) b1 − av(a) da + δ  (x) .
R |a|<1 2
B.3 Lévy–Khintchine formula revisited 343

The underlying principle is that the so-defined generalized function will result in the
same measurements as (B.4) when applied to the test function ϕ. In particular, the
values of ϕ (n) at the origin are sampled using the Dirac distribution and its derivatives in
accordance with the rule ϕ, δ (n) = (−1)n ϕ (n) (0).

(2) Non-integrable Lévy density with finite absolute moment


To ensure that the integral
 in (B.4) is convergent, we need to retain the zero-order cor-
rection. Yet, when R |a|v(a) da < ∞, we can still pull out the third term, which results
in the interpretation
 
    b2
g(x) = p.f. v − δ (x) b1 − av(a) da + δ  (x) ,
|a|<1 2
where p.f. stands for the finite-part operator that implicitly implements the Taylor-series
adjustment that stabilizes the scalar-product integral v, ϕ (see Appendix A).

(3) Non-integrable Lévy density and unbounded absolute


 moment
Here, we cannot split the integral anymore. However, when |a|>1 |a|v(a) da < ∞, we
can stabilize the integral by applying a full first-order Taylor-series correction. This
leads to the finite-part interpretation
  b2
g(x) = p.f. v − b1 δ  (x) + δ  (x),
2

which is the direct counterpart of (4.5). For |a|>1 |a|v(a) da = ∞, the proper point-
wise interpretation becomes more delicate and it is safer to stick to the distributional
definition (B.4).
Appendix C Special functions

C.1 Modified Bessel functions

The modified Bessel function of the second kind with order parameter α ∈ R admits the
Fourier-based representation [AS72]
e−jωx
Kα (ω) = dx.
R (1 + x2 )|α|
 π  1 −x
It has the property that Kα (x) = K−α (x). A special case of interest is K 1 (x) = 2x 2 e .
" #α 2

The small-scale behavior of Kα (x) is Kα (x) ∼ (α) 2


2
x as x → 0. In order to deter-
mine the form of the variance-gamma distribution around the origin, we can rely on the
following expansion which includes a few more terms:
 " #
2α−3 (α)x2
Kα (x) = x−α 2α−1 (α) − + O x4
α−1
 −α−3 " #
−α−1 2 (−α)x2
α
+x 2 (−α) + + O x4 .
α+1
At the other end of the scale, its asymptotic behavior is
7
π −x
Kα (x) ∼ e as x → +∞.
2x

C.2 Gamma function

Euler’s gamma function constitutes an analytic extension of the factorial function n! =


(n + 1) to the complex plane. It is defined by the integral
+∞
(z) = tz−1 e−t dt,
0

which is convergent for Re(z) > 0. Specific values are (1) = 1 and (1/2) = π. The
gamma function satisfies the functional equation

(z + 1) = z(z), (C.1)


C.2 Gamma function 345

which is compatible with the recursive definition of the factorial n! = n(n − 1)!. Another
useful result is Euler’s reflection formula,
π
(1 − z)(z) = .
sin(πz)

By combining the above with (C.1), we obtain

sin(πz) 1
sinc(z) = = , (C.2)
πz (1 − z) (1 + z)

which makes an intriguing connection with the sinus cardinalis function. There is a
similar link with Euler’s beta function,
1
B(z1 , z2 ) = tz1 −1 (1 − t)z2 −1 dt (C.3)
0
(z1 )(z2 )
=
(z1 + z2 )

with Re(z1 ), Re(z2 ) > 0.


(z) also admits the well-known product decomposition

e−γ0 z $ " z #−1 z/n



(z) = 1+ e , (C.4)
z n
n=1

where γ0 is the Euler–Mascheroni constant. The above allows us to derive the expansion
  
z 2

 Re(z)
− log |(z)|2 = 2γ0 Re(z) + log |z|2 + log 1 +  −2 ,
n n
n=1

which is directly applicable to the likelihood function associated with the Meixner dis-
tribution. Also relevant to that context is the integral relation
 r 2  r
  jzx 1
( + jx) e dx = 2π(r)
R 2 2 cosh 2z

for r > 0 and z ∈ C, which can be interpreted as a Fourier transform by setting z = −jω.
Euler’s digamma function is defined as

d   (z)
ψ(z) = log (z) = , (C.5)
dz (z)

while its mth-order derivative

dm+1
ψ (m) (z) = log (z) (C.6)
dzm+1
is called the polygamma function of order m.
346 Special functions

C.3 Symmetric-alpha-stable distributions

The SαS pdf of degree α ∈ (0, 2] and scale parameter s0 is best defined via its charac-
teristic function,

e−|s0 ω| ejωx
α
p(x; α, s0 ) = .
R 2π
Alpha-stable distributions do not admit closed-form expressions, except for the special
cases α = 1 (Cauchy) and 2 (Gauss distribution). Moreover, their absolute moments
of order p, E{|X|p }, are unbounded for p > α, which is characteristic of heavy-tailed
distributions. We can relate the (symmetric) γ th-order moments of their characteristic
function to the gamma function by performing the change of variable t = (s0 ω)α , which
leads to
" #
−γ −1
∞ s−γ −1 γ −α+1 s0  γ α+1
|ω|γ e−|s0 ω| dω = 2 t α e−t dt = 2
α 0
. (C.7)
R 0 α α
By using the correspondence between Fourier-domain moments and time-domain deri-
vatives, we use this result to write the Taylor series of p(x; α, s0 ) around x = 0 as
∞ −2k−1  
s0 2k + 1 |x|2k
p(x; α, s0 ) =  (−1)k , (C.8)
πα α (2k)!
k=0

which involves even terms only (because of symmetry). The moment formula (C.7) also
yields a simple expression for the slope of the score at the origin, which is given by
" #
p (0)  α3
X (0) = − X = " #.
pX (0) s2  1
0 α

Similar techniques are applicable to obtain the asymptotic form of p(x; α, s0 ) as x tends
to infinity [Ber52, TN95]. To characterize the tail behavior, it is sufficient to consider
the first term of the asymptotic expansion
1 " πα # 1
p(x; α, s0 ) ∼ (α + 1) sin sα0 α+1 as x → ±∞, (C.9)
π 2 |x|
which emphasizes the algebraic decay of order (α + 1) at infinity.
References

[ABDF11] M. V. Afonso, J. M. Bioucas-Dias, and M. A. T. Figueiredo, An augmented


Lagrangian approach to the constrained optimization formulation of imaging
inverse problems, IEEE Transactions on Image Processing 20 (2011), no. 3,
681–695.
[ABT01] A. Achim, A. Bezerianos, and P. Tsakalides, Novel Bayesian multiscale method
for speckle removal in medical ultrasound images, IEEE Transactions on
Medical Imaging 20 (2001), no. 8, 772–783.
[AG01] A. Aldroubi and K. Gröchenig, Nonuniform sampling and reconstruction in
shift-invariant spaces, SIAM Review 43 (2001), 585–620.
[Ahm74] N. Ahmed, Discrete cosine transform, IEEE Transactions on Communications
23 (1974), no. 1, 90–93.
[AKBU13] A. Amini, U. S. Kamilov, E. Bostan, and M. Unser, Bayesian estimation for
continuous-time sparse stochastic processes, IEEE Transactions on Signal Pro-
cessing 61 (2013), no. 4, 907–920.
[Ald95] A. Aldroubi, Portraits of frames, Proceedings of the American Mathematical
Society 123 (1995), no. 6, 1661–1668.
[App09] D. Appelbaum, Lévy Processes and Stochastic Calculus, 2nd edn., Cambridge
University Press, 2009.
[AS72] M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions,
National Bureau of Standards, 1972.
[AS83] D. A. Agard and J. W. Sedat, Three-dimensional architecture of a polytene
nucleus, Nature 302 (1983), no. 5910, 676–681.
[ASS98] F. Abramovich, T. Sapatinas, and B. W. Silverman, Wavelet thresholding via a
Bayesian approach, Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 60 (1998), no. 4, 725–749.
[ATB03] A. Achim, P. Tsakalides, and A. Bezerianos, SAR image denoising via Baye-
sian wavelet shrinkage based on heavy-tailed modeling, IEEE Transactions on
Geoscience and Remote Sensing 41 (2003), no. 8, 1773–1784.
[AU94] A. Aldroubi and M. Unser, Sampling procedures in function spaces and asymp-
totic equivalence with Shannon’s sampling theory, Numerical Functional Ana-
lysis and Optimization 15 (1994), no. 1–2, 1–21.
[AU14] A. Amini and M. Unser, Sparsity and infinite divisibility, IEEE Transactions on
Information Theory 60 (2014), no. 4, 2346–2358.
[AUM11] A. Amini, M. Unser, and F. Marvasti, Compressibility of deterministic and ran-
dom infinite sequences, IEEE Transactions on Signal Processing 59 (2011),
no. 11, 5193–5201.
348 References

[Bar61] M. S. Bartlett, An Introduction to Stochastic Processes, with Special Reference


to Methods and Applications, Cambridge Universty Press, 1961.
[BAS07] M. I. H. Bhuiyan, M. O. Ahmad, and M. N. S. Swamy, Spatially adaptive
wavelet-based method using the Cauchy prior for denoising the SAR images,
IEEE Transactions on Circuits and Systems for Video Technology 17 (2007),
no. 4, 500–507.
[BDE09] A. M. Bruckstein, D. L. Donoho, and M. Elad, From sparse solutions of systems
of equations to sparse modeling of signals and images, SIAM Review 51 (2009),
no. 1, 34–81.
[BDF07] J. M. Bioucas-Dias and M. A. T. Figueiredo, A new twist: Two-step iterative
shrinkage/thresholding algorithms for image restoration, IEEE Transactions on
Image Processing 16 (2007), no. 12, 2992–3004.
[BDR02] A. Bose, A. Dasgupta, and H. Rubin, A contemporary review and bibliography
of infinitely divisible distributions and processes, Sankhya: The Indian Journal
of Statistics: Series A 64 (2002), no. 3, 763–819.
[Ber52] H. Bergström, On some expansions of stable distributions, Arkiv für Mathema-
tik 2 (1952), no. 18, 375–378.
[BF06] L. Boubchir and J. M. Fadili, A closed-form nonparametric Bayesian estima-
tor in the wavelet domain of images using an approximate alpha-stable prior,
Pattern Recognition Letters 27 (2006), no. 12, 1370–1382.
[BH10] P. J. Brockwell and J. Hannig, CARMA(p,q) generalized random processes,
Journal of Statistical Planning and Inference 140 (2010), no. 12, 3613–3618.
[BK51] M. S. Bartlett and D. G. Kendall, On the use of the characteristic functional
in the analysis of some stochastic processes occurring in physics and biology,
Mathematical Proceedings of the Cambridge Philosophical Society 47 (1951),
65–76.
[BKNU13] E. Bostan, U. S. Kamilov, M. Nilchian, and M. Unser, Sparse stochastic pro-
cesses and discretization of linear inverse problems, IEEE Transactions on
Image Processing 22 (2013), no. 7, 2699–2710.
[BM02] P. Brémaud and L. Massoulié, Power spectra of general shot noises and Hawkes
point processes with a random excitation, Advances in Applied Probability 34
(2002), no. 1, 205–222.
[BNS01] O. E. Barndorff-Nielsen and N. Shephard, Non-Gaussian Ornstein-Uhlenbeck-
based models and some of their uses in financial economics, Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 63 (2001), no. 2,
167–241.
[Boc32] S. Bochner, Vorlesungen über Fouriersche Integrale, Akademische Verlagsge-
sellschaft, 1932.
[Boc47] Stochastic processes, Annals of Mathematics 48 (1947), no. 4, 1014–1061.
[Bog07] V. I. Bogachev, Measure Theory, Vol. I, II, Springer, 2007.
[BPCE10] S. Boyd, N. Parikh, E. Chu, and J. Eckstein, Distributed optimization and sta-
tistical learning via the alternating direction method of multipliers, Information
Systems Journal 3 (2010), no. 1, 1–122.
[Bro01] P. J. Brockwell, Lévy-driven CARMA processes, Annals of the Institute of Sta-
tistical Mathematics 53 (2001), 113–124.
[BS50] H. W. Bode and C. E. Shannon, A simplified derivation of linear least square
smoothing and prediction theory, Proceedings of the IRE 38 (1950), no. 4,
417–425.
References 349

[BS93] C. Bouman and K. Sauer, A generalized Gaussian image model for edge-
preserving MAP estimation, IEEE Transactions on Image Processing 2 (1993),
no. 3, 296–310.
[BT09a] A. Beck and M. Teboulle, Fast gradient-based algorithms for constrained
total variation image denoising and deblurring problems, IEEE Transactions
on Image Processing 18 (2009), no. 11, 2419–2434.
[BT09b] A fast iterative shrinkage-thresholding algorithm for linear inverse problems,
SIAM Journal on Imaging Sciences 2 (2009), no. 1, 183–202.
[BTU01] T. Blu, P. Thévenaz, and M. Unser, MOMS: Maximal-order interpolation of
minimal support, IEEE Transactions on Image Processing 10 (2001), no. 7,
1069–1080.
[BU03] T. Blu and M. Unser, A complete family of scaling functions: The (α, τ )-
fractional splines, Proceedings of the IEEE International Conference on Acous-
tics, Speech, and Signal Processing (ICASSP’03) (Hong Kong SAR, People’s
Republic of China, April 6–10, 2003), vol. VI, IEEE, pp. 421–424.
[BU07] Self-similarity. Part II: Optimal estimation of fractal processes, IEEE Tran-
sactions on Signal Processing 55 (2007), no. 4, 1364–1378.
[BUF07] K. T. Block, M. Uecker, and J. Frahm, Undersampled radial MRI with multiple
coils. Iterative image reconstruction using a total variation constraint, Magnetic
Resonance in Medicine 57 (2007), no. 6, 1086–1098.
[BY02] B. Bru and M. Yor, Comments on the life and mathematical legacy of Wolfgang
Döblin, Finance and Stochastics 6 (2002), no. 1, 3–47.
[Car99] J. F. Cardoso and D. L. Donoho, Some experiments on independent component
analysis of non-Gaussian processes, Proceedings of the IEEE Signal Processing
Workshop on Higher-Order Statistics (SPW-HOS’99) (Caesarea, Istael, June
14–16, 1999), 1999, pp. 74–77.
[CBFAB97] P. Charbonnier, L. Blanc-Féraud, G. Aubert, and M. Barlaud, Deterministic
edge-preserving regularization in computed imaging, IEEE Transactions on
Image Processing 6 (1997), no. 2, 298–311.
[CD95] R. R. Coifman and D. L. Donoho, Translation-invariant de-noising, Wavelets
and statistics, (A. Antoniadis and G, Oppeheim, eds.), Lecture Notes in Statis-
tics, vol. 103, Springer, 1995, pp. 125–150.
[CDLL98] A. Chambolle, R. A. DeVore, N.-Y. Lee, and B. J. Lucier, Nonlinear wave-
let image processing: Variational problems, compression, and noise removal
through wavelet shrinkage, IEEE Transactions on Image Processing 7 (1998),
no. 33, 319–335.
[Chr03] O. Christensen, An Introduction to Frames and Riesz Bases, Birkhäuser, 2003.
[CKM97] H. A. Chipman, E. D. Kolaczyk, and R. E. McCulloch, Adaptive Bayesian
wavelet shrinkage, Journal of the American Statistical Association 92 (1997),
no. 440, 1413–1421.
[CLM+ 95] W. A. Carrington, R. M. Lynch, E. D. Moore, G. Isenberg, K. E. Fogarty, and
F. S. Fay, Superresolution three-dimensional images of fluorescence in cells
with minimal light exposure, Science 268 (1995), no. 5216, 1483–1487.
[CNB98] M. S. Crouse, R. D. Nowak, and R. G. Baraniuk, Wavelet-based statistical
signal processing using hidden Markov models, IEEE Transactions on Signal
Processing 46 (1998), no. 4, 886–902.
[Com94] P. Comon, Independent component analysis: A new concept, Signal Processing
36 (1994), no. 3, 287–314.
350 References

[CP11] P. L. Combettes and J.-C. Pesquet, Proximal splitting methods in signal proces-
sing, Fixed-Point Algorithms for Inverse Problems in Science and Engineering
(H. H. Bauschke, R. S. Burachik, P. L. Combettes, V. Elser, D. R. Luke, and
H. Wolkowicz, eds.), vol. 49, Springer New York, 2011, pp. 185–212.
[Cra40] H. Cramér, On the theory of stationary random processes, The Annals of
Mathematics 41 (1940), no. 1, 215–230.
[CSE00] C. Christopoulos, A. S. Skodras, and T. Ebrahimi, The JPEG2000 still image
coding system: An overview, IEEE Transactions on Consumer Electronics 16
(2000), no. 4, 1103–1127.
[CT04] R. Cont and P. Tankov, Financial Modelling with Jump Processes, Chapman &
Hall, 2004.
[CU10] K. N. Chaudhury and M. Unser, On the shiftability of dual-tree complex
wavelet transforms, IEEE Transactions on Signal Processing 58 (2010), no. 1,
221–232.
[CW91] C. K. Chui and J.-Z. Wang, A cardinal spline approach to wavelets, Proceedings
of the American Mathematical Society 113 (1991), no. 3, 785–793.
[CW08] E. J. Candès and M. B. Wakin, An introduction to compressive sampling, IEEE
Signal Processing Magazine 25 (2008), no. 2, 21–30.
[CYV00] S. G. Chang, B. Yu, and M. Vetterli, Spatially adaptive wavelet thresholding
with context modeling for image denoising, IEEE Transactions on Image Pro-
cessing 9 (2000), no. 9, 1522 –1531.
[Dau88] I. Daubechies, Orthogonal bases of compactly supported wavelets, Communi-
cations on Pure and Applied Mathematics 41 (1988), 909–996.
[Dau92] Ten Lectures on Wavelets, Society for Industrial and Applied Mathematics,
1992.
[dB78] C. de Boor, A Practical Guide to Splines, Springer, 1978.
[dB87] The polynomials in the linear span of integer translates of a compactly sup-
ported function, Constructive Approximation 3 (1987), 199–208.
[dBDR93] C. de Boor, R. A. DeVore, and A. Ron, On the construction of multivariate (pre)
wavelets, Constructive Approximation 9 (1993), 123–123.
[DBFZ+ 06] N. Dey, L. Blanc-Féraud, C. Zimmer, P. Roux, Z. Kam, J.-C. Olivo-Marin, and
J. Zerubia, Richardson-Lucy algorithm with total variation regularization for
3D confocal microscope deconvolution, Microscopy Research and Technique
69 (2006), no. 4, 260–266.
[dBH82] C. de Boor and K. Höllig, B-splines from parallelepipeds, Journal d’Analyse
Mathématique 42 (1982), no. 1, 99–115.
[dBHR93] C. de Boor, K. Höllig, and S. Riemenschneider, Box Splines, Springer, 1993.
[DDDM04] I. Daubechies, M. Defrise, and C. De Mol, An iterative thresholding algorithm
for linear inverse problems with a sparsity constraint, Communications on Pure
and Applied Mathematics 57 (2004), no. 11, 1413–1457.
[DJ94] D. L. Donoho and I. M. Johnstone, Ideal spatial adaptation via wavelet shrin-
kage, Biometrika 81 (1994), 425–455.
[DJ95] Adapting to unknown smoothness via wavelet shrinkage, Journal of the Ame-
rican Statistical Association 90 (1995), no. 432, 1200–1224.
[Don95] D. L. Donoho, De-noising by soft-thresholding, IEEE Transactions on Infor-
mation Theory 41 (1995), no. 3, 613–627.
[Don06] Compressed sensing, IEEE Transactions on Information Theory 52 (2006),
no. 4, 1289–1306.
References 351

[Doo37] J. L. Doob, Stochastic processes depending on a continuous parameter, Tran-


sactions of the American Mathematical Society 42 (1937), no. 1, 107–140.
[Doo90] Stochastic Processes, John Wiley & Sons, 1990.
[Duc77] J. Duchon, Splines minimizing rotation-invariant semi-norms in Sobolev
spaces, Constructive Theory of Functions of Several Variables (W. Schempp
and K. Zeller, eds.), Springer, 1977, pp. 85–100.
[Ehr54] L. Ehrenpreis, Solutions of some problems of division I, American Jourmal of
Mathematics 76 (1954), 883–903.
[EK00] R. Estrada and R. P. Kanwal, Singular Integral Equations, Birkhäuser, 2000.
[ENU12] A. Entezari, M. Nilchian, and M. Unser, A box spline calculus for the discre-
tization of computed tomography reconstruction problems, IEEE Transactions
on Medical Imaging 31 (2012), no. 8, 1532–1541.
[Fag14] J. Fageot, A. Amini and M. Unser, On the continuity of characteristic functionals
and sparse stochastic modeling, Preprint (2014), arXiv:1401.6850(math.PR).
[FB05] J. M. Fadili and L. Boubchir, Analytical form for a Bayesian wavelet estima-
tor of images using the Bessel K form densities, IEEE Transactions on Image
Processing 14 (2005), no. 2, 231–240.
[FBD10] M. A. T. Figueiredo and J. Bioucas-Dias, Restoration of Poissonian images
using alternating direction optimization, IEEE Transactions on Image Proces-
sing 19 (2010), no. 12, 3133–3145.
[Fel71] W. Feller, An Introduction to Probability Theory and its Applications, vol. 2,
John Wiley & Sons, 1971.
[Fer67] X. Fernique, Lois indéfiniment divisibles sur l’espace des distributions, Inven-
tiones Mathematicae 3 (1967), no. 4, 282–292.
[Fie87] D. J. Field, Relations between the statistics of natural images and the response
properties of cortical cells, Journal of the Optical Society of America A: Optics
Image Science and Vision 4 (1987), no. 12, 2379–2394.
[Fla89] P. Flandrin, On the spectrum of fractional Brownian motions, IEEE Transac-
tions on Information Theory 35 (1989), no. 1, 197–199.
[Fla92] Wavelet analysis and synthesis of fractional Brownian motion, IEEE Tran-
sactions on Information Theory 38 (1992), no. 2, 910–917.
[FN03] M. A. T. Figueiredo and R. D. Nowak, An EM algorithm for wavelet-based
image restoration, IEEE Transactions on Image Processing 12 (2003), no. 8,
906–916.
[Fou77] J. Fournier, Sharpness in Young’s inequality for convolution, Pacific Journal of
Mathematics 72 (1977), no. 2, 383–397.
[FR08] M. Fornasier and H. Rauhut, Iterative thresholding algorithms, Applied and
Computational Harmonic Analysis 25 (2008), no. 2, 187–208.
[Fra92] J. Frank, Electron Tomography: Three-Dimensional Imaging with the Trans-
mission Electron Microscope, Springer, 1992.
[Fre00] D. H. Fremlin, Measure Theory, Vol. 1, Torres Fremlin, 2000.
[Fre01] Measure Theory, Vol. 2, Torres Fremlin, 2001.
[Fre02] Measure Theory, Vol. 3, Torres Fremlin, 2002.
[Fre03] Measure Theory, Vol. 4, Torres Fremlin, 2003.
[Fre08] Measure Theory, Vol. 5, Torres Fremlin, 2008.
[GBH70] R. Gordon, R. Bender, and G. T. Herman, Algebraic reconstruction techniques
(ART) for three-dimensional electron microscopy and X-ray photography, Jour-
nal of Theoretical Biology 29 (1970), no. 3, 471–481.
352 References

[Gel55] I. M. Gelfand, Generalized random processes, Doklady Akademii Nauk SSSR


100 (1955), no. 5, 853–856, in Russian.
[GG84] S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images, IEEE Transactions on Pattern Analysis and
Machine Intelligence 6 (1984), no. 6, 721–741.
[GK68] B. V. Gnedenko and A. N. Kolmogorov, Limit Distributions for Sums of Inde-
pendent Random Variables, Addison-Wesley, 1968.
[GKHPU11] M. Guerquin-Kern, M. Häberlin, K. P. Pruessmann, and M. Unser, A fast
wavelet-based reconstruction method for magnetic resonance imaging, IEEE
Transactions on Medical Imaging 30 (2011), no. 9, 1649–1660.
[GR92] D. Geman and G. Reynolds, Constrained restoration and the recovery of dis-
continuities, IEEE Transactions on Pattern Analysis and Machine Intelligence
14 (1992), no. 3, 367–383.
[Gra08] L. Grafakos, Classical Fourier Analysis, Springer, 2008.
[Gri11] R. Gribonval, Should penalized least squares regression be interpreted as maxi-
mum a posteriori estimation?, IEEE Transactions on Signal Processing 59
(2011), no. 5, 2405–2410.
[Gro55] A. Grothendieck, Produits tensoriels topologiques et espaces nucléaires,
Memoirs of the American Mathematical Society 16 (1955).
[GS64] I. M. Gelfand and G. Shilov, Generalized Functions, Vol. 1: Properties and
Operations, Academic Press, 1964.
[GS68] Generalized Functions, Vol. 2: Spaces of Fundamental and Generalized
Functions, Academic Press, New York, 1968.
[GS01] U. Grenander and A. Srivastava, Probability models for clutter in natural
images, IEEE Transactions on Pattern Analysis and Machine Intelligence 23
(2001), no. 4, 424–429.
[GV64] I. M. Gelfand and N. Ya. Vilenkin, Generalized Functions, Vol. 4: Applications
of Harmonic Analysis, Academic Press, New York, 1964.
[Haa10] A. Haar, Zur Theorie der orthogonalen Funktionensysteme, Mathematische
Annalen 69 (1910), no. 3, 331–371.
[Hea71] O. Heaviside, Electromagnetic Theory: Including an Account of Heaviside’s
Unpublished Notes for a Fourth Volume, Chelsea Publishing, 1971.
[HHLL79] G. T. Herman, H. Hurwitz, A. Lent, and H. P. Lung, On the Bayesian
approach to image reconstruction, Information and Control 42 (1979), no. 1,
60–71.
[HKPS93] T. Hida, H.-H. Kuo, J. Potthoff, and L. Streit, White Noise: An Infinite Dimen-
sional Calculus, Kluver, 1993.
[HK70] A. E. Hoerl and R. W. Kennard, Ridge regression: Biased estimation for non-
orthogonal problems, Technometrics 12 (1970), no. 1, 55–67.
[HL76] G. T. Herman and A. Lent, Quadratic optimization for image reconstruction. I,
Computer Graphics and Image Processing 5 (1976), no. 3, 319–332.
[HO00] A. Hyvärinen and E. Oja, Independent component analysis: Algorithms and
applications, Neural Networks 13 (2000), no. 4, 411–430.
[Hör80] L. Hörmander, The Analysis of Linear Differential Operators I: Distribution
Theory and Fourier Analysis, 2nd edn., Springer, 1980.
[Hör05] The Analysis of Linear Partial Differential Operators II. Differential Opera-
tors with Constant Coefficients. Classics in mathematics, Springer, 2005.
References 353

[HP76] M. Hamidi and J. Pearl, Comparison of the cosine and Fourier transforms of
Markov-1 signals, IEEE Transactions on Acoustics, Speech and Signal Proces-
sing 24 (1976), no. 5, 428–429.
[HS04] T. Hida and S. Si, An Innovation Approach to Random Fields: Application of
White Noise Theory, World Scientific, 2004.
[HS08] Lectures on White Noise Functionals, World Scientific, 2008.
[HSS08] T. Hofmann, B. Schölkopf, and A. J. Smola, Kernel methods in machine learn-
ing, Annals of Statistics 36 (2008), no. 3, 1171–1220.
[HT02] D. W. Holdsworth and M. M. Thornton, Micro-CT in small animal and speci-
men imaging, Trends in Biotechnology 20 (2002), no. 8, S34–S39.
[Hun77] B. R. Hunt, Bayesian methods in nonlinear digital image restoration, IEEE
Transactions on Computers C-26 (1977), no. 3, 219–229.
[HW85] K. M. Hanson and G. W. Wecksung, Local basis-function approach to computed
tomography, Applied Optics 24 (1985), no. 23, 4028–4039.
[HY00] M. Hansen and B. Yu, Wavelet thresholding via MDL for natural images, IEEE
Transactions on Information Theory 46 (2000), no. 5, 1778 –1788.
[Itô54] K. Itô, Stationary random distributions, Kyoto Journal of Mathematics 28
(1954), no. 3, 209–223.
[Itô84] Foundations of Stochastic Differential Equations in Infinite-Dimensional
Spaces, CBMS-NSF Regional Conference Series in Applied Mathematics,
vol. 47, Society for Industrial and Applied Mathematics (SIAM), 1984.
[Jac01] N. Jacob, Pseudo Differential Operators & Markov Processes, Vol. 1: Fourier
Analysis and Semigroups, World Scientific, 2001.
[Jai79] A. K. Jain, A sinusoidal family of unitary transforms, IEEE Transactions on
Pattern Analysis and Machine Intelligence 1 (1979), no. 4, 356–365.
[Jai89] Fundamentals of Digital Image Processing, Prentice-Hall, 1989.
[JMR01] S. Jaffard, Y. Meyer, and R. D. Ryan, Wavelets: Tools for Science and Techno-
logy, SIAM, 2001.
[JN84] N. S. Jayant and P. Noll, Digital Coding of Waveforms: Principles and Applica-
tion to Speech and Video Coding, Prentice-Hall, 1984.
[Joh66] S. Johansen, An application of extreme point methods to the representation
of infinitely divisible distributions, Probability Theory and Related Fields 5
(1966), 304–316.
[Kai70] T. Kailath, The innovations approach to detection and estimation theory, Pro-
ceedings of the IEEE 58 (1970), no. 5, 680–695.
[Kal06] W. A. Kalender, X-ray computed tomography, Physics in Medicine and Biology
51 (2006), no. 13, R29.
[Kap62] W. Kaplan, Operational Methods for Linear Systems, Addison-Wesley, 1962.
[KBU12] U. Kamilov, E. Bostan, and M. Unser, Wavelet shrinkage with consistent cycle
spinning generalizes total variation denoising, IEEE Signal Processing Letters
19 (2012), no. 4, 187–190.
[KFL01] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, Factor graphs and the sum-
product algorithm, IEEE Transactions on Information Theory 47 (2001), no. 2,
498–519.
[Khi34] A. Khintchine, Korrelationstheorie der stationären stochastischen Prozesse,
Mathematische Annalen 109 (1934), no. 1, 604–615.
[Khi37a] A new derivation of one formula by P. Lévy, Bulletin of Moscow State Uni-
versity, I (1937), no. 1, 1–5.
354 References

[Khi37b] Zur Theorie der unbeschränkt teilbaren Verteilungsgesetze, Recueil Mathé-


matique (Matematiceskij Sbornik) 44 (1937), no. 2, 79–119.
[KKBU12] A. Kazerouni, U. S. Kamilov, E. Bostan, and M. Unser, Bayesian denoising:
From MAP to MMSE using consistent cycle spinning, IEEE Signal Processing
Letters 20 (2012), no. 3, 249–252.
[KMU11] H. Kirshner, S. Maggio, and M. Unser, A sampling theory approach for
continuous ARMA identification, IEEE Transactions on Signal Processing 59
(2011), no. 10, 4620–4634.
[Kol35] A. N. Kolmogoroff, La transformation de Laplace dans les espaces linéaires,
Comptes Rendus de l’Académie des Sciences 200 (1935), 1717–1718, Note de
A. N. Kolmogoroff, présentée par Jacques Hadamard.
[Kol40] Wienersche Spiralen und einige andere interessante Kurven im Hilbertschen
Raum, Comptes Rendus (Doklady) de l’Académie des Sciences de l’URSS 26
(1940), no. 2, 115–118.
[Kol41] A. N. Kolmogorov, Stationary sequences in Hilbert space, Vestnik Moskovskogo
Universiteta, Seriya 1: Matematika, Mekhanika 2 (1941), no. 6, 1–40.
[Kol56] Foundations of the Theory of Probability, 2nd English edn., Chelsea Publi-
shing, 1956.
[Kol59] A note on the papers of R. A. Minlos and V. Sazonov, Theory of Probability
and Its Applications 4 (1959), no. 2, 221–223.
[KPAU13] U. S. Kamilov, P. Pad, A. Amini, and M. Unser, MMSE estimation of sparse
Lévy processes, IEEE Transactions on Signal Processing 61 (2013), no. 1,
137–147.
[Kru70] V. M. Kruglov, A note on infinitely divisible distributions, Theory of Probability
and Its Applications 15 (1970), no. 2, 319–324.
[KU06] I. Khalidov and M. Unser, From differential equations to the construction of
new wavelet-like bases, IEEE Transactions on Signal Processing 54 (2006),
no. 4, 1256–1267.
[KUW13] I. Khalidov, M. Unser, and J. P. Ward, Operator-like bases of L2 (Rd ), Journal
of Fourier Analysis and Applications 19, (2013), no. 6, 1294–1322.
[KV90] N. B. Karayiannis and A. N. Venetsanopoulos, Regularization theory in image
restoration: The stabilizing functional approach, IEEE Transactions on Acous-
tics, Speech and Signal Processing 38 (1990), no. 7, 1155–1179.
[KW70] G. Kimeldorf and G. Wahba, A correspondence between Bayesian estimation
on stochastic processes and smoothing by splines, Annals of Mathematical Sta-
tistics 41 (1970), no. 2, 495–502.
[Lat98] B. P. Lathy, Signal Processing and Linear Systems, Berkeley-Cambridge Press,
1998.
[LBU07] F. Luisier, T. Blu, and M. Unser, A new SURE approach to image denoising:
Interscale orthonormal wavelet thresholding, IEEE Transactions on Image Pro-
cessing 16 (2007), no. 3, 593–606.
[LC47] L. Le Cam, Un instrument d’étude des fonctions aléatoires: la fonctionnelle
caractéristique, Comptes Rendus de l’Académie des Sciences 224 (1947), no. 3,
710–711.
[LDP07] M. Lustig, D. L. Donoho, and J. M. Pauly, Sparse MRI: The application of
compressed sensing for rapid MR imaging, Magnetic Resonance in Medicine
58 (2007), no. 6, 1182–1195.
References 355

[Lév25] P. Lévy, Calcul des Probabilités, Gauthier-Villars, 1925.


[Lév34] Sur les intégrales dont les éléments sont des variables aléatoires indépen-
dantes, Annali della Scuola Normale Superiore di Pisa: Classe di Scienze 3
(1934), no. 3–4, 337–366.
[Lév54] Le Mouvement Brownien, Mémorial des Sciences Mathématiques, vol.
CXXVI, Gauthier–Villars, 1954.
[Lév65] Processus Stochastiques et Mouvement Brownien, 2nd edn., Gauthier-Villars,
1965.
[Lew90] R. M. Lewitt, Multidimensional digital image representations using generalized
Kaiser Bessel window functions, Journal of the Optical Society of America A 7
(1990), no. 10, 1834–1846.
[LFB05] V. Lučić, F. Förster, and W. Baumeister, Structural studies by electron tomo-
graphy: From cells to molecules, Annual Review of Biochemistry 74 (2005),
833–865.
[Loè73] M. Loève, Paul Lévy, 1886-1971, Annals of Probability 1 (1973), no. 1, 1–8.
[Loe04] H.-A. Loeliger, An introduction to factor graphs, IEEE Signal Processing
Magazine 21 (2004), no. 1, 28–41.
[LR78] A. V. Lazo and P. Rathie, On the entropy of continuous probability distributions,
IEEE Transactions on Information Theory 24 (1978), no. 1, 120–122.
[Luc74] L. B. Lucy, An iterative technique for the rectification of observed distributions,
The Astronomical Journal 6 (1974), no. 6, 745–754.
[Mal56] B. Malgrange, Existence et approximation des solutions des équations aux dé-
rivées partielles et des équations de convolution, Annales de l’Institut Fourier
6 (1956), 271–355.
[Mal89] S. G. Mallat, A theory of multiresolution signal decomposition: The wavelet
representation, IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 11 (1989), no. 7, 674–693.
[Mal98] S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, 1998.
[Mal09] A Wavelet Tour of Signal Processing: The Sparse Way, 3rd edn., Academic
Press, 2009.
[Man63] B. B. Mandelbrot, The variation of certain speculative prices, The Journal of
Business 36 (1963), no. 4, 393–413.
[Man82] The Fractal Geometry of Nature, Freeman, 1982.
[Man01] Gaussian Self-Affinity and Fractals, Springer, 2001.
[Mar05] R. Martin, Speech enhancement based on minimum mean-square error estima-
tion and supergaussian priors, IEEE Transactions on Speech and Audio Proces-
sing 13 (2005), no. 5, 845–856.
[Mat63] G. Matheron, Principles of geostatistics, Economic Geology 58 (1963), no. 8,
1246–1266.
[Mey90] Y. Meyer, Ondelettes et Opérateurs I: Ondelettes, Hermann, 1990.
[MG01] D. Mumford and B. Gidas, Stochastic models for generic images, Quarterly of
Applied Mathematics 59 (2001), no. 1, 85–112.
[Mic76] C. Micchelli, Cardinal L-splines, Studies in Spline Functions and Approxima-
tion Theory (S. Karlin, C. Micchelli, A. Pinkus, and I. Schoenberg, eds.), Aca-
demic Press, 1976, pp. 203–250.
[Mic86] Interpolation of scattered data: Distance matrices and conditionally positive
definite functions, Constructive Approximation 2 (1986), no. 1, 11–22.
356 References

[Min63] R. A. Minlos, Generalized Random Processes and Their Extension to a Mea-


sure, Selected Translations in Mathematical Statististics and Probability, vol. 3,
American Mathematical Society, 1963, pp. 291–313.
[Miy61] K. Miyasawa, An empirical Bayes estimator of the mean of a normal popula-
tion, Bulletin de l’Institut International de Statistique 38 (1961), no. 4, 181–188.
[MKCC99] J. G. McNally, T. Karpova, J. Cooper, and J. A. Conchello, Three-dimensional
imaging by deconvolution microscopy, Methods 19 (1999), no. 3, 373–385.
[ML99] P. Moulin and J. Liu, Analysis of multiresolution image denoising schemes
using generalized Gaussian and complexity priors, IEEE Transactions on
Information Theory 45 (1999), no. 3, 909–919.
[MN90a] W. R. Madych and S. A. Nelson, Multivariate interpolation and conditionally
positive definite functions. II, Mathematics of Computation 54 (1990), no. 189,
211–230.
[MN90b] Polyharmonic cardinal splines, Journal of Approximation Theory 60 (1990),
no. 2, 141–156.
[MP86] S. G. Mikhlin and S. Prössdorf, Singular Integral Operators, Springer, 1986.
[MR06] F. Mainardi and S. Rogosin, The origin of infinitely divisible distributions:
From de Finetti’s problem to Lévy-Khintchine formula, Mathematical Methods
in Economics and Finance 1 (2006), no. 1, 7–55.
[MVN68] B. B. Mandelbrot and J. W. Van Ness, Fractional Brownian motions, fractional
noises and applications, SIAM Review 10 (1968), no. 4, 422–437.
[Mye92] D. E. Myers, Kriging, cokriging, radial basis functions and the role of positive
definiteness, Computers and Mathematics with Applications 24 (1992), no. 12,
139–148.
[Nat84] F. Natterer, The Mathematics of Computed Tomography, John Wiley & Sons,
1984.
[New75] D. J. Newman, A simple proof of Wiener’s 1/f theorem, Proceedings of the
American Mathematical Society 48 (1975), no. 1, 264–265.
[Nik07] M. Nikolova, Model distortions in Bayesian MAP reconstruction, Inverse Pro-
blems and Imaging 1 (2007), no. 2, 399–422.
[OF96a] B. A. Olshausen and D. J. Field, Emergence of simple-cell receptive field
properties by learning a sparse code for natural images, Nature 381 (1996),
no. 6583, 607–609.
[OF96b] Natural image statistics and efficient coding, Network: Computation in Neu-
ral Systems 7 (1996), no. 2, 333–339.
[Øks07] B. Øksendal, Stochastic Differential Equations, 6th edn., Springer, 2007.
[OSB99] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-Time Signal Proces-
sing, Prentice Hall, 1999.
[PAP72] J. Pearl, H. C. Andrews, and W. Pratt, Performance measures for transform data
coding, IEEE Transactions on Communications 20 (1972), no. 3, 411–415.
[Pap91] A. Papoulis, Probability, Random Variables, and Stochastic Processes,
McGraw-Hill, 1991.
[PBE01] M. Persson, D. Bone, and H. Elmqvist, Total variation norm for three-
dimensional iterative reconstruction in limited view angle tomography, Physics
in Medicine and Biology 46 (2001), no. 3, 853.
[Pet05] M. Petit, L’équation de Kolmogoroff: Vie et Mort de Wolfgang Doeblin, un
Génie dans la Tourmente Nazie, Gallimard, 2005.
References 357

[PHBJ+ 01] E. Perrin, R. Harba, C. Berzin-Joseph, I. Iribarren, and A. Bonami, nth-order


fractional Brownian motion and fractional Gaussian noises, IEEE Transactions
on Signal Processing 49 (2001), no. 5, 1049–1059.
[Pie72] A. Pietsch, Nuclear Locally Convex Spaces, Springer, 1972.
[PM62] D. P. Petersen and D. Middleton, Sampling and reconstruction of wave-number-
limited functions in n-dimensional Euclidean spaces, Information and Control
5 (1962), no. 4, 279–323.
[PMC93] C. Preza, M. I. Miller, and J.-A. Conchello, Image reconstruction for 3D light
microscopy with a regularized linear method incorporating a smoothness prior,
Proceedings of the IS&T/SPIE Symposium on Electronic Imaging Science and
Technology (San Jose, CA, January 1993), vol. 1905, SPIE 1993, pp. 129–139.
[PP06] A. Pizurica and W. Philips, Estimating the probability of the presence of a
signal of interest in multiresolution single- and multiband image denoising,
IEEE Transactions on Image Processing 15 (2006), no. 3, 654–665.
[PPLV02] B. Pesquet-Popescu and J. Lévy Véhel, Stochastic fractal models for image pro-
cessing, IEEE Signal Processing Magazine 19 (2002), no. 5, 48–62.
[PR70] B. L. S. Prakasa Rao, Infinitely divisible characteristic functionals on locally
convex topological vector spaces, Pacific Journal of Mathematics 35 (1970),
no. 1, 221–225.
[Pra91] W. K. Pratt, Digital Image Processing, John Wiley & Sons, 1991.
[Pro04] P. Protter, Stochastic Integration and Differential Equations, Springer, 2004.
[Pru06] K. P. Pruessmann, Encoding and reconstruction in parallel MRI, NMR in Bio-
medicine 19 (2006), no. 3, 288–299.
[PSV09] X. Pan, E. Y. Sidky, and M. Vannier, Why do commercial CT scanners still
employ traditional, filtered back-projection for image reconstruction?, Inverse
Problems 25 (2009), no. 12, 123009.
[PSWS03] J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli, Image denoising
using scale mixtures of Gaussians in the wavelet domain, IEEE Transactions
on Image Processing 12 (2003), no. 11, 1338–1351.
[PU13] P. Pad and M. Unser, On the optimality of operator-like wavelets for sparse
AR(1) processes, Proceedings of the 2013 IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP’13) (Vancouver, BC,
Canada, May 26–31, 2013), IEEE, 2013, pp. 5598–5602.
[PWSB99] K. P. Pruessmann, M. Weiger, M. B. Scheidegger, and P. Boesiger, SENSE:
Sensitivity encoding for fast MRI, Magnetic Resonance in Medicine 42 (1999),
no. 5, 952–962.
[Rab92a] C. Rabut, Elementary m-harmonic cardinal B-splines, Numerical Algorithms 2
(1992), no. 2, 39–61.
[Rab92b] High level m-harmonic cardinal B-splines, Numerical Algorithms 2 (1992),
63–84.
[Ram69] B. Ramachandran, On characteristic functions and moments, Sankhya: The
Indian Journal of Statistics, Series A 31 (1969), no. 1, 1–12.
[RB94] D. L. Ruderman and W. Bialek, Statistics of natural images: Scaling in the
woods, Physical Review Letters 73 (1994), no. 6, 814–818.
[RBFK11] L. Ritschl, F. Bergner, C. Fleischmann, and M. Kachelrieß, Improved total
variation-based CT image reconstruction applied to clinical data, Physics in
Medicine and Biology 56 (2011), no. 6, 1545.
358 References

[RF11] S. Ramani and J. A. Fessler, Parallel MR image reconstruction using augmented


Lagrangian methods, IEEE Transactions on Medical Imaging 30 (2011), no. 3,
694–706.
[Ric72] W. Richardson, Bayesian-based iterative method of image restoration, Journal
of the Optical Society of America 62 (1972), no. 1, 55–59.
[Ric77] J. Rice, On generalized shot noise, Advances in Applied Probability 9 (1977),
no. 3, 553–565.
[RL71] G. N. Ramachandran and A. V. Lakshminarayanan, Three-dimensional recons-
truction from radiographs and electron micrographs: Application of convolu-
tions instead of Fourier transforms, Proceedings of the National Academy of
Sciences 68 (1971), no. 9, 2236–2240.
[ROF92] L. I. Rudin, S. Osher, and E. Fatemi, Nonlinear total variation based noise
removal algorithms, Physica D 60 (1992), no. 1–4, 259–268.
[Ron88] A. Ron, Exponential box splines, Constructive Approximation 4 (1988), no. 1,
357–378.
[RS08] M. Raphan and E. P. Simoncelli, Optimal denoising in redundant representa-
tions, IEEE Transactions on Image Processing 17 (2008), no. 8, 1342–1352.
[RU06] S. Ramani and M. Unser, Matérn B-splines and the optimal reconstruction of
signals, IEEE Signal Processing Letters 13 (2006), no. 7, 437–440.
[Rud73] W. Rudin, Functional analysis, McGraw-Hill Series in Higher Mathematics,
McGraw-Hill, 1973.
[Rud97] D. L. Ruderman, Origins of scaling in natural images, Vision Research 37
(1997), no. 23, 3385–3398.
[RVDVBU08] S. Ramani, D. Van De Ville, T. Blu, and M. Unser, Nonideal sampling and
regularization theory, IEEE Transactions on Signal Processing 56 (2008), no. 3,
1055–1070.
[SA96] E. P. Simoncelli and E. H. Adelson, Noise removal via Bayesian wavelet
coring, Proceedings of the IEEE International Conference on Image Processing
(ICIP2010), (Lausanne, Switzerland September 16–19, 1996), vol. 1, IEEE,
1996, pp. 379–382.
[Sat94] K.-I. Sato, Lévy Processes and Infinitely Divisible Distributions, Chapman &
Hall, 1994.
[Sch38] I. J. Schoenberg, Metric spaces and positive definite functions, Transactions of
the American Mathematical Society 44 (1938), no. 3, 522–536.
[Sch46] Contribution to the problem of approximation of equidistant data by analytic
functions, Quarterly of Applied Mathematics 4 (1946), 45–99, 112–141.
[Sch66] L. Schwartz, Théorie des Distributions, Hermann, 1966.
[Sch73a] I. J. Schoenberg, Cardinal Spline Interpolation, Society of Industrial and
Applied Mathematics, 1973.
[Sch73b] L. Schwartz, Radon Measures on Arbitrary Topological Spaces and Cylindri-
cal Measures, Studies in Mathematics, vol. 6, Tata Institute of Fundamental
Research, Bombay Oxford University Press, 1973.
[Sch81a] L. L. Schumaker, Spline Functions: Basic Theory, John Wiley & Sons, 1981.
[Sch81b] L. Schwartz, Geometry and Probability in Banach Spaces, Lecture Notes in
Mathematics, vol. 852, Springer, 1981.
[Sch88] I. J. Schoenberg, A Brief Account of My Life and Work, Selected papers
(C. de Boor, ed.), vol. 1, Birkhäuser, 1988, pp. 1–10.
References 359

[Sch99] H. H. Schaefer, Topological Vector Spaces, 2nd edn., Graduate Texts in Mathe-
matics, vol. 3, Springer, 1999.
[SDFR13] J.-L. Starck, D. L. Donoho, M. J. Fadili, and A. Rassat, Sparsity and the Baye-
sian perspective, Astronomy and Astrophysics 552 (2013), A133.
[SF71] G. Strang and G. Fix, A Fourier analysis of the finite element variational
method, Constructive Aspects of Functional Analysis, Edizioni Cremonese,
1971, pp. 793–840.
[Sha93] J. Shapiro, Embedded image coding using zerotrees of wavelet coefficients,
IEEE Transactions on Acoustics, Speech and Signal Processing 41 (1993),
no. 12, 3445–3462.
[Sil99] B. W. Silverman, Wavelets in statistics: Beyond the standard assumptions,
Philosophical Transactions: Mathematical, Physical and Engineering Sciences
357 (1999), no. 1760, 2459–2473.
[SLG02] A. Srivastava, X. Liu, and U. Grenander, Universal analytical forms for model-
ing image probabilities, IEEE Transactions on Pattern Analysis and Machine
Intelligence 24 (2002), no. 9, 1200–1214.
[SLSZ03] A. Srivastava, A. B. Lee, E. P. Simoncelli, and S.-C. Zhu, On advances in statis-
tical modeling of natural images, Journal of Mathematical Imaging and Vision
18 (2003), 17–33.
[SO01] E. P. Simoncelli and B. A. Olshausen, Natural image statistics and neural repre-
sentation, Annual Review of Neuroscience 24 (2001), 1193–1216.
[Sob36] S. Soboleff, Méthode nouvelle à résoudre le problème de Cauchy pour les équa-
tions linéaires hyperboliques normales, Recueil Mathématique (Matematiceskij
Sbornik) 1(43) (1936), no. 1, 39–72.
[SP08] E. Y. Sidky and X. Pan, Image reconstruction in circular cone-beam computed
tomography by constrained, total-variation minimization, Physics in Medicine
and Biology 53 (2008), no. 17, 4777.
[SPM02] J.-L. Starck, E. Pantin, and F. Murtagh, Deconvolution in astronomy: A review,
Publications of the Astronomical Society of the Pacific 114 (2002), no. 800,
1051–1069.
[SS02] L. Sendur and I. W. Selesnick, Bivariate shrinkage functions for wavelet-based
denoising exploiting interscale dependency, IEEE Transactions on Signal Pro-
cessing 50 (2002), no. 11, 2744–2756.
[ST94] G. Samorodnitsky and M. S. Taqqu, Stable Non-Gaussian Random Processes:
Stochastic Models with Infinite Variance, Chapman & Hall, 1994.
[Ste76] J. Stewart, Positive definite functions and generalizations: An historical survey,
Rocky Mountain Journal of Mathematics 6 (1976), no. 3, 409–434.
[Ste81] C. Stein, Estimation of the mean of a multivariate normal distribution, Annals
of Statistics 9 (1981), no. 6, 1135–1151.
[SU12] Q. Sun and M. Unser, Left inverses of fractional Laplacian and sparse sto-
chastic processes, Advances in Computational Mathematics 36 (2012), no. 3,
399–441.
[Sun93] X. Sun, Conditionally positive definite functions and their application to mul-
tivariate interpolations, Journal of Approximation Theory 74 (1993), no. 2,
159–180.
[Sun07] Q. Sun, Wiener’s lemma for infinite matrices, Transactions of the American
Mathematical Society 359 (2007), no. 7, 3099–3123.
360 References

[SV67] M. H. Schultz and R. S. Varga, L-splines, Numerische Mathematik 10 (1967),


no. 4, 345–369.
[SVH03] F. W. Steutel and K. Van Harn, Infinite Divisibility of Probability Distributions
on the Real Line, Marcel Dekker, 2003.
[SW71] E. M. Stein and G. Weiss, Introduction to Fourier Analysis on Euclidean
Spaces, Princeton University Press, 1971.
[TA77] A. N. Tikhonov and V. Y. Arsenin, Solution of Ill-Posed Problems, Winston &
Sons, 1977.
[Taf11] P. D. Tafti, Self-similar vector fields, unpublished Ph.D. thesis, École Polytech-
nique Fédérate de Lausanne (2011).
[Tay75] S. J. Taylor, Paul Lévy, Bulletin of the London Mathematical Society 7 (1975),
no. 3, 300–320.
[TBU00] P. Thévenaz, T. Blu, and M. Unser, Image interpolation and resampling, Hand-
book of Medical Imaging, Processing and Analysis (I. N. Bankman, ed.), Aca-
demic Press, 2000, pp. 393–420.
[TH79] H. J. Trussell and B. R. Hunt, Improved methods of maximum a posteriori res-
toration, IEEE Transactions on Computers 100 (1979), no. 1, 57–62.
[Tib96] R. Tibshirani, Regression shrinkage and selection via the Lasso, Journal of the
Royal Statistical Society, Series B 58 (1996), no. 1, 265–288.
[Tik63] A. N. Tikhonov, Solution of incorrectly formulated problems and the regulari-
zation method, Soviet Mathematics 4 (1963), 1035–1038.
[TM09] J. Trzasko and A. Manduca, Highly undersampled magnetic resonance image
reconstruction via homotopic 0 -minimization, IEEE Transactions on Medical
Imaging 28 (2009), no. 1, 106–121.
[TN95] G. A. Tsihrintzis and C. L. Nikias, Performance of optimum and suboptimum
receivers in the presence of impulsive noise modeled as an alpha-stable process,
IEEE Transactions on Communications 43 (1995), no. 234, 904 –914.
[TVDVU09] P. D. Tafti, D. Van De Ville, and M. Unser, Invariances, Laplacian-like wave-
let bases, and the whitening of fractal processes, IEEE Transactions on Image
Processing 18 (2009), no. 4, 689–702.
[UAE92] M. Unser, A. Aldroubi, and M. Eden, On the asymptotic convergence of
B-spline wavelets to Gabor functions, IEEE Transactions on Information
Theory 38 (1992), no. 2, 864–872.
[UAE93] A family of polynomial spline wavelet transforms, Signal Processing 30
(1993), no. 2, 141–162.
[UB00] M. Unser and T. Blu, Fractional splines and wavelets, SIAM Review 42 (2000),
no. 1, 43–67.
[UB03] Wavelet theory demystified, IEEE Transactions on Signal Processing 51
(2003), no. 2, 470–483.
[UB05a] Cardinal exponential splines, Part I: Theory and filtering algorithms, IEEE
Transactions on Signal Processing 53 (2005), no. 4, 1425–1449.
[UB05b] Generalized smoothing splines and the optimal discretization of the Wiener
filter, IEEE Transactions on Signal Processing 53 (2005), no. 6, 2146–2159.
[UB07] Self-similarity, Part I: Splines and operators, IEEE Transactions on Signal
Processing 55 (2007), no. 4, 1352–1363.
[Uns84] M. Unser, On the approximation of the discrete Karhunen-Loève transform for
stationary processes, Signal Processing 7 (1984), no. 3, 231–249.
References 361

[Uns93] On the optimality of ideal filters for pyramid and wavelet signal approxima-
tion, IEEE Transactions on Signal Processing 41 (1993), no. 12, 3591–3596.
[Uns99] Splines: A perfect fit for signal and image processing, IEEE Signal Proces-
sing Magazine 16 (1999), no. 6, 22–38.
[Uns00] Sampling: 50 years after Shannon, Proceedings of the IEEE 88 (2000), no. 4,
569–587.
[Uns05] Cardinal exponential splines, Part II: Think analog, act digital, IEEE Tran-
sactions on Signal Processing 53 (2005), no. 4, 1439–1449.
[UT11] M. Unser and P. D. Tafti, Stochastic models for sparse and piecewise-smooth
signals, IEEE Transactions on Signal Processing 59 (2011), no. 3, 989–1005.
[UTAK14] M. Unser, P. D. Tafti, A. Amini, and H. Kirshner, A unified formulation of
Gaussian vs. sparse stochastic processes, Part II: Discrete-domain theory, IEEE
Transactions on Information Theory 60 (2014), no. 5, 3036–3051.
[UTS14] M. Unser, P. D. Tafti, and Q. Sun, A unified formulation of Gaussian vs. sparse
stochastic processes, Part I: Continuous-domain theory, IEEE Transactions on
Information Theory 60 (2014), no. 3, 1361–1376.
[VAVU06] C. Vonesch, F. Aguet, J.-L. Vonesch, and M. Unser, The colored revolution of
bioimaging, IEEE Signal Processing Magazine 23 (2006), no. 3, 20–31.
[VBB+ 02] G. M. Viswanathan, F. Bartumeus, S. V. Buldyrev, J. Catalan, U. L. Fulco,
S. Havlin, M. G. E da Luz, M. L. Lyra, E. P. Raposo, and H. E. Stanley, Lévy
flight random searches in biological phenomena, Physica A: Statistical Mecha-
nics and Its Applications 314 (2002), no. 1–4, 208–213.
[VBU07] C. Vonesch, T. Blu, and M. Unser, Generalized Daubechies wavelet families,
IEEE Transactions on Signal Processing 55 (2007), no. 9, 4415–4429.
[VDVBU05] D. Van De Ville, T. Blu, and M. Unser, Isotropic polyharmonic B-splines: Sca-
ling functions and wavelets, IEEE Transactions on Image Processing 14 (2005),
no. 11, 1798–1813.
[VDVFHUB10] D. Van De Ville, B. Forster-Heinlein, M. Unser, and T. Blu, Analytical foot-
prints: Compact representation of elementary singularities in wavelet bases,
IEEE Transactions on Signal Processing 58 (2010), no. 12, 6105–6118.
[vKvVVvdV97] G. M. P. van Kempen, L. J. van Vliet, P. J. Verveer, and H. T. M. van der
Voort, A quantitative comparison of image restoration methods for confocal
microscopy, Journal of Microscopy 185 (1997), no. 3, 345–365.
[VMB02] M. Vetterli, P. Marziliano, and T. Blu, Sampling signals with finite rate
of innovation, IEEE Transactions on Signal Processing 50 (2002), no. 6,
1417–1428.
[VU08] C. Vonesch and M. Unser, A fast thresholded Landweber algorithm for wavelet-
regularized multidimensional deconvolution, IEEE Transactions on Image Pro-
cessing 17 (2008), no. 4, 539–549.
[VU09] A fast multilevel algorithm for wavelet-regularized image restoration, IEEE
Transactions on Image Processing 18 (2009), no. 3, 509–523.
[Wag09] P. Wagner, A new constructive proof of the Malgrange-Ehrenpreis theorem,
American Mathematical Monthly 116 (2009), no. 5, 457–462.
[Wah90] G. Wahba, Spline Models for Observational Data, Society for Industrial and
Applied Mathematics, 1990.
[Wen05] H. Wendland, Scattered Data Approximations, Cambridge University Press,
2005.
362 References

[Wie30] N. Wiener, Generalized harmonic analysis, Acta Mathematica 55 (1930), no. 1,


117–258.
[Wie32] Tauberian theorems, Annals of Mathematics 33 (1932), no. 1, 1–100.
[Wie64] Extrapolation, Interpolation and Smoothing of Stationary Time Series with
Engineering Applications, MIT Press, 1964.
[WLLL06] J. Wang, T. Li, H. Lu, and Z. Liang, Penalized weighted least-squares approach
to sinogram noise reduction and image reconstruction for low-dose X-ray com-
puted tomography, IEEE Transactions on Medical Imaging 25 (2006), no. 10,
1272–1283.
[WM57] N. Wiener and P. Masani, The prediction theory of multivariate stochastic pro-
cesses, Acta Mathematica 98 (1957), no. 1, 111–150.
[WN10] D. Wipf and S. Nagarajan, Iterative reweighted and methods for finding sparse
solutions, IEEE Journal of Selected Topics in Signal Processing 4 (2010), no. 2,
317 –329.
[Wol71] S. J. Wolfe, On moments of infinitely divisible distribution functions, The
Annals of Mathematical Statistics 42 (1971), no. 6, 2036–2043.
[Wol78] On the unimodality of infinitely divisible distribution functions, Probability
Theory and Related Fields 45 (1978), no. 4, 329–335.
[WYHJC91] J. B. Weaver, X. Yansun, D. M. Healy Jr., and L. D. Cromwell, Filtering noise
from images with wavelet transforms, Magnetic Resonance in Medicine 21
(1991), no. 2, 288–295.
[WYYZ08] Y. L. Wang, J. F. Yang, W. T. Yin, and Y. Zhang, A new alternating minimization
algorithm for total variation image reconstruction, SIAM Journal on Imaging
Sciences 1 (2008), no. 3, 248–272.
[XQJ05] X. Q. Zhang and F. Jacques, Constrained total variation minimization and
application in computerized tomography, Energy Minimization Methods in
Computer Vision and Pattern Recognition, Lecture Notes in Computer Science,
vol. 3757, Springer, 2005, pp. 456–472.
[Yag86] A. M. Yaglom, Correlation Theory of Stationary and Related Random Func-
tions I: Basic Results, Springer, 1986.
[Zem10] A. H. Zemanian, Distribution Theory and Transform Analysis: An Introduction
to Generalized Functions, with Applications, Dover, 2010.
Index

additive white Gaussian noise (AWGN), 254, 269, cardinal interpolation problem, 5
283, 301, 321 Cauchy’s principal value, 62, 333, 334
adjoint operator, 21, 32, 36, 37, 51, 52, 54, 91, causal, 5, 90, 95, 98, 105
94, 109 central-limit theorem, 245, 246, 312
alternating direction method of multipliers characteristic function, 45, 192, 206, 225, 240, 251,
(ADMM), 262, 263, 319 337
analysis window SαS, 234
arbitrary function in Lp , 80, 191, 195, 223 characteristic functional, 46, 47, 85, 154, 156, 158,
rectangular, 12, 58, 80 164, 188, 194, 225
analytic continuation, 329 domain extension, 195–197
augmented Lagrangian method, 261, 319 of Gaussian white noise, 75
autocorrelation function, 12, 51, 54, 152, 153, 159, of innovation process, 53, 73
161, 163, 167, 169 of sparse stochastic process, 153
compound-Poisson process, 10, 174, 185
B-spline factorization, 136–137 compressed sensing, 2, 255, 284, 288
B-splines, 21, 127 conditional positive definiteness, 62, 339
exponential, 125–126, 138–139, 150 of generalized functions, 340
Schoenberg’s correspondence, 69
fractional, 139–141, 204
continuity, 27, 32, 45, 46
generalized, 127–142, 160, 197–198
of functional, 85
minimum-support property, 21, 132, 134, 138
convex optimization, 2, 263
polyharmonic, 141
convolution, 39, 40–43, 89, 92–94, 158–159, 161,
polynomial, 8, 137
334
basis functions
semigroup, 239–242
Faber-Schauder, 13
correlation functional, 11, 50, 81, 153, 155, 188
Haar wavelets, 12
covariance matrix, 215, 216, 258
Bayes’ rule, 254 cumulants, 209, 213, 237–239
belief propagation, 279–282 cumulant-generating function, 238, 239
beta function, 345 cycle spinning, 299
BIBO stability, 93, 94 through averaging, 317–318
binomial expansion (generalized), 204
biomedical image reconstruction, 2, 24, 263, 276, Döblin, Wolfgang, 17
298 decay, 132, 235
deconvolution microscopy, 265–269, 298–299 algebraic, 132–133, 235
MRI, 269–272 compact support, 12, 125, 132, 134
X-ray CT, 272–276 exponential, 132, 235
biorthogonality, 145, 250, 253 supra-exponential, 235
Blu, Thierry, 204 decorrelation, 202, 217
boundary conditions, 9, 92, 102–103, 167 decoupling of sparse processes, 23, 191
bounded operator, 41 generalized increments, 20, 170–172, 197–205,
convolution on Lp (Rd ), 40–43 211, 251
Riesz–Thorin theorem, 41, 94 increments, 11, 191, 193
Brownian motion, 4, 11, 163, 173, 220 increments vs. wavelets, 191–194,
364 Index

wavelet analysis, 21, 205–206, fractional derivative, 104, 119, 176, 204
wavelets vs. KLT, 220 fractional integrator, 185
denoising, 1, 24, 277, 290 fractional Laplacian, 107, 108, 135, 176, 185
consistent cycle spinning, 318–320, frequency response, 5, 7
iterative MAP, 318–320 factorization, 95
MAP vs. MMSE, 283 rational, 39, 95, 162
MMSE (gold standard), 281 function, 25
wavelet-domain notion, 28
shrinkage/thresholding, 296 function spaces, 25–32
soft-threshold, 290 complete-normed, 28–29
differential entropy, 215 finite-energy (L2 ), 34
differential equation, 5, 89 generalized functions (S  ), 34
stable, 95–97 Lebesgue (Lp ), 29
unstable, 19, 97–103
nuclear, 29–32
dilation matrix, 142, 206
rapidly decaying (R ), 29
Dirac impulse, 6, 9, 34, 35, 83, 121
smooth, rapidly decaying (S ), 30–31
discrete AR(1) process, 211, 220, 278
topological, 28
discrete convolution, 110
functional, 32
inverse, 110
discrete cosine transform (DCT), 1, 14, 211, 217,
220 gamma function, 344
discrete whitening filter, 202 properties, 344–345
discretization, 194–195 Gaussian hypothesis, 1, 54, 258–259
deconvolution, 268 Gaussian stationary process, 1, 161–162
MRI, 270 generalized compound process, 78, 185–187
X-ray CT, 273 generalized functions, 35–40
dual extension principle, 36 generalized increment process, 160–161, 195,
dual space 198–199
algebraic, 32 generalized increments, 199–205
continuous, 32, 35, 46 probability law of, 199–200
of topological vector space, 32 generalized random field, 47
duality product, 33–35 generalized stochastic process, 47–54, 84–87,
existence of, 86
estimators isotropic, 154
comparison of, 283–286, 303, 321 linear transform of, 51–52, 57, 84
LMMSE, 258 self-similar, 154–155
MAP, 255 stationary, 154–155
MMSE (or conditional mean), 278 Gibbs energy minimization, 287
pointwise MAP, 301 Green’s function, 7, 93, 95, 119–120
pointwise MMSE, 301–312 reproduction, 129–130
wavelet-domain MAP, 290, 292–293
expected value, 44
Hölder smoothness, 9, 135
Haar wavelets, 12, 114–116,
filtered white noise, 19, 53–54, 91, 159
analysis of Lévy process, 12–15, 193–194,
finite difference, 7, 192, 201
synthesis of Brownian motion, 15–16
finite rate of innovation, 2, 9, 78, 173
fluorescence microscopy, 265 Haar, Alfréd, 147
forward model, 249, 264, 270 Hadamard’s finite part, 106, 332–334, 343
Fourier central-slice theorem, 273 Hilbert transform, 42, 105, 334
Fourier multiplier, 40, 93, 119 Hurst exponent, 154, 156, 177, 178, 245
Lp characterization theorem, 41
Mikhlin’s theorem, 43 impulse response, 5, 38, 41, 158–160, 162
Fourier transform, 1, 34 first-order system, 90, 150
basic properties, 37 impulsive Poisson noise, 76–78,
of generalized functions, 37 increments, 11, 165, 178, 192
of homogeneous distributions, 331 independence at every point, 79, 164
of singular functions, 332 independent component analysis (ICA), 218–220
Index 365

infinite divisibility, 24, 60, 80 Lévy, Paul Pierre, 4, 16–18


heavy-tail behavior, 58, 235 continuity theorem, 46
Lévy–Khintchine representation, 61 synthesis of Brownian motion, 15–16
link with sparsity, 59, 68 Lévy–Khintchine formula, 61, 64–67, 70–71, 225,
semigroup property, 239 228, 342–343
infinitesimal generator, 240 Gaussian term, 66
inner product, 34, 194 Poisson atoms, 66
innovation model, 11–12, 19–22, 48, 57 linear inverse problem, 248
finite-dimensional, 71–73 Bayesian formulation, 252
generalized, 84–87 discretization, 249–254
innovation process, 4, 9, 52–54, 73–75 wavelet-based formulation, 291
integrator, 10, 91 wavelet-based solution, 294–298
shift-invariant, 91 linear least-squares, 295
stabilized adjoint, 92, 165 linear measurements, 194, 253
stabilized, scale-invariant, 92
linear methods, 1, 2, 258, 263
interpolator, 3, 126, 144, 152, 159
linear predictive coding (LPC), 211
inverse operator, 19, 89, 91
linear transform, 212
left vs. right inverse, 84, 86, 100
decoupling ability, 218, 220, 221
shift-invariant, 92–97
transform-domain statistics, 212–216
stabilized, 98–101, 102
localization, 21, 125, 129, 132–135, 203
stabilized, scale-invariant, 108–109
iterative MAP reconstruction, 261–263 Loève, Michel, 17
iterative solver
conjugate gradient (CG), 259 M-term approximation
Landweber algorithm, 295 wavelets vs. KLT, 14
steepest descent, 295 magnetic resonance, 269
iterative thresholding, 24, 290, 295, 298 Mandelbrot, Benoit B., 17
FISTA, 297–298 fractional Brownian motion, 178, 180–184
ISTA, 296–297 marginal distribution, 213
Matheron, Georges, 17
joint characteristic function, 49, 200, 207, 209, maximum a posteriori (MAP), 24, 255–256, 293
213 Minlos–Bochner theorem, 46, 74, 86
Joint Photographic Experts Group modified Bessel functions, 344
JPEG 2000, 1 modulation transfer function, 267
JPEG compression standard, 1, 211 moments, 50, 235–238
generalized, 235, 236
Karhunen–Loève transform (KLT), 3, 14, 216, 220 Mondrian process, 187
kriging, 18 multi-index notation, 30, 208
multiresolution analysis, 113, 142–144
L-spline, 120–121 mutual information, 218–220
1 -norm minimization, 2, 248, 255, 265, 276,
284
negentropy, 215
left inverse, 92, 98, 99, 101, 102, 105, 108, 109
non-Gaussian stationary process, 158–159
Lévy density, 60
norm, 28
modified, 223, 225
null space, 118–119
Lévy exponent, 22, 59, 60–64, 293
p-admissible, 63, 152, 196
Gaussian, 66 operator, 20, 89
Lévy triplet, 62 continuous, 27
Lévy–Schwartz admissible, 74, 152, 196 derivative, 6, 111, 164
modified, 223, 256, 291 differential, 19, 96, 125, 166, 191
Poisson family, 64 discrete, 20, 197
SαS family, 64 factorization, 96, 97, 101–102, 162, 166
Lévy process, 4, 11, 163–166, 278, 283 fractional, 89, 104, 107
alpha-stable, 11, 174, 220 linear, 27
classical definition of, 163–164 linear shift-invariant, 90, 92, 156
higher-order extensions, 166–167 partial differential, 120
366 Index

rotation-invariant, 90, 107 regularization, 2, 249, 255, 256, 286


scale-invariant, 90, 104–109, 143, 156 parameter, 263, 284
spline-admissible, 118–120 quadratic, 287
operator-like wavelets, 124, 142 total-variation, 284, 287
fast algorithm, 314 wavelet-domain, 295, 298–299
first-order, 126–127, 221, 313 regularizing singular integrals, 107, 329–331
general construction, 144–146,
ridge regression, 287
generalized, 142–147
Riesz basis, 9, 114, 121, 122, 129, 250
Ornstein–Uhlenbeck process, 150
Riesz transform, 335
Paley–Wiener theorem, 134 right inverse, 99, 101, 103, 105, 109
Parseval’s identity, 34, 161, 300, 316, 318
partial differential equation, 240 sampling, 3
diffusion, 241 constant-rate, 3
fractional diffusion, 241 generalized, 249
partial differential operator in Fourier domain, 270, 271
elliptic, 120 sampling step, 249
partition of unity, 114, 132 scaling function, 114
point-spread function (PSF), 266
Schoenberg, Isaac Jacob, 9, 147
Airy pattern, 266
B-spline formula, 8
diffraction-limited model, 266
correspondence theorem, 69
resolution limit, 266, 268
Poisson summation formula, 132 Schwartz, Laurent
poles and zeros, 95, 141, 162, 166 kernel theorem, 38–39
polygamma function, 257, 345 Schwartz–Bochner theorem, 338
polynomial reproduction, 130–132 space of test functions (S ), 30
Strang-Fix conditions, 131 tempered distributions (S  ), 34
positive definiteness, 45, 336 score, 260
Bochner’s theorem, 45, 337–338 self-decomposable distribution, 232
of functional, 85 self-similar process, 154, 176–180
of generalized function, 338 wide-sense, 155, 176
potential function, 256–258, 299 seminorm, 28
wavelet-domain, 293–294, sequence spaces (p ), 110
power spectrum, 81, 152, 159, 163
Shepp–Logan phantom, 275
principal component analysis (PCA), 216
shrinkage function, 259, 290, 300–313
probability density function (pdf), 43
asymptotic behavior, 260, 302
probability law, 65
Cauchy, 58, 311 for Cauchy distribution, 311
compound-Poisson, 58, 64, 67, 307 for compound-Poisson, 307
Gaussian, 58 for hyperbolic secant, 306
hyperbolic secant, 305 for Laplace distribution, 260, 304
Laplace, 58, 304 for Meixner family, 310
Meixner, 310 for Student distribution, 306
Poisson, 66 for symmetric gamma (sym gamma), 309
Student, 306 linear, 260, 303
symmetric gamma (sym gamma), 309 MAP vs. MMSE, 302
symmetric-alpha-stable (SαS), 64, 234, 346 soft-threshold, 260
probability measure signal
finite-dimensional, 43–44, 45
as element of a function space, 28
infinite-dimensional, 46, 47
continuous-domain, 3
proximal operator, 259, 263, 296
discrete (or sampled), 6, 195
as MAP estimator, 301
singular integral, 20, 61, 62, 100, 108, 328–335
Radon transform, 272 sinogram, 273
of polynomial B-spline, 275 Sobolev smoothness, 135–136
random field, 19 sparse representations, 191
reconstruction subspace, 249 sparse signal recovery, 248, 255
Index 367

sparse stochastic process, 3, 23, 150–187 rate of decay, 235


bandpass, 174 self-decomposability, 232
CARMA, 162–163, 201 stability, 234
Lévy and extension, 163–175 unimodality, 230, 231
lowpass, 173 two-scale relation, 114, 143–144, 314
mixed, 175
non-Gaussian AR(1), 150–152, 211 unimodal, 230, 231, 232, 256, 303
self-similar, 23, 176–187
sparsity, 1, 68 vanishing moments, 147
spectral mixing, 224, 229 variogram, 179
spline knots, 6, 121 vector-matrix notation, 72, 212
splines, 23, 113 Vetterli, Martin, 9
cardinal, 5, 120
definition of, 120–121 wavelet frame, 1, 315–318
non-uniform, 7, 121 improved denoising, 317
piecewise-constant, 5–7, 113 pseudo-inverse, 316
stable distribution, 234, 246 union of ortho bases, 316
stationary process, 79, 154, 158–160, 163 wavelet-domain statistics, 206
wide-sense (WSS), 155 correlations, 207
stochastic difference equation, 201–202, 211 cumulants, 208–210, 244, 245
stochastic differential equation (SDE), 4, 19, 84, evolution across scale, 242–247
164 probability laws, 206–207
Nth-order, 162, 166 wavelets, 1, 113, 290
first-order, 91, 150, 166, 211 admissible, 124
fractional, 176 Haar, 12, 114–116, 220
stochastic integral, 4, 163 Haar 2-D, 298
stochastic modeling of images, 264 orthogonal transform, 13, 116, 301
structure semi-orthogonal transform, 146
algebraic, 27 white Lévy noise, 4, 10, 73–84
topological, 27 as limit of random impulses, 83–84
submultiplicative function, 235 canonical id distribution, 80
system matrix, 253, 259, 263, 292 definition, 73
infinite divisibility, 80–81
Taylor series, 208, 238, 239 whitening operator, 19, 23, 113, 152
Tikhonov regularization, 248, 255, 286 Wiener, Norbert
topological vector space, 28 Wiener filter, 3, 258
total variation, 42, 284, 287, 288 Wiener’s lemma, 111
transform-domain statistics, 223 Wiener–Khintchine theorem, 88, 152
cumulants, 239
infinite divisibility, 224 Young’s inequality, 40, 94, 110

You might also like