Gaussian Sequence Model
Gaussian Sequence Model
Iain M. Johnstone
iii
c
2013.
Iain M. Johnstone
Contents
ix
xiii
xiv
xix
List of illustrations
List of tables
(working) Preface
List of Notation
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
Introduction
A comparative example
A first comparison of linear methods, sparsity and thresholding
A game theoretic model and minimaxity
The Gaussian Sequence Model
Why study the sequence model?
Plan of the book
Notes
Exercises
1
1
6
9
12
15
16
17
18
2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
19
20
22
23
30
34
37
41
44
46
50
51
3
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
57
58
61
64
70
72
74
78
79
iv
Contents
3.9
3.10
3.11
3.12
81
88
92
98
99
4
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
105
106
108
111
112
115
116
121
126
129
131
132
134
135
5
5.1
5.2
5.3
5.4
5.5
5.6
138
139
140
145
148
152
155
156
6
6.1
6.2
6.3
6.4
6.5
6.6
158
159
159
162
164
170
178
179
7
7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
181
182
189
191
194
194
201
206
208
vi
Contents
Exercises
208
8
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10
8.11
209
210
212
219
224
228
230
235
237
240
240
242
243
9
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.9
9.10
9.11
9.12
247
248
251
253
254
256
257
265
267
269
273
276
277
278
10
10.1
10.2
10.3
10.4
10.5
10.6
10.7
10.8
10.9
10.10
281
282
284
285
287
288
290
294
297
299
300
300
11
11.1
11.2
11.3
301
302
305
308
Contents
vii
11.4
11.5
11.6
11.7
11.8
312
317
319
320
322
322
12
12.1
12.2
12.3
12.4
12.5
12.6
324
325
328
332
334
339
342
343
13
13.1
13.2
13.3
13.4
13.5
13.6
13.7
13.8
345
346
347
350
355
358
363
364
365
366
14
14.1
14.2
14.3
14.4
14.5
14.6
14.7
14.8
368
368
369
370
371
373
374
375
377
377
15
15.1
15.2
15.3
15.4
15.5
378
378
381
383
385
388
389
16
Epilogue
390
viii
Contents
Appendix A
Appendix: The Minimax Theorem
A.1 A special minimax theorem for thresholding
391
399
Appendix B
401
Appendix C
Background Material
425
Appendix D
To Do List
440
Bibliography
441
(working) Preface
This is a book about some of the theory of nonparametric function estimation. The premise
is that much insight can be gained even if attention is confined to a Gaussian sequence model
yi D i C zi ;
i 2 I;
(0.1)
where I is finite or countable, fi g is fixed and unknown, fzi g are i.i.d. N.0; 1/ noise variables and is a known noise level. If I is finite, this is an old friend, the multivariate normal
means model, with independent co-ordinates and known variance. It is the centerpiece of
parametric statistics, with many important, beautiful, and even surprising results whose influence extends well beyond the formal model into the practical, approximate world of data
analysis.
It is perhaps not so obvious that the infinite sequence model could play a corresponding
role in nonparametric statistics. For example, problems of nonparametric regression, density
estimation and classification are typically formulated in terms of unknown functions, rather
than sequences of parameters. Secondly, the additive white Gaussian noise assumption may
seem rather remote.
There are several responses to these objections. First, the model captures many of the
conceptual issues associated with non-parametric estimation, with a minimum of technical
complication. For example, non-parametrics must grapple with the apparent impossibility
of trying to estimate an infinite-dimensional object a function on the basis of a finite
p
amount n of noisy data. With a calibration D 1= n; this challenge is plain to see in model
(0.1). The broad strategy is to apply various methods that one understands in the multivariate
normal model to finite submodels, and to argue that often not too much is lost by ignoring
the (many!) remaining parameters.
Second, models and theory are always an idealisation of practical reality. Advances in
size of datasets and computing power have enormously increased the complexity of both
what we attempt to do in data analysis and the algorithms that we invent to carry out our
goals. If one aim of theory is to provide clearly formulated, generalizable insights that might
inform and improve our computational efforts, then we may need to accept a greater degree
of idealisation in our models than was necessary when developing theory for the estimation
of one, two or three parameters from modest numbers of observations.
Thirdly, it turns out that model (0.1) is often a reasonable approximation, in large samples, to other nonparametric settings. In parametric statistics, the central limit theorem and
asymptotic normality of estimators extends the influence of multivariate normal theory to
generalized linear models and beyond. In nonparametric estimation, it has long been observed that similar features are often found in spectrum, density and regression estimation.
ix
(working) Preface
Relatively recently, results have appeared connecting these problems to model (0.1) and
thereby providing some formal support for these observations.
Model (0.1) and its justifications have been used and understood for decades, notably
by Russian theoretical statisticians, led by I. A. Ibragimov and R. Z. Khasminskii. It was
somewhat slower to receive wide discussion in the West. However, it received a considerable
impetus when it was observed that (0.1) was a natural setting in which to understand the
estimation of signals, functions and images in wavelet orthonormal bases. In turn, wavelet
bases made it possible to give a linked theoretical and methodological account of function
estimation that responded appropriately to spatial inhomogeneties in the data, such as (in an
extreme form) discontinuities and cusps.
The goal of this book is to give an introductory account of some of the theory of estimation
in the Gaussian sequence model that reflects these ideas.
Estimators are studied and compared using the tools of statistical decision theory, which
for us means typically (but not always) comparison of mean squared error over appropriate
classes of sets supposed to contain the unknown vector . The best-worst-case or minimax
principle is used, though deliberately more often in an approximate way than exactly. Indeed,
we look for various kinds of approximate adaptive minimaxity, namely estimators that are
able to come close to the minimax criterion simultaneously over a class of parameter sets.
A basic theme is that the geometric characteristics of the parameter sets, which themselves
often reflect assumptions on the type of smoothness of functions, play a critical role.
In the larger first part of the book, Chapters 1- 9, an effort is made to give equal time
to some representative linear and non-linear estimation methods. Linear methods, of which
kernel estimators, smoothing splines, and truncated series approaches are typical examples,
are seen to have excellent properties when smoothness is measured in a sufficiently spatially uniform way. When squared error loss is used, this is geometrically captured by the
use of hyperrectangles and ellipsoids. Non linear methods, represented here primarily by
thresholding of data in a wavelet transform domain, come to the fore when smoothness of a
less uniform type is permitted. To keep the account relatively self-contained, introductions
to topics such as Gaussian decision theory, wavelet bases and transforms, and smoothness
classes of functions are included. A more detailed outline of topics appears in Section 1.6
after an expanded introductory discussion. Starred sections contain more technical material
and can be skipped on a first reading.
The second part of the book, Chapters 10 15, is loosely organized as a tour of various
types of asymptotic optimality in the context of estimation in the sequence model. Thus,
one may be satisfied with optimality up to log terms, or up to constants or with exact
constants. One might expect that as the demands on quality of optimality are ratcheted up,
so are the corresponding assumptions, and that the tools appropriate to the task change. In
our examples, intended to be illustrative rather than exhaustive, this is certainly the case.
The other organizing theme of this second part is a parallel discussion of results for simple
or monoresolution models (which need have nothing to do with wavelets) and conclusions
specifically for multiresolution settings.
We often allow the noise level in (0.1) to depend on the index ia small enough change
to be easily accommodated in many parts of the theory, but allowing a significant expansion
in models that are fairly directly convertible to sequence form. Thus, many linear inverse
problems achieve diagonal form through a singular value or wavelet-vaguelette decompo-
(working) Preface
xi
sition, and problems with correlated Gaussian noise can be diagonalized by the principal
compoent or Karhunen-Lo`eve transformation.
Of course much is omitted. To explain some of the choices, we remark that the project
began over ten years ago as an account of theoretical properties of wavelet shrinkage estimators based largely on work with David Donoho, Gerard Kerkyacharian and Dominique
Picard. Much delay in completion ensued, due to other research and significant administrative distractions. This history has shaped decisions on how to bring the book to light after
so much elapsed time. First among the choices has been to cast the work more as a graduate text and less as a current research monograph, which is hopefully especially apparent
in the earlier chapters. Second, and consistent with the first, the book does not attempt to
do justice to related research in recent years, including for example the large body of work
on non-orthogonal regression, sparse linear models and compressive sensing. It is hoped,
however, that portions of this book will provide helpful background for readers interested in
these areas as well.
The intended readership, then, includes graduate students and others who would like an
introduction to this part of the theory of Gaussian estimation, and researchers who may
find useful a survey of a part of the theory. Helpful background for reading the book would
be familiarity with mathematical statistics at the level of a first year doctoral course in the
United States.
The exercises, which are concentrated in the earlier chapters, are rather variable in complexity and difficulty. Some invite verifications of material in the text, ranging from the
trivial to the more elaborate, while others introduce complementary material.
xii
(working) Preface
from reviewers commissioned by John Kimmel and Lauren Cowles, especially Anirban Das
Gupta, Sam Efromovich and Martin Wainwright.
For the final push, I wish to specially thank Tony Cai, whose encouragement to complete
the book took the concrete form of insightful counsel along with organizing further helpful
comments from our colleagues Weidong Liu, Mark Low, Lie Wang, Ming Yuan and Harry
Zhou. Michael Martin and Terry ONeill at the Australian National University, and Marta
Sanz at the University of Barcelona, hosted a sabbatical leave which enabled the challenging
task of imposing final discipline on a protracted project.
Thanks also to the John Simon Guggenheim Memorial Foundation for a Fellowship during which the first draft was written, and to the National Science Foundation and National
Institutes of Health, which have supported much of my own research and writing, and to the
Australian National University and University of Barcelona which provided space and time
for writing.
Chapter dependency graph. A heavy solid line indicates a more than incidental scientific dependence of the higher numbered chapter on the lower numbered one. A dotted line
indicates a weaker formal dependence, perhaps at the level of motivation. A more specific
indication of cross-chapter dependence at the level of sections can then be found below.
In the first part, Chapters 2 and 3 provide basic material for the book as a whole, while
the decision theory of Chapter 4 is important for virtually everything that follows. The linear
estimation results, Chapters 5 and 6, form one endpoint in themselves. Chapters 8 and 9 on
thresholding and properties of wavelet shrinkage form the other main endpoint in Part I; the
wavelet primer Chapter 7 prepares the way.
In the second part, with numbers shown in Courier font, there some independence of the
chapters at a formal level: while they lean heavily on Part I, the groups f10g; f11; 12g and
f13; 14; 15g can be read separately of one another. The first chapter in each of these three
groups (for Ch. 10, the first half) does not require any wavelet/multiresolution ideas.
xiii
(working) Preface
7
6
10
3
11
2
12
15
13
14
List of Notation
Standard notations.
xC , positive part; x, fractional part; bxc, largest previous integer; dxe, smallest following integer.
?, convolution; #fg, cardinality; a ^ b D min.a; b/; a _ b D max.a; b/; log, base e
logarithm.
R, real numbers; C; complex numbers; Z; integers, N D f1; 2; : : :g; N0 D f0; 1; 2; : : :g
natural numbers.
Rn ; Cn n-dimensional real, complex Euclidean space
R1 ; (countably infinite) sequences of reals, 58
Sets, Indicator (functions). Ac complement; At dilation; (REF?) IA ; IA .t / indicator
function; sign.x/ sign function.
Derivatives. Univariate: g 0 ; g 00 or g .r/ , or D r g, rth derivative; Partial: @=@t , @=@x; Multivariate: Di g or @g=@xi ; Divergence: r T , (2.56).
Matrices. In identity, n n; AT transpose; trA trace; rank.A/ rank; A 1 inverse;
%1 .A/ : : : %n .A/ eigenvalues; A1=2 non-negative definite square root; jAj D .AT A/1=2
p. 36
Vectors. D .i / in sequence model; f D f .tl / in time domain, 12; ek has 1 in kth place,
0s elsewhere. Indices in sequence model: i for generic sequences, k for specific concrete
bases.
P
Inner Products. uT v D hu; vi D i ui vi , Euclidean inner product; h; in normalized, p.
12; h; i% weighted, p. 80.
P
Norms. Vectors. k k, when unspecified, Euclidean norm . u2i /1=2 ; k k2;n normalized
Euclidean p. 12; k kp , `p norm, (1.5)
Matrices. k kHS Hilbert-Schmidt, p. 2.8, (3.10), (C.5).
Functions. kf kp Fill in!; kgk2;I ; kgk1;I restricted to interval I , p. 67, (3.27).
Function spaces.
L2 0; 1
App Ref.
`2 D `2 .N/
App Ref.
`2;% D `2 .N; .%i 2 // App Ref.
Distributions, Expectations. (2.7), 4.1. Joint P.d; dy/; E;
Conditional on : P .dyj/; P .dy/, E ;
Conditional on y: P .djy/; Py .d /, Ey ;
Marginals: for y, P .dy/; EP ; for ; .d /; E ;
xiv
List of Notation
xv
Collections: P ; supported on , P ./ before (4.19); convolutions P after (4.21); substochastic PC .R/, C.19; moment constrained, M; M.C /, 4.11.
Random Variables. Y (vector of) observations; Z mean zero; iid = independent and
D
identically distributed; D equality in distribution.
Stochastic operators. E, expectation; Var; variance; Cov; covariance; Med; median,
Exercise 2.4; Bias; bias, p. 35;
O
Estimators. O D .y/,
general sequence estimator at noise level ; O D .x/,
O
general
O
O
estimator at noise level 1, i ; j k ith, or .j; k/th component of ;
O , Bayes estimator for prior , (2.9), p. 106; y , posterior mean (2.14); O , minimax
estimator,
Specific classes of estimators: OC ; linear estimators with matrices C , (2.45); Oc diagonal linear esimators with shrinkage constants c (3.11); O threshold estimator, (1.7), (2.5),
(2.6); except in Chapter 3, where it is a regularization (spline) estimator, (??); O truncation
estimator, (3.14); Oh kernel estimator, (3.30);
fO.t/; fOh ; fQh ; fO , function estimator (3.18), (3.41).
Estimators with superscripts. O JSC Positive part James-Stein estimator. (2.70) [to
index?]
Decision theory: Loss function L.a; /, 2.3, randomized decision rule, .Ajy/ (A.10).
Distance between statistical problems, d .P0 ; P1 /; .P0 ; P1 /, (3.84). L1 -distance, L1 .P0 ; P1 /,
(3.85).
Risk functions For estimators.
O /,
r.;
r.; /
rL .%; /
rS .; /; rH .; /
2.5;
of randomized rule, (A.10)
of linear shrinkage rule, (2.49)
of soft (2.7, 8.2) and hard (8.2) thresholding.
For priors.
O /,
B.;
B./; B.; /,
B.P /; B.P ; /
Minimax risks.
RN ./; RN .; /
RN .; /
Rn D RN .Rn ; /
RN .F ; /
RL ./,
RDL .; /,
N .; /; L .; /
P .; /,
1
Introduction
And hither am I come, a Prologue armed,... to tell you, fair beholders, that our play leaps
oer the vaunt and firstlings of those broils, beginning in the middle; starting thence away
to what may be digested in a play. (Prologue, Troilus and Cressida William Shakespeare.)
The study of linear methods, non-linear thresholding and sparsity in the special but central
setting of Gaussian data is enlightened by statistical decision theory. This overture chapter
introduces these themes and the perspective to be adopted.
Section 1.1 begins with two data examples, in part to emphasize that while this is a theoretical book, the motivation for the theory comes from describing and understanding the
properties of commonly used methods of estimation.
A first theoretical comparison follows in Section 1.2, using specially chosen cartoon examples of sparse signals. In order to progress from constructed cases to a plausible theory,
Section 1.3 introduces, still in a simple setting, the formal structures of risk function, Bayes
rules and minimaxity that are used throughout.
The signal in Gaussian white noise model, the main object of study, makes its appearance
in Section 1.4, in both continuous and sequence forms, along with informal connections to
finite regression models and spline smoothing estimators. Section 1.5 explains briefly why
it is our guiding model; but it is the goal of the book to flesh out the story, and with some of
the terms now defined, Section 1.6 provides a more detailed roadmap of the work to follow.
l D 1; : : : ; n:
(1.1)
The observation Yl is the minimum temperature at a fixed time period tl , here equally spaced,
with n D 366, f .t/ is an unknown mean temperature function, while Zl is a noise term,
1
Introduction
20
15
10
10
50
100
150
200
250
300
350
days in 2008
Figure 1.1 Spline smoothing of Canberra temperature data. Solid line: original
spline fit, Dashed line: periodic spline, as described in text.
assumed to have mean zero, and variance onesince the standard deviation is shown
explicitly.
Many approaches to smoothing could be taken, for example using local averaging with
a kernel function or using local (linear) regression. Here we briefly discuss two versions of
smoothing splines informallySection 1.4 has formulas and a little more detail. The choice
of splines here is merely for definiteness and conveniencewhat is important, and shared
by other methods, is that the estimators are linear in the data Y, and depend on a tuning or
bandwidth parameter .
A least squares
seek an estimator fO to minimize a residual sum of squares
Pn approach would
1
2
f .tl / . In nonparametric estimation, in which f is unconstrained,
S.f / D n
lD1 Yl
this would lead to an interpolation, fO.tl / D Yl , an overfitting which would usually be
too rough to use as a Rsummary. The spline approach brings in a penalty for roughness,
for example P .f / D .f 00 /2 in terms of the squared second derivative of f . The spline
estimator is then chosen to minimize S.f / C P .f /, where the regularization parameter
adjusts the relative importance of the two terms.
As both S and P are quadratic functions, it is not surprising (and verified in Section 1.4)
that the minimizing fO is indeed linear in the data Y for a given value of . As increases
from 0 to 1, the solution will pass from rough (interpolating the data) to smooth (the linear
least squares fit). A subjective choice of was made in Figure 1.1, but it is often desirable
to have an automatic or data-driven choice specified by some algorithm.
Depending on whether ones purpose is to obtain a summary for a given year, namely
2008, or to obtain an indication of an annual cycle, one may or may not wish to specifically
require f and fO to be periodic. In the periodic case, it is natural to do the smoothing using
Fourier series. If yk and k denote the kth Fourier coefficient of the observed data and
unknown function respectively, then the periodic linear spline smoother takes on the simple
coordinatewise linear form Ok D yk =.1Cwk / for certain known constants wk that increase
with frequency like k 4 .
Interestingly, in the temperature example, the periodic and nonperiodic fits are similar,
differing noticeably only within a short distance of the year boundaries. This can be understood in terms of an equivalent kernel form for spline smoothing, Section 3.5.
To understand the properties of linear estimators such as fO , we will later add assumptions
that the noise variables Zl are Gaussian and independent. A probability plot of residuals in
fact shows that these temperature data are reasonably close to Gaussian, though not independent, since there is a clear lag-one sample autocorrelation. However the dependence appears
to be short-range and appropriate adjustments for it could be made in a detailed analysis of
this example.
The NMR data. Figure 1.2 shows a noisy nuclear magnetic resonance (NMR) signal sampled at n D 2J D 1024 points. Note the presence both of sharp peaks and baseline noise.
The additive regression model (1.1) might again be appropriate, this time with tl D l=n and
perhaps with f substantially less smooth than in the first example.
The right hand panel shows the output of wavelet denoising. We give a brief description
of the method using the lower panels of the figuremore detail is found in Chapter 7.
The noisy signal is transformed, via an orthogonal discrete wavelet transform, into wavelet
coefficients yj k , organized by scale (shown vertically, from coarsest level j D 4 to finest
level j D J 1 D 9) and by location, shown horizontally, with coefficients located at
k2 j for k D 1; : : : ; 2j . Correspondingly, the unknown function values f .tl / transform
into unknown wavelet coefficients j k . In this transform domain, we obtain estimates Oj k
by performing a hard thresholding
(
p
yj k if jyj k j > O 2 log n;
O
j k D
0
otherwise
to retain only the large coefficients, settingp
all others to zero. Here O is a robust estimate
of the error standard deviation1 . The factor 2 log n reflects the likely size of the largest
of n independent zero mean standard normal random variablesChapter 8 has a detailed
discussion.
The thresholded coefficients, shown in the lower right panel, are then converted back to
the time domain by the inverse discrete wavelet transform, yielding the estimated signal
fO.tl / in the top right panel. The wavelet denoising seems to be effective at removing
nearly all of the baseline noise, while preserving much of the structure of the sharp peaks.
By contrast, the spline smoothing approach cannot accomplish both these tasks at the
same time. The right panel of Figure 1.3 shows a smoothing spline estimate with an automatically chosen2 value of . Evidently, while the peaks are more or less retained, the spline
estimate has been unable to remove all of the baseline noise.
An intuitive explanation for the different behaviors of the two estimatesP
can be given
using the idea of kernel averaging, in which a function estimate fO.t / D n 1 l wl .t /Yl is
obtained by averaging the data Yl with a weight function
wl .t / D h 1 K.h 1 .t
1
2
tl //;
using the median absolute deviation MADfyJ 1;k g=0:6745, explained in Section 7.5
chosen to minimize an unbiased estimate of mean squared error, Mallows CL , explained in Section 6.4
(1.2)
Introduction
1 (a) NMR Spectrum
40
40
30
30
20
20
10
10
10
0.2
0.4
0.6
0.8
10
4
0
0.2
0.4
0.6
0.8
0.4
0.6
0.8
10
0.2
0.2
0.4
0.6
0.8
Figure 1.2 Wavelet thresholding of the NMR signal. Data originally via Chris
Raphael from the laboratory of Andrew Maudsley, then at UCSF. Signal has
n D 1024 points, discrete wavelet transform using Symmlet6
filter in Wavelab,
p
coarse scale L D 4, hard thresholding with threshold O 2 log n as in the text.
for a suitable kernel function K, usually non-negative and integrating to 1. The parameter h
is the bandwidth, and controls the distance over which observations contribute to the estimate at point t. (Section 3.3 has more detail.) The spline smoothing estimator, for equally
spaced data, can be shown to have approximately this form, with a one-to-one correspondence between h and described in Chapter 6.4. A key property of the spline estimator is
that the value of h does not vary with t.
By contrast, the kernel average view of the wavelet threshold estimate in Figure 1.2 shows
that h D h.t/ depends on t stronglythe bandwidth is small in a region of sharp transients,
and much larger in a zone of stationary behavior in which the noise dominates. This is
shown schematically in Figure 1.3, but can be given a more precise form, as is done in
Section 7.5.
One of the themes of this book is to explore the reasons for the difference in performance
of splines and wavelet thresholding in these examples. An important ingredient can be seen
by comparing the lower panels in Figure 1.2. The true signalassuming that we can speak
45
40
40
35
35
30
30
25
25
20
20
15
15
10
10
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
Figure 1.3 Schematic comparison of averaging kernels: The baseline dashed bell
curves give qualitative indications of the size of the bandwidth h in (1.2), the
equivalent kernel. In the left panel, corresponding to wavelet thresholding, the
equivalent kernel depends on position, h D h.tl /, whereas in the right panel, for
spline smoothing, it is translation invariant.
of such a thingappears to be concentrated in a relatively small number of wavelet coefficients, while the noise is scattered about globally and at an apparently constant standard
deviation within and across levels. Thus the thresholding can literally clean out most of the
noise while leaving the bulk of the signal energy, concentrated as it is in a few coefficients,
largely undisturbed. This sparsity of representation of the signal in the wavelet transform
domain is an essential property.
The example motivates a number of questions:
what are the properties of thresholding? Can we develop expressions for, say, mean
squared error and understand how to choose the value of the threshold?
when is it effective e.g. better than linear shrinkage? Can we compare the mean squared
error of linear estimators and thresholding over various classes of functions, representing
different amounts and types of smoothness?
what is the role of sparsity? Can we develop quantitative measures of sparsity of representation and describe how they affect the possible mean squared error?
are optimality statements possible? Can we identify assumptions on classes of functions
for which it is possible to assert that linear, or threshold, estimators are, in an appropriate
sense, nearly best?
are extensions to other settings possible? Are there other nonparametric estimation problems, such as density estimation or linear inverse problems, in which similar phenomena
appear?
Our goal will be to develop some theoretical definitions, tools and results to address these
issues. A key technique throughout will be to use sequence models, in which our methods,
hypotheses and results are phrased in terms of the coefficients, k or j k , that appear when
the function f is expanded in an orthogonal basis. In the NMR example, the (wavelet)
Introduction
coefficients are those in the bottom panels of Figure 1.2, while in the weather data, in the
periodic form, they are the Fourier coefficients.
In the next sections we turn to a first discussion of these questions in the simplest sequence
model. Exactly why sequence models repay detailed study is taken up in Section 1.5.
iid
zk N.0; 1/;
(1.3)
which may be obtained by taking coefficients in any orthonormal basis. We might call this a
monoresolution model when we wish to think of what is going on at a single level in the
wavelet transform domain, as in the bottom panels of Figure 1.2.
Assume now that the k are random, being drawn independently from a Gaussian prior
distribution N.0; 2 /. The posterior distribution of k given the data y is also Gaussian, and
the Bayes estimator is given by the posterior mean
Ok D
yk ;
C1
D
2
:
2
(1.4)
The constant is the squared signal-to-noise ratio. The estimator, sometimes called the
Wiener filter, is optimal in the sense of minimizing the posterior expected squared error.
This analysis has two important features. First, the assumption of a Gaussian prior distribution produces an optimal estimator which is a linear function of the data y. Second, the
estimator does not depend on the choice of orthonormal basis: both the model (1.3) and the
Gaussian prior are invariant under orthogonal changes of basis, and so the optimal rule has
the same linear shrinkage in all coordinate systems.
In contrast, sparsity has everything to do with the choice of bases. Informally, sparsity
conveys the idea that most of the signal strength is concentrated in a few of the coefficients.
Thus a spike signal
.1; 0; : : : ; 0/ is much sparser than a comb vector
.n 1=2 ; : : : ; n 1=2 /
even though both have the same energy, or `2 norm: indeed these could be representations of
the same vector in two different bases. In contrast, noise, almost by definition, is not sparse
in any basis. Thus, among representations of signals in various bases, it is the ones that are
sparse that will be most easily denoised.
Figure 1.4 shows part of a reconstructed signal represented in two different bases: panel a)
is a subset of 27 wavelet coefficients W, while panel b) is a subset of 27 Fourier coefficients
F . Evidently W has a much sparser representation than does F :
The sparsity of the coefficients in a given basis may be quantified using `p norms 4
!1=p
n
X
p
k kp D
jk j
;
(1.5)
kD1
3
4
The use of in place of the more common already betrays a later focus on low noise asympotics!
in fact, only a quasi-norm for p < 1, Appendix C.1.
20
15
15
10
10
-5
-5
-1 0
-1 0
-1 5
-1 5
-2 0
20
40
60
80
100
120
-2 0
20
40
60
80
100
120
Figure 1.4 Panel (a): kW D level 7 of estimated NMR reconstruction fO of Figure
1.2, while in panel (b): kF D Fourier coefficients of fO at frequencies 65 : : : 128,
both real and imaginary parts shown. While these do not represent exactly the same
projections of f; the two overlap and k F k2 D 25:3 23:1 D k W k2 .
which track sparsity for p < 2, with smaller p giving more stringent measures. Thus, while
the `2 norms of our two representations are roughly equal:
k F k2 D 25:3 23:1 D k W k2 ;
the `1 norm of the sparser representation W is smaller by a factor of 6:5:
k F k1 D 246:5 37:9 D k W k1 :
P
Figure 1.5 shows that the `p -norm level sets f W n1 jk jp C p g become progressively
smaller and clustered around the co-ordinate axes as p decreases. Thus, the only way for
a signal in an `p ball to have large energy (i.e. `2 norm) is for it to consist of a few large
components, as opposed to many small components of roughly equal magnitude. Put another
way, among all signals with a given energy, the sparse ones are precisely those with small
`p norm.
p=1
p=1
p small
p=2
Introduction
Thus, we will use sets fk kp C g as quantitative models for a priori constraints that the
signal has an approximately sparse representation in the given basis.
How might we exploit this sparsity information in order to estimate better: in other
words, can we estimate W better than F We quantify the quality of estimator O .y/ using
Mean Squared Error (MSE):
E kO
k D
n
X
E.Ok
k /2 ;
(1.6)
kD1
in which the expectation averages over the distribution of y given , and hence over the
noise z D .zk / in (1.3).
Figure 1.6 shows an idealized case in which all k are zero except for two spikes, each of
p
p D C D 1: it is thus
size 1=2: Assume,
Pn for simplicity here, that D n D 1= n and that
supposed that 1 jk j 1: Consider the class of linear estimators Oc .y/ D cy, which have
per co-ordinate variance c 2 n2 and squared bias .1 c/2 k2 . Consequently, the mean squared
error (1.6)
(
n
X
1
cD1
MSE D
c 2 n2 C .1 c/2 k2 D c 2 C .1 c/2 =2 D
1=2 c D 0:
1
The upper right panel shows the unbiased estimate with c D 1; this has no bias and only
variance. The lower left panels shows c D 0 with no variance and only bias. The MSE
calculation shows that no value of c leads to a linear estimate with much better errorthe
minimum MSE
P is 1/3 at c D 1=3. As an aside, if we were interested instead in the absolute,
or `1 error k jOk k j, we could visualize it using the vertical linesagain this is relatively
large for all linear estimates.
In the situation of Figure 1.6, thresholding is natural. As in the preceding section, define
the hard threshold estimator by its action on coordinates:
(
yk if jyk j n ;
O
;k .y/ D
(1.7)
0
otherwise:
The lower right panel of Figure 1.6 uses a threshold of n D 2:4n D 0:3: For the particular
configuration of true means k shown there, the data from the two spikes pass the threshold
unchanged, and so are essentially unbiased estimators. Meanwhile, in all other coordinates,
the threshold correctly sets all coefficients to zero except for the small fraction of noise that
exceeds the threshold.
As is verified in more detail in Exercise 1.2, the MSE of O consists essentially of two
variance contributions each of n2 from the two spikes, and n 2 squared bias contributions of
2
2n2 ./ from the zero components, where ./ D .2/ 1=2 e =2 denotes the standard
Gaussian density. Hence, in the two spike setting,
E kO
2/n2 ./
(1.8)
C 2./ 0:139
when n D 64 and D 2:4. This mean squared error is of course much better than for any
of the linear estimators.
0.4
0.4
0.2
0.2
0.2
0.2
0
20
40
60
0.4
0.2
0.2
0.2
0.2
20
40
40
60
Threshold Estimate
0.4
20
60
20
40
60
Figure 1.6 (a) Visualization of model (1.3): open circles are unknown values k ,
crosses are observed data yk . In the other panels, solid circles show various
O for k D 1; : : : ; n D 64: Horizontal lines are thresholds at
estimators ,
D 2:4n D 0:3: (b) Vertical lines indicate absolute errors jO1;k k j made by
leaving the data alone: O1 .y/ D y: (c) Corresponding absolute errors for the zero
estimator O0 .y/ D 0: (d) Much smaller errors due to hard thresholding at D 0:3:
10
Introduction
k22 :
(1.9)
(Here, E denotes expectation over y given , and E expectation over , Section 4.1.)
Of course, the Statistician tries to minimize the risk and Nature to maximize it.
Classical work in statistical decision theory (Wald, 1950; Le Cam, 1986), Chapter 4 and
Appendix A, shows that the minimax theorem of von Neumann can be adapted to apply
here, and that the game has a well defined value, the minimax risk:
Rn D inf sup B.O ; / D sup inf B.O ; /:
O
O
(1.10)
risk B
An estimator O attaining the left hand infimum in (1.10) is called a minimax strategy
or estimator for player I, while a prior distribution attaining the right hand supremum
is called least favorable and is an optimal strategy for player II. Schematically, the pair of
optimal strategies .O ; / forms a saddlepoint, Figure 1.7: if Nature uses , the best the
Statistician can do is to use O . Conversely, if the Statistician uses O , the optimal strategy
for Nature is to choose .
prior
estimator of
Figure 1.7 Left side lower axis: strategies for Nature. Right side lower axis:
strategies O for the Statistician. Vertical axis: payoff B.O ; / from the Statistician to
Nature. The saddlepoint indicates a pair .O ; / of optimal strategies.
It is the structure of these optimal strategies, and their effect on the minimax risk Rn that
is of chief statistical interest.
While these optimal strategies cannot be exactly evaluated for finite n, informative asymptotic approximations are available. Indeed, as will be seen in Section 13.5, an approximately
least favorable distribution is given by drawing the individual coordinates k ; k D 1; : : : ; n
11
Prior Constraint
traditional (`2 ) sparsity (`1 )
minimax estimator
linear
thresholding
least favorable
Gaussian
sparse
minimax MSE
D 1=2
log n
n
Table 1.1 Comparison of structure of optimal strategies in the monoresolution game under
traditional and sparsity assumptions.
(1.11)
This amounts to repeated tossing of a coin highly biasedptowards zero. Thus, in n draws, we
expect to see a relatively small number, namely nn D n= log n of non-zero components.
The size of these non-zero values is such that they are
p hard to distinguish from the larger
values among the remaining, more numerous, n
n= log n observations that are pure
noise. Of course, what makes this distribution difficult for Player I, the Statistician, is that
the locations of the non-zero components are random as well.
It can also be shown, Chapter 13, that an approximately minimax estimator for this setting
is given by
p the hard thresholding rule described earlier, but with threshold given roughly by
n D n log.n log n/. This estimate asymptotically achieves the minimax value
p
Rn log n=n
for MSE. [Exercise 1.3 bounds the risk B.On ; /; (1.9), for this prior, hinting at how this
minimax value arises]. It can also be verified that no linear estimator can achieve a risk less
than 1=2 if Nature chooses a suitably uncooperative probability distribution for , Theorem
9.5 and (9.28).
p
In the setting of the previous
section with n D 64 and n D 1= n, we find that the
p
non-zero magnitudes n log n D 0:255 and the expected non-zero number nn D 3:92.
p
Finally, the threshold value n log.n log n/ D :295:
Thisand anystatistical decision problem make a large number of assumptions, including values of parameters that typically are not known in practice. We will return later
to discuss the virtues and vices of the minimax formulation. For now, it is perhaps the qualitative features of this solution that most deserve comment. Had we worked with simply a
signal to noise constraint, E k k22 1, say, we would have obtained a Gaussian prior distribution N.0; n2 / as being approximately least favorable and the linear Wiener filter (1.4)
with n2 D n2 D 1=n as an approximately minimax estimator. As may be seen from the summary in Table 1.1, the imposition of a sparsity constraint E kk1 1 reflects additional a
priori information and yields great improvements in the quality of possible estimation, and
produces optimal strategies that take us far away from Gaussian priors and linear methods.
12
Introduction
yi D i C %i zi ;
zi N.0; 1/;
i 2 I:
(1.12)
The index set will typically be a singleton, I D f1g, finite I D f1; : : : ; ng, or infinite
I D N. Multidimensional index sets, such as f1; : : : ; ngd or Nd are certainly allowed, but
will appear only occasionally. The scale parameter sets the level of the noise, and in some
settings will be assumed to be small.
In particular, we often focus on the model with I D f1; : : : ; ng. Although this model
is finite dimensional, it is actually non-parametric in character since the dimension of the
unknown parameter equals that of the data. In addition, we often consider asymptotics as
n ! 1.
We turn to a first discussion of models motivating, or leading to, (1.12)further examples
and details are given in Chapters 2 and 3.
Nonparametric regression. In the previous two sections, was a vector with no necessary relation among its components. Now we imagine an unknown function f .t /. The
independent variable t is thought of as low dimensional (1 for signals, 2 for images, 3 for
volumetric fields etc.); indeed we largely confine attention to functions of a single variable,
say time, in a bounded interval, say 0; 1. In a sampled-data model, we might have points
0 t1 tn 1, and
Yl D f .tl / C Zl ;
i id
Zl N.0; 1/:
(1.13)
This is the model for the two examples of Section 1.1 with the i.i.d. Gaussian assumption
added.
We can regard Y , Z and f D .f .tl // as vectors in Rn , viewed
Pn as the time domain and
endowed with a normalized inner product ha; bin D .1=n/ lD1 al bl , and corresponding
norm k k2;n . Let f'i g be an arbitrary orthonormal basis with respect to h; in . For example,
if the tl were equally spaced, this might be the discrete Fourier basis of sines and cosines. In
general, form the inner products
p
(1.14)
yk D hY; 'k in ;
k D hf; 'k in
zk D nhZ; 'k in :
One can check easily that under model (1.13), the zk are iid N.0; 1/, so that .yk / satisfies
p
(1.3) with D = n.
We illustrate the reduction to sequence form with the smoothing spline estimator used in
Section 1.1, and so we suppose that an estimator fO of f in (1.13) is obtained by minimizing
the penalized sum of squares S.f / C P .f /, or more explicitly
Z 1
n
X
Q.f / D n 1
Yl f .tl /2 C
.f 00 /2 :
(1.15)
lD1
13
The account here is brief; for much more detail see Green and Silverman (1994) and the
chapter notes.
It turns out that a unique minimizer exists and belongs to the space S of natural cubic
splinestwice continuously differentiable functions that are formed from cubic polynomials
on each interval tl ; tlC1 and are furthermore linear on the outermost intervals 0; t1 and
tn ; 1. Equally remarkably, the space S has dimension exactly n, and possesses a special
orthonormal basis, the Demmler-Reinsch basis. This basis consists of functions 'k .t /and
associated vectors 'k D .'k .tl //that are simultaneously orthogonal both on the set of
sampling points and on the unit interval:
Z 1
h'j ; 'k in D j k
and
'j00 'k00 D wk j k :
(1.16)
0
k / C
n
X
wk k2 :
(1.17)
(Exercise 1.4.) The charm is that this can now readily be minimized term by term to yield
the sequence model expression for the smoothing spline estimate OS S :
OS S;k D ck yk D
1
yk :
1 C wk
(1.18)
The estimator is thus linear in the data and operates co-ordinatewise. It achieves its smoothing aspect by shrinking the higher frequencies by successively larger amounts dictated by
the increasing weights wk . In the original time domain,
X
X
Of D
ck yk 'k :
(1.19)
OS S;k 'k D
k
There is no shrinkage on the constant and linear terms: c1 D c2 D 1, but for k 3,
the shrinkage factor ck < 1 and decreases with increasing frequency. Large values of
smoothing parameter lead to greater attenuation of the data, and hence greater smoothing
in the estimate.
To represent the solution in terms of the original data, gather the basis functions into an
p
p
p
n n orthogonal matrix U D '1 ; : : : ; 'n = n: Then Y D nUy and f D nU , and so
p
Of D nU O D Uc U T Y;
c D diag .ck /:
(1.20)
Notice that the change of basis matrix U does not depend on : Thus, many important
14
Introduction
aspects of the spline smoothing problem, such as the issue of choosing well from data, can
be studied in the diagonal sequence form that the quasi-Fourier basis provides.
Software packages, such as spline.smooth in R, may use other bases, such as B splines,
to actually compute the spline estimate. However, because there is a unique solution to the
optimization problem, the estimate computed in practice must coincide, up to numerical
error, with (1.20).
We have so far emphasized structure that exists whether or not the points tl are equally
spaced. If, however, tl D l=n and it is assumed that f is periodic, then everything in the
approach above has an explicit form in the Fourier basisSection 3.4.
Continuous Gaussian white noise model. Instead of sampling a function at a discrete set
of points, we might suppose that it can be observedwith noise!throughout the entire
interval. This leads to the central model to be studied in this book:
Z t
f .s/ds C W .t /;
0 t 1;
(1.21)
Y.t/ D
0
0 t 1:
(1.22)
The observational noise consists of a standard Brownian motion W, scaled by the known
noise level : For an arbitrary square integrable function g on 0; 1, we therefore write
Z 1
Z 1
Z 1
g.t/d Y.t / D
g.t /f .t /dt C
g.t /d W .t /:
(1.23)
0
The third integral features a deterministic function g and a Brownian increment d W and is
known as a Wiener integral. We need only a few properties of standard Brownian motion
and Wiener integrals, which are recalled in Appendix C.13.
The function Y is observed, and we seek to recover the unknown function f , assumed to
be square integrable: f 2 L2 0; 1, for example using the integrated squared error loss
Z 1
kfO f k2L2 D
.fO f /2 :
0
To rewrite the model in sequence form, we may take any orthonormal basis f'i .t /g for
L2 0; 1. Examples include the Fourier basis, or any of the classes of orthonormal wavelet
bases to be discussed later. To set notation for the coefficients, we write
Z 1
Z 1
Z 1
yi D Y.'i / D
'i d Y;
i D hf; 'i i D
f 'i ;
zi D W .'i / D
'i d W:
0
(1.24)
From the stationary and independent increments properties of Brownian motion, the Wiener
integrals zi are Gaussian variables that have mean 0 and are uncorrelated:
Z 1
hZ 1
i Z 1
Cov.zi ; zj / D E
'i d W
'j d W D
'i 'j dt D ij :
0
As a result, the continuous Gaussian model is entirely equivalent to the constant variance
15
sequence model (1.3). The Parseval relation, (C.1), converts squared error in the function
domain to the analog in the sequence setting:
Z 1
X
.fO f /2 D
.Oi i /2 :
(1.25)
0
Linking regression and white noise models. Heuristically, the connection between (1.13)
and (1.21) arises by forming the partial sum process of the discrete data, now assumed to be
equally spaced, tl D l=n:
nt
nt
nt
1X
1 X
1X
Yl D
f .l=n/ C p p
Zl :
(1.26)
n 1
n 1
n n 1
Rt
1 P
The signal term is a Riemann sum approximating 0 f , and the error term n 2 nt Zl
converges weakly to standard Brownian motion as n ! 1. Making the calibration D
p
.n/ D = n; and writing Y.n/ for the process in (1.21), we see that, informally, the processes Y.n/ .t/ and Yn .t/ merge as n ! 1. A formal statement and proof of this result is
given in Chapter 3.11, using the notion of asymptotic equivalence of statistical problems,
which implies closeness of risks for all decision problems with bounded loss. Here we simply observe that heuristically there is convergence of mean average squared errors. Indeed,
for fixed functions fO and f 2 L2 0; 1:
Z 1
n
X
fO.l=n/ f .l=n/2 !
kfO f k22;n D n 1
fO f 2 :
Yn .t/ D
Non white noise models. So far we have discussed only the constant variance subclass
of models (1.12) in which i 1. The scope of (1.12) is considerably broadened by allowing unequal i > 0. Here we make only a few remarks, deferring further discussion and
examples to Chapters 2 and 3.
When the index set I is finite, say f1; : : : ; ng, two classes of multivariate Gaussian models
lead to (1.12):
(i) Y N.; 2 /, by transforming to an orthogonal basis that diagonalizes , so that
2
.%i / are the eigenvalues of , and
P
(ii) Y N.A; 2 I /, by using the singular value decomposition of A D i bi ui viT and
setting yi D bi 1 Yi , so that %i D bi 1 are the inverse singular values.
When the index set I is countably infinite, case (i) corresponds to a Gaussian process
with unknown mean function f and the sequence form is obtained from the KarhunenLo`eve transform (Section 3.10). Case (ii) corresponds to observations in a linear inverse
problem with additive noise, Y D Af C Z, in which we do not observe f but rather its
image Af after the action of a linear operator A, representing some form of integration,
smoothing or blurring. The conversion to sequence form is again obtained using a singular
value decomposition, cf. Chapter 3.
16
Introduction
1.7 Notes
17
The focus then turns to the phenomena of sparsity and non-linear estimation via coordinatewise thresholding. To set the stage, Chapter 7 provides a primer on orthonormal
wavelet bases and wavelet thresholding estimation. Chapter 8 focuses on the properties of
thresholding estimators in the sparse normal means model: y Nn .; 2 I / and the unknown vector is assumed to be sparse. Chapter 9 explores the consequences of these
thresholding results for wavelet shrinkage estimation, highlighting the connection between
sparsity, non-linear approximation and statistical estimation.
Part II is structured around a theme already implicit in Chapters 8 and 9: while wavelet
bases are specifically designed to analyze signals using multiple levels of resolution, it is
helpful to study initially what happens with thresholding etc. at a single resolution scale
both for other applications, and before assembling the results across several scales to draw
conclusions for function estimation.
Thus Chapters 1014 are organized around two strands: the first strand works at a single
or mono-resolution level, while the second develops the consequences in multiresolution
models. Except in Chapter 10, each strand gets its own chapter. Three different approachs
are exploredeach offers a different tradeoff between generality, sharpness of optimality,
and complexity of argument. We consider in turn
(i) optimal recovery and universal thresholds (Ch. 10)
(ii) penalized model selection (Chs. 11, 12)
(iii) minimax-Bayes optimal methods (Chs. 13, 14)
The Epilogue, Chapter 15 has two goals. The first is to provide some detail on the comparison between discrete and continuous models. The second is to mention some recent related
areas of work not covered in the text. The Appendices collect background material on the
minimax theorem, functional classes, smoothness and wavelet decompositions.
1.7 Notes
1.4. Although our main interest in the Demmler and Reinsch (1975) [DR] basis lies in its properties,
for completeness,
we provide a little more information on its construction. More detail for our penalty
R
P .f / D .f 00 /2 appears in Green and Silverman (1994) [GS]; here we make more explicit the connection
between the two discussions. Indeed, [GS] describe tridiagonal matrices Q and R, built respectively from
divided differences and from inner products of linear B-splines. The Demmler-Reinsch weights wk and
basis vectors 'k are given respectively by eigenvalues and vectors of the matrix K D QR 1 QT . The
functions 'k .t/ are derived from 'k using the natural interpolating spline (AC 'k in DR) given in [GS]
Section 2.4.2.
Related books and monographs. The book of Ibragimov and Hasminskii (1981), along with their many
research papers has had great influence in establishing the central role of the signal in Gaussian noise model.
Textbooks on nonparametric estimation include Efromovich (1999) and Tsybakov (2009), which include
coverage of Gaussian models but range more widely, and Wasserman (2006) which is even broader, but
omits proofs.
Closer to the research level are the St. Flour courses by Nemirovski (2000) and Massart (2007). Neither are primarily focused on the sequence model, but do overlap in content with some of the chapters of
this book. Ingster and Suslina (2003) focuses largely on hypothesis testing in Gaussian sequence models.
References to books focusing on wavelets and statistics are collected in the notes to Chapter 7.
18
Introduction
Exercises
1.1
D lim
p!1
p!0
1.2
X
n
jk jp
1=p
;
kD1
n
X
jk jp :
(1.27)
(1.28)
kD1
[kk1 is a legitimate norm on Rn , while kk0 is not: note the absence of the pth root in the
limit. Nevertheless it is often informally called the `0 norm.]
(Approximate MSE of thresholding for two spike signal.) Suppose that y Nn .; 2 I /, compare (1.3), and that O D .O;k / denotes hard thresholding, (1.7).
(a) Verify the MSE decomposition
E.O;k
k /2 D Ef.yk
(1.29)
1.3
(1.30)
(c) When k is large relative to n , show that the MSE is approximately E.yk k /2 D n2 :
(d) Conclude that (1.8) holds.
(Risk bound for two point prior.) Let y Nn .; n2 I / and O denote the hard thresholding
rule (1.7). Let r.; k I n / D E.;k k /2 denote the risk (mean squared error) in a single
co-ordinate.
(i) for the two point prior given in (1.11), express the Bayes risk B.O ; / D E E kO k22
in terms of the risk function ! r.; I n /.
(ii) Using (1.29), derive the bound
r.; n I n / .1 C 2 /n2 :
p
(iii) Using also (1.30), verify that for D n log.n log n/,
p
B.O ; / log n=n .1 C o.1//:
1.4
[This gives the risk for a typical configuration of drawn from the least favorable prior (1.11).
It does not yet show that the minimax risk Rn satisfies this bound. For a simple, but slightly
suboptimal, bound see Theorem 8.1; for the actual argument, Theorems 13.7, 13.9 and 13.17].
(Sequence form of spline penalized sum of squares.) Take as given the fact that the minimizer of
P
(1.15) belongs to the space S and hence has a representation f .t / D nkD1 k 'k .t / in terms of
the Demmler-Reinsch basis f'k .t /gnkD1 . Use the definitions (1.14) and orthogonality relations
(1.16) to verify that
P
P
(i) f D .f .tl // equals k k 'k and kY fk22;n D nkD1 .yk k /2 .
R 00 2 Pn
(ii) .f / D 1 wk k2 and hence that Q.f / D Q. / given by (1.17).
2
The multivariate normal distribution
We know not to what are due the accidental errors, and precisely because we do not
know, we are aware they obey the law of Gauss. Such is the paradox. (Henri Poincare,
The Foundations of Science.)
Estimation of the mean of a multivariate normal distribution, y Nn .; 02 I /, is the elemental estimation problem of the theory of statistics. In parametric statistics it is sometimes
plausible as a model in its own right, but more often occursperhaps after transformationas
a large sample approximation to the problem of estimating a finite dimensional parameter
governing a smooth family of probability densities.
In nonparametric statistics, it serves as a building block for the study of the infinite dimensional Gaussian sequence model and its cousins, to be introduced in the next chapter.
Indeed, a recurring theme in this book is that methods and understanding developed in the
finite dimensional Gaussian location model can be profitably transferred to nonparametric
estimation.
It is therefore natural to start with some definitions and properties of the finite Gaussian
location model for later use. Section 2.1 introduces the location model itself, and an extension to known diagonal covariance that later allows a treatment of certain correlated noise
and linear inverse problem models.
Two important methods of generating estimators, regularization and Bayes rules, appear
in Sections 2.2 and 2.3. Although both approaches can yield the same estimators, the distinction in point of view is helpful. Linear estimators arise from quadratic penalties/Gaussian
priors, and the important conjugate prior formulas are presented. Non-linear estimators arise
from `q penalties for q < 2, including the soft and hard thresholding rules, and from sparse
mixture priors that place atoms at 0, Section 2.4.
Section 2.5 begins the comparative study of estimators through their mean squared error
properties. The bias and variance of linear estimators are derived and it is shown that sensible
linear estimators in fact must shrink the raw data. The James-Stein estimator explodes any
hope that we can get by with linear methods, let alone the maximum likelihood estimator.
Its properties are cleanly derived using Steins unbiased estimator of risk; this is done in
Section 2.6.
Soft thresholding consists of pulling each co-ordinate yi towards, but not past, 0 by a
threshold amount . Section 2.7 develops some of its properties, including a simple oracle
inequality which already shows that thresholding outperforms James-Stein shrinkage on
sparse signals, while James-Stein can win in other dense settings.
19
20
Section 2.8 turns from risk comparison to probability inequalities on the tails of Lipschitz
functions of a multivariate normal vector. This concentration inequality is often useful in
high dimensional estimation theory; the derivation given has points in common with that of
Steins unbiased risk estimate.
Section 2.9 makes some remarks on more general linear models Y D A C e with correlated Gaussian errors e, and how some of these can be transformed to diagonal sequence
model form.
i D 1; : : : ; n:
(2.1)
Here .yi / represents the observed data. The signal .i / is unknownthere are n unknown
parameters. The .zi / are independent N.0; 1/ noise or error variables, and is the noise
level, which for simplicity we generally assume to be known. The model is called white
because the noise level is the same at all indices, which often represent increasing frequencies. Typically we will be interested in estimation of .
Equation (2.1) can also be written in the multivariate normal mean form y Nn .; 2 I /
that is the central model for classical parametric
Q statistical theoryone justifications is recalled in Exercise 2.26. We write .y / D i .yi i / for the joint density of .yi / with
2 1=2
respect to Lebesgue measure. The
expf yi2 =2 2 g.
R y univariate densities .yi / D .2 /
We put D 1 and .y/ D 1 .s/ds for the standard normal density and cumulative
distribution function.
Two generalizations considerably extend the scope of the finite sequence model. In the
first, corresponding to indirect or inverse estimation,
yi D i i C zi ;
i D 1; : : : ; n;
(2.2)
the constants i are known and positive. In the second, relevant to correlated noise,
yi D i C %i zi ;
i D 1; : : : ; n:
(2.3)
Here again the constants %i are known and positive. Of course these two models are equivalent in the sense that dividing by i in the former and setting %i D 1=i and yi0 D yi =i
yields the latter. In this sense, we may regard (2.3) as describing the general case. In Section
2.9, we review some Gaussian linear models that can be reduced to one of these sequence
forms.
Among the issues to be addressed are
(i) we imagine .i / to be high dimensional. In particular, as decreases, the number of parameters n D n./ may increase. This makes the problem fundamentally nonparametric.
(ii) what are the effects of .i / or .%i /, i.e. the consequences of indirect estimation, or correlated noise, on the ability to recover ?
(iii) asymptotic behavior as ! 0. This corresponds to a low-noise (or large sample size)
limit.
21
(iv) optimality questions: can one describe bounds for minimum attainable error of estimation
and estimators that (more or less) achieve these bounds?
Before starting in earnest, we briefly introduce the Stein effect, a phenomenon mentioned
already in Section 1.5 as basic to high-dimensional estimation, as motivation for much of
the work of this chapter.
Perhaps the obvious first choice of estimator of in model (2.1) is OI .y/ D y. It is the
least squares and maximum likelihood estimator. It is unbiased, E OI D , and its mean
squared error, (1.6), is constant: E kOI k2 D n 2 D Rn ; say.
However it is easy to greatly improve on the MLE when the dimension n is large. Consider
first the linear shrinkage estimators Oc .y/ D cy for c < 1, introduced in Section 1.2: we
saw that the MSE
E kOc k2 D c 2 n 2 C .1 c/2 kk2 :
This MSE is less than Rn if k k2 <
c Rn for
c D .1 C c/=.1 c/ and can be much smaller
at D 0, compare Figure 2.1.
^
I
Rn=n
^
c
JS
^
jjjj
Figure 2.1 Schematic comparison of mean squared error functions for the unbiased
estimator (MLE) OI , a linear shrinkage estimator Oc and James-Stein estimator O JS .
k2 2 2 C
.n 2/ 2 k k2
:
.n 2/ 2 C kk2
22
Thus, like the linear shrinkage estimators, O JS offers great MSE improvement near 0, but
unlike the linear estimator, the improvement persists, albeit of small magnitude, even if kk
is large. This is summarized qualitatively in Figure 2.1.
These improvements offered by linear and James-Stein estimators, along with those of the
threshold estimators introduced in Section 1.3, motivate the more systematic study of wide
classes of estimators using shrinkage and thresholding in the sequence models (2.1) (2.3).
Ak22 C P . /:
The reason for the names regularize and penalty function becomes clearer in the general
linear model setting, Section 2.9. Here we explore the special consequences of diagonal
structure. Indeed, since A is diagonal, the data term is a sum of individual
components
P
and so it is natural to assume that the penalty also be additive: P . / D pi .i /, so that
X
.yi i i /2 C pi .i /:
Q. / D
i
P
Two simple and commonly occurring penalty functions are quadratic:
P
.
/
D
!i i2 for
P
n
q
some non-negative constants !i , and q t h power: P . / D k kq D i D1 ji jq :
The crucial regularization parameter determines the relative weight given to the sum of
squared error and penalty terms: much more will be said about this later. As varies from
0 to C1, we may think of the penalized estimates O as forming a path from the roughest,
least squares solution vector O0 D .yi =i / to the smoothest solution vector O1 D 0:
Since Q./ has an additive structure, it can be minimized term by term, leading to a
univariate optimization for each coefficient estimate Oi . This minimization can be done explicitly in each of three important cases.
(i) `2 penalty: pi .i / D !i i2 : By differentiation, we obtain a co-ordinatewise linear
shrinkage estimator
i
yi :
(2.4)
Oi .y/ D 2
i C !i
(ii) `1 penalty: p.i / D 2ji j: We take i 1 here for convenience. Considering only a
single co-ordinate and dropping subscripts i, we have
Q. / D .y
/2 C 2jj:
23
(
D
.y / > 0
.y C / < 0
is piecewise linear with positive slope except for an upward jump of 2 at D 0. Hence
Q0 ./ has exactly one sign change (from negative to positive) at a single point D O which
must therefore be the minimizing value of Q. /. Depending on the value of y, this crossing
point is positive, zero or negative, indeed
8
y>
<y
O
.y/ D 0
(2.5)
jyj
:
yC
y < :
This is called soft thresholding at threshold . As is evident from Figure 2.2, the estimator
O is characterized by a threshold zone y 2 ; , in which all data is set to 0, and by
shrinkage toward 0 by a fixed amount whenever y lies outside the threshold zone: jyj > .
The thresholding is called soft as it is a continuous function of input data y. When applied
to vectors y D .yi /, it typically produces sparse fits, with many co-ordinates O;i D 0, with
larger values of producing greater sparsity.
(iii) `0 penalty. p.i / D I fi 0g. The total penalty counts the number of non-zero
coefficients:
X
P . / D
p.i / D #fi W i 0g:
i
(Exercise 1.1 explains the name `0 -penalty). Again considering only a single coordinate,
and writing the regularization parameter as 2 ,
Q. / D .y
/2 C 2 I f 0g:
By inspection,
min Q. / D minfy 2 ; 2 g;
(2.6)
This is called hard thresholding at threshold : The estimator keeps or kills the data y
according as it lies outside or inside the threshold zone ; . Again O produces sparse
fits (especially for large ), but with the difference that there is no shrinkage of retained
coefficients. In particular, the estimate is no longer a continuous function of the data.
24
^(y)
Figure 2.2 Left panel: soft thresholding at , showing threshold zone and
shrinkage by towards 0 outside threshold zone. Dashed line is 45 degree line.
Right panel: hard thresholding, with no shrinkage outside the threshold zone.
between the form of Bayes estimators and the penalized estimators of the last section. The
more decision theretic detail, is postponed to Chapter 4.
Suppose we have a prior probability distribution .d / on Rn , and a family of sampling
distributions P .dyj/, namely a collection of probability measures indexed by on the sample space Y D Rn . Then there is a joint distribution P, say, on Y and two factorizations
into marginal and conditional distributions:
P.d; dy/ D .d /P .dyj / D P .dy/.djy/:
(2.7)
Here P .dy/ is the marginal distribution of y and .djy/ the posterior for given y.
Now suppose that all sampling distributions have densities with respect to Lebesgue measure, P .dyj/ D p.yj/dy: Then the marginal distribution also has a density with respect
to Lebesgue measure, P .dy/ D p.y/dy, with
Z
p.y/ D p.yj /.d /;
(2.8)
and we arrive at Bayes formula for the posterior distribution
.djy/ D
p.yj /.d /
:
p.y/
In part, this says that the posterior distribution .djy/ is absolutely continuous with respect
to the prior .d/, and applies equally well whether the prior is discrete (for example, as
at (2.26) below) or continuous. We denote
by Ey expectation with respect to the posterior
R
distribution given y; thus Ey h. / D h. /.djy/.
A loss function associates a loss L.a; / 0 with each pair .a; / in which a 2 Rn
denotes an action, or estimate, chosen by the statistician, and 2 Rn denotes the true
parameter value. Typically L.a; / D w.a / is a function w./ of a . Our main
examples here will be quadratic and qth power losses:
w.t/ D t T Qt;
w.t / D ktkqq D
n
X
i D1
jti jq :
25
Here Q is assumed to be a positive definite matrix. Given a prior distribution and observed
data y, the posterior expected loss (or posterior risk)
Z
Ey L.a; / D L.a; /.djy/
is a function of a (and y). The Bayes estimator corresponding to loss function L is obtained
by minimizing the posterior expected loss:
O .y/ D argmina Ey L.a; /:
(2.9)
For now, we assume that a unique minimum exists, and ignore measure theoretic questions
(see the Chapter Notes).
The Bayes risk corresponding to prior is the expected valuewith respect to the marginal
distribution of yof the posterior expected loss of O :
B./ D EP Ey L.O .y/; /:
(2.10)
Remark. The frequentist definition of risk function begins with the first factorization in
(2.7), thus
r.O ; / D E L.O .y/; /:
(2.11)
This will be taken up in Section 2.5 and beyond, and also in Chapter 4, where it is seen to
lead to an alternate, but equivalent definition of the Bayes rule O in (2.9).
Example 1. Quadratic loss and posterior mean. Suppose that L.a; / D .a /T Q.a /
for some positive definite matrix Q. Then a ! Ey L.a; / has a unique minimum, given by
the zero of
ra Ey L.a; / D 2Qa
Ey ;
and so the Bayes estimator for a quadratic loss function is just the posterior mean
O .y/ D Ey D E.jy/:
(2.12)
Note, in particular, that this result does not depend on the particular choice of Q > 0. The
posterior expected loss of O is given by
EL.O ; /jy D E
E.jy/T Q
E.jy/ D trQCov.jy/:
(2.13)
Conjugate priors for the multivariate normal. Suppose that the sampling distribution
P .dyj/ is multivariate Gaussian Nn .; / and that the prior distribution .d / is also
Gaussian: Nn .0 ; T / 1 . Then the marginal distribution P .dy/ is Nn .0 ; C T / and the
posterior distribution .djy/ is also multivariate normal Nn .y ; y /this is the conjugate
prior property. Perhaps most important are the formulas for the posterior mean and covariance matrix:
y D .
1
CT
/ 1 . 1 y C T
0 /;
y D .
CT
(2.14)
26
y D T
T .T C / 1 T:
(2.15)
2 /
12 221 21 :
Apply this to the joint distribution that is implied by the assumptions on sampling distribution and prior, after noting that Cov.; y/ D T ,
0
T
T
N
;
y
0
T T C
which yields formulas (2.15) for the posterior mean and variance, after noting that
I
T .T C /
D .T C / 1 :
Formulas (2.14) may then be recovered by matrix algebra, using the identity
T
T .T C / 1 T D .T
C 1/ 1:
27
1979). In addition, the constancy of posterior variance characterizes Gaussian priors, see
Exercise 2.3.
Product priors and posteriors. Suppose that the components
Q of the prior are independent, so that we may form the product measure .d / D i i .di /, and suppose
that the sampling
Q distributions are independent, each depending on only one i , so that
P .dyj/ D i P .dyi ji /. Then from Bayes formula the posterior distribution factorizes
also:
Y
.djy/ D
.di jyi /:
(2.16)
i
In this situation, then, calculations can be done co-ordinatewise, and are hence generally
much simpler.
Additive Loss Functions take the special form
X
L.a; / D
`.ai ; i /:
(2.17)
Under the assumption of product joint distributions, we have just seen that the posterior
distribution factorizes. In this case, the i th component of the posterior expected loss
Z
Ey `.ai ; i / D `.ai ; i /.di jyi /
can be computed based on .ai ; yi / alone. As a result, the posterior expected loss Ey L.a; /
can be minimized term by term, and so the Bayes estimator
X
O .y/ D argmin.ai / Ey
`.ai ; i / D .Oi .yi //
(2.18)
i
28
(2.19)
1
2
i C i
if
if
i2 i2
:
i2 C i2
(2.20)
i2 i2 ;
i2 i2 ;
corresponding to very concentrated and very vague prior information about respectively.
Remark on notation. Formulas are often simpler in the case of unit noise, D 1, and we
reserve a special notation for this setting: x Nn .; I /, or equivalently
xi D i C zi ;
i id
zi N.0; 1/;
(2.21)
for i D 1; : : : ; n. It is usually easy to recover the formulas for general by rescaling. Thus,
if y D x and D , then y Nn .; 2 I / and so if O D ,
O then for example
EkO .y/
k2 D 2 Ek.x/
O
k2 :
(2.22)
Examples. 1. There is a useful analytic expression for the posterior mean in the Gaussian
shift model x Nn .; I /. First we remark
R that in this case the marginal density (2.8) has the
convolution form p.x/ D ? .x/ D .x /.d/. Since p.x/ is finite everywhere
it has integral 1 and is continuousit follows from a standard exponential family theorem
(Lehmann and Romano, 2005, Theorem 2.7.1) that p.x/ is actually an analytic function of
x, and so in particular is infinitely differentiable everywhere.
Now, the Bayes estimator can be written
Z
O .x/ D .x /.d/=p.x/:
29
xi .x/;
x/, we arrive at
O .x/ D x C
rp.x/
D x C r log p.x/;
p.x/
(2.23)
which represents the Bayes rule as the perturbation of the maximum likelihood estimator
O 0 .x/ D x by a logarithmic derivative of the marginal density of the prior.
We illustrate how this representation allows one to deduce shrinkage properties of the
estimator from assumptions on the prior. Suppose that the prior .d/ D
./d has a
continuously differentiable density that satisfies, for all ,
kr log
./k :
(2.24)
This forces the prior tails to be at least as heavy as exponential: it is easily verified that
.0/e
kk
(2.26)
The posterior also concentrates on f ; g, but with posterior probabilities given by
.f gjx/ D
1
.x
2
1
.x
2
/
/ C 12 .x C /
e x
e x
Ce
x
(2.27)
so that
.f gjx/ > .f gjx/
(2.28)
2
E .jx/ jx D
2
;
cosh2 x
(2.29)
30
B. / D e
1
1
.x/dx
:
cosh x
(2.30)
(2.31)
Thus a (large) fraction 1 w of co-ordinates are 0, while a (small) fraction w are drawn
from a prior probability distribution
. Such sparse mixture priors will occur in several later
chapters. Later we will consider simple discrete priors for
.d/, but for now we assume
that
.d/ has a density
./d which is symmetric about 0 and unimodal.
In this section, our main interest in these priors is that their posterior medians generate
threshold rules in which the threshold zone depends naturally on the sparsity level w.
Proposition 2.1 Suppose that the prior has mixture form (2.31) for w > 0 and that the
non-zero density
./ is symmetric and unimodal. The posterior median .x/
O
D O .x/ is
(a) monotone in x and antisymmetric: .
O x/ D .x/,
O
(b) a shrinkage rule: 0 .x/
O
x for x 0,
(c) a threshold rule: there exists t .w/ > 0 such that
.x/
O
D 0 if and only if
jxj t .w/:
(d) Finally, the threshold t.w/, as a function of w, is continuous and strictly decreasing
from t D 1 at w D 0 to t D 0 at w D 1.
Some remarks: we focus on the posterior median since it turns out (REF?) that the posterior mean must be a smooth, even analytic, function of x, and so cannot have a threshold
zone. Unimodality means that
./ is decreasing in for 0; this assumption facilitates
the proof that the posterior median O is a shrinkage rule.
The behavior of t.w/ with w is intuitive: with smaller w, a greater fraction of the data
xi D i C zi are pure noise, and so we might seek a higher threshold t .w/ in order to screen
out that noise, knowing that there is a smaller chance of falsely screening out true signal.
Compare Figure 2.3.
Before beginning the proof, we explore the structure of the posterior corresponding to
prior (2.31). First, the marginal density for x is
Z
p.x/ D .x /.d/ D .1 w/.x/ C wg.x/;
R
where the convolution density g.x/ D ?
.x/ D .x /
./d: For later use, it is
helpful to split up g.x/ into parts gp .x/ and gn .x/ corresponding to integrals over > 0
and < 0 respectively. Note that gp .x/ and gn .x/ respectively satisfy
Z 1
2
.gp/n =/.x/ D
e x =2
./d;
(2.32)
0
31
^ (x)
t(w)
{t(w)
t(w)
Figure 2.3 Left: posterior median estimator O .x/ showing threshold zone
x 2 t .w/; t .w/. Right: Threshold t .w/ decreases as w increases.
Turning now to the form of the posterior, we note that since the prior has an atom at 0, so
must also the posterior, and hence
Z
.Ajx/ D .f0gjx/IA .0/ C
.jx/d;
(2.33)
A
with
.f0gjx/ D
.1
w/.x/
;
p.x/
.jx/ D
w
./.x
p.x/
/
(2.34)
/=p.x/;
(2.35)
Proof of Proposition 2.1. (a) Since
is assumed unimodal and symmetric about 0, its support is an interval a; a for some a 2 .0; 1. Consequently, the posterior density has
support a; a and .jx/ > 0 for 2 . a; a/ and all x. In particular, the posterior
median O .x/ is uniquely defined.
We will show that x < x 0 implies that for m 2 R ,
. > mjx/ < . > mjx 0 /;
(2.36)
from which it follows that the posterior distribution is stochastically increasing and in particular that the posterior median is increasing in x. The product form representation (2.35)
suggests an argument using ratios: if < 0 then cancellation and properties of the Gaussian
density yield
.0 jx 0 /.jx/
D expf.0 /.x 0 x/g > 1:
.jx 0 /.0 jx/
Now move the denominator to the right side and integrate with respect to the dominating
measure over 0 2 R D .m; 1/ and 2 Rc D . 1; m to get
.Rjx 0 /.Rc jx/ > .Rc jx 0 /.Rjx/;
32
Since p.x/ > wg.x/, clearly a sufficient condition for . > xjx/ < 1=2 is that
Z 1
Z x
.x /
./d
.x /
./d;
x
or equivalently that
Z
. /
.x C /d
0
.0 / .x
0 /d0
which indeed follows from the unimodality hypothesis (combined with symmetry for the
case when 0 > x).
For later use, we use (2.34) and the definition of gp to write
. > 0jx/ D
wgp .x/
:
w/.x/ C wg.x/
.1
(2.37)
If x < 0, then gp .x/ < g.x/=2 using the symmetry of
, and so . > 0jx/ < 1=2 and
hence the posterior median .x/
O
0: By antisymmetry, we conclude that O .x/ 0 for
x 0.
(c) Now we turn to existence of the threshold zone. If w < 1, we have .f0gjx/ > 0 and
by symmetry . < 0 j x D 0/ D . > 0 j x D 0/, so it must be that
. < 0 j x D 0/ <
1
2
< . 0 j x D 0/
(2.38)
so that O .0/ D 0, which is also clear by reason of symmetry. More importantly, the functions x ! . > 0jx/ and . 0jx/ are continuous (e.g. from (2.37)) and strictly
increasing (from (2.36) proved above). Consequently, (2.38) remains valid on an interval:
t.w/ x t.w/, which is the threshold zone property. Compare Figure 2.4.
(d) From Figure 2.4, the threshold t D t .w/ satsifies . > 0jt / D 1=2, and rearranging
(2.37) we get the equation
2wgp .t / D .1
w/.t / C wg.t /:
Since the right side is continuous and monotone in t , we conclude that w is a continuous
and strictly decreasing function of t, from w D 1 at t D 0 to w D 0 at t D 1.
33
{t(w)
t(w)
Figure 2.4 The threshold zone arises for the set of x for which both
. 0jx/ 1=2 and . > 0jx/ 1=2.
The tails of the prior density
have an important influence on the amount of shrinkage of
the posterior median. Consider the following univariate analog of (2.24):
.log
/./ is absolutely continuous, and j.log
/0 j a.e.
(2.39)
xj t .w/ C C 2:
(2.40)
Remark. The condition (2.39) implies, for u > 0, that log
.u/ log
.0/ u and so,
for all u, that
.u/
.0/e juj . Hence, for bounded shrinkage, the assumption requires
the tails of the prior to be exponential or heavier. Gaussian priors do not satisfy (2.40), and
indeed the shrinkage is then proportional to x for large x. Heuristically, this may be seen
by arguing that the effect of the atom at 0 is negligible for large x, so that the posterior is
essentially Gaussian, so that the posterior median equals the posterior mean, and is given,
from (2.19) by
2 y=. 2 C 1/ D y
y=. 2 C 1/:
For actual calculations, it is useful to have a more explicit expression for the posterior
median. From (2.34) and the succeeding discussion, we may rewrite
.jx/ D w.x/
.jx/:
R1
Q jx/
Let .
Q
D Q
.jx/d: If x t .w/, then the posterior median O D O .x/ is defined
by the equation
Q jx/
w.x/.
O
D 1=2:
(2.41)
Example. A prior suited to numerical calculation in software is the Laplace density
a ./ D 21 ae
ajj
which satisfies (2.39). The following formulas may be verified (Exercise 2.6 fills in some
34
a/ C .x C a/
a
1D
.x
2
g.x/
.x/ D
.x/
1:
a/w
1 .z0 /g;
(2.42)
.x/ 12 a=.x
z0 21 ;
a/;
.x/
O
x
a:
(2.43)
In particular, we see the bounded shrinkage propertyfor large x, the data is pulled down
by about a. The threshold t D t .w/ and the weight w D w.t / are related by
w.t /
D a.=/.t
a/
.t /:
(2.44)
4.5
3.5
t(w)
2.5
1.5
0.5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 2.5 Threshold t .w/ as a function of non-zero prior mass w for the Laplace
density for three values of scale parameter a: Dash-dot: a D 0:1, Solid: a D 0:5,
Dashed a D 2. Increasing sparsity (smaller w) corresponds to larger thresholds.
n h
X
k D E
Oi .y/
2
i
i2
iD1
Let us begin with the sequence model y Nn .; 2 I / and the class of linear estimators
OC .y/ D Cy
(2.45)
35
for some n n matrix C . The class of linear estimators includes smoothing splines, seen in
Chapter 1, kernel estimators (Chapter 3) and other frequently used methods.
For any estimator O with a finite variance, linear or not, the mean square error splits into
variance and (squared) bias terms, yielding the variance-bias decomposition:
EkO
O 2 C kE O
k2 D EkO E k
D var.O / C bias2 .O /:
O 2 D tr.O
E k
E O /.O
k2
(2.46)
E O /T , we have
var.O / D trCov.O /:
For linear estimators OC , clearly Cov.Cy/ D 2 C C T and so
var.OC / D 2 tr C C T D 2 tr C T C:
The bias E OC
D .C
C /k2 :
(2.47)
[Note that only second order distributional assumptions are used here, namely that Ez D 0
and Cov.z/ D I .]
The mean squared error is a quadratic function of , and the squared bias term is unbounded except in the case C D I . In this case OI .y/ D y is the maximum likelihood
estimator (MLE), it is exactly unbiased for and the MSE of the MLE is constant,
r.OI ; / n 2 :
Thus, with linear estimators we already see the fundamental issue: there is no single estimator with uniformly best mean squared error, compare Figure 2.1.
One way to exclude poor estimators is through the notion of admissibility. We say that
estimator O is inadmissible if there exists another estimator O 0 such that R.O 0 ; / R.O ; /
for all , with strict inequality occurring for some . Such an estimator O 0 is said to dominate
O . And if no such dominating O 0 exists, then the original estimator O is called admissible.
Admissibility itself is a rather weak notion of optimality, but the concept is useful because
in principle, if not always in practiceone would not want to use an inadmissible estimator.
Thus, typically, inadmissibility results are often of more interest than admissibility ones.
The most important (and surprising) fact about admissibilty is that the MLE OI is itself inadmissible exactly when n 3. Indeed, as indicated in Section 2.1, the JamesStein estimator O JS dominates the MLE everywhere: r.O JS ; / < n 2 D r.OI ; / for all
2 Rn ; n 3. A short proof is given in the next section.
We can now describe a nice result on inadmissibility for linear estimators. We saw in
Chapter 1.4 that cubic smoothing splines shrink all frequencies except for a two dimensional
subspace on which no shrinkage occurs. This turns out to be admissible, and In fact, all
reasonable, i.e. admissible, linear estimators must behave in this general manner.
Theorem 2.3 Suppose that y Nn .; 2 I /. The linear estimator OC .y/ D Cy is admissible (for squared error loss) if and only if C
(i) is symmetric,
36
D/T .I
C j2 D .I
D/ D jI
C /T .I
C /;
the two estimators have the same (squared) bias. Turning to the variance terms, write
tr D T D D tr I
2tr.I
D/ C tr.I
D/T .I
D/:
(2.48)
Comparing with the corresponding variance term for OC , we see that tr D T D < tr C T C if
and only if
tr.I
D/ D trjI
C j > tr.I
C/
k2 :
(2.50)
37
Here the infimum is taken over all estimators, linear or non-linear. We take up the systematic
study of minimaxity in Chapter 4. For now, we mention the classical fact that the MLE
OI .y/ D y is minimax:
Rn D n 2 D sup E ky
k2 :
(2.51)
2Rn
(This is proved, for example, using Corollary 4.10 and Proposition 4.16).
Mallows CL and Cp . There is a simple and useful unbiased estimate of the MSE of linear
estimators OC . To derive it, observe that the residual y OC D .I C /y; and that the mean
residual sum of squares (RSS) satisfies
Eky
OC k2 D Ek.I
C /. C z/k2 D 2 tr .I
C /T .I
C / C k.I
C / k2 : (2.52)
OC k2
n 2 C 2 2 tr C
k2 :
C./yk2
n 2 C 2 2 tr C./:
(2.53)
If C D PK represents orthogonal projection onto the subspace spanned by the coordinates in a subset K f1; : : : ; ng of cardinality nK , then trPK D nK . One might then
choose the subset to minimize
UK .y/ D ky
PK yk2 C 2 2 nK
n 2 :
(2.54)
This version of the criterion is called Mallows Cp . In applying it, one may wish to restrict
the class of subsets K, for example to initial segments f1; : : : ; kg for 1 k n.
38
(2.55)
E .X
(2.57)
Regularity conditions do need attention here: some counterexamples are given below. It
is, however, enough in (2.55) and (2.57) to assume that g is weakly differentiable: i.e. that g
is absolutely continuous on all line segments parallel to the co-ordinate axes, and its partial
derivatives (which consequently exist almost everywhere) are integrable on compact sets.
Appendix C.22 gives the conventional definition of weak differentiability and the full proof
of (2.57) and the following important consequence.
Proposition 2.4 Suppose that g W Rn ! Rn is weakly differentiable, that X Nn .; I /
and that for i D 1; : : : ; n, E jXi gi .X /j C EjDi gi .X /j < 1. Then
E kX C g.X/
(2.58)
n C 2tr C C k.I
C /xk2 :
2. Soft thresholding satisfies the weak differentiability condition. Indeed, writing O S .x/ D
x C gS .x/, we see from (2.5) that
8
< xi >
gS;i .x/ D
(2.59)
xi jxi j
:
xi <
is absolutely continuous as a function of each xi , with derivative bounded by 1.
3. By contrast, hard thresholding has O H .x/ D x C gH .x/ which is not even continuous,
gH;i .x/ D xi I fjxi j g, and so the unbiased risk formula cannot be applied.
4. Generalization to noise level and more generally to Y Nn .; V / is straightforward
(see Exercise 2.8).
The James-Stein estimate. For X Nn .; I /; the James-Stein estimator is defined by
n 2
x;
(2.60)
O JS .x/ D 1
kxk2
39
and was used by James and Stein (1961) to give a more explicit demonstration of the inadmissibility of the maximum likelihood estimator O MLE .x/ D x in dimensions n 3:
[The MLE is known to be admissible for n D 1; 2, see e.g. Lehmann and Casella (1998,
Ch. 5, Example 2.5 and Problem 4.5).] Later, Stein (1981) showed that the inadmissibility may be verified immediately from the unbiased risk formula (2.58). Indeed, if n 3,
g.x/ D .n 2/kxk 2 x is weakly differentiable, and
2xi2
1
Di gi .x/ D .n 2/
kxk2 kxk4
so that r T g.x/ D
.n
2/2 kxk
U.x/ D n
.n
2/2 kxk 2 :
r.O JS ; / D n
.n
2/2 E kX k 2 ;
Consequently
which is finite and everywhere smaller than r.O MLE ; / D E kX
n 3: We need n 3 for finiteness of EkXk 2 , see (2.64) below.
(2.61)
k2 n so long as
Remarks. 1. The James-Stein rule may be derived from a linear Bayes shrinkage estimator
by estimating the shrinkage constant from the data. This empirical Bayes interpretation,
due to Efron and Morris (1973), is given in Exercise 2.11.
2. Where does the factor n 2 come from? A partial explanation: the estimator .x/
O
D
.1 =kxk2 /x has unbiased risk estimate U .x/ D n f2.n 2/ 2 g=kxk2 ; and this
quantity is minimized, for each x, by the choice D n 2: Note that D 2.n 2/ gives
the same risk as the MLE.
3. The positive part James-Stein estimator
n 2
JSC
x
(2.62)
O
.x/ D 1
kxk2 C
has necessarily even better MSE than O JS (Exercise 2.12), and hence better than O MLE .
The unbiased risk estimate leads to an informative bound on the mean squared error of
the James-Stein rule.
Proposition 2.5 If X Nn .; I /, then the James-Stein rule satisfies
E kO JS
k2 2 C
.n 2/kk2
:
.n 2/ C kk2
(2.63)
Proof For general ; the sum of squares kX k2 follows a non-central chisquared distribution with non-centrality parameter kk2 : The non-central distribution may be realized as a
mixture of central chi-squared distributions 2nC2N ; where N is a Poisson variate with mean
kk2 =2: (cf. e.g. Johnson and Kotz (1970, p. 132)). Recall also the formula
E 1=2n D 1=.n
2/:
(2.64)
Hence, by conditioning first on N , and then using (2.64) and Jensens inequality,
E1=2nC2N D E1=.n
2 C 2N / 1=.n
2 C kk2 /:
40
.n 2/2
;
n 2 C kk2
2/
n=80
80
60
Bound
JS
JS+
MSE
MSE
n=8
8
0
0
Bound
JS
JS+
40
20
0
0
10
10
||||
20
||||
30
40
Figure 2.6 Exact risk functions of James-Stein rule O JS (dashed) and positive part
James-Stein O JSC (solid) compared with upper bound from right side of (2.63). In
the right panel (n D 80) the three curves are nearly indistinguishable.
Figure 2.6 illustrates several important aspects of the risk of the James-Stein estimator.
First, the improvement offered by James-Stein relative to the MLE can be very large. For
D 0, we see from (2.61) and (2.64) that r.O JS ; 0/ D 2 while r.O MLE ; / n.
Second, the region of significant savings can be quite large as well. For kk2 n, the
upper bound (2.63) is less than .1 C n/=.1 C / so that, for example, if kk2 4n, then
the savings is (roughly) at least 20 %. See also Exercise 2.14.
Third, the additional improvement offered by the positive part estimator can be significant
when both kk and n are small, but otherwise the simple upper bound (2.63) gives a picture
of the risk behavior that is accurate enough for most purposes.
Remarks. Exercise 2.18 provides details on the exact risk formulas for O JSC used in
Figure 2.6. It is known, e.g. Lehmann and Casella (1998, Example 5.7.3), that the positive
part James-Stein rule cannot be admissible. While dominating estimators have been found,
(Shao and Strawderman, 1994), the actual amount of improvement over O JSC seems not to
be of practical importance.
Direct use of Jensens inequality in (2.61) yields a simpler bound inferior to (2.63), Exercise 2.13.
Corollary 2.6 Let O c .x/ D cx be a linear shrinkage estimate. Then
r.O JS ; / 2 C inf r.O c ; /:
c
(2.65)
Proof
41
EkcX
c/2 kk2 :
(2.66)
In an idealized situation in which kk is known, the ideal shrinkage factor c D c IS ./
would be chosen to minimize this MSE, so that
c IS ./ D
kk2
;
n C kk2
(2.67)
and
inf r.O c ; / D
c
.n 2/kk2
nkk2
;
n C kk2
n 2 C kk2
(2.68)
(2.69)
the risk of a bona fide estimator O JS is bounded by the risk of the ideal estimator O IS .x/ D
c IS ./x, (unrealizable in practice, of course) plus an additive constant. If one imagines the
ideal shrinkage factor c IS ./ as being provided by an oracle with supernatural knowledge,
then (2.68) says that the James-Stein estimator can almost mimic the oracle.
In high dimensions, the constant 2 is small in comparison with the risk of the MLE, which
is everywhere equal to n: On the other hand the bound (2.69) is sharp: at D 0; the unbiased
risk equality (2.61) shows that r.O JS ; 0/ D 2, while the ideal risk is zero.
The James-Stein estimator O JS can be interpreted as an adaptive linear estimator, that is,
an estimator that while itself not linear, is derived from a linear estimator by estimation of a
tuning parameter, in this case the shrinkage constant. The ideal shrinkage constant c IS ./ D
1 n=.n C kk2 / and we can seek to estimate this using X. Indeed, EkX k2 D n C kk2
and so EkX k 2 1=.n C kk2 /, with approximate equality for large n. Consider therefore
estimates of the form c.x/
O
D 1 =kxk2 and note that we may determine by observing
that for D 0, we have E cO D 1 =.n 2/ D 0. Hence D n 2, and in this way, we
recover precisely the James-Stein estimator.
For use in the next section, we record a version of (2.69) for arbitrary noise level. Define
.n 2/ 2
JSC
O
y
(2.70)
.y/ D 1
kyk2
C
Corollary 2.7 Let Y Nn .; 2 I /. The James-Stein estimate O JS C .y/ in (2.70) satisfies
EkO JS C
k2 2 2 C
n 2 kk2
:
n 2 C k k2
42
1 : Initially we adopt the unit noise setting, X Nn .; I / and evaluate Steins unbiased
risk estimate for O .x/ D x C gS .x/, where the form of gS .x/ for soft thresholding was
given in (2.59). We have .@gS;i =@xi /.x/ D I fjxi j g a.e. and so
k2 D E U .x/
E kO .x/
U .x/ D n
n
X
I fjxi j g C
n
X
min.xi2 ; 2 /:
(2.71)
Since U .x/ depends only on and the observed x, it is natural to consider minimizing
U .x/ over to get a threshold estimate O S URE .
2 : Consider the one dimensional case with X N.; 1/. Let the (scalar) risk function rS .; / D E O .x/ 2 . By inserting the definition of soft thresholding and then
changing variables to z D x , we obtain
Z
Z 1
Z
.z C /2 .z/dz:
rS .; / D 2
.z/dz C
.z /2 .z/dz C
Several useful properties follow from this formula. First, after some cancellation, one
finds that
@
rS .; / D 2. / . / 2;
(2.72)
@
which shows in particular that the risk function is monotone increasing for 0 (and of
course is symmetric about D 0).
The risk at D 0 has a simple form
Z 1
Q
rS .; 0/ D 2
.z /2 .z/dz D 2.2 C 1/./
2./
Q
and, using the bound for Mills ratio ./
1 ./ valid for > 0, (C.14),
2
rS .; 0/ 2 1 ./ e =2 ;
:
with the final inequality true for > 2.0/ D 0:8.
Hence the risk increases from a typically small value at D 0, to its value at D 1,
rS .; 1/ D 1 C 2 ;
(which follows, for example, by inspection of (2.71)). See Figure 2.7.
3 : Some useful risk bounds are now easy consequences. Indeed, from (2.72) we have
rS .; / rS .; 0/ 2 : Using also the bound at 1, we get
rS .; / r.; 0/ C min.1 C 2 ; 2 /:
p
Making a particular choice of threshold, U D 2 log n, and noting that rS .U ; 0/
2
e U =2 D 1=n; we arrive at
rS .U ; / .1=n/ C .2 log n C 1/ min.2 ; 1/:
Returning to noise level , and a vector observation Y Nn .; 2 I /, and adding over the
n coordinates, we can summarize our conclusions.
43
1+2
(smaller)
rS(;)
1
(smaller)
Figure 2.7 Qualitative behavior of risk function for soft thresholding. Arrows show
how the risk function changes as the threshold is decreased.
p
Lemma 2.8 Let Y Nn .; 2 I / and O denote soft thresholding with D 2 log n.
Then for all ,
EkO
k C
n
X
min.i2 ; .1 C 2 / 2 /
(2.73)
i D1
2
C .2 log n C 1/
n
X
min.i2 ; 2 /:
i D1
;
1
min.
i2 ; n 2 /:
2
n 2 C k k2
For thresholding, looking at the main term in Lemma 2.8, we see that thresholding dominates
(in terms of mean squared error) if
X
X
.2 log n/
min.i2 ; 2 / min
i2 ; n 2 :
i
For example, with D 1= n, and if is highly sparse, as for example in the case of a spike
such as D .1; 0; : : : ; 0/, then the left side equals .2 log n/=n which is much smaller than
the right side, namely 1.
Conversely, James-Stein dominates if all ji j are nearly equalrecall, for example, the
comb D .1; : : : ; 1/, where now the left side equals .2 log n/ n 2 which is now larger
than the right side, namely n 2 D 1.
While thresholding has a smaller risk by a factor proportional to log n=n in our example,
44
note that
be more
a multiplicative
factor 2 log n worse than James-Stein,
P it can2never
Pthan
2
2
2
since min.i ; / min
i ; n (Exercise 2.19).
f .y/j Lkx
yk
for all x; y 2 Rn . Here kxk is the usual Euclidean norm on Rn . If f is differentiable, then
we can take L D sup krf .x/k.
Proposition 2.9 If Z Nn .0; I /, and f W Rn ! R is Lipschitz.L/, then
P ff .Z/ Ef .Z/ C tg e
P ff .Z/ Medf .Z/ C tg
t 2 =.2L2 /
2
2
1
e t =.2L / :
2
(2.74)
(2.75)
This property is sometimes expressed by saying that the tails of the distribution of a
Lipschitz function of a Gaussian vector are subgaussian.
Note that the dimension n plays a very weak role in the inequality, which is sometimes
said to be infinite-dimensional. The phrase concentration of measure refers at least in
part to the fact that the distribution of a Lipschitz(1) function of n variables is concentrated
about its mean, in the sense that the tails are no heavier than those of a univariate standard
Gaussian, regardless of the value of n!
Some statistically relevant examples of Lipschitz functions include
(i) Order statistics. If z.1/ z.2/ z.n/ are the order statistics of a data vector z,
then f .z/ D z.k/ has Lipschitz constant L D 1. The same is true for the absolute values
jzj.1/ jzj.n/ . Section 8.9 has results on the maxima of Gaussian noise variates.
(ii) Ordered eigenvalues of symmetric matrices. Let A be an n n symmetric matrix with
eigenvalues %1 .A/ %2 .A/ %n .A/. If E is also symmetric, then from Weyls
inequality (e.g. (Golub and Van Loan, 1996, p. 56 and 396))
j%k .A C E/
%k .A/j kEkHS ;
2
where kEk2HS D i;j ei;j
denotes the square of the Hilbert-Schmidt, or Frobenius norm,
which is the Euclidean norm on n n matrices. This is of statistical relevance, for example,
if A is a sample covariance matrix, in which case %1 .A/ is the largest principal component
variance.
(iii) Orthogonal projections. If S is a linear subspace of Rn , then f .z/ D kPS zk has
D
Lipschitz constant 1. If dim S D k, then kPS zk2 D 2.k/ and so
p
EkPS zk fEkPS zk2 g1=2 D k
p
kCt ge
t 2 =2
(2.76)
45
These bounds play a key role in the oracle inequalities of Chapter 11.3.
P
(iv) Linear combinations of 2 variates. Suppose that i 0. Then f .z/ D . i zi2 /1=2
is differentiable and Lipschitz: krf .z/k2 kk1 . Then a fairly direct consequence of
(2.74) is the tail bound
X
Pf
j .Zj2 1/ > tg expf t 2 =.32kk1 kk1 /g
(2.77)
for 0 P
< t kk1 (Exercise 2.23). This is used for Pinskers theorem in Chapter 5.4. The
form j .Zj2 1/ also arises as the limiting distribution of degenerate U -statistics of order
1, e.g. Serfling (1980, Sec. 5.5.2).
P
(v) Exponential sums. The function f .z/ D log n1 exp.zk / is Lipschitz./. It appears,
for example, in the study of Gaussian likelihood ratios of sparse signals, Section ??.
The two concentration inequalities of Proposition 2.9 have a number of proofs. We give
an analytic argument for the first that builds on Steins integration by parts identity (2.55).
For the second, we shall only indicate how the result is reduced to the isoperimetric property
of Gaussian measuresee e.g. Ledoux (2001) for a more complete discussion.
We begin with a lemma that bounds covariances in terms of derivatives. Let
D
n
denote the canonical Gaussian measure on Rn corresponding to Z Nn .0; 1/.
Lemma 2.10 Assume that Y; Z Nn .0; I / independently and set Y D Y cos C Z sin
for 0 =2. Suppose that f and g are differentiable real valued functions on Rn with
rf and rg 2 L2 .
/. Then
Z =2
Covff .Y /; g.Y /g D
Erf .Y /T rg.Y / sin d:
(2.78)
0
(2.79)
Proof We may assume that Eg.Y / D 0, since replacing g.y/ by g.y/ Eg.Y / changes
neither side of the equation. Now, since Y and Z are independent, the covariance may be
written Ef .Y /g.Y / g.Z/. We exploit the path Y from Y0 D Y to Y=2 D Z, writing
Z =2
g.Y / g.Z/ D
.d=d /g.Y /d:
0
The vectors Y and Z are independent and Nn .0; I /, being a rotation through angle of
the original Y and Z, Lemma C.11. Inverting this rotation, we can write Y D Y cos
Z sin . Considering for now the ith term in the inner product in (2.80), we therefore have
Ef .Y /Z;i Di g.Y / D Ef .Y cos
D
46
where the second equality uses Steins identity (2.55) applied to the .n C i /th component
of the 2n-dimensional spherical Gaussian vector .Y ; Z /. Adding over the n co-ordinates i
and inserting into (2.80), we recover the claimed covariance formula.
Proof of Concentration inequality (2.74). This uses an exponential moment method. By
rescaling and centering, we may assume that L D 1 and that Ef .Y / D 0. We will first show
that for all t > 0,
Ee tf .Y / e t
=2
(2.81)
Make the temporary additional assumption that f is differentiable. The Lipschitz bound
on f entails that krf k 1. We are going to apply the identity of Lemma 2.10 with the
functions f and g D e tf . First, observe that
Erf .Y /T rg.Y / D tEe tf .Y / rf .Y /T rf .Y / tEe tf .Y / :
Introduce the notation e u.t / D Ee tf .Y / , differentiate with respect to t and then use (2.78)
along with the previous inequality:
Z =2
0
u.t /
tf .Y /
u .t/e
D Ef .Y /e
t e u.t / sin d D t e u.t / :
0
Hence u .t/ t for t > 0 and u.0/ D 0, from which we get u.t / t 2 =2 and so (2.81). The
assumption that f is differentiable can be removed by smoothing: note that the sequence
fn D f ? 1=n is Lipschitz.1/ and converges to f uniformly (Exercise 2.24), so that (2.81)
follows by Fatous lemma.
Now we conclude by using Markovs inequality and (2.81). For each t > 0,
0
P .f .X / u/ D P .e tf .X / e t u /
e
tu
Ee tf .X / e
t uCt 2 =2
t 2 =2
47
Some models that reduce to sequence form. A fairly general Gaussian linear model
for estimation of means in correlated noise might be described in vector notation as Y D
A C Z, or equivalently Y N.A; 2 /. Some frequently occurring subclasses of this
model can be reduced to one of the three sequence forms (2.1) - (2.3).
First, when Y Nn .; 2 I /, one can take co-ordinates in any orthonormal basis fui g for
Rn , yielding
yi D hY; ui i;
i D h; ui i;
zi D hZ; ui i:
(2.82)
An essentially equivalent situation arises when Y Nn .A; 2 I /, and the matrix A itself
has orthogonal columns: ATA D mIn . The columns of A might be orthogonal polynomials
or other systems of functions, or orthogonal contrasts in the design of experiments, and so
on. Specific examples include weighing designs, Hadamard and Fourier transforms (as in
magnetic resonance imaging). The model can be put in the form (2.1) simply by premultiplying by m 1 AT : define y D m 1 AT Y; z D m 1=2 AT Z; and note especially the noise
p
calibration D = m:
While this formulation appears parametric, formally it also covers the setting of nonparametric regression on a fixed equi-spaced design. Thus, the model
Yl D f .l=n/ C Zl ;
l D 1; : : : ; n
(2.83)
i id
with Zl N.0; 1/ becomes an example of (2.1) if one uses as design matrix an inverse disp
crete orthogonal wavelet (or Fourier) transform W T to express f D .f .l=n// D nW T :
p
Thus here A D nW T and z D W Z. The components of y and are wavelet (or Fourier)
coefficients of Y and f respectively. Compare the discussion around (1.20) and (7.22).
If we drop the requirement (2.83) that the errors be normally distributed, keeping only the
first and second moment requirements that Z have mean 0 and covariance I , then the same
will bePtrue of the transformed errors z. If the matrix W is in some sense dense, so that
zi D k wi k Zk has many non-zero terms of similar size, then by a central limit theorem
for independent summands such as Lyapunovs or Lindebergs, the zi will be approximately
normally distributed.
Second, assume that Y N.A; 2 I /, with A an N M matrix.
PnThis can Tbe converted
into model (2.2) using the singular value decomposition A D
i D1 i ui vi ; where we
assume that i > 0 for i D 1; : : : ; n D rank.A/. We obtain
X
i i ui ;
i D hvi ; i;
(2.84)
A D
i
48
Thirdly, assume that Y N.; 2 /, with positive definite covariance matrix , with
eigenvalues and eigenvectors
ui D %2i ui ;
%i > 0;
so that with definitions (2.82), we recover the third sequence model (2.3), after noting that
Cov.yi ; yj / D 2 ui uj D 2 %2i ij . Here ij D 1 if i D j and 0 otherwise is the usual
Kronecker delta. This version arises as the limiting Gaussian model in the large sample
local asymptotic normal approximation to a smooth parametric model of fixed dimension,
Exercise 2.26.
Infinite sequence model analogs of the last two models are discussed in Sections 3.9 and
3.10 respectively.
In the most general setting Y N.A; 2 /, however, a simple sequence version will
typically only be possible if ATA and have the same sets of eigenvectors (including multiplicities). This does occur, for example, if ATA and are circulant matrices 2 , and so are
diagonalized by the discrete Fourier transform, (e.g. Gray (2006, Ch. 3)), or more generally
if AT A and commute.
Penalization and regularization. The least squares estimate of is found by minimizing
! kY Ak22 . When Y N.A; 02 I /, this is also the maximum likelihood estimate. If
is high dimensional, or if A has a smoothing character with many small singular values i ,
then the least squares solution for is often ill-determined. See below for a simple example,
and Section 3.9 for more in the setting of linear inverse problems.
A commonly used remedy is to regularize the solution by introducing a penalty function
P ./, and minimizing instead the penalized least squares criterion
Q./ D kY
Ak22 C P ./:
(2.85)
A matrix C is circulant if each row is obtained by cyclically shifting the previous row to the right by one; it is
thus determined by its first row.
49
O
as forming a path from the roughest, least squares solution .0/
D OLS to the smoothest
O
solution .1/ which necessarily belongs to ker P .
We consider three especially important examples. First, the quadratic penalty P ./ D
T is nice because it allows explicit solutions. The penalized criterion is itself quadratic:
Q./ D T .ATA C /
2Y T A C Y T Y:
Let us assume, for convenience, that at least one of ATA and is positive definite. In that
case, @2 Q=@ 2 D 2.ATA C / is positive definite and so there is a unique minimizer
O
./
D .ATA C / 1 AT Y:
(2.86)
This is the classical ridge regression or Tikhonov regularization estimate (see chapter notes
for some references), with ridge matrix . For each , the estimate is a linear function
S./Y of the data, with smoother matrix S./ D .AT A C / 1 AT : The trajectory !
O
O
O
./
shrinks from the least squares solution .0/
D .ATA/ 1 AT Y down to .1/
D 0:
Second, consider `1 penalties, which are used to promote sparsity in the solution. If the
penalty isPimposed after transformation to a sequence form such as (2.2) or (2.3), so that
P ./ D ji j, then the co-ordinatewise thresholding interpretation
of Section 2.1 is availP
able. When imposed in the original variables, so that P ./ D n1 ji j, the resulting estimator is known as the lasso for least absolute selection and shrinkage operator, introduced
by Tibshirani (1996), see also Chen et al. (1998). There is no explicit solution, but the optimization problem is convex and many algorithms and a huge literature exist. See for example
Buhlmann and van de Geer (2011) and Hastie et al. (2012).
Third, the `0 penalty P ./ D kk0 D #fi W i 0g also promotes sparsity by penalizing
the number of non-zero coefficients in the solution. As this penalty function is not convex,
the solution is in general difficult to compute. However, in sufficiently sparse settings, the
`0 and `1 solutions can coincide, and in certain practical settings, successful heuristics exist.
(e.g. Donoho and Huo (2001), Cand`es and Romberg (2007), Buhlmann and van de Geer
(2011)).
Example. Convolution furnishes a simple example of ill-posed inversion and the advantages of regularization. Suppose that A D .ak j ; 1 j; k n/ so that A D a ?
represents convolution with the sequence .ak /. Figure 2.8 shows a simple example in which
a0 D 1; a1 D 1=2 and all other ak D 0. Although A is formally invertible, it is nearly
:
singular, since for osc D .C1; 1; C1; : : : ; 1/, we have Aosc D 0, indeed the entries are
exactly zero except at the boundaries. The instability of A 1 can be seen in the figure: the
left panel shows both y D A and y 0 D A C Z for a given signal and a small added
noise with D :005 and Z being a draw from Nn .0; I /. Although the observations y and
y 0 are nearly identical, the least squares estimator OLS D .ATA/ 1 AT y D A 1 y is very
0
different from OLS
D A 1 y 0 . Indeed A is poorly conditioned, its smallest singular value is
:
:
n D 0:01, while the largest singular value 1 D 2.
Regularization with the squared second difference penalty P2 removes the difficulty: with
O
D 0:01, the reconstruction ./
from (2.86) is visually indistinguishable from the true .
This may be understood in the sequence domain. If the banded matrices A and are
lightly modified in their bottom left and top right corners to be circulant matrices, then
both are diagonalized by the (orthogonal) discrete Fourier transform, and in the Fourier
50
0.5
0.7
0.4
0.6
0.3
0.5
0.2
0.4
0.1
0.3
0.2
0.1
0.1
0.2
6
Figure 2.8 Left: Observed data y D A, solid line, and y 0 D A C Z, dashed
line, for l D .tl /, the standard normal density with tl D .l=n/ 6 and
n D 13; D 0:005 and Z a draw from Nn .0; I /. Right: reconstructions
O
OLS D A 1 y, dashed line, and regularized ./,
solid line, from (2.86) with
D 0:01 D 2 D 2 .
2.10 Notes
Some of the material in this chapter is classical and can be found in sources such as Lehmann and Casella
(1998).
2. The connection between regularization with the `1 penalty and soft thresholding was exploited in
Donoho et al. (1992), but is likely much older.
The soft thresholding estimator is also called a limited translation rule by Efron and Morris (1971,
1972).
3. Measure theoretic details may be found, for example, in Lehmann and Romano (2005). Measurability
of estimators defined by extrema, such as the Bayes estimator (2.9), requires care: see for example Brown
and Purves (1973).
The characterizations of the Gaussian distribution noted in Remark 2 are only two of many, see for
example DasGupta (2011a, p157) and the classic treatise on characterizations Kagan et al. (1973).
Identity (2.23) is sometimes called Tweedies formula (by Efron (2011) citing Robbins (1956)), and
more usually Browns formula, for the extensive use made of it in Brown (1971). The posterior variance
formula of Exercise 2.3 appears in Srinivasan (1973).
4. Priors built up from sparse mixture priors such as (2.31) are quite common in Bayesian variable
selection problems; see for example George and McCulloch (1997). The connection with posterior median
thresholding and most of the results of this section come by specialization from Johnstone and Silverman
(2004a), which also has some remarks on the properties of the posterior mean for these priors. Full details
51
Exercises
of the calculations for the Laplace example and a related quasi-Cauchy prior may be found in Johnstone
and Silverman (2005a, 6).
5. Basic material on admissibility is covered in Lehmann and Casella (1998, Ch. 5). Inadmissibility
of the MLE was established in the breakthrough paper of Stein (1956). The James-Stein estimator and
positive part version were introduced in James and Stein (1961), for more discussion of the background
and significance of this paper see Efron (1993). Inadmissibility of the best invariant estimator of location
in n 3 dimensions is a very general phenomenon, see e.g. Brown (1966). Theorem 2.3 on eigenvalues of
linear estimators and admissibility is due to Cohen (1966). Mallows CL and its relative Cp are discussed
in Mallows (1973).
6. Stein (1981) presented the unbiased estimate of risk, Proposition 2.4 and, among much else, used it
to give the quick proof of dominance of the MLE by the James Stein estimator presented here. The Stein
identity characterizes the family of normal distributions: for example, if n D 1 and (2.57) holds for C 1
functions of compact support, then necessarily X N.; 1/, (Diaconis and Zabell, 1991).
Many other estimators dominating the MLE have been foundone classic paper is that of Strawderman
(1971). There is a large literature on extensions of the James-Stein inadmissibility result to spherically
symmetric distributions and beyond, one example is Evans-Stark (1996).
The upper bound for the risk of the James-Stein estimator, Proposition 2.5 and Corollary 2.6 are based
on Donoho and Johnstone (1995).
We have also not discussed confidence sets for example, Brown (1966) and Joshi (1967) show inadmissibility of the usual confidence set and Hwang and Casella (1982) show good properties for recentering
the usual set at the positive part James-Stein estimate.
7. The unbiased risk estimate for soft thresholding was exploited in Donoho and Johnstone (1995),
while Lemma 2.8 is from Donoho and Johnstone (1994a).
8. The median version of the Gaussian concentration inequality (2.75) is due independently to Borell
(1975) and Sudakov and Cirel0 son (1974). The expectation version (2.74) is due to Cirelson et al. (1976).
Systematic accounts of the (not merely Gaussian) theory of concentration of measure are given by Ledoux
(1996, 2001).
Our approach to the analytic proof of the concentration inequality is borrowed from Adler and Taylor
(2007, Ch. 2.1), who in turn credit Chaumont and Yor (2003, Ch. 3.10), which has further references. The
proof of Lemma 2.10 given here is lightly modified from Chatterjee (2009, Lemma 5.3) where it is used to
prove central limit theorems by Steins method. Tao (2011) gives a related but simpler proof of a weaker
version of (2.74) with 12 replaced by a smaller value C . An elegant approach via the semi-group of the
Ornstein-Uhlenbeck process is described in Ledoux (1996, Ch. 2), this also involves an integration by parts
formula.
Sharper bounds than (2.76) for the tail of 2 random variables are available (Laurent and Massart (1998),
Johnstone (2001), Birge and Massart (2001), [CHECK!]) . The constant 32 in bound (2.77) can also be
improved to 8 by working directly with the chi-squared distribution.
The Hermite polynomial proof of the univariate Gaussian-Poincare inequality (2.79) was written down
by Chernoff (1981); for some historical remarks on and extensions of the inequality, see Beckner (1989).
9. For more on Hadamard matrices and weighing designs, see for example Hedayat and Wallis (1978)
and its references and citing articles.
Traditional references for ridge regression and Tikhonov regularization are, respectively, Hoerl and Kennard (1970) and Tikhonov and Arsenin (1977). A more recent text on inverse problems is Vogel (2002).
Exercises
2.1
(Gaussian priors.) Suppose that Nn .0 ; T / and that yj Nn .; I /: Let p.; y/ denote
the joint density of .; y/. Show that
2 log p.; y/ D T B
2.2
2 T C r.y/:
Identify B and
, and conclude that jy N.y ; y / and evaluate y and y .
(Posterior variance formula.) Suppose that x N.; 1/ and that is drawn from a prior
.d/ with marginal density p.x/ D . ? /.x/. We saw that the posterior mean O .x/ D
52
2.3
2.4
d
O .x/ D 1 C .log p/00 .x/:
dx
(Some characterizations of Gaussian priors.) Again suppose that x N.; 1/ and that is
drawn from a proper prior .d/ with marginal density p.x/ D . ? /.x/.
(a) Suppose that p is log-quadratic, specifically that .log p/.x/ D x 2 C x C
with 0.
Show that necessarily > 0 and that .d/ is Gaussian.
(b) Instead, suppose that the posterior mean E.jx/ D cx C b for all x and some c > 0. Show
that .d/ is Gaussian.
(c) Now, suppose only that the posterior variance Var.jx/ D c > 0 for all x. Show that .d/
is Gaussian.
(Minimum L1 property of median.) Let F be an arbitrary probability distribution function on
R. A median of F is any point a0 for which
F . 1; a0
1
2
F a0 ; 1/ 12 :
and
2.5
ja
jdF . /
.g=/.x/
1 w
.g=/.x/
.g=/.t / D
:
.g=/.x 2/
w
.g=/.x 2/
2.6
.t
x 2
/dt exp.2t C 2/ 2;
2.7
Use these expressions to verify the posterior median formula (2.42) and the threshold relation (2.44).
(Properties of jAj.) Let A be a square matrix and jAj D .AT A/1=2 .
(i) Show how the polar decomposition A D U jAj, for suitable orthogonal U , can be constructed
from the SVD (2.84) of A.
53
Exercises
(ii) Let .i ; ei / be eigenvalues and eigenvectors of jAj. Show that tr A tr jAj.
(iii) If equality holds in (ii), show that Aei D jAjei for each i , and so that A must be symmetric.
2.8
(Unbiased risk estimator for correlated data.) (i) Suppose that Y Nd .; V /. For a linear
estimator OC .y/ D Cy, show that
r.OC ; / D tr C V C T C k.I
C /k2 :
(ii) If, in addition, g W Rn ! Rn is smooth and satisfies EfjYi gi .Y /j C jDi gj .Y /jg < 1 for
all i; j , show that
E kY C g.Y /
2.9
(2.87)
(iii) Suppose that all variances are equal, Vi i D 2 . Show that the unbiased risk estimate for
soft thresholding in this correlated case is still given by (2.71), after inserting a rescaling to
noise level .
P 2
(A large class of estimators dominating the MLE.) Suppose that X Np .; I /, S D
Xi ,
and consider estimators of the form
.S /.p 2/
X:
O
.X / D 1
S
2.11 (An empirical Bayes derivation of James-Stein.) Suppose that yi ji N.i ; 02 / for i D
1; : : : ; n with 02 known. Let i be drawn independently from a N.; 2 / prior.
(i) Show that the (posterior mean) Bayes estimator is
Oi D C .yi /;
D 2 =.02 C 2 /:
P
P
(ii) Let yN D n 1 n1 yi and S 2 D n1 .yi y/
N 2 : Show that yN and .n 3/02 =S 2 are unbiased
for and 1 in the marginal distribution of yi given hyperparameters and 2 .
(iii) This yields the unknown-mean form of the James-Stein estimator:
.n 3/02
JS
O
.yi y/:
N
i D yN C 1
S2
Explain how to modify the argument to recover the known-mean form (2.60).
2.12 (Positive part helps.) Show that the positive part James-Stein estimator (2.62) has MSE smaller
than the original rule (2.60): EkO JSC k2 < EkO JS k2 for all 2 Rn :
2.13 (A simpler, but weaker, version of (2.63).) Use Jensens inequality in (2.61) to show that
r.O JS ; / 4 C nkk2 =.n C kk2 /:
2.14 (Probable risk improvements under a Gaussian prior.)
Suppose that Nn .0; 2 I /. Show that if kk2 E.kk2 / C kSD.kk2 /, then
r.O JS ; / 2 C
p
2
.n C k 2n/:
2
1C
54
1
2 fd .w/
(2.88)
fd C2 .w/;
(2.89)
d / 1=2:
(2.90)
P .2d d / 12 .
1 1
d=2
(2.91)
1 1 d=2
P .2d d / D
e d=2
.d=2/
.u d=2/ d=2 1
du:
d=2
For the two bounds use Stirlings formula and (2.90) respectively.]
2.16 (Poisson mixture representation for noncentral 2 .) Let X Nd .; I / and define the noncentrality parameter D kk2 . The noncentral 2d ./ distribution refers to the law of Wd D kXk2 .
This exercise offers two verifications of the representation of its density f;d as a Poisson.=2/
mixture of central 2d C2j distributions:
f;d .w/ D
1
X
(2.92)
j D0
where p .j / D e
j =j
Ee s.ZC
p 2
/
D .1
2s/
(iv) Consider the difference operator fd D fd C2 fd , and define the operator exponential,
P
as usual, via e D j 0 ./j =j . Show that (2.92) can be rewritten as
f;d .w/ D e =2 fd .w/:
(2.93)
2.17 (Noncentral 2 facts.) Let S 2 2d ./ be a noncentral 2 variate, having density f;d .w/ and
distribution function F;d .w/ Use (2.93) to show that
@
f;d .w/ D
@
@
f;d C2 .w/ D 12 f;d C2 .w/
@w
f;d .w/;
(2.94)
2f;d C2 .w/:
(2.95)
(2.96)
55
Exercises
2.18 (Exact MSE for the positive part James-Stein estimator.)
(i) Show that the unbiased risk estimator for O JSC is
(
n .n 2/2 kxk 2 ; kxk > n
U.x/ D
kxk2 n;
kxk < n
(ii) Let F .t I k/ D P .2k t / and FQ .tI k/ D 1
2
2:
2/
F .t I k
2/:
(iii) If X Nn .; I /, then let K Poisson.kk2 =2/ and D D n C 2K. Show that
r.O JS ; / D n
r.O JSC ; / D n
E .n 2/2 =.D 2/
n .n 2/2
FQ .n 2I D
E
D 2
C 2nF .n
2I D/
2/
DF .n
o
2I D C 2/ :
[which can be evaluated using routines for F .tI k/ available in many software packages.]
2.19 (Comparison of shrinkage and thresholding.) As in Section 2.7, use ideal risk n 2 kk2 =.n 2 C
kk2 / as a proxy for the risk of the James-Stein estimator. Show that
n
X
min.i2 ; 2 / 2
iD1
n 2 kk2
;
n 2 C k k2
and identify sequences .i / for which equality occurs. Thus, verify the claim in the last paragraph of Section 2.7.
2.20 (Shrinkage and thresholding for an approximate form of sparsity.) Suppose that D n 1=2
and p < 2. Compare p
the the large n behavior of the MSE of James-Stein estimation and soft
thresholding at D 2 log n on the weak-`p -extremal sequences
k D k
1=p
k D 1; : : : ; n:
Q /D
2.21 (Simple Gaussian tail bounds.) (a) Let .t
R1
t
Q / .t /=t:
.t
(b) By differentiating e t
=2 .t
Q /,
(2.97)
t 2 =2
(2.98)
2.22 (Median and mean for maxima.) If Z Nn .0; I / and Mn equals either maxi Zi or maxi jZi j,
then use (2.74) to show that
p
jEMn MedMn j 2 log 2:
(2.99)
(Massart (2007)).
2.23 (Chi-squared tail bound.) Use the inequality .1 C x/1=2 1 C x=4 for 0 x 1 to verify
(2.77).
56
2.24 (Easy approximate delta function results.) (a) Let denote the Nn .0; 2 I / density in Rn .
Suppose that f is Lipschitz.L/ and define f D f ? . By writing the convolution in two ways,
show that f is still Lipschitz.L/, but also differentiable, even C 1 , and that f .x/ ! f .x/
uniformly on Rn as ! 0.
R
(b) Now suppose that
0 is a C 1 function with
D 1 and support contained in the
unit ball fx 2 Rn W kxk 1g. [Such functions exist and are known as mollifiers.] Let D
n .x=/. Suppose that f is continuous on Rn and define f D f ? . Use the same
arguments as in (a) to show that f is differentiable, even C 1 ), and that f ! f uniformly on
compact sets in Rn .
[Part (a) is used in the proof of the concentration inequality, Proposition (2.74), while part (b)
is a key component of the proof of the approximation criterion for weak differentiability used
in the proof of Steins unbiased risk estimate, Proposition 2.4 see C.22]
2.25 (Regularization and Bayes rule.) Suppose that Y Nn .A; 2 I /. Show that the minimizer
O of the penalized least squares criterion (2.85) can be interpreted as the posterior mode of a
suitable prior .d/ and identify .
ind
2.26 (Local asymptotic normality and the Gaussian model.) Suppose X1 ; : : : ; Xn f .x/.dx/
for 2 Rp . Let the loglikelihood for a single observation be ` D log f .x/: Write @i for
@=@i and set `P D .@i ` / and `R D .@ij ` /. Under common regularity conditions, the Fisher
information matrix I D E `P `PT D E `R . The following calculations make it plausible that
the models Pn D .Pn Ch=pn ; h 2 Rp / and Q D .N.h; I01 /; h 2 Rp / have similar statistical
0
properties for n large. Full details may be found, for example, in van der Vaart (1998, Ch. 7).
(i) Show that
Y
log .f0 Ch=pn =f0 /.Xl / D hT n;0 12 hT In;0 h C op .1/
(2.100)
and that under Pn0 ,
n;0 D n
1=2
In;0 D
(ii) Let gh .y/ denote the density of N.h; I01 / and show that
log.gh =g0 /.y/ D hT I0 y
1 T
2 h I0 h:
(2.101)
3
The infinite Gaussian sequence model
It was agreed, that my endeavors should be directed to persons and characters supernatural, or at least romantic, yet so as to transfer from our inward nature a human interest and
a semblance of truth sufficient to procure for these shadows of imagination that willing
suspension of disbelief for the moment, which constitutes poetic faith. (Samuel Taylor
Coleridge, Biographia Literaria, 1817)
For the first few sections, we focus on the infinite white Gaussian sequence model
yi D i C zi
i 2 N:
(3.1)
For some purposes and calculations this is an easy extension of the finite model of Chapter
2, while in other respects important new issues emerge. For example, the unbiased estimaO
tor .y/
D y has infinite mean squared error, and bounded parameter sets are no longer
necessarily compact, with important consequences that we will see.
Right away, it must be remarked that we are apparently attempting to estimate an infinite
number of parameters on the basis of what must necessarily be a finite amount of data. This
calls for a certain suspension of disbelief which the theory attempts to reward.
Essential to the effort is some assumption that most of the i are small in some sense.
In this chapter we require to belong to an ellipsoid. In terms of functions expressed in
a Fourier basis, this corresponds to mean-square smoothness. This and some consequences
for mean squared error of linear estimators over ellipsoids are developed in Section 3.2,
along with a first rate of convergence result, for a truncation estimator that ignores all high
frequency information.
We have seen already in the introductory Section 1.4 that (3.1) is equivalent to the continuous Gaussian white noise model. This connection, along with the heuristics also sketched
there, allow us to think of this model as approximating the equispaced nonparametric regression model Yl D f .l=n/ C Zl , compare (1.13). This opens the door to using (3.1)
to gain insight into frequently used methods of nonparametric estimation. Thus, kernel and
smoothing spline estimators are discussed in Sections 3.3 and 3.4 respectively, along with
their bias and variance properties. In fact, a smoothing spline estimator is a kernel method
in disguise and in the sequence model it is fairly easy to make this explicit, so Section 3.5
pauses for this detour.
Mean squared error properties return to the agenda in Section 3.6. The worst case MSE of
a given smoothing spline over an ellipsoid (i.e. smoothness class) is calculated. This depends
on the regularization parameter of the spline estimator, which one might choose to minimize
57
58
the worst case MSE. With this choice, standard rate of convergence results for smoothing
splines can be derived.
The rest of the chapter argues that the splendid simplicity of the sequence model (3.1)
actually extends to a variety of other settings. Two approaches are reviewed: transformation
and approximation. The transformation approach looks at models that can be put into the independent Gaussian sequence form yi D i C %i zi for i 2 N and known positive constants
%i . This can be done for linear inverse problems with white Gaussian noise via the singular
value decomposition, Section 3.9, and for processes with correlated Gaussian noise via the
Karhunen-Lo`eve transform (aka principal components), Section 3.10.
The approximation approach argues that with sufficient data, more concrete nonparametric function estimation problems such as density and spectral density estimation and flexible
regression models look like the Gaussian sequence model. Methods and results can in
principle, and sometimes in practice, be transferred from the simple white noise model to
these more applications oriented settings. Section 3.11 gives a brief review of these results,
in order to provide further motivation for our detailed study of the Gaussian sequence model
in later chapters.
59
k22 :
This can be expressed in terms of functions and in the continuous time domain using the
Parseval relation (1.25), yielding r.fO; f /.
Suppose that is restricted to lie in a parameter space `2 and compare estimators
through their worst case risk over : A particular importance attaches to the best possible
worst-case risk, called the minimax risk over :
RN .; / D inf sup E L.O .y/; /:
O 2
(3.2)
The subscript N is a mnemonic for non-linear estimators, to emphasise that no restricO One is often interested also in the minimax risk
tion is placed on the class of estimators .
when the estimators are restricted to a particular class E defined by a property such as linearity. In such cases, we write RE for the E -minimax risk. Note also that we will often drop
explicit reference to the noise level , writing simply RN ./ or RE ./.
This is an extension of the notion of minimax risk over Rn , introduced in Section 2.5.
Indeed, in (3.2) we are forced to consider proper subsets of `2 .N/this is a second new
feature of the infinite dimensional model. To see this, recall the classical minimax result
quoted at (2.51), namely that RN .Rn ; / D n 2 . Since Rn `2 .N/ for each n, it is apparent
that we must have RN .`2 .N/; / D 1; and in particular for any estimator O
sup E kO
k22 D 1:
2`2 .N/
(3.3)
1
X
ak2 k2 C 2 g;
(3.4)
(3.5)
We will see that each class can be used to encode different types of smoothness for functions
f 2 L2 0; 1. For now, we record criteria for compactness, with the proofs as Exercise 3.1.
60
Lemma 3.2 The ellipsoid .a; C / is `2 -compact if and only if ak > 0 and ak ! 1:
P
The hyperrectangle ./ is `2 -compact if and only if k2 < 1:
Compactness is not necessary for finiteness of the minimax risk, as the classical finitedimensional result (2.51) already shows. Indeed, Lemma 3.1 extends to sets of direct product
form D Rr 0 , where r < 1 and 0 is compact. We will need this in the next paragraph;
the easy proof is Exercise 3.2. The argument of the Lemma can also be extended to show
that RN .; / < 1 if L.a; / D w.ka k/ with w continuous and being k k-compact.
Ellipsoids and mean square smoothness. Ellipsoids furnish one of the most important
and interpretable classes of examples of parameter spaces . Consider first the continuous
form of the Gaussian white noise model (1.21). For the moment, we restrict attention to the
subspace L2;per 0; 1 of square integrable periodic functions on 0; 1. For integer 1; let
f ./ denote the th derivative of f and
n
F D F .; L/ D f 2 L2;per 0; 1 Wf . 1/ is absolutely continuous, and
Z 1
(3.6)
o
f ./ .t /2 dt L2 :
0
Thus, the average L2 norm of f ./ is required not merely to be finite, but also to be less than
a quantitative bound Las we will shortly see, this guarantees finiteness of risks. Historically,
considerable statistical interest focused on the behavior of the minimax estimation risk
Z 1
RN .F ; / D inf sup E
fO f 2
(3.7)
fO f 2F
in the low noise limit as ! 0: For example, what is the dependence on the parameters
describing F W namely .; L/? Can one describe minimax estimators, and in turn, how do
they depend on .; L; /?
The ellipsoid interpretation of the parameter spaces F .; L/ comes from the sequence
form of the white noise model. Consider the orthonormal trigonometric basis for L2 0; 1,
in which
(
p
k D 1; 2; : : :
'2k 1 .t / D 2 sin 2k t
p
(3.8)
'0 .t/ 1;
'2k .t / D 2 cos 2k t:
R1
Let k D hf; 'k i D 0 f 'k denote the Fourier coefficients of f . Let 2 .C / denote the
ellipsoid (3.4) with semiaxes
a0 D 0;
a2k
D a2k D .2k/ :
(3.9)
Thus 2 .C / has the form R 0 with 0 compact. Furthermore, it exactly captures the
notion of smoothness in mean square.
Lemma 3.3 Suppose 2 N. For f 2 L2;per 0; 1, let D f 2 `2 denote coefficients
in the Fourier basis (3.8). Let F .; L/ be given by (3.6) and the 2 .C / the ellipsoid with
semiaxes (3.9). Then f 2 F .; C / if and only if 2 2 .C /.
61
Proof Outline (For full details, see e.g. Tsybakov (2009, pp. 196-8)). Differentiation takes
0
a simple form in the Fourier basis: if l D 2k 1 or 2k, then 'l./
P D .2k/ 'l., with
0
0
l D 2k 1 or 2k also, and l D l iff is even. Hence, if f D
k 'k ; and f 1/ is
absolutely continuous then
Z
X
ak2 k2 ;
f ./ 2 D 2
so
if (3.6) holds, then 2 2 .C /. For the converse, one shows first that finiteness of
P that
ak2 k2 implies that f . 1/ exists and is absolutely continuous. Then for 2 2 .C /, the
previous display shows that (3.6) holds.
The statistical importance of this result is that the function space minimax risk problem
(3.7) is equivalent to a sequence space problem (3.2) under squared `2 loss. In the sequence
version, the parameter space is an ellipsoid. Its simple geometric form was exploited by
Pinsker (1980) to give a complete solution to the description of minimax risk and estimators.
We shall give Pinskers solution in Chapter 5 as an illustration of tools that will find use for
other parameter sets in later chapters.
Remark. The ellipsoid representation (3.4)(3.9) of mean-square smoothness extends to
non-integer
of smoothness. Sometimes we put, more simply, just ak D k : FiniteP degrees
2 2
ness of
k k can then be taken as a definition of finiteness of the Sobolev seminorm
kf ./ k2 even for non-integer . Appendices B and C.25 contain further details and references.
1
X
cij2 < 1:
(3.10)
i;j D1
Thus, C must be a bounded linear operator on `2 with square summable singular values.
In particular, in the infinite sequence case, the maximum likelihood estimator C D I
must be excluded! Hence the bias term is necessarily unbounded over all of `2 , namely
sup 2`2 r.OC ; / D 1; as is expected anyway from the general result (3.3).
The convergence in (3.10) implies that most cij will be small, corresponding at least
62
heuristically, to the notion of shrinkage. Thus, familiar smoothing methods such as the
Wiener filter and smoothing splines are indeed linear shrinkers except possibly for a low
dimensional subspace on which no shrinkage is done. Recall, for example, formula (1.18)
for the smoothing spline estimator in the Demmler-Reinsch basis from Section 1.4, in which
w1 D w2 D 0 and wk increases for k 3: This shrinks all co-ordinates but the first two.
More generally, in the infinite sequence model it is again true that reasonable linear estimators must shrink in all but at most two eigendirections. Indeed Theorem 2.3 extends to the
infinite sequence model (3.1) in the natural way: a linear estimator OC .y/ D Cy is admissible for squared error loss if and only if C is symmetric with finite Hilbert-Schmidt norm
(3.10) and eigenvalues %i .C / 2 0; 1 with at most two %i .C / D 1 (Mandelbaum, 1984).
A particularly simple class of linear estimators is given by diagonal shrinkage, C D
diag.ck / for a sequence of constants c D .ck /. In this case, we write Oc and the MSE
decomposition simplifies to
X
X
r.Oc ; / D 2
ck2 C
.1 ck /2 k2 :
(3.11)
k
This form is easy to study because it is additive in the co-ordinates. Thus, it can be desirable
that the basis f'k g be chosen so that the estimators of interest have diagonal shrinkage form.
We will see how this can happen with kernel and spline estimators in Sections 3.3 and 3.4.
Maximum risk over ellipsoids. We illustrate by deriving an expression for the maximum
risk of a diagonal linear estimator over an ellipsoid.
Lemma 3.4 Assume
homoscedastic white noise model yk D k C zk . Let D
P 2 the
.a; C / D f W
ak k2 C 2 g and consider a diagonal linear estimator Oc .y/ D .ck yk /:
Then the maximum risk
X
r.
N Oc I / D sup r.Oc ; / D 2
ck2 C C 2 sup ak 2 .1 ck /2 :
2
Proof The diagonal linear estimator has variance-bias decomposition (3.11). The worst
case risk over has a corresponding form
r.
N Oc I / D sup r.Oc ; / D VN ./ C BN 2 ./:
(3.12)
2
P
The max variance term VN ./ D
2 k ck2 does not depend on . On the other hand, the
P
max bias term BN 2 ./ D sup k .1 ck /2 k2 does not depend on the noise level : It does
depend on ; but can be easily evaluated on ellipsoids.
2 2
2
linear function .sk / !
PIndeed, make new variables sk D ak k =C and note that the P
dk sk has maximum value sup dk over the non-negative simplex sk 1. Hence,
BN 2 ./ D C 2 sup ak 2 .1
ck /2 :
(3.13)
63
E.O;i
i /2 D 2 C
i2 :
i >
This follows from the MSE formula (3.11) for diagonal estimators, by noting that c corresponds to a sequence beginning with ones followed by zeros.
Of course is unknown, but adopting the minimax approach, one might suppose that a
particular ellipsoid .a; C / is given, and then find that value of which minimizes the maximum MSE over that ellipsoid. Using Lemma 3.4, for an ellipsoid with k ! ak2 increasing,
we have for the maximum risk
r.
N O I / WD
sup
2.a;C /
2
r.O ; / D 2 C C 2 aC1
:
Now specialize further to the mean-square smoothness classes 2 .C / in the trigonometric basis (3.8) in which the semi-axes ai follow the polynomial growth (3.9). If we truncate
at frequency k, then D 2k C 1 (remember the constant term!) and
r.
N O I / D .2k C 1/ 2 C C 2 .2k C 2/
As the cut-off frequency k increases, there is a trade-off of increasing variance with decreasing biashere and quite generally, this is a characteristic feature of linear smoothers
indexed by a model size parameter. The maximum risk function is convex in k, and the
optimal value is found by differentiation1 :
2k C 2 D .2C 2 = 2 /1=.2C1/ :
Substituting this choice into the previous display and introducing the rate of convergence
1
We ignore the fact that k should be an integer: as ! 0, it turns out that using say k would add a term
of only O. 2 /, which will be seen to be negligible. See also the discussion on page 75.
64
22 .C /
D .2/1=.2C1/ C 2.1
b C 2.1
r/ 2r
C C 2 .2C 2 = 2 /
C O. 2 /
r/ 2r
;
(3.15)
.2/ 1=2 e t =2
Gaussian
<.1=2/I
Uniform
1;1 .t /
K.t/ D
(3.16)
2
.3=4/.1
t
/I
.t
/
Quadratic/Epanechnikov
1;1
:.15=16/.1 t 2 /2 I
Biweight:
1;1 .t /
These are all symmetric and non-negative; all but the first also have compact support.
2
it would be more correct to write rate index to refer to r, but here and throughout we simply say rate.
65
For simplicity we assume that K has compact support t0 ; t0 , which guarantees convergence of (3.17), though the results hold more generally. With observations in the continuous
white noise model, d Y.t/ D f .t /dt C d W .t /; 0 t 1, the kernel estimator of f is
Z 1
KV h .s t /d Y .t /:
(3.18)
fOh .s/ D
0
The integral is interpreted as in (1.23) and it follows from the compact support assumption
that .s; t/ ! KV h .s t/ is a square integrable kernel, so that fOh .s/ has finite variance and
belongs to L2 0; 1 almost surelyfor further details see C.13.
Remark. To help in motivating this estimator, we digress briefly to consider the nonparametric regression model Yl D f .tl / C Zl , for ordered tl in 0; 1 and l D 1; : : : ; n. A
locally weighted average about s would estimate f .s/ via
.X
X
wl .s/Yl
wl .s/:
(3.19)
fO.s/ D
l
A typical choice of weights might use a kernel K.u/ and set wl .s/ D Kh .s tl /: This
is sometimes called the Nadaraya-Watson estimator, see the Chapter Notes for references.
A difficulty with (3.19) is that it runs out of data on one side of s when s is near the
boundaries 0 and 1. Since we assume that f is periodic, we may handle this by extending
the data periodically:
YlCj n D Yl
for j 2 Z:
(3.20)
We also simplify by supposing that the design points tl are equally spaced, tl D l=n. Then
we may make an integral approximation to the denominator in (3.19),
Z
X
X
:
Kh .s l=n/ D n Kh .s t /dt D n:
wl .s/ D
l
Kh .s
tl /Yl :
(3.21)
l2Z
Now use the assumed periodicity (3.20) to rewrite the right hand sum as
n X
X
lD1 j 2Z
Kh .s
l=n/Yl D
n
X
KV h .s
l=n/Yl :
lD1
To link this with the continuous form (3.18), we recall the partial sum process approximation Yn .t/ from (1.26), and view Yl as the scaled increment given by n.1=n Yn /.l=n/ D
66
nYn .l=n/
Yn ..l
1/=n/. We obtain
fOh;n .s/ D
n
X
KV h .s
Now approximate Yn .t/ by the limiting process Y .t / and the sum by an integral to arrive at
formula (3.18) for fOh .s/.
Now we return to the white noise model and derive the bias and variance properties of fOh .
Lemma 3.5 In the continuous white noise model on 0; 1, assume that f is periodic. Let
fOh , (3.18), denote convolution with the periodized kernel (3.17). Then
Z 1
E fOh .s/ D
Kh .s t /f .t /dt D .Kh f /.s/:
1
If also
supp.K/ t0 ; t0 ;
and
h < 1=.2t0 /;
(3.22)
then
VarfOh .s/ D 2
1
1
Kh2 .t /dt D
2
kKk22 :
h
From the first formula, one sees that fOh estimates a smoothed version of f given by
convolution with the kernel of bandwidth h. The smaller the value of h, the more narrowly
peaked is the kernel Kh and so the local average of f more closely approximates f .s/. One
calls Kh an approximate delta-function. Thus as h decreases so does the bias E fOh .s/
f .s/, but inevitably at the same time the variance increases, at order O.1= h/,
Here we use operator notation Kh f for the convolution Kh ? f over R, consistent with
later use in the book, compare e.g. Section 3.9 and (C.6).
Proof
u/f .u/du:
For the second, we use formula (1.23) and the Wiener integral identity ?? to write
Z 1
2
O
Varfh .s/ D
KV h2 .t /dt:
0
This yields the first equality for VarfOh .s/ and the second follows by rescaling.
j 0 / do not
67
Local MSE. The mean squared error of fOh as an estimator of f at the point s has a
decomposition into variance and squared bias terms in the same manner as (2.46):
EfOh .s/
f .s/2 :
f .s/2 D
2
kKk22 C k.I Kh /f k22;I :
(3.24)
h
Again this is a decomposition into variance and bias terms. The result holds even without
(3.22) if we replace K by KV on the right side.
Notice the similarity of this mean squared error expression to (2.47) for a linear estimator
in the sequence model. This is no surprise, given the sequence form of fOh to be described
later in this section.
EkfOh
f k22;I D
pD0
Z 1
<1
p D
(3.25)
v p K.v/dv D 0
p D 1; : : : ; q 1
1
:
qcq 0
p D q:
Observe that if K is symmetric about zero then necessarily q 2. However, if K is symmetric and non-negative, then c2 > 0 and so q D 2: We will see shortly that to obtain fast
rates of convergence, kernels of order q > 2 are required. It follows that such kernels must
necessarily have negative sidelobes.
To see the bias reduction afforded by a q th order kernel, assume that f has q continuous
derivatives on 0; 1. Then the Taylor series approximation to f at s takes the form
f .s
q 1
X
. hv/q .q/
. hv/p .p/
hv/ D f .s/ C
f .s/ C
f .s.v//;
p
q
pD1
(3.27)
68
Thus, other things being equal, (which they may not be, see Section 6.5), higher order
kernels might seem preferable due to their bias reduction properties for smooth functions.
Exercise 3.7 has an example of an infinite order kernel. We will see this type of argument in
studying the role of vanishing moments for wavelets in Chapter 7.
In summary, if K is a q th order kernel, if (3.22) holds, and if f is C q , then as h ! 0
we have the local and global MSE approximations
EfOh .s/
EkfOh
2
kKk22 C cq2 h2q D q f .s/2 1 C o.1/
h
Z
2
f k22 D kKk22 C cq2 h2q .D q f /2 1 C o.1/:
h
f .s/2 D
(3.28)
The Variance-Bias Lemma. The approximate MSE expressions just obtained have a characteristic form, with a variance term decreasing in h balanced by a bias term that grows with
h. The calculation to find the minimizing value of h occurs quite frequently, so we record it
here once and for all.
Lemma 3.6 (Variance-Bias) The function G.h/ D vh 1 C bh2 ; defined for h 0 and
positive constants v; b and ; has minimizing value and location given by
G.h / D e H.r/ b 1 r v r
and
h D r
r log r
H.r/
.v=b/1 r :
For example, with kernel estimates based on a kernel K of order q, (3.28), shows that h
can be thought of as a bandwidth and v as aR variance factor (such as n 1 or 2 ), while b is a
bias factor (for example involving c.K; q/ .D q f /2 ; with D q).)
The proof is straightforward calculus, though the combination of the two terms in G.h/
to yield the multiplier e H.r/ is instructive: the variance and bias terms contribute in the ratio
1 to .2/ 1 at the optimum, so that in the typical case > 12 , the bias contribution is the
smaller of the two at the optimum h :
Sequence space form of kernel estimators. Our kernel estimators are translation invariant, Kh .s; t/ D Kh .s t/, and so in the Fourier basis they should correspond to diagonal
shrinkage introduced in the last section and to be analyzed further in later chapters. To describe this, let 'k .s/ denote the trigonometric basis (3.8), and recall that the correspondence
between the continuous model (1.21) and sequence form (3.1) is given by formulas (1.24)
for yk ; k etc. Thus we have
fOh .t / F
1
X
Oh;k 'k .t /;
(3.29)
kD0
where we have used notation F (read has Fourier series expansion) in place of equality
in order to recognize that the left side is not periodic while the right hand series necessarily
is. Of course the right hand series converges in mean square to fOh , but not necessarily at the
endpoints 0 and 1.
69
with
ch;2k
b
D ch;2k D K.2kh/:
(3.30)
Thus the diagonal shrinkage constants in estimator Oh are given by the Fourier transb
form of kernel K, and their behavior for small bandwidths is determined by that of K
b
near zero. Indeed, the r Rth derivative of K./ at zero involves the r th moment of K,
b .r/ .0/ D . i/r t r K.t /dt. Hence an equivalent description of a q th order kernamely K
nel, (3.25), says
b
K./
D 1 bq q C o. q /
as ! 0
(3.31)
where bq D . i/q cq 0. Typically bq > 0, reflecting the fact that the estimator usually
shrinks coefficients toward zero.
For the first three kernels listed at (3.16), we have
8 2
=2
Gaussian
<e
b
(3.32)
K./ D sin =
Uniform
:
2
.3= /.sin = cos / Quadratic/Epanechnikov:
Proof We begin with the orthobasis of complex exponentials 'kC .s/ D e 2 i ks for k 2 Z.
The complex Fourier coefficients of the kernel estimator fOh are found by substituting the
periodized form of (3.18) and interchanging orders of integration:
Z 1
Z 1
Z 1
fOh .s/e 2 i ks ds D
KV h .s t /e 2 i k.s t / ds
e 2 i k t d Y .t /:
0
C
C
In other words, we have the diagonal form Oh;k
.y/ D
h;k
ykC for k 2 Z. Now using first the
periodicity of KV h , and then its expression (3.17) in terms of K, we find that
Z 1
Z 1
C
h;k
D
KV h .u/e 2 i ku du D
Kh .u/e 2 i ku du
(3.33)
0
1
ch .2k/ D K.2kh/:
b
DK
C
C
b / D K./
b
Observe that since K is symmetric we have K.
and so
h;
k D
h;k :
It remains to convert this to the real trigonometric basis. The relation between Fourier
coefficients ffkC ; k 2 Zg in the complex exponential basis and real coefficients ffk ; k 0g
in trigonometric basis (3.8) is given by
p
p
f2k 1 D .1= i 2/.fkC f Ck /:
f2k D .1= 2/.fkC C f Ck /;
3
O fO denoting
The reader should note an unfortunate clash of two established conventions: the hats in ;
b denoting Fourier transform!
estimators should not be confused with the wider ones in K
70
C
The desired diagonal form (3.30) now follows from this and (3.33) since
h;
C
D
h;k
:
Yl D f .l=n/ C Zl ;
Zl N.0; 1/
(3.34)
n
X
Yl
.D m f /2 :
f .tl / C cm
0
lD1
Here we allow a more general mth derivative penalty; the constant cm is specified below.
The discrete sine and cosine vectors will be 'k D .'k .tl //. The key point is that for the
Fourier basis, the vectors 'k are discrete orthogonal on f1=n; : : : ; .n 1/=n; 1g and at the
same time the functions 'k .t/ are continuous orthonormal on 0; 1, see Exercise 3.6 and
Appendix C.9. Using the properties of differentiation in the Fourier basis, as in the proof of
Lemma 3.3, these double orthogonality relations take the form
n
n
X
Z
'j .tl /'k .tl / D j k
D m 'j D m 'k D 2m wk j k ;
and
0
lD1
w2k
D w2k D .2k/2m :
(3.35)
P
We now convert the objective Qn .f / into sequence space form. Let yk D n 1 nlD1 Yl 'k .tl /
be the empirical Fourier coefficients of fYl g, for k D 0; : : : ; n 1. Then, using the double
71
n 1
X
.yk
k /2 C
kD0
2m
n 1
X
wk k2 :
kD0
for k D 0; : : : ; n
(3.36)
n 1
X
O;k 'k .t /;
(3.37)
kD0
might be called a periodic smoothing spline estimator based on fY1 ; : : : ; Yn g. The periodic
spline problem therefore has many of the qualitative features of general spline smoothing
seen in Section 1.4, along with a completely explicit description.
Remark. It is not true that the minimizer of Q.f / over all functions lies in Sn , as was the
case with cubic splines. The problem lies with aliasing: the fact that when 0 < r n and
l 2 N; we have 'r D 'rC2ln when restricted to t1 ; : : : ; tn : See Exercise 3.8.
Infinite sequence model. The periodic spline estimate (3.37) in the finite model (3.34) has
a natural analogue in the infinite white noise model, which we recall has the dual form
Z t
Yt D
f .s/ds C Wt
t 2 0; 1;
(3.38)
0
,
yk D k C zk
k D 0; 1; : : :
with the second row representing coefficients in the trigonometric basis (3.8) (so that the
index k begins at 0). The smoothing spline estimate O in the infinite sequence model is the
minimizer4 of
1
1
X
X
Q./ D
.yk k /2 C
wk k2 :
(3.39)
0
We
use weights wk given by (3.35) corresponding to the mth order penalty P .f / D
R again
m
2
.D f / . The estimator O D .O;k / minimizing Q. / has components again given by
(3.36), only now the index k D 0; 1; : : :. The corresponding estimate of f is
fO .t/ D
1
X
0
O;k 'k .t / D
1
X
c;k yk 'k .t /
(3.40)
Studying fO in the infinite model rather than fO;n in the finite model amounts to ignoring
discretization, which does not have a major effect on the principal results (we return to this
point in Chapter 15).
We may interpret the m-th order smoothing spline as a Bayes estimator. Indeed, if the
4
P1
In the Gaussian white noise model, 0 .yk k /2 D 1 with probability one, but this apparent obstacle
P 2
Q
may be evaded by minimizing the equivalent criterion Q./
D Q./
yk .
72
prior makes the co-ordinates k independently N.0; k2 / with k2 D b=wk bk
the posterior mean, according to (2.19), is linear with shrinkage factor
ck D
bk
bk
2m
2m
C 2
2m
, then
1
;
1 C k 2m
after adopting the calibration D 2 =b. Section 3.10 interprets this prior in terms of .m 1/fold integrated Brownian motion.
Some questions we aim to address using O include
(a) what is the MSE of O , or rather the worst case MSE of O over mean square smoothness classes such as 2 .C /?
(b) what is the best (i.e. minimax) choice of regularization parameter , and how does it
and the resulting minimax MSE depend on ; C and ?
After a digression that relates spline and kernel estimators, we take up these questions in
Section 3.6.
Kernel
fOh;n .s/ D n
Spline
P 1
fO;n .s/ D nkD0
c;k yk 'k .s/
lD1 Kh .s
tl /Yl
R1
KV h .s t /d Y .t /
m
P
fO .s/ D 1
kD0 c;k yk 'k .s/
0
Table 3.1 The analogy between spline smoothing and regression goes via versions of each method in
the infinite sequence model.
As we have just seen, in terms of functions, the spline estimate is given by the series in
the lower row of Table 3.1, with shrinkage constants c;k given by (3.36).
We can now derive theR kernel representation
P of the infinite sequence spline estimate.
Substituting (1.24), yk D 'k d Y into fO D k c;k yk 'k , we get
Z
1
X
O
f .s/ F
C .s; t /d Y .t /;
C .s; t / D
c;k 'k .s/'k .t /:
(3.41)
0
As at (3.29), the notation F indicates that fO is the Fourier series representation of the
integral on the right side.
Now specialize to the explicit weights for periodic splines in (3.35). Then c;2k 1 D
c;2k , and from (3.8) and the addition formula for sines and cosines,
'2k
1 .s/'2k 1 .t /
t /:
73
1
X
2 cos 2ku
:
1 C .2k/2m
1
But we can describe C more explicitly! First, recall from the previous section that a function
P
f on R can be made periodic with period 1 by periodizing: fV.t / D j 2Z f .t C j /:
Theorem 3.8 The periodic smoothing spline is the Fourier series of a kernel estimator:
Z 1
O
f .s/ F
C .s t /d Y .t /:
0
With D h2m , kernel C .u/ D KV h .u/ is the periodized version of Kh .u/ D .1= h/K.u= h/.
The equivalent kernel is given for m D 1 by K.u/ D .1=2/e juj and for m D 2 by
juj
p
1
K.u/ D e juj= 2 sin p C
:
(3.42)
2
4
2
For general m, K is a .2m/ th order kernel, described at (3.45) below.
The kernel Kh has exponential decay, and is essentially negligible for juj 8h for m D
1; 2 and for juj 10h for m D 3; 4compare Figure 3.1. The wrapped kernel KV h is therefore
effectively identical with Kh on 12 ; 12 when h is small: for example h < 1=16 or h < 1=20
respectively will do.
Thus in the infinite sequence model, periodic spline smoothing is identical with a particular kernel estimate. One may therefore interpret finite versions of periodic splines (and by
analogy even B-spline estimates for unequally spaced data) as being approximately kernel
smoothers. The approximation argument was made rigorous by Silverman (1984), who also
showed that for unequally spaced designs, the bandwidth h varies with the fourth root of the
design density.
Proof One approach is to use the Poisson summation formula, recalled at (C.12) and
(C.13). Thus, inserting D h2m we may rewrite C as
X
X
e 2 i ku
Kh .u C j /
(3.43)
Ch2m .u/ D
D
1 C .2kh/2m
j
k2Z
where the second equality uses the Poisson formula (C.13). Since equality holds for all u,
we read off that Kh has Fourier transform
ch ./ D .1 C h2m 2m / 1 :
K
(3.44)
ch ./ D K.h/.
b
Corresponding to the rescaling Kh .u/ D .1= h/K.u= h/, we have K
From
2 1
juj
b
K./ D .1 C / one can verify that K.u/ D e =2, and a little more work yields
the m D 2 result. More generally, from Erdelyi et al. (1954, (Vol. 1), p.10), with rk D
.2k 1/=.2m/,
K.u/ D .2m/
m
X
juj sin rk
sin.juj cos rk C rk /;
kD1
(3.45)
74
0.5
0.4
0.3
0.2
0.1
0.1
15
10
10
15
Figure 3.1 Equivalent kernels for spline smoothing: dashed lines show m D 1; 3
and solid lines m D 2; 4. Only m D 1 is non-negative, the side lobes are more
pronounced for increasing m.
Remark. We mention two other approaches. Exercise 3.9 outlines a direct derivation via
contour integration. Alternately, by successively differentiating (3.43), it is easily seen that
X
l ;
(3.46)
h4 Kh.4/ C Kh D
l
75
1
X
ck /2 D VN ./ C BN 2 ;
ck2 C C 2 sup ak 2 .1
kD0
k0
say. In order to evaluate these maximum variance and bias terms for the spline choice ck D
.1 C wk / 1 , we pause for some remarks on approximating the sums and minima involved.
Aside on discretization approximations. Often a simpler expression results by replacing
a sum by its (Riemann) integral approximation, or by replacing a minimization over nonnegative integers by an optimization over a continuous variable in 0; 1/. We use the special
:
notation D to denote the approximate inequality in such cases. For example, the sum
S./ D
1
X
k p .1 C k q /
:
D
D .p C 1/=q;
(3.48)
/./=.q.r//:
(3.49)
(3.50)
kD0
k2N
x>0
with N D e H.=
/ and N D =
. The final equality uses, for example, the Variance-Bias
Lemma 3.6 with v D ; h D x , etc. Again we have asymptotic equality, SN ./=
N 1 N !
1 as ! 0.
The errors in these discretization approximations6 are quadratic in the size of the discretization step, and so can be expected often to be fairly small. Briefly, for the integral
5
since k should be replaced by an even integer, being whichever of the two values closest to k leads to the
larger squared bias.
Actually, we work in the reverse direction, from discrete to continuous!
76
R 1 00
approximation,
if
G
has,
for
example,
G.0/
D
0
and
0 jG j < 1, then the difference beR1
P1
2
tween kD0 G.k/ and 0 G.x/dx is O. /, as follows from the standard error analysis
for the trapezoid rule. Similarly, if G is C 2 , then the difference between mink2N G.k/ and
infx>0 G.x/ is O. 2 /, as follows from the usual Taylor expansion bounds.
Finally, for later use we record, for p 0 and a > 1, that
K
X
:
k p D .p C 1/ 1 K p ;
and
kD1
K
X
:
a k D
a a K
(3.51)
kD1
which means for the first sum, that the relative error in the integral approximation is O.K 1 /.
In the second sum,
a D a=.a 1/, and the relative error is geometrically small, O.a K /.
Continuation of proof of Proposition 3.9. For the variance term, use the integral approximation (3.48) with p D 0; q D 2m and r D 2:
VN ./ D 1 C 2
1
X
1 C .2k/2m
kD1
: X
D
.1 C k 2m /
:
D vm
1=2m
kD0
m /.m / D .1
m /=sinc.m /;
(3.52)
2m
g 1:
2H.=2m/
D .2m/ 2 =m .2m
/2
=m
(3.53)
Note that bm D 1 if D 2m. Combining the variance and bias terms yields (3.47).
Remarks. 1. The exponent of in the bias term shows that high smoothness, namely
2m, has no effect on the worst-case mean squared error.
2. The degrees of freedom of the smoother O D S y is approximately
X
X
:
tr S D
ck D
.1 C k 2m / 1 D c 1=.2m/ :
k
In the equivalent kernel of the Section 3.5, we saw that corresponded to h2m , and so the
77
sup
2 .C /
r/ 2r
;
(3.54)
as ! 0, with
c2 .; m/. 2 =C 2 /2m.1
r/
Proof Formula (3.47) shows that there is a variance-bias tradeoff, with small corresponding to small bandwidth h and hence high variance and low bias, with the converse being
true for large . To find the optimal , apply the Variance-Bias lemma with the substitutions
h D 1=.2m/ ;
v D vm 2 ;
b D bm C 2 ;
D 2m ^ ;
where vm and bm are the variance and bias constants (3.52) and (3.53) respectively.
From the Variance-Bias lemma, one can identify the constants explicitly:
1 r r
c1 .; m/ D e H.r/ bm
vm ;
(3.55)
In particular, for cubic splines over ellipsoids of twice differentiable functionsR in mean
square, we get that .v2 2 =C 2 /4=5 : For a fixed function f , recall that f 002 D
78
ak2 k2 : Thus, if f is known (as for example
in simulation studies), and a reasonable
R 002
2
4
value of is desired, one might set C D
f to arrive at the proposal
p
4 6 2 2 4=5
R
:
D
2
f 002
P
4
5. We may compare the minimax- MSE for splines given in (3.54) with the minimax-
MSE for truncation estimators (3.15). By comparing the constant terms, it can be verified
for m > 1=2 and 0 2m that the spline estimators have asymptotically at least as good
MSE performance as the truncation estimator, Exercise 3.11. We will see in Section 5.2 just
how close to optimal the spline families actually come.
6. We have assumed periodic f , equispaced sampling points tl and Gaussian errors to
allow a concrete analysis. All of these assumptions can be relaxed, so long as the design
points tl are reasonably regularly spaced. A selection of relevant references includes Cox
(1983); Speckman (1985); Cox (1988); Carter et al. (1992).
sup
h 2 .C /
2
r/ 2r
;
where the exact form of c;K is given in Exercise 3.12: it has a structure entirely analogous
to the periodic spline case.
Local polynomial regression. Consider the finite equispaced regression model (3.34) for
periodic f , with data extended periodically as in (3.20). Let K be a kernel of compact
support and let O minimize
X
Yl
l2Z
p
X
j .tl
t /j
2
Kh .tl
t /:
(3.56)
j D0
Then the local polynomial estimator of degree p puts fOp;h .t / D O0 . This may be motivated
by the Taylor expansion of f .s/ about s D t, in which j D f .j / .t /=j . The kernel Kh
serves to localize the estimation in fOp;h .t / to data within a distance of order h of t.
It can be shown, Exercise 3.13, that the local polynomial regression estimator has an
79
Kh .tl
t /Yl ;
l2Z
compare (3.21), where the equivalent kernel K is a kernel of order (at least) p C 1, even if
the starting kernel K has order 2. Consequently, the rates of convergence results described
for higher order kernel estimators also apply to local polynomial regression. A comprehensive discussion of local polynomial regression is given by Fan and Gijbels (1996).
i 2 N;
(3.57)
where the zi are again i.i.d. N.0; 1/, but the %i are known positive constants.
In the next two sections, we explore two large classes of Gaussian models which can be
transformed into (3.57). These two classes parallel those discussed for the finite model in
Section 2.9. The first, linear inverse problems, studies models of the form Y D Af C Z,
where A is a linear operator, and the singular value decomposition (SVD) of A is needed to
put the model into sequence form. The second, correlated data, considers models of the form
Y D f C Z, where Z is a correlated Gaussian process. In this setting, it is the KarhunenLo`eve transform (KLT, also called principal component analysis) that puts matters into sequence form (3.57). The next two sections develop the SVD and KLT respectively, along
with certain canonical examples that illustrate the range of possibilities for .%i /.
First, some preliminaries about model (3.57). For example, when is it well defined? We
pause to recall the elegant Kakutani dichotomy for product measures (e.g. Williams (1991,
Ch. 14), Durrett (2010, Ch. 5)). Let P and Q be probability measures on a measurable
space .X ; B /; absolutely continuous with respect to a probability measure : (For example,
D .P C Q/=2:) Write p D dP =d and q D dQ=d: The Hellinger affinity
Z
p
pqd
(3.58)
.P; Q/ D
does not depend on the choice of : Now let fPi g and fQi g be two sequences of probability
measures on R: Define
product measures
on sequence space R1 ; with the product Borel
Q
Q
Q field, by P D Pi and Q D Qi . The affinity behaves well for products: .P; Q/ D
.Pi ; Qi /.
Kakutanis dichotomy says that if the components Pi Qi for i D 1; 2; : : : then the
products P and Q are either equivalent or orthogonal. And there is an explicit criterion:
P Q
if and only if
1
Y
.Pi ; Qi / > 0:
i D1
Q
In either case L1 D lim n1 dPi =dQi exists
Q1 Q-a.s. And when P Q; the likelihood ratio
dP =dQ is given by the product L1 D i D1 dPi =dQi ; whereas if P is orthogonal to Q,
then Q.L1 D 0/ D 1.
80
The criterion is easy to apply for Gaussian sequence measures. A little calculation shows
that the univariate affinity
.N.; 2 /; N. 0 ; 2 // D expf .
0 /2 =.8 2 /g:
Let P denote the product measure corresponding to (3.57). The dichotomy says that for
two different mean vectors and 0 , the measures P and P 0 are equivalent or orthogonal.
[See Exercise 3.15 for an implication for statistical classification]. The product affinity
X
.P ; P 0 / D expf D 2 =8g;
D2 D
.i i0 /2 =.%i /2 :
(3.59)
i
P
Thus P is absolutely continuous relative to P0 if and only ifP i2 =%2i < 1; in which
case the density is given in terms of the inner product hy; i% D i yi i =%2i by
(
)
dP
hy; i% k k2%
.y/ D exp
:
dP0
2
2 2
Here i =.%i / might be interpreted as the signal-to-noise ratio of the i-th co-ordinate.
We will again be interested in evaluating the quality of estimation of that is possible
in model (3.57). An important question raised by the extended sequence model is the effect
of the constants .%i / on quality of estimationif %i increases with i, we might expect, for
example, a decreased rate of convergence as ! 0:
We will also be interested in the comparison of linear and non-linear estimators in model
(3.57). For now, let us record the natural extension of formula (2.47) for the mean squared
error of a linear estimator OC .y/ D Cy: Let R D diag.%2i /, then
r.OC ; / D 2 tr CRC T C k.C
I / k2 :
(3.60)
Hellinger and L1 distances. We conclude this section by recording some facts about
distances between (Gaussian) measures for use in Section 3.11 on asymptotic equivalence.
A more systematic discussion may be found in Lehmann and Romano (2005, Ch. 13.1).
Let P and Q be probability measures on .X ; B / and a dominating measure, such as
P C Q. Let p and q be the corresponding densities. The Hellinger distance H.P; Q/ and
L1 or total variation distance between P and Q are respectively given by 7
Z
p
p 2
2
1
H .P; Q/ D 2 . p
q/ d;
Z
kP Qk1 D jp qjd:
Neither definition depends on the choice of . Expanding the square in the Hellinger distance, we have H 2 .P; Q/ D 1 .P; Q/, where is the affinity (3.58). The Hellinger
distance is statistically useful because the affinity behaves well for products (i.e. independence), as we have seen. The L1 distance has a statistical interpretation in terms of the sum
of errors of the likelihood ratio test between P and Q:
1
7
1
kP
2
81
as is easily checked directly. Thus the sum of errors is small if and only if the L1 distance is
large. The measures are related (Lehmann and Romano, 2005, Th. 13.1.2) by
H 2 .P; Q/ 12 kP
Qk1 1
2 .P; Q/1=2 :
(3.61)
P0 k1 D 21
Q
2.D=2/;
(3.62)
R1
Q
where as usual .u/
D u is the right Gaussian tail probability. We can now compare
the quantities in (3.61) assuming that D is small. Indeed
H 2 .P ; P0 / D 2 =8;
1
kP
2
P0 k1 .0/D;
and 1
In the continuous Gaussian white noise model (%i 1), we can re-interpret D 2 using
Parsevals identity, so that kPf PfN k1 is given by (3.62) with
Z 1
D2 D 2
.f fN/2 :
(3.63)
0
(3.64)
Y . / D Af; C Z. /;
and Z D fZ. /g is a Gaussian process with mean zero and covariance function
Z
0
0
Cov.Z. /; Z. // D
d2 :
U
(3.65)
(3.66)
82
A
k;
D b k 'k :
D bk hf; 'k i:
(3.67)
Suppose
P now that A is one-to-one, so that f'k g is an orthonormal basis for H. We can expand
f D k k 'k where k D hf; 'k i. The representer equations (3.67) show that f can be
represented in terms of Af via Af; k .
As observables, introduce Yk D Y . k /. From our model (3.65), Yk D Af; k CZ. k /.
The representer equations say that Af; k D bk hf; 'k i D bk k , so that Yk =bk is unbiased
for k . Now introduce zk D Z. k /; from the covariance formula (3.66) and orthogonality
in L2 .U /, it is evident that the zk are i.i.d. N.0; 1/. We arrive at the sequence representation
Yk D bk k C zk :
(3.68)
As with the regression model of Chapter 2, we set yk D Yk =bk and %k D =bk to recover
our basic sequence model (1.12). After next describing some examples, we return later in
this section to the question of building an estimator of f from the fYk g, or equivalently the
fyk g. However, it is already clear that the rate of variation inflation, i.e. the rate of decrease
of bk with k, plays a crucial role in the analysis.
Examples
(i) Deconvolution. Smoothing occurs by convolution with a known integrable function a:
Z 1
Af .u/ D .a ? f /.u/ D
a.u t /f .t /dt;
0
and the goal is to reconstruct f . The two-dimensional analog is a natural model for image
blurring.
The easiest case for describing the SVD occurs when a is a periodic function on R with
83
period 1. It is then natural to use the Fourier basis for H D K D L2 0; 1. In the complex
form, 'k .u/ D e 2ki u , and with fk D hf; 'k i; ak D ha; 'k i, the key property is
.a ? f /k D ak fk ;
A'k D ak 'k :
If a is also even, a. t/ D a.t /, then ak is real valued, the singular values bk D jak j, and
the singular functions k D sign.ak /'k .
If a.u/ D I fjuj u0 g is the boxcar blurring function, then ak D sin.2ku0 /=.k/,
so that the singular values bk O.k 1 /. For a smooth, say with r continuous derivatives,
then bk D O.k r / (e.g. Katznelson (1968, Ch. 1.4)).
(ii) Differentiation. We observe Y D g C Z, with g assumed to be 1-periodic and seek
to estimateR the derivative f D g 0 . We can express g as the output of integration: g.u/ D
u
Af .u/ D 0 f .t/dt Cc.f /, where c.f / is the arbitrary constant of integration. We suppose
that H D K D L2 0; 1. Roughly speaking, we can take the singular functions 'k and k to
be the trigonometric basis functions, and the singular values bk 1=jkj D O.k 1 /.
A little more carefully, consider for example the real trignometric basis (3.8). Choose the
constants of integration c.'k / so that A'2k D .2k/ 1 '2k 1 and A'2k 1 D .2k/ 1 '2k .
Since the observed function is assumed periodic, it is reasonable to set A'0 D 0. So now
A is well defined on L2 0; 1 and one checks that A D A and hence, for k 0 that the
singular values b2k 1 D b2k D 1=j2kj.
More generally, we might seek to recover f D g .m/ , so that, properly interpreted, g is the
m-th iterated integral of f . Now the singular values bk jkj m D O.k m / for k 0:
(iii) The Abel equation Af D g has
1
.Af /.u/ D p
Z
0
f .t /
dt
p
u t
and goes back to Abel (1826), see Keller (1976) for an engaging elementary discussion and
Gorenflo and Vessella (1991) for a list of motivating applications, including Abels original
tautochrone problem.
The singular value decomposition in this and the next example is not so standard, and the
derivation is outlined in Exercise 3.17. To describe thep
result, let H D L2 0; 1 with f'k g
given by normalized Legendre polynomials 'k .u/ D 2k C 1Pk .1 2u/. On the other
side, let
p
2= sin.k C 12 /;
x D sin2 .=2/
k .u/ D
for 0 . Setting Q k ./ D k .u/, the functions Q k are orthonormal in L2 0; (and
p 1=2; 1=2
uPk
.1 2u/, see
k .u/ can be expressed in terms of modified Jacobi polynomials
(3.71) below). It is shown in Exercise 3.17 that A'k D bk k with singular values
bk D .k C 1=2/
1=2
84
(iii) Wicksell problem. Following Wicksell (1925) and Watson (1971), suppose that
spheres are embedded in an opaque medium and one seeks to estimate the density of the
sphere radii, pS , by taking a planar cross-section through the medium and estimating the
density pO of the observed circle radii.
Assume that the centers of the spheres are distributed at random according to a homogeneous Poisson process. Then pO and pS are related by (Watson, 1971, eq. (5))
Z b
Z
y b pS .s/
spS .s/ds:
(3.69)
ds;
D
pO .y/ D
p
y
s2 y2
0
We may put this into Abel equation form. Suppose, by rescaling, that b D 1 and work on
the scale of squared radii, letting g be the density of u D 1 y 2 and p be the density of
p
t D 1 s 2 . Setting D 2= , we get
Z u
1
1
p.t /
g.u/ D
dt D .Ap/.u/:
p
2 0
x t
Thus we can use observations on g and the SVD of A to estimate f D p=. To obtain an
estimate of p we can proceed as follows. Since '0 1 and p is a probability density, we
have hp; '0 i D 1. Thus from (3.67)
1 D hf; '0 i D b0 1 Af;
and so D b0 =g;
and hence
p D f D
X b0 g;
bk g;
k
k
0
'k
k .
dt D .f ? /.u/
(3.70)
1
where .x/ D xC
=./ and xC D max.x; 0/. Gelfand and Shilov (1964, 5.5) explain how convolution with and hence operator
A can be interpreted as integration of
Ru
(fractional) order . Of course, .A1 f /.u/ D 0 f .t /dt is ordinary integration and D 1=2
yields the Abel operator.
The SVD of A can be given in terms of Jacobi polynomials Pka;b .1 2x/ and their
normalization constants ga;bIk , Appendix C.31 and Exercise 3.17:
p
'k .u/ D 2k C 1Pk .1 2u/
on L2 .0; 1; du/
k .u/
D g;1 Ik u Pk; .1
bk D ..k
2u/
on L2 .0; 1; u .1
1=2
C 1/= .k C C 1//
u/ du/;
(3.71)
as k ! 1:
85
heat in a rod. If u.x; t/ denotes the temperature at position x in the rod at time t , then in
appropriate units, u satisfies the equation 8
@2
@
u.x; t / D 2 u.x; t /:
(3.72)
@t
@x
For our discussion here, we will assume that the initial temperature profile u.x; 0/ D f .x/
is unknown, and that the boundary conditions are periodic: u.0; t / D u.1; t /. We make noisy
observations on the temperature in the rod at a time T > 0:
Y .x/ D u.x; T / C Z.x/;
and it is desired to estimate the initial condition f .x/. See Figure 3.2.
The heat equation (3.72) is a linear partial differential equation, having a unique solution
which is a linear transform of the intial data f :
u.x; T / D .AT f /.x/:
This can be expressed in terms of the Gaussian heat kernel, but we may jump directly to
the SVD of AT by recalling that (3.72) along with the given boundary conditions can be
solved by separation of variables. If we assume that the unknown, periodic f has Fourier
sine expansion
f .x/ D
1
p X
k sin kx;
2
kD1
then it is shown in introductory books on partial differential equations that the solution
u.x; T / D
1
p X
k e
2
2 k2 T
sin kx:
kD1
p
2 2
Thus 'k .x/ D k .x/ D 2 sin kx, and the singular values bk D e k T :
The very rapid decay of bk shows that the heat equation is extraordinarily ill-posed.
(vi) Radon transform and 2-d computed tomography (CT). In a two-dimensional idealization, this is the problem of reconstructing a function from its line integrals. Thus, let T D D
be the unit disc in R2 , and suppose that the unknown f 2 H D L2 .D; 1 dx/.
A line at angle from the vertical and distance s from the origin is given by t !
.s cos t sin ; s sin C t cos / and denoted by Ls; , compare Figure 3.2. The corresponding line integral is
.Af /.s; / D Ave f jLs; \ D
Z p1 s 2
1
f .s cos
D p
p
2 1 s2
1 s2
.s; / 2 R:
In this and the next example, we use notation conventional for these settings.
86
Y(x)
Ls;
Y(s;)
D
0
f(x)
x
1
f(x)
Figure 3.2 Left panel: domain for the heat equation. We observe u.x; T / plus
noise (top line) and wish to recover the initial data f .x/ D u.x; 0/, (bottom line).
Right panel: domain for computed tomography example. We observe line integrals
.Af /.s; / along lines Ls; plus noise, and wish to recover f .x/; x 2 D.
The SVD of A was derived in the optics and tomography literatures (Marr, 1974; Born and
Wolf, 1975); we summarize it here as a two-dimensional example going beyond the Fourier
basis. There is a double index set N D f.l; m/ W m D 0; 1; : : : I l D m; m 2; : : : ; mg;
where m is the degree and l the order. For D .l; m/, the singular functions are
p
jlj
i l
' .r; / D m C 1Zm
.r/e i l ;
;
.s; / D Um .s/e
p
and the singular values b D 1= m C 1. Here Um .cos / D sin.m C 1/= sin are Chebychev polynomials of the
kind, and the Zernike polynomials are characterized by the
R 1second
k
k
orthogonality relation 0 ZkC2s
.r/ZkC2t
.r/rdr D ..k C 2s C 1/=2/st .
The main point here is that the singular values b decay slowly: the reconstruction problem is only mildly ill-posed, consistent with the now routine use of CT scanners in medicine.
where the shrinkage constants ck 2 0; 1 are chosen to counteract the variance inflation
effects of the small singular values bk . Examples that fall within this class include
(a) Truncated SVD, also known as a projection or spectral cut-off estimator:
(
b 1 Yk k ;
O;k D k
0
k > :
87
bk2
bk
Yk :
C wk
In this case the shrinkage constants ck D bk2 =.bk2 C wk /. For direct estimation, bk 1,
with wk D k 2m , this reduces to the mth order smoothing spline estimator. In the general
case, it arises from a penalized least squares problem
min kY
f
Af k2 C kf k2 ;
if the singular functions 'k also satisfy 'k D wk 'k . This occurs, for example, in the
trigonometric basis, with D D m , namely mth order differentiation.
Rates of convergence. We use the example of the truncated SVD to give a brief discussion of the connection between the decay of the singular values bk and rates of convergence,
following the approach of Section 3.2. A similar analysis is possible for the regularized
Tikhonov estimates, along the lines of Section 3.6.
We carry out the analysis in model (3.57), so that %k D 1=bk . Then the truncated SVD
estimator is identical to (3.14) and the analysis of maximum risk over ellipsoids 2 .C / can
be patterned after (3.11). As was done there, let O be given by (3.14). Its mean squared error
r.O ; / D EkO
k22 D 2
X
%2k C
kD1
k2 ;
k>
r/ 2r
:
(3.73)
This rate r is in fact optimal, as is shown in Proposition 4.23more refined results at the
level of constants come with Pinskers theorem in Chapter 5. We see that the effect of the
degree of decay of the singular values is to degrade the rate of convergence of maximum
MSE from 2=.2 C 1/ to 2=.2 C 2 C 1/. Thus the faster the decay of the singular
values, the slower the rate of convergence, and also the smaller the frequency at which
the data is truncated. For this reason is sometimes called an index of the ill-posedness.
P
2
Proof For any ellipsoid .a; C / we have r.
N O I / D 2 kD1 %2k C C 2 aC1
, so long as ak2
88
Remark. If %k D 1=bk grows exponentially fast, or worse, then the inversion problem
might be called severely ill-posed. In this case, the attainable rates of convergence over
Sobelev smoothness classes 2 .C / are much slower: Exercise 3.19, for the heat equation,
derives rates that are algebraic in log 1 . One can recover rates of convergence that are
polynomial in by assuming much greater smoothness, for example by requiring to be
an ellipsoid of analytic functions, see Section 5.2 and Exercise 5.2.
f .s/Z.s/ds 0:
hRf; f i D
f .s/Cov.Z.s/; Z.t //f .t /dsdt D Var
Under these conditions it follows (Appendix C.4 has some details and references) that R
is a compact operator on L2 .T /, and so it has, by the Hilbert-Schmidt theorem, a complete
orthonormal basis f'k g of eigenfunctions with eigenvalues %2k 0,
Z
R.s; t /'k .t /dt D %2k 'k .s/;
s 2 T:
In addition, by Mercers theorem, Thm. C.5, the series
X
R.s; t / D
%2k 'k .s/'k .t /
converges uniformly and in mean square on T T .
Define Gaussian variables (for k such that %k > 0)
Z
zk D %k 1 'k .t /Z.t /dt:
The zk are i.i.d. N.0; 1/: this follows from the orthonormality of eigenfunctions:
Z
Z
Z
Cov
'k Z; 'k 0 Z D
'k R'k 0 D h'k ; R'k 0 i D %2k kk 0 :
(3.74)
T T
The sum
Z.t / D
%k zk 'k .t /
P
converges in mean-square
on L2 .T /: Indeed, for a tail sum rmn D nm hZ; 'k i'k we have,
P
2
using (3.74), Ermn
D niDm %2k 'k2 .t / ! 0 as m; n ! 1 by Mercers theorem.
89
If the eigenfunctions 'k corresponding to %k > 0 are not complete, then we may add an
orthonormal basis for the orthogonal complement of the closure of the range of R in L2 .T /
and thereby obtain an orthobasis for L2 .T /: Since R is symmetric, these 'k correspond to
%k D 0:
Now suppose that Z.t/ is observed with an unknown drift function added:
Y.t/ D .t / C Z.t /;
t 2 T:
Such a model occurs commonly in functional data analysis, in which Y .t / models a smooth
curve and there is smoothness also in the noise process due to correlation. See for example
Ramsay and Silverman (2005); Hall and Hosseini-Nasab (2006), and Exercise 3.15.
If 2 L2 .T /, then we may take coefficients in the orthonormal set f'k g W
yk D hY; 'k i;
k D h; 'k i;
to obtain exactly the sequence model (3.57). [Of course, co-ordinates corresponding to %k D
0 are observed perfectly, without noise.] From
P the discussion in Section 3.8, it follows that
model P is equivalent to P0 if and only if k k2 =%2k < 1.
To summarize: for our purposes, the Karhunen-Lo`eve transform gives (i) a diagonalization of the covariance operator of a mean-square continuous Gaussian process and (ii) an
example of the Gaussian sequence model. As hinted at in the next subsection it also provides
a way to think about and do computations with Gaussian priors in the sequence model.
Connection to Principal Components Analysis. Constructing the KLT is just the stochastic
process analog of finding the principal components of a sample covariance matrix. Indeed,
suppose that
Pthe sample data is fxij g for i D 1; : : : ; n cases and j D 1; : : : ; p variables. Let
xN j D n 1 i xij denote the sample mean for variable j . Set zij D xij xN j and make the
correspondence Z.!; t/ $ zij , identifying the realization ! with i, and the time t with
j . Then R.t1P
; t2 / D EZ.t1 /Z.t2 / corresponds to an entry in the sample covariance matrix
Sj1 j2 D n 1 i .xij1 xN j1 /.xij2 xN j2 /.
m
X1
j D0
j
tj
0
C Zm
.t /:
j
90
Wahba (1978, 1983, 1990) has advocated the use of Zm
as a prior distribution for Bayesian
estimation in the context of smoothing splinesactually, she recommends using ! 1;
for reasons that will be apparent. She showed (Wahba, 1990, Th. 1.5.3), for the nonparametric
setting (1.13), that the smoothing spline based on the roughness penalty
R m regression
.D f /2 arises as the limit of posterior means calculated from the Zm
priors as ! 1:)
This prior distribution has some curious features, so we explore its Karhunen-Lo`eve transform. The key conclusion: for each 0; and in the ! 1 limit, the eigenvalues satisfy
%k .k/
as k ! 1:
Recall that the spline estimator (??) in the Gaussian sequence model arises as a Bayes estimator for the prior with independent N.0; k2 / co-ordinates with k / k m .
0
We discuss only the cases m D 1; 2 here. For general m, the same behavior (for Zm
) is
established by Gao et al. (2003).
It is simpler to discuss the m D 1 situation first, with Z1 .t / D 0 CW .t /, and covariance
kernel R .s; t/ D Cov .Z1 .s/; Z1 .t // D 2 C s ^ t. The eigenvalue equation R ' D %2 '
becomes
Z 1
Z s
Z 1
2
'.t/dt C
t '.t /dt C s
'.t /dt D %2 '.s/:
(3.76)
0
(3.77)
and differentiating a second time yields the second order ordinary differential equation
'.s/ D %2 ' 00 .s/
0 s 1:
(3.78)
The homogeneous equation %2 ' 00 C ' D 0 has two linearly independent solutions given by
trigonometric functions
'.t / D a sin.t =%/ C b cos.t =%/:
(3.79)
The equations (3.76) and (3.77) impose boundary conditions which non-zero eigenfunctions
must satisfy:
' 0 .1/ D 0;
[The first condition is Revident from (3.77) while the second follows by combining the two
equations: %2 ' 0 .0/ D ' D %2 '.0/= 2 :
Let us look first at the ! 1 limit advocated by Wahba. In this case the boundary
conditions become simply ' 0 .0/ D ' 0 .1/ D 0: Substituting into (3.79), the first condition
implies that a D 0 and the second that sin.1=%/ D 0: Consequently the eigenvalues and
eigenfunctions are given by
p
n D 1; 2; : : : :
%k D 1=k;
'k .s/ D 2 cos ks;
91
Eigenvalues
D1
%k 1 D k
D0
%k 1 D .k C 12 /
0< <1
Periodic
' 0 .0/ D
' 0 .1/ D 0
2 '.0/;
'.0/ D '.1/;
' 0 .0/ D ' 0 .1/
%k 1 2 .k; .k C 12 //
%2k1
D %2k1 D 2k
Eigenfunctions
p
2 cos k t
p
2 sin.k C 21 / t
ck sin %k 1 t C : : :
ck 2 %k 1 cos %k 1 t
p
p2 sin 2k t;
2 cos 2k t
Table 3.2 Effect of Boundary Conditions for the vibrating string equation
' 0 .0/ D
and so the standard periodic boundary conditions and the usual sine and cosine eigenfunctions, as used in Section 3.4, emerge in the ! 1 limit. See the final row of Table 3.2.
In all cases summarized in Table 3.2, the eigenfunctions show increasing oscillation with
increasing n, as measured by sign crossings, or frequency. This is a general phenomenon
for such boundary value problems for second order differential equations (Sturm oscillation
theorem - see e.g. Birkhoff and Rota (1969, Sec 10.7)). Note also that in the periodic case,
the eigenvalues have multiplicity two both sines and cosines of the given frequency but
in all cases the asymptotic behavior of the eigenvalues is the same: %k 1 k:
The analysis of the integrated Wiener prior (3.75), corresponding to cubic smoothing
splines, then proceeds along the same lines, with most details given in Exercise 3.10 (see also
9
p
In this case, the eigenfunctions 2 sin.k C 12 / t happen to coincide with the left singular functions of the
Abel transform of the previous section.
92
Freedman (1999, Sec. 3) ). The eigenvalue equation is a fourth order differential equation:
'.s/ D %2 ' .4/ .s/:
This equation is associated with the vibrating
rod (Courant and Hilbert, 1953, Secs IV.10.2
R
and V.4) indeed, the roughness penalty f 002 corresponds to the potential energy of deformation of the rod. It is treated analogously to the vibrating string equation. In particular, the
(four!) boundary conditions for the D 1 limit become
' 00 .0/ D ' 000 .0/ D 0;
0 t 1;
(3.80)
0 t 1;
(3.81)
l D 1; : : : ; n:
(3.82)
In problem .PN n /, the function fNn is a step function approximation to f , being piecewise
constant on intervals .l 1/=n; l=n/. Qn is the nonparametric regression problem (1.13)
with sample size n, while Pn is the continuous Gaussian white noise model at noise level
p
D = n. We will define a distance .Pn ; Qn / between statistical problems and show
that it converges to zero in two steps. First, problems Pn and P n are on the same sample
93
rL .; / D
L.a; /.dajy/P .dy/;
(3.83)
compare (A.10) and the surrounding discussion for more detail. If .jy/ is a point mass at
O
.y/,
then this definition reduces to (2.11).
Now consider two regular statistical problems P0 ; P1 with sample spaces Y0 ; Y1 but the
same parameter space . Let the two corresponding families of distributions be denoted
by fPi; ; 2 g for i D 0; 1. The deficiency d .P0 ; P1 / of P0 with respect to P1 is the
smallest number 2 0; 1 such that for every arbitrary loss function L with 0 L.a; / 1
and every decision rule 1 in problem P1 , there is a decision rule 0 in problem P0 such that
r0;L .0 ; / r1;L .1 ; / C for all 2 . To obtain a distance on statistical problems, we
symmetrize and set
.P0 ; P1 / D maxfd .P0 ; P1 /; d .P1 ; P0 /g:
(3.84)
The definition of distance is quite elaborate because it requires that performance in the
two problems be similar regardless of the choice of estimand (action space) and measure
of performance (loss function). In particular, since the loss functions need not be convex,
randomized decision rules must be allowed (cf. (A.10)(A.13) in Appendix A).
A simplification can often be achieved when the problems have the same sample space.
Proposition 3.12 If Y0 D Y1 and P0 and P1 have a common dominating measure , then
.P0 ; P1 / L1 .P0 ; P1 /;
where the maximum L1 distance is defined by
Z
L1 .P0 ; P1 / D sup jp0; .y/
p1; .y/j.dy/:
(3.85)
2
Proof In the definition of deficiency, when the sample spaces agree, we can use the same
decision rule in P0 as in P1 , and if we write kLk1 D sup jL.a; /j, then from (3.83)
Z
jr0;L .; / r1;L .; /j kLk1 jp0; .y/ p1; .y/j.dy/:
In the definition of deficiency, we only consider loss functions with kLk1 1. Maximizing
10
Regular means that it is assumed that the sample space Y is a complete separable metric space, equipped
with the associated Borel -field, and that the family fP g is dominated by a -finite measure. These
assumptions hold for all cases we consider.
94
over shows that r0;L .; / r1;L .; / C L1 .P0 ; P1 /. Repeating the argument with the
roles of P0 and P1 reversed completes the proof.
A sufficient statistic causes no loss of information in this sense.
Proposition 3.13 Let P be a regular statistical problem with sample space Y . Suppose
that S W Y ! S is a sufficient statistic, and let Q D fQ I 2 g denote the problem in
which S D S.Y / is observed. Then .P ; Q/ D 0.
Proof Since S D S.Y / is sufficient
R for Y , there is a kernel K.C js/ defined for (Borel)
subsets C Y such that P .C / D K.C js/Q .ds/. This formalizes11 the notion that the
distribution of Y given S isR free of . Given a decision rule for problem P , we define a
rule 0 for Q by 0 .Ajs/ D .Ajy/K.dyjs/. By chasing the definitions, it is easy to verify,
given a loss function L, that rQ;L . 0 ; / D rP;L .; /, where the subscripts indicate the
statistical problem. Hence d .Q; P / D 0. Since a rule for Q is automatically a rule for P ,
we trivially have also d .P ; Q/ D 0, and hence .P ; Q/ D 0.
We are now ready to formulate and prove a special case of the Brown-Low theorem.
Consider parameter spaces of Holder continuous functions of order . The case 0 < < 1
is of most interest hereAppendix C gives the definitions for 1. We set
H .C / D ff 2 C.0; 1/ W jf .x/
f .y/j C jx
(3.86)
Theorem 3.14 Let Pn and Qn denote the continuous Gaussian white noise model (3.80)
and the discrete regression model (3.82) respectively. Let the parameter space for both
models be the Holder function class H .C /. Then, so long as > 1=2, the two problems
are asymptotically equivalent:
.Pn ; Qn / ! 0:
Proof We pursue the two step approach outlined earlier. Given a function f 2 H .C /,
define a piecewise constant step function approximation to it from the values f .l=n/. Set
fNn .t/ D f .l=n/
if .l
and put fNn .1/ D f .1/. [This type of interpolation from sampled values occurs again in
Chapter 15.] As indicated at (3.81), let P n denote the statistical problem in which fNn is observed in continuous white noise. Since both Pn and P n have sample space Y D C.0; 1/
and are dominated, for example by P0 , the distribution of Yn under f D 0, we have
.Pn ; P n / L1 .Pn ; P n /. The L1 distance between Pf and PfNn can be calculated fairly
easily; indeed from (3.62) and (3.63),
Q n .f /=2/;
PfNn k1 D 21 2.D
Z 1
2 Dn2 .f / D n
fNn .t / f .t /2 dt:
kPf
f .l=n/j C jt
l=nj for t 2 .l
The existence of such a kernel, specifically a regular conditional probability distribution, is guaranteed for a
regular statistical problem, see. e.g. Schervish (1995, Appendix B.3) or Breiman (1968).
95
Q
and this holds uniformly for all f 2 H .C /. Since 1 2./
2.0/ as ! 0, we
conclude that L1 .Pn ; P n / ! 0 so long as > 1=2.
For the second step, reduction by sufficiency, define
l D 1; : : : ; n:
(3.87)
Sn;l .YNn / D n YNn .l=n/ YNn ..l 1/=n/ ;
The variables Sn;l are independent Gaussians with mean f .l=n/ and variance 2 . Hence
the vector Sn D .Sn;l / is an instance of statistical problem Qn . In addition, Sn D Sn .YNn /
is sufficient for f 2 in problem P n (Exercise 3.18 prompts for more detail), and so
.P n ; Qn / D 0. Combining the two steps using the triangle inequality for metric , we
obtain .Pn ; Qn / ! 0.
Remarks. 1. Let us describe how to pass from a procedure in one problem to a corresponding procedure in the other. Given a rule n in regression problem Qn , we define a rule
n0 .Yn / in the white noise problem Pn simply by forming Sn .Yn / as in (3.87) and setting
n0 .Yn / D n .Sn /. In the other direction we use the construction in the proof of Proposition
3.13. Given a rule n in white noise problem Pn , we may equally well use it in problem P n
which has the same sample space as Pn . So we may define n0 in the regression problem by
n0 .Ajsn / D E n .AjYNn / j Sn .YNn / D sn :
The conditional expectation is well defined as an estimator (free of f ) by sufficiency, though
of course it may in general be hardR to evaluate. The evaluation is easy however in the case
1
of a linear estimator n .Yn /.u/ D 0 c.u; t /d Yn .t /: one can check that
Z l=n
n
X
cnl .u/Sn;l ;
cnl .u/ D
c.u; t /dt:
n0 .Sn /.u/ D
lD1
.l 1/=n
2. Theorem 3.14 extends to a regression model with unequally spaced and heteroscedastic
observations: instead of (3.82), suppose that Qn becomes
Yl D f .tnl / C .tnl /Zl ;
l D 1; : : : ; n:
If tnl D H 1 .l=.n C 1// for a strictly increasing and absolutely continuous distribution
function H and if .t/ is well-behaved, then after suitably modifying the definition (3.87),
Brown and Low (1996a) show that Qn is still asymptotically equivalent to Pn .
p
3. An example shows that equivalence fails when D 1=2. Define n .t / D t on
0; 1=.2n/ and then reflect it about 1=.2n/ to extend to 1=.2n/; 1=n. Then extend n by
translation to each interval .l 1/=n; l=n so as to obtain a sawtooth-like
function on 0; 1
p
p R1
which is Holder continuous with D 1=2, and for which n 0 n D 2=3: Now conR1
sider estimation of the linear functional Lf D 0 f .t /dt . In problem Pn , the normalp
ized difference n.Yn .1/ Lf / N.0; 1/ exactly for all f and n. However, in model
Qn , the observation vector Y D .Yl / has the same distribution whether f D f0 0 or
f D f1n D n , since n .l=n/ D 0. Thus there can be no estimator n .Y / in Qn for which
96
1=2
n.n .Y/ Lf / !
p N.0; 1/ in distribution uniformly over f 2 H .1/; since
p
while nLf1n D 2=3.
nLf0 D 0
1
f .t /dt C p d W .t /;
2 n
0 t 1:
(3.88)
p
Nussbaum (1996). The appearance of the root density f is related to the square root
variance stabilizing transformation
p for Poisson data, which is designed to lead to the constant
variance term. Note also that f is square integrable with L2 norm equal to 1
Here is a heuristic argument, in the spirit of (1.26), that leads to (3.88). Divide the unit
interval into mn D o.n/ equal intervals of width hn D 1=mn : Assume also that mn ! 1
so that hn ! 0: Write Ik n for the kth such interval, which at stage n extends from tk D
k=mn to tkC1 : First the Poissonization trick: draw a random number Nn of observations
X1 ; : : : ; XNn of i.i.d. from f; with Nn Poisson.n/: Then, because of the Poisson thinning
property,
the number of observations falling in the kth bin Ik n will be Poisson with mean
R
n Ik n f nf .tk /hn : The square root transformation is variance stabilizing for the Poisson
p
p
N
family and
so
Y
n .Ik n / N. f .tk /nhn ; 1=4/ approximately for large n. Thus
k n WD
p
p
Yk n f .tk / nhn C 12 Zk n with Zk n independent and approximately standard Gaussian.
p
Now form a partial sum process as in (1.26), and premultiply by hn =n to obtain
r
Yn .t/ D
mn t
mn t
mn t
p
1 X
1
hn X
1 X
Yk n
f .tk / C p p
Zk n :
n 1
mn 1
2 n mn 1
This makes it plausible that the process Yn .t /, based on the density estimation model, merges
in large samples with the Gaussian white noise process of (3.88).
A non-constructive proof of equivalence was given by Nussbaum (1996) under the assumption that f is Holder continuous for > 1=2, (3.86), and uniformly bounded below,
f .t/ > 0. A constructive argument was given by Brown et al. (2004) under a variety of
smoothness conditions, including the Holder condition with > 1=2. While the heuristic
argument given above can be formalized for > 1, Brown et al. (2004) achieve > 1=2
via a conditional coupling argument that can be traced back to Komlos et al. (1975).
Nonparametric Generalized Linear Models. This is an extension of model (3.82) to errors
drawn from an exponential family. Indeed count data with time varying Poisson intensities
and dichotomous or categorical valued series with time varying cell probabilities occur naturally in practice (e.g. Kolaczyk (1997); Stoffer (1991)). We suppose that the densities in
the family may be written P .dx/ D p .x/.dx/ with p .x/ D e U.x/ . / : Thus is the
canonical parameter,
U.x/ the sufficient statistic, .dx/ the dominating measure on R and
R
./ D log e U.x/ .dx/ the cumulant generating function. (Lehmann and Casella (1998,
97
Ch. 1) or Brown (1986) have more background on exponential families). All the standard examples Poisson, Bernoulli, Gaussian mean, Gaussian variance, exponential are included.
We will describe a form of the equivalence result in the mean value parameterization, given
by ./ D 0 ./ D E U.X/. Let tl D l=n, l D 1; : : : ; n and g be a sufficiently smooth
function, typically with Holder smoothness greater than 1/2. Assume that we have observations .tl ; Xl / in which Xl is drawn from Pl .dx/ with l D .l / D g.tl /; call this model
Pn . [In the usual generalised linear model setting with canonical link function, one models
D .l / D X in terms of p predictors with coefficients 1 ; : : : pP
. If the predictor had
the form of an expansion in (say)
polynomials
in
t,
so
that
.X/
D
l
k k pk .tl /, then we
P
would be replacing l D . k k pk .tl // by the nonparametric g.tl /.]
Recall that 00 ./ D Var U.X /; and let
p V ./ be the variance stabilizing transformation
00 . /. Then Grama and Nussbaum (1998) show
for fP g defined through V 0 .. // D 1=
that this experiment is asymptotically equivalent to
.Qn /
1=2
0 t 1;
p
in the sense that .Pn ; Qn / ! 0. The Poisson case, with V ./ D 2 , is closely related to
the density estimation setting discussed earlier. For a second example, if Xl are independent
N.0; g.tl //, then we are in the Gaussian scale family and the corresponding exponential
family form for N.0; 2 / has natural parameter D 1= 2 , mean parameter . / D
1=.2/ and variance stabilising transformation V ./ D 2 1=2 log . So the corresponding
white noise problem has d Y.t/ D 2 1=2 log g.t / C n 1=2 d W .t /, for t 2 0; 1.
d W .t /;
Spectral density estimation. Suppose that X n D .X1 ; : : : ; Xn / is a sample from a stationary Gaussian random process with mean zero and spectral density function fP
./ on ; ;
1
1
i k
related to the covariance function
.k/ D EXj Xj Ck via f ./ D .2/
.k/:
1e
Estimation of the spectral density f was the first nonparametric function estimation model
to be studied asymptotically see for example Grenander and Rosenblatt (1957).
Observe that X n N.0; n .f // where the covariance matrix is Toeplitz: n .f /j k D
.k j /. A classical approximation in time series analysis replaces the Toeplitz covariance matrix by a circulant matrix Q n .f / in which the rows are successive shifts by one of
a single periodic function on f0; 1; : : : ; n 1g. 12 The eigenvalues of a circulant matrix are
given by the discrete Fourier transform of the top row, and so the eigenvalues of Q n .f / are
approximately f .j / where j are equispaced points on ; . After an orthogonal transformation to diagonalize Q n .f /, one can say heuristically that the model X n N.0; n .f //
is approximately equivalent to
Zj N.0; f .j //;
j D 1; : : : ; n:
This is the Gaussian scale model discussed earlier, and so one expects that both statistical
problems will be asymptotically equivalent with
dZ./ D log f ./ C 2 1=2 n
1=2
d W ./;
2 ;
(3.89)
for f in a suitable function class, such as the Holder function class H .C / on ; with
12
98
> 1=2 and restricted also to bounded functions f ./ 1=. Full proofs are given in
Golubev et al. (2010).
This by no means exhausts the list of examples where asymptotic equivalence has been
established; one might add random design nonparametric regression and estimation in diffusion processes. For further references see the bibliography of Carter (2011).
Some cautions are in order when interpreting these results. First, there are significant regularity conditions, for example concerning the smoothness of the unknown
R f: Thus, Efromovich and Samarov (1996) have a counterexample for estimation of f 2 at very low
smoothness. Meaningful error measures for spectral densities may not translate into, say,
squared error loss in the Gaussian sequence model. See Cai and Zhou (2009a) for some
progress with unbounded loss functions. Nevertheless, the asymptotic equivalence results
lend further strength to the idea that the Gaussian sequence model is the fundamental setting for nonparametric function estimation, and that theoretical insights won there will have
informative analogs in the more concrete practical problems of curve estimation.
3.12 Notes
1. Defining Gaussian measures on infinite dimensional spaces is not completely straightforward and we
refer to books by Kuo (1975) and Bogachev (1998) for complete accounts. For the sequence model (3.1)
with I D N, the subtleties can usually be safely ignored. For the record, as sample space for model (3.1)
we take R1 , the space of sequences in the product topology of pointwise convergence, under which it
is complete, separable and metrizable. [Appendix C.16 recalls the topological definitions.] It is endowed
with the Borel field, and as dominating measure, we take P0 D P0; , the centered Gaussian Radon
measure (see Bogachev (1998, Example 2.3.5)) defined as the product of a countable number of copies
of the N.0; 2 / measure on R. Bogachev (1998, Theorem 3.4.4) shows that in a certain, admittedly weak,
sense all infinite dimensional Gaussian measures are isomorphic to the sequence measure P0 :
One can formally extend the infinitesimal representation (1.22) to a compact set D Rn if t ! Wt is
d parameter Brownian sheet (Hida, 1980). If 'i is an orthonormal basis for L2 .D/; then the operations
(1.24) again yield data in the for of model (3.1).
For more discussion (and citations) for kernel nonparametric regression estimators such as NadarayaWatson and Priestley-Chao, and discussion of the effect of boundaries in the nonperiodic case, we refer to
books on nonparametric smoothing such as Wand and Jones (1995); Simonoff (1996).
There is some discussion of orthogonal series methods in Hart (1997), though the emphasis is on lackof-fit tests. Eubank (1999) has a focus on spline smoothing.
Rice and Rosenblatt (1981) show that in the non-periodic case, the rate of convergence of the MSE is
determined by the boundary behavior of f .
Cogburn and Davis (1974) derived the equivalent kernel corresponding to periodic smoothing splines
for equally spaced data, see also Cox (1983).
We have not discussed the important question of data-determined choices of the regularization parameter
in spline-like smoothers, in part because a different approach based on the James-Stein estimator is
studied in Chapter 6. Some popular methods include Cp , (generalized) cross validation, and (generalized)
maximum likelihood. Some entries into the literature looking at theoretical properties include Speckman
(1985); Wahba (1985); Efron (2001).
There is a large literature on the matching of posterior and frequentist probabilities in parametric models
- the Bernstein-von Mises phenomenon. The situation is more complicated for non-parametric models.
Some simple examples are possible with Gaussian sequence models and Gaussian priorsJohnstone (2010)
develops three examples to illustrate some possibilities.
Exercises
99
We have given only a brief introduction to linear inverse problems in statistics, with a focus on the singular value decomposition. A broader perspective is given in the lecture notes by Cavalier (2011), including
other ways of imposing smoothness such as through source conditions. Other books.
L2 boundedness of the fractional integration operator A is a consequence of classical results of Hardy
and Littlewood (1928), see also Gorenflo and Vessella (1991, pp. 6467). Indeed, for 1=2, the operator
A is bounded from L2 0; 1 to Ls 0; 1 for a value s D s./ > 2, while for > 1=2, it is bounded from
L2 0; 1 to C 1=2 .0; 1/.
There is a large literature on the Wicksell problem representative examples include Hall and Smith
(1988), which introduces the transformation to squared radii, and Groeneboom and Jongbloed (1995), who
study an isotonic estimation. See also Feller (1971, Ch. 3.11) and the lectures on inverse problems by
Groeneboom (1996).
For upper and lower bounds on rates of convergence in statistical inverse problems see Koo (1993),
and...
For more on the singular value decomposition of the Radon transform given in (v), see Johnstone and
Silverman (1990).
Exercises
3.1
(Compactness criteria.) Here `2 denotes square summable sequences with the norm k k2 D
P 2
i :
P
2 2
2
(a) The ellipsoid D f W
k1 ak k C g is `2 compact if and only if ak > 0 and
ak ! 1:
P
Q
(b) The hyperrectangle D k1 k ; k is `2 compact if and only if k1 k2 < 1:
3.2
3.3
(Affinity and L1 distance for Gaussians.) (a) Let denote Hellinger affinity (3.58) and show
.N.1 ; 12 /; N.2 ; 22 / D
2 1=2
n .
2 /2 o
1 2
1
exp
:
12 C 22
4.12 C 22 /
(Equivalence for marginals?) In the Gaussian sequence model yk D k C zk ; consider priors
k N.0; k2 /; independently with k2 D bk 2m : Under what conditions on m is the marginal
distribution P .dy/ equivalent to P0 .dy/; the distribution conditional on D 0
3.5
3.6
(Discrete orthogonality relations). Let ek denote the vector in Cn obtained by sampling the
k-th complex exponential at tj D j=n. Thus ek D fexp.2 i kj=n/; j D 0; 1; : : : ; n 1g: For
P
f; g 2 Cn , use the usual inner product hf; gin D n 1 n1 fk gN k . Show that for k; l 2 Z;
(
1 if k l 2 nZ
hek ; el in D
0 otherwise:
Turn now to the real case. For k 0; let ck D fcos.2kj=n/; j D 0; 1; : : : ; n 1g and define sk
analogously using the k th sine frequency. If n D 2mC1 is odd, then take fc0 ; s1 ; c1 ; : : : sm ; cm g
as the basis Bn for Rn : If n D 2m C 2 is even, then adjoin cn=2 to the previous set to form Bn :
100
1
kl ;
2
hck ; sl in D 0;
(Infinite order kernels.) Let hc ./ D 1=.jj c/2 and show that the function e h0 ./ I f 0g is
C 1 . Define
8
if jj c
<1
b./ D expf bh1 ./ exp. bhc .//g if c jj 1
K
:
0
if jj 1
R
b./d is a C 1 kernel of infinite order (i.e. satisfies
and show that K.s/ D .2/ 1 e i s K
(3.25) with q D 1) that decays faster than jsj m for any m > 0. (McMurry and Politis, 2004)
3.8
(Aliasing example.) Consider equispaced model (3.34) with n D 5, and as in Section 3.4, let
S5 be the linear span of trigonometric polynomials of degree nd D 2. Let f minimize Q.f /
given below (3.35), and let f D f C .'11 '1 /. Show, under an appropriate condition, that
Q.f / < Q.f / for small. Hence the minimum of Q.f / does not lie in S5 .
3.9
(Evaluation of equivalent kernel.) If 2 C belongs to the upper half plane, show by contour
integration that
(
Z 1 i
x
e i
if
> 0
e
1
dx D
2 i
0
if
< 0:
1 x
Use the partial fraction expansion
r
Y
kD1
.x
k /
r
X
ck .x
k /
1=ck D
Y
j k
kD1
O
to compute the equivalent kernel L.t / given that L./
D .1 C 4 /
1:
Z
D 1 C 2 t C
.t
u/d W .u/;
.k
j /;
101
Exercises
four times, show that ' satisfies
'.s/ D %2 ' .4/ .s/;
with boundary conditions
' 00 .0/ D
2 0
'.0/
With D 0; show that the boundary conditions imply the equation cos % 1=2 cosh % 1=2 D 1
for the eigenvalues. In the D 1 limit, show that the corresponding equation is cos % 1=2 cosh %
1: In either case, show that the eigenvalues satisfy, for large n
%n
1
.n C 12 /2 2
1=2
1
:
n2 2
(c) Let the right side of the previous display be ra .hI /. Show that for r D 2=.2 C 1/ and
q that
inf ra .hI / D c;K C 2.1
h
r/ 2r
;
3.13 (Local polynomial regression and its equivalent kernel.) Consider the finite equispaced regression model (3.34) for periodic f , with data extended periodically as in (3.20). Let K be a
kernel of compact support and let fOp;h .t / be the local polynomial estimator of degree p defined
at (3.56).
(a) Show that O can be written in the weighted least squares form
O D .XT WX/ 1 XT WY:
R
(b) Let the moments of the kernel k D v k K.v/dv. Define the moment matrix S D .Sj k /j;kD0;:::;p
by Sj k D j Ck , and write S 1 D .S j k /. Show that the local polynomial estimator, for large
n, has the approximate form
X
fO.t / D n 1
Kh .tl t /Yl ;
l2Z
102
p
X
S 0k t k K.t /:
kD0
K
satisfies
Z
v r K .v/dv D 0r ;
0 r p:
f2 .t / D .e 4t
t /.1
t /2 ;
z D 1; : : : ; n;
with D 1 and zi N.0; 1/ chosen i.i.d. Let fOS S; and fOPER; denote the solutions to
Z 1
X
min Q.f / D n 1
Yi f .i=n/2 C
f 002
0
among cubic splines and trignometric polynomials respectively. Note that fOS S; can be computed in S-PLUS using smooth.spline(). For fOPER; , youll need to use the discrete
Fourier transform fft(), with attention to the real and imaginary parts. For ; use the value
suggested by the ellipsoid considerations in class:
Z
p
D .=2/4 .6 2/4=5 .n f 002 / 4=5 :
Run experiments with R D 100 replications at n D 50; 200 and 1000 to compare the estimates
fOS S; and fOPER; obtained for f1 and f2 : Make visual comparisons on selected replications
chosen in advance, as well as computing averages over replications such as
ave kfOS S fOPER k22
:
ave kfOS S f k22
3.15 (Perfect classification.) Consider the two class classification problem in which y is observed
in the heteroscedastic sequence model (3.57) and it is desired to decide whether D 0 or
D 1 obtains. Then consider a loss function L.a; / D I fa g with a 2 f 0 ; 1 g and a
prior distribution putting mass 0 on 0 and 1 D 1 0 on 1 .
(a) Let D 0 1 and show that the optimal classifier (i.e. the Bayes rule in the above set
up) is the Fisher linear discriminant, using T .y/ D hy 1 ; i% kk2% =2.
(b) Show that perfect classificationi.e. incurring zero error probabilitiesoccurs if and only if
P
D 2 D 2i =.%i /2 D 1.
(Modified from (Delaigle and Hall, 2011).)
3.16 (Maximum risk of the Pinsker estimator.) Consider a slightly different family of shrinkage rules,
to appear in Pinskers theorem, and also indexed by a positive parameter:
O;k .y/ D .1
k m =/C yk ;
k 2 N:
103
Exercises
Show that the maximum risk over a Sobolev ellipsoid 2 .C / is approximated by
r.
N O I / vN m 2 1=m C C 2
2 min.=m;1/
where
vN m D 2m2 =.m C 1/.2m C 1/:
If D m; show that the maximum MSE associated with the minimax choice of is given by
.2mC 2 =vN m 2 /r=2
r.
N O I / e H.r/ C 2
2r
(3.90)
.vN m 2 /r :
3.17 (SVD for fractional integration.) Let A be the operator of fractional order integration (3.70).
This exercise outlines the derivation of the singular value decomposition for a class of domain
spaces, based on identites for Gauss hypergeometric function and Jacobi polynomials that are
recalled in Appendix C.30. Let n .a; / D .a C n C 1/= .a C C n C 1/ n as n ! 1.
(a) Interpret identities (C.32) and (C.33) in terms of the operator A and Jacobi polynomials:
A w a Pna;b .1
.1
2x/:
(b) Let ga;bIn denote the normalizing constants for Jacobi polynomials in (C.34); show that
1
'a;bIn .x/ WD ga;bIn
x a Pna;b .1 2x/
are orthonormal in H 2a;b WD L2 0; 1; x a .1 x/b dx .
(c) Verify that the singular value decomposition of A W H 2a;b ! H 2a
'n D 'a;bIn ;
D 'aC;b
In ;
;b
; / n
is given by
n ! 1:
Pn
.x/ D
x D cos
3.19 (Rates of convergence in a severely ill-posed problem.) Assume model (3.57) with %k D e
k .
[In the case of the heat equation (3.72)
D 2 T .] Let O be the truncation estimator (3.14)
and choose to approximately minimize r.
N O / D sup22 .C / r.O ; /. Show that, as ! 0,
RN .2 .C /; / r.
N O / c; C 2 log.C =/
.1 C o.1//:
0t 1
a s b;
104
4
Gaussian decision theory
In addition to those functions studied there are an infinity of others, and unless some
principle of selection is introduced we have nothing to look forward to but an infinity of
test criteria and an infinity of papers in which they are described. (G. E. P. Box, discussion
in J. R. S. S. B., 1956)
In earlier chapters we have formulated the Gaussian sequence model and indicated our
interest in comparisons of estimators through their maximum risks, typically mean squared
error, over appropriate parameter spaces. It is now time to look more systematically at questions of optimality.
Many powerful tools and theorems relevant to our purpose have been developed in classical statistical decision theory, often in far more general settings than used here. This chapter
introduces some of these ideas, tailored for our needs. We focus on comparison of properties
of estimators rather than the explicit taking of decisions, so that the name decision theory
is here of mostly historical significance.
Our principle of selectioncomparison, reallyis minimaxity: look for estimators whose
worst case risk is (close to) as small as possible for the given parameter space, often taken
to encode some relevant prior information. This principle is open to the frequent and sometimes legitimate criticism that the worst case may be an irrelevant case. However, we aim to
show that by appropriate choice of parameter space, and especially of families of parameter spaces, that sensible estimators emerge both blessed and enlightened from examination
under the magnifying glass of the minimax prinicple.
A minimax estimator is exactly or approximately a Bayes estimator for a suitable least
favorable prior. 1 It is then perhaps not surprising that the properties of Bayes rules and
risks play a central role in the study of minimaxity. Section 4.1 begins therefore with Bayes
estimators, now from a more frequentist viewpoint than in Chapter 2. Section 4.2 goes more
deeply than Chapter 2 into some of the elegant properties and representations that appear for
squared error loss in the Gaussian model.
The heart of the chapter lies in the development of tools for evaluating, or approximating
RN ./, the minimax risk when the parameter is assumed to belong to . Elementary lower
bounds to minimax risk can often be derived from Bayes rules for priors supported on the
parameter space, as discussed in Section 4.3. For upper bounds and actual evaluation of the
minimax risk, the minimax theorem is crucial. This is stated in Section 4.4, but an overview
of its proof, even in this Gaussian setting, must be deferred to Appendix A.
1
105
106
Statistical independence and product structure of parameter spaces plays a vital role in
lifting minimax results from simpler component spaces to their products, as shown in
Section 4.5.
A theme of this book is that conclusions about function estimation can sometimes be
built up from very simple, even one dimensional, parametric constituents. As an extended
example of the techniques introduced, we will see this idea at work in Sections 4.6 - 4.8.
We start with minimaxity on a bounded interval in a single dimension and progress through
hyperrectanglesproducts of intervalsto ellipsoids and more complex quadratically convex sets in `2 .N/:
Byproducts include conclusions on optimal (minimax) rates of convergence on Holder, or
uniform, smoothness classes, and the near mean square optimality of linear estimators over
all quadratically convex sets.
With notation and terminology established and some examples already discussed, Section
4.10 gives an overview of the various methods used for obtaining lower bounds to minimax
risks throughout the book.
A final Section 4.11 outlines a method for the exact asymptotic evaluation of minimax
risks using classes of priors with appropriately simple structure. While this material is used
on several later occasions, it can be omitted on first reading.
(4.1)
(4.2)
An estimator O that minimizes B.O ; / for a fixed prior is called a Bayes estimator for
, and the corresponding minimum value is called the Bayes risk B./I thus
B./ D inf B.O ; /:
O
(4.3)
Of course B./ D B.; / also depends on the noise level , but again this will not always
be shown explicitly.
107
Remark 4.1 One reason for using integrated risks is that, unlike the ordinary risk function
O /; the mapping ! B.O ; / is linear. This is useful for the minimax theorem,
! r.;
Appendix A. Representation (4.3) also shows that the Bayes risk B./ is a concave function
of , which helps in studying least favorable distributions (e.g. Proposition 4.14).
The decidedly frequentist definition of Bayes estimators fortunately agrees with the Bayesian
definition given at (2.9), under mild regularity conditions (see the Chapter Notes). We saw
that the joint distribution P of the pair .; y/ may be decomposed two ways:
P.d; dy/ D .d /P .dyj / D P .dy/.djy/;
where P .dy/ is the marginal distribution of y and .djy/ is the posterior distribution of
given y. The integrated risk of (4.2), which uses the first decomposition, may be written
using the second, posterior decomposition as
B.O ; / D EP Ey L.O .y/; /:
Here, EP denotes expectation with respect to the marginal distribution P .dy/ and Ey denotes expectation with respect to the posterior .djy/. Thus one sees that O .y/ is indeed
obtained by minimizing the posterior expected loss (2.9), O .y/ D argmina Ey L.a; /.
As seen in Chapter 2.3, this formula often leads to explicit expressions for the Bayes rules.
In particular, if L.a; / D ka k22 ; the Bayes estimator is simply given by the mean of the
posterior distribution, O .y/ D E .jy/.
If the loss function L.a; / is strictly convex in a, then the Bayes estimator O is unique
(a.e. P for each ) if both B./ < 1; and P .A/ D 0 implies P .A/ D 0 for each .
(Lehmann and Casella, 1998, Corollary 4.1.4)
Remark 4.2 Smoothness of risk functions. We digress briefly to record for later use some
O For y
information about the smoothness of the risk functions of general estimators .
2
O
Nn .; /, with diagonal and quadratic loss, the risk function ! r. ; / is analytic,
i.e. has a convergent power series expansion, on the interior of the set on which it is finite. This follows,
for example, from Lehmann and Romano (2005, Theorem 2.7.1), since
R
O
O
r.; / D k.y/ k2 . 1=2 .y //dy can be expressed in terms of Laplace transforms.
Example. Univariate Gaussian. We revisit some earlier calculations to illustrate the two
perspectives on Bayes risk. If yj N.; 2 / and the prior .d / sets N.0; 2 /
then the posterior distribution .djy/ was found in Section 2.3 to be Gaussian with mean
O .y/ D 2 y=. 2 C 2 /, which is linear in y, and constant posterior variance 2 =. 2 C 2 /.
Turning now to the frequentist perspective,
B. / D inf B.O ; /:
O
Suppose that, for the moment arbitrarily, we restrict the minimization to linear estimators
Oc .y/ D cy: Formula (2.47) showed that the risk of Oc
r.Oc ; / D c 2 2 C .1
c/2 2
c/2 2 :
108
Minimizing this over c yields the linear minimax choice cLIN D 2 =. 2 C 2 / and value
B. / D 2 2 =. 2 C 2 /, which agrees with the result of the posterior calculation.
Remark 4.3 If yj N.; 2 /, then the univariate MLE O1 .y/ D y is admissible for
squared error loss. As promised in Theorem 2.3, we indicate a proof: the result is also used at
Corollary 4.10 below. It suffices to take D 1. The argument is by contradiction: supposing
O1 to be inadmissible, we can find a dominating estimator Q and a parameter value 0 so that
Q / 1 for all , with r.;
Q 0 / < 1. The risk function r.Q ; / is continuous by Remark
r.;
4.2, so there would exist > 0 and an interval I of length L > 0 containing 0 for which
Q / 1 when 2 I . Now bring in the conjugate priors . From the example above,
r.;
1 B. / 2 as ! 1. However, the definition (4.2) of integrated risk implies that
1 B.Q ; / .I / c0 1
p
as ! 1, with c0 D L= 2. Consequently, for large, we must have B.Q ; / < B. /,
contradicting the very definition of the Bayes risk B. /. Hence O1 must be admissible.
/2 : For
(4.4)
Proof Write E for expectation with respect to the joint distribution of .; y/ when .
The left side above can be rewritten
E.O
/2
E.O
/2 D E.O
2 /:
As we have seen, with squared error loss the Bayes estimator is given by the posterior mean
O .y/ D E.jy/ and so by conditioning on y,
E.O
O / D E.O
O /O :
Substitute this into the previous display and (4.4) falls out.
We apply Browns identity and some facts about Fisher information, reviewed here and in
Appendix C.20 to obtain some useful bounds on Bayes risks. If P is a probability measure
on R with absolutely continuous density p.y/dy, the Fisher information is defined by
Z 0 2
p .y/
I.P / D
dy:
p.y/
This agrees with the definition of Fisher information for parametric families when p.yI / D
109
(4.5)
with equality if and only if P is Gaussian. [For a location family, this is the Cramer-Rao
bound.] The proof is just the Cauchy-Schwarz inequality. Indeed, we may suppose that
I.P / < 1, which entails that the density p of P exists and is absolutely continuous, and
permits integration by parts in the following chain:
Z
Z
Z
Z
1 D p.y/dy D
.y /p 0 .y/dy .y /2 p.y/dy p 0 .y/2 =p.y/dy;
with equality if and only if .p 0 =p/.y/ D .log p/0 .y/ D c.y /; so that p is Gaussian.
Now we return to Browns identity. We also need the Tweedie/Brown formula (2.23) for
a Bayes estimator, which for noise level and dimension n D 1 takes the form
O .y/ D y C 2 p 0 .y/=p.y/:
(4.6)
Recalling the unbiased estimator O0 .y/ D y, we might write this as O0 O D 2 .p 0 =p/.y/.
Of course, B.O0 ; / D E E .y /2 D 2 ; regardless of the prior . If now in (4.4), we
insert O0 for O , we have
Z 0 2
p .y/
p.y/dy:
2 B./ D 4
p.y/2
Since p is the absolutely continuous density of the marginal distribution ? , we arrive
at a formula that is also sometimes called Browns identity:
Proposition 4.5 (Brown) For y N.; 2 / and squared error loss,
B.; / D 2 1
2 I. ? /:
(4.7)
2 Var
;
2 C Var
(4.8)
(4.9)
We give two further applications of Browns identity: to studying continuity and (directional) derivatives of Bayes risks.
110
Continuity of Bayes risks. This turns out to be a helpful property in studying Bayes minimax risks, e.g. in Section 8.7.
Lemma 4.7 If n converges weakly to , then B.n / ! B./.
Note that definition (4.3) itself implies only upper semicontinuity for B./.
R
Proof It suffices to consider unit noise D 1. Let pn .y/ D .y / d n and define
p.y/ correspondingly from . From (4.7), it is enough to show that
Z 02
Z 02
p
pn
!
D I. ? /:
(4.10)
I.n ? / D
pn
p
R
R
Weak convergence says that g d n ! g d for every g bounded and continuous, and
so pn , pn0 and hence pn0 2 =pn converge respectively to p, p 0 and p 0 2 =p pointwise in R. We
construct functions Gn and G such that
0
pn0 2
Gn ;
pn
0
p0 2
G;
p
R
R
and Gn ! G, and use the extended version of the dominated convergence theorem,
Theorem C.7, to conclude (4.10). Indeed, from representation (4.6)
pn0
.y/ D On .y/
pn
y D En
yjy;
/ n .d /:
A corresponding bound holds with n and pn replaced by and p and yields a bounding
function G.y/. To complete the verification, note also that
Z
Z
Gn .y/ dy D
.y /2 .y / dy n .d / D 1 D G.y/dy:
Remark. The smoothing effect of the Gaussian density is the key to the convergence (4.10).
Indeed, in general Fisher information is only lower semicontinuous: I./ lim inf I.n /,
see also Appendix C.20. For a simple example in which continuity fails, take discrete measures n converging weakly to , so that I.n / is infinite for all n.
Derivatives of Bayes Risk. Browns identity also leads to an interesting formula for the
directional or Gateaux derivative for the Bayes risk. We use it later, Proposition 4.13, to
exhibit saddle points.
Lemma 4.8 Given priors 0 and 1 , let t D .1
d
B.t /jt D0 D B.O0 ; 1 / B.0 /:
(4.11)
dt
Formula (4.11), which involves a change of prior, should be compared with (4.4), which
involves a change of estimator, from O to O .
111
Proof Write Pt D ? t for the marginal distributions. Since I.Pt / < 1, the density
pt .y/ of Pt exists, along with its derivative pt0 D .d=dy/pt a.e. Introduce
O0 .y/;
0 .y/
where the final equality uses the Bayes estimator representation (4.6).
R 02From Browns identity
(4.7), .d=dt/B.t / D .d=dt/I.Pt /. Differentiating I.Pt / D pt =pt under the integral
sign, we obtain (see Appendix C.20 for details)
Z
d
I.Pt /jtD0 D 2 0 p10 C 02 p1 dy C I.P0 /:
(4.12)
dt
R
.y /.y /1 .d /,
Since p1 D ? 1 is the marginal density of 1 and p10 .y/ D
we can write the previous integral as
2.y O0 /.y / C .y O0 /2 .y /1 .d /dy
D
O0 /2 D
1 C E1 E .
1 C B.O0 ; 1 /:
Recalling that B.0 / D 1 I.P0 / by Browns identity (4.7), we arrive at formula (4.11).
We begin with an elementary, but very useful, lower bound for RN ./ that may be derived
using Bayes risks of priors supported in . Indeed, if supp , then
Z
O / D
B.;
r.O ; /.d / sup r.O ; /:
2
O for sup
O
O
We sometimes write r.
N /
2 r. ; /. Minimizing over ; we have
B./ inf sup r.O ; / D RN ./:
O
(4.13)
(4.14)
2P
A prior that attains the supremum will be called least favorable. A sequence of priors for
which B.n / approaches the supremum is called a least favorable sequence. Letting supp P
denote the union of all supp for in P , we obtain the lower bound
supp P
H)
RN ./ B.P /:
(4.15)
112
Proposition 4.9 An estimator O0 is minimax if there exists a sequence of priors n with
B.n / ! rN D sup .O0 ; /.
Proof
Indeed, from (4.13) we have rN RN ./, which says that O0 is minimax.
Corollary 4.10 If yj N.; 2 /, then O1 .y/ D y is minimax for squared error loss. In
addition, O1 is the unique minimax estimator.
Proof Indeed, using the conjugate priors , we have r.
N O1 / D 2 D lim!1 B. /. To
0
O
establish uniqueness, suppose that 1 is another minimax estimator with P .O1 O10 / > 0
for some and hence every . Then strict convexity of the loss function implies that the new
estimator Q D .O1 C O10 /=2 satisfies, for all , r.Q ; / < .r.O1 ; / C r.O10 ; //=2 2 which
contradicts the admissibility of O1 , Remark 4.3.
Example 4.11 Bounded normal mean. Suppose that y N.; 1/ and that it is known a
priori that jj , so that D ; . This apparently very special problem will be an
important building block later in this chapter. We use the notation N .; 1/ for the minimax
risk RN ./ in this case, in order to highlight the interval endpoint and the noise level,
here equal to 1.
Let V denote the prior on ; with density .3=.2 3 //. jj/2C ; from the discussion
above N .; 1/ B.V /.
N .; 1/ D inf sup E.O
O 2 ;
/2 B.V /:
We use the van Trees inequality (4.9), along with I.V / D I.V1 /= 2 to conclude that
N .; 1/
1
2
D 2
:
1 C I.V /
C I.V1 /
(4.16)
From this one learns that N .; 1/ % 1 as ! 1, indeed at rate O.1= 2 /. An easy
calculation shows that I.V1 / D 12. For the exact asymptotic behavior of 1 N .; 1/, see
the remark following (4.43) in Section 4.6.
113
2P O
(4.17)
2
and so minimizing over all estimators O and using the minimax theorem (4.17) gives an
upper bound on minimax risk that we will use frequently:
RN ./ B.P /:
(4.18)
The bound is useful because the Bayes-minimax risk B.P / is often easier to evaluate than
the minimax risk RN ./. We can often (see Section 4.11) show that the two are comparable
in the low noise limit as ! 0:
RN .; / B.P ; /:
3. In some cases, we may combine the lower and upper bounds (4.15) and (4.18). For
example, if P D P ./ D f W supp g; then
RN ./ D B.P .//:
(4.19)
In this case, if O is minimax for (4.17), then it is minimax for ordinary risk:
sup r.O ; / RN ./:
2
Example 4.11 continued. In the bounded normal mean problem of the last section, we
have D ; and so
N .; 1/ D supfB./ W supp ; g:
(4.20)
114
4. The role of lower semicontinuity is described more fully in Appendix A, but one can
give an illustrative example even in dimension one with quadratic loss and D 1. One
2
can check that the (otherwise absurd) estimator O .y/ D e y =4 =.1 C y/I fy > 0g has a
risk function which is discontinuous at 0, but still lower semicontinuous. The assumption of
lower semicontinuity allows all estimators to be included in statements such as (4.17).
5. It is easy to check that the loss functions ka kpp are lower semicontinuous in a : if
ai.n/ ! ai.1/ for all i, then ka.1/ kpp lim infn ka.n/ kpp : See also Exercise 4.6.
inf I.P /;
P 2P ?
(4.21)
for all 2 P ;
(4.22)
115
if and only if .d=dt/B.t /jt D0 0 for each 1 2 P . The desired inequality (4.22) is now
immediate from the Gateaux derivative formula (4.11).
(4.24)
Independence is less favorable. Here is a trick that often helps in finding least favorable
priors. Let be an arbitrary prior, so that the j are not necessarily independent. Denote
by jQthe marginal distribution of j . Build a new prior N by making the j independent:
N D j j : This product prior is more difficult, as measured in terms of Bayes risk.
Lemma 4.15 B./
N B./:
Proof Because of the independence structure, the N posterior distribution of j given y in
fact depends only on yj compare (2.16). Hence the Bayes
N
rule is separable: O;j
N .y/ D
Oj .yj /. From the additivity of losses and independence of components given , (4.23),
X
r.ON ; / D
r.O;j
N ; j /:
j
The -average of the rightmost term therefore depends only the marginals j ; so
Z
Z
r.ON ; /.d / D r.ON ; /.d
N
/ D B./:
N
The left side is just B.ON ; /, which is at least as large as B./ by definition.
To see more intuitively why the product marginal prior N is harder than , consider
squared error loss: conditioning on all of y has to be betterlower variancethan conditioning on just yj :
E E .j jy/
j 2 :
116
Indeed, the inequality above follows from the identity VarX D EVar.X jY /CVarE.X jY /
using for X the conditional distribution of j jyj and for Y the set fyk W k j g.
Product Spaces. Suppose that `2 .I / is a product space D j 2J j : Again the
index j may refer to individual coordinates of `2 .I / or to a cluster of coordinates. If the loss
function is additive and convex, then the minimax risk for can be built from the minimax
risk for each of the subproblems j :
P
Proposition 4.16 Suppose that D j 2J j and L.a; / D Lj .aj ; j /. Suppose that
aj ! Lj .aj ; j / is convex and lower semicontinuous for each j : Then
X
RN .j j ; / D
RN .j ; /:
(4.25)
j
If j .yj / is separately minimax for each j , then .y/ D j .yj / is minimax for :
Remarks: 1. There is something to prove here: among estimators O competing in the left
side of (4.25), each coordinate Oj .y/ may depend on all components yj ; j 2 J . The result
says that a minimax estimator need not have such dependencies: j .y/ depends only on yj .
2. The statement of this result does not involve prior distributions, and yet the simplest
proof seems to need priors and the minimax theorem. A direct proof without priors is possible, but is more intricateExercise 4.8.
Proof
where P ./ denotes the collection of all probability measures supported in : Given any
such prior , construct a new prior N as the product of the marginal distributions j of j
under : Lemma 4.15 shows that N is more difficult than W B./
N B./: Because of
the product structure of ; each j is supported in j and N still lives on : Thus the
maximization can be restricted to priors with independent coordinates. Bayes risk is then
additive, by (4.24), so the optimization can be term-by-term:
X
X
RN ./ D
supfB.j / W j 2 P .j /g D
RN .j /:
j
The verification that separately minimax j .yj / combine to yield a minimax .y/ can now
be left to the reader.
117
/2
(4.26)
than N .1; / D 2 and than the corresponding linear minimax risk L .; / obtained by
restricting O to linear estimators of the form Oc .y/ D cy?
Linear Estimators. Applying the variance-biase decomposition of MSE, (2.46), to a linear estimator Oc .y/ D cy; we obtain E.Oc /2 D c 2 2 C .1 c/2 2 : If the parameter is
known to lie in a bounded interval ; ; then the maximum risk occurs at the endpoints:
sup E.Oc
c/2 2 D r.Oc ; /:
/2 D c 2 2 C .1
(4.27)
2 ;
The minimax linear estimator is thus found by minimizing the quadratic function c !
r.Oc ; /. It follows that
2 2
:
(4.28)
L .; / D inf r.Oc ; / D 2
c
C 2
The minimizer c D 2 =. 2 C 2 / 2 .0; 1/ and the corresponding minimax linear estimator
OLIN .y/ D
2
y:
2 C 2
(4.29)
Thus, if the prior information is that 2 2 ; then a large amount of linear shrinkage is
indicated, while if 2 , then essentially the unbiased estimator is to be used.
Of course, OLIN is also Bayes for the prior .d / D N.0; 2 / and squared error loss.
Indeed, from (2.19) we see that the posterior is Gaussian, with mean (4.29) and variance
equal to the linear minimax risk (4.28). Note that this prior is not concentrated on . /:
only a moment statement is possible: E 2 D 2 :
There is a simple but important scale invariance relation
L .; / D 2 L .=; 1/:
Writing D = for the signal-to-noise ratio, we have
(
2
L .; 1/ D 2 =.1 C 2 /
1
(4.30)
!0
! 1:
(4.31)
These results, however simple, are nevertheless a first quantitative indication of the importance of prior information, here quantified through , on possible quality of estimation.
Projection Estimators. Orthogonal projections form an important and simple subclass
of linear estimators. The particular case of projections orthogonal to the co-ordinate axes
was defined and discussed in Section 3.2. In one dimension the situation is almost trivial,
with only two possibilities. Either O0 .y/ 0 with risk r.O0 ; / D 2 the pure bias case,
or O1 .y/ D y; with risk r.O1 ; / D 2 the case of pure variance. Nevertheless, one can
usefully define and evaluate the minimax risk over D ; for projection estimators
P .; / D inf
sup E.Oc
c2f0;1g 2 ;
/2 D min. 2 ; 2 /:
(4.32)
118
The choice is to keep or kill: if the signal to noise ratio = exceeds 1, use O .y/ D y;
O
otherwise use .y/
D 0: The inequalities
1
2
min. 2 ; 2 /
22
min. 2 ; 2 /
2 C 2
(4.33)
imply immediately that 12 P .; / L .; / P .; /, so that the best projection estimator is always within a factor of 2 of the best linear estimator.
Non-linear estimators. The non-linear minimax risk N .; /, (4.26), cannot be evaluated
analytically in general. However, the following properties are easy enough:
N .; / L .; /;
(4.34)
(4.35)
N .; / is increasing in ;
(4.36)
lim N .; / D :
(4.37)
!1
Indeed, (4.34) is plain since more estimators are allowed in the nonlinear competition,
while (4.35) follows by rescaling, and (4.36) is obvious. Turning to (4.37), we recall that
the classical result (2.51) says that the minimax risk for unconstrained to any interval,
N .1; / D 2 : Thus (4.37) asserts continuity as increases without boundand this follows immedately from the example leading to (4.16): N .; 1/ 2 =. 2 C I.V1 //.
In summary so far, we have the bounds N L P , as illustrated in Figure 4.1, from
which we might guess that the bounds are relatively tight, as we shall shortly see.
1
P =2 ^ 1
L = 2=(2+1)
N(;1)
1
L .; /
1:25:
N .; /
(4.38)
119
Thus, regardless of signal bound and noise level , linear rules are within 25% of optimal
for mean squared error. The bound < 1 is due to Ibragimov and Khasminskii (1984).
The extra worksome numericalneeded to obtain the essentially sharp bound 1.25 is
outlined in Donoho et al. (1990) along with references to other work on the same topic.
Proof We show only a weaker result: that is finite and bounded by 2.22, which says that
linear rules are within 122% of optional. For the extra work to get the much better bound
of 1.25 we refer to Donoho et al. (1990). Our approach uses projection estimators and the
Bayes risk identity (2.30) for the symmetric two point priors D .1=2/. C / to give a
short and instructive demonstration that 1=B.1 /: Numerical evaluation of the integral
(2.30) then the latter bound to be approximately 2.22.
First, it is enough to take D 1, in view of the scaling invariances (4.30) and (4.35). We
may summarize the argument by the inequalities:
2 ^ 1
1
L .; 1/
:
N .; 1/
N .; 1/
B.1 /
(4.39)
Indeed, the first bound reflects a reduction to projection estimators, (4.32). For the second
inequality, consider first 1, and use monotonicity (4.36) and the minimax risk lower
bound (4.15) to obtain
N .; 1/ N .1; 1/ B.1 /:
For 1, again N .; 1/ B. / and then from (2.30) 2 =B. / is increasing in .
An immediate corollary, using also (4.28) and (4.33), is a bound for N :
.2 /
(4.40)
The proof also gives sharper information for small and large : indeed, the linear minimax
risk is then essentially equivalent to the non-linear minimax risk:
./ D L .; 1/=N .; 1/ ! 1
as ! 0; 1:
(4.41)
Indeed, for small ; the middle term of (4.39) is bounded by =B. /, which approaches 1,
as may be seen from (2.30). For large ; the same limit results from (4.37). Thus, as ! 0,
O0 .y/ D 0 is asymptotically optimal, while as ! 1; O .y/ D y is asymptotically best.
These remarks will play a role in the proof of Pinskers theorem in the next chapter.
120
Proof Property (4.42) implies that the set K D f 2 K./ W r. / D r g has -probability
equal to 1. Since K is a closed set, it follows from the definition (C.15) that the support of
is contained in K. Now we recall, e.g. C.8, that if the set of zeros of an analytic function,
here r./ r , has an accumulation point 0 inside its domain D, then the function it is
identically zero on the connected component of D containing 0 .
Now to the minimax rules. Let r.
N O / D maxj j r.O ; /. Given a prior distribution , let
M./ denote the set of points where the Bayes rule for attains its maximum risk:
M./ D 2 ; W r.O ; / D r.
N O / :
Proposition 4.19 For the non-linear minimax risk N .; / given by (4.26), a unique least
favorable distribution exists and .O ; / is a saddlepoint. The distribution is symmetric, supp. / M. / and M. / is a finite set. Conversely, if a prior satisfies
supp./ M./, then O is minimax.
Proof We apply Propositions 4.13 and 4.14 to the symmetric set P of probability measures supported on ; , which is weakly compact. Consequently a unique least favorable
distribution 2 P exists, it is symmetric, and the corresponding Bayes rule O satisfies
Z
O
r. ; / B. / D r.O ; / .d /;
as we see by considering the point masses D for 2 ; .
The risk function ! r.O ; / is finite and hence analytic on R, Remark 4.2 of Section
2.5, and not constant (Exercise 4.1). The preceding lemma shows that supp. / M. /,
which can have no points of accumulation and (being also compact) must be a finite set.
Finally, if supp./ M./, then r.O ; / D r.
N O / and so O must be minimax:
r.
N O / D B.O ; / D inf B.O ; / inf r.
N O /:
O
O
In general, this finite set and the corresponding minimax estimator can only be determined
numerically, see Kempthorne (1987); Donoho et al. (1990); Gourdin et al. (1994). Nevertheless, one can still learn a fair amount about these least favorable distributions. Since the
posterior distribution of must also live on this finite set, and since the root mean squared
error of O must be everywhere less than , one guesses heuristically that the support points
of will be spaced at a distance on the scale of the noise standard deviation . See Exercise
4.12 and Figure 4.2.
For small , then, one expects that there will be only a small number of support points,
and this was shown explicitly by Casella and Strawderman (1981). Their observation will be
important for our later study of the least favorable character of sparse signal representations,
so we outline the argument. Without loss of generality, set D 1:
1. Proposition 4.19 says that the symmetric two point prior D .1=2/. C / is
minimax if f ; g M. /: For this two point prior, the posterior distribution and mean
O were given in Chapter 2, (2.27) (2.29), and we recall that the Bayes risk satisfies (2.30).
2. Since the posterior distribution concentrates on , one guesses from monotonicity
and symmetry considerations that M. / f ; 0; g for all . The formal proof uses a
sign change argument linked to total positivity of the Gaussian distribution see Casella
and Strawderman (1981).
121
4.7 Hyperrectangles
3. A second sign change argument shows that there exists 2 such that for jj < 2 ;
r.O ; 0/ < r.O ; /:
Thus supp./ D f ; g D M. / and so O is minimax for jj < 2 , and numerical work
:
shows that 2 D 1:057:
This completes the story for symmetric two point priors. In fact, Casella and Strawderman
go on to show that for 2 j j < 3 ; an extra atom of the prior distribution appears at 0,
and has the three-point form
D .1
/0 C .=2/. C /:
r (^
;)
O()
Figure 4.2 as the interval ; grows, the support points of the least favorable
prior spread out, and a risk function reminiscent of a standing wave emerges.
As j j increases, prior support points are added successively and we might expect a picture
such as Figure 4.2 to emerge. Numerical calculations may be found in Gourdin et al. (1994).
An interesting phenomenon occurs as gets large: the support points become gradually
more spaced out. Indeed, if the least favorable distributions are rescaled to 1; 1 by
setting .A/ D .A/; then Bickel (1981) derives the weak limit ) 1 , with
1 .ds/ D cos2 .s=2/ds;
for jsj 1, and shows that N .; 1/ D 1
2 = 2 C o.
(4.43)
/ as ! 1.
4.7 Hyperrectangles
In this section, we lift the results for intervals to hyperrectangles, and obtain some direct
consequences for nonparametric estimation over Holder classes of functions.
The set `2 .I / is said to be a hyperrectangle if
Y
D ./ D f W ji j i for all i 2 I g D
i ; i :
i
122
k 1; > 0; C > 0;
(4.44)
jk j C e
ak
k 1; a > 0; C > 0:
(4.45)
We suppose that data y from the heteroscedastic Gaussian model (3.57) is observed, but
for notational ease here, we set i D %i , so that
yi D i C i zi ;
i 2 I:
(4.46)
We seek to compare the linear and non-linear minimax risks RN .. /; / RL .. /; /.
The notation emphasizes the dependence on scale parameter , for later use in asymptotics.
Proposition 4.16 says that the non-linear minimax risk over a hyperrectangle decomposes
into the sum of the one-dimensional component problems:
X
RN .. /; / D
N .i ; i /:
(4.47)
Minimax linear estimators have a similar structure:
Proposition 4.20 (i) If OC .y/ D Cy is minimax linear over a hyperrectangle . /, then
necessarily C must be diagonal. (ii) Consequently,
X
RL .. /; / D
L .i ; i /
(4.48)
i
Before proving this, we draw an immediate and important consequence: by applying Theorem 4.17 term by term, L .i ; i / N .i ; i /; it follows that the Ibragimov-Hasminski
theorem lifts from intervals to hyperrectangles:
Corollary 4.21 In model (4.46),
RL .. /; / RN .. /; /:
(4.49)
Proof of Proposition 4.20. First note that a diagonal linear estimator Oc .y/ D .ci yi / has
mean squared error of additive form:
X
r.Oc ; / D
i2 ci2 C .1 ci /2 i2 :
(4.50)
i
Let r.
N OC / D supfr.OC ; /; 2 . /g and write d.C / D diag.C / for the matrix (which is
infinite if I is infinite) obtained by setting the off-diagonal elements to 0. We show that this
always improves the estimator over a hyperrectangle:
r.
N Od.C / / r.
N OC /:
(4.51)
Recall formula (3.60) for the mean squared error of a linear estimator. The variance term is
easily boundedwith D diag.i2 /, we have, after dropping off-diagonal terms,
X
X
tr C TC D
cij2 i2
ci2i i2 D tr d.C /Td.C /:
ij
For the bias term, k.C I / k2 , we employ a simple but useful random signs technique.
Let 2 ./ and denote by V . / the vertex set of the corresponding hyperrectange . /:
123
4.7 Hyperrectangles
I / k2 E T .C
I /T .C
I /
2V . /
D tr .C
I /.C
I /T k.d.C /
I /k2 ;
where D diag.k2 / and the final inequality simply drops off-diagonal terms in evaluating
the trace.
The risk of a diagonal linear estimator is identical at all the vertices of V . /compare
(4.50)and so for all vertex sets V . / we have shown that
sup r.OC ; / sup r.Od.C / ; /:
2V . /
2V . /
Since 2 ./ is arbitrary, we have established (4.51) and hence part (i).
Turning to part (ii), we may use this reduction to diagonal linear estimators to write
X
RL .. /; / D inf sup
E.ci yi i /2 :
.ci / 2. /
Now, by the diagonal form ci yi and the product structure of . /, the infimum and the
supremum can be performed term by term. Doing the supremum first, and using (4.27),
RL .. /; / D inf r.Oc ; /:
c
(4.52)
124
analyticity conditions are less often used in nonparametric theory than are constraints on a
finite number of derivatives.
From this perspective, the situation is much better for wavelet bases, to be discussed in
Chapter 7 and Appendix B, since Holder smoothness is exactly characterized by hyperrectangle conditions, at least for non-integer .
To describe this, we introduce doubly indexed vectors .j k / and hyperrectangles
1 .C / D f.j k / W jj k j C 2
.C1=2/j
; j 2 N; k D 1; : : : ; 2j g:
(4.53)
r/ 2r
;
r D 2=.2 C 2 C 1/:
(4.54)
The notation shows the explicit dependence on both C and . The expression a./ b./
means that there exist positive constants
1 <
2 depending only on , but not on C or ,
such that for all , we have
1 a./=b./
2 . The constants
i may not be the same at
each appearance of .
While the wavelet interpretation is not needed to state and prove this result (which is why
it can appear in this chapter!) its importance derives from the smoothness characterization.
Indeed, this result exhibits the same rate of convergence as we saw for mean square smoothness, i.e. for an ellipsoid, in the upper bound of (3.15). Note that here we also have a
lower bound.
The noise level j D 2j is allowed to depend on level j , but not on kthe parameter
corresponds to that in the ill-posed linear inverse problems of Section 3.9, and appears
again in the discussion of the wavelet-vaguelette decomposition and the correlated levels
model of Section 12.4.
Proof Using (4.47), we can reduce to calculations based on the single bounded normal
mean problem:
X
RN .; / D
2j N .C 2 .C1=2/j ; 2j /:
j
.C1=2/j
D 2j :
For j < j , the variance term 22j 2 is active in the bound for N , while for j > j it is
the squared bias term C 2 .2C1/j which is the smaller. Hence, with j0 D j ,
X
X
2j 22j 2 C C 2
2 2j :
RN .; /
j j0
j >j0
125
4.7 Hyperrectangles
These geometric sums are dominated by their leading terms, multiplied by constants depending only on . Consequently,
RN .; / 2.2 C1/j 2 C C 2 2
2j
r/ 2r
;
m
X
N .; k / D
m
X
min. 2 ; k2 /;
kD1
m
X
k 2 c 2 m2 C1 ;
2
m
m
X
1
2
ak2 D m
m
X
k 2 c 2 .m C 1/2C2 C1 C 2 :
Let m1 be the largest integer such that c .m C 1/2C2 C1 C 2 = 2 and m0 2 R the solution
to c m02C2 C1 D C 2 = 2 . It is easy to check that if m0 4, say, then m1 =m0 1=2. We
may therefore conclude from the previous two displays that for =C d ,
C1
C1
RN .; / RN ..m1 /; /
c 2 m2
c 2 m2
c 2 .C 2 = 2 /1 r :
1
0
126
Suppose again that yi N.i ; 2 / for i D 1; : : : ; n and consider the product prior
ind
i 12 .i C
i /:
We take a brief break from squared error loss functions to illustrate the discussion of product
priors, additive loss functions and posterior modes of discrete priors (cf. Section 2.3) in the
context of three related discrete loss functions
X
L0 .a; / D
I fai i g;
i
N.a; / D
I fsgn ai sgn i g
and
I fsgn yi sgn i g
127
(4.55)
2
For linear estimation, we show that the minimax risk for may be found among diagonal
linear estimators.
Lemma 4.24 Let Oc .y/ D .ci yi / denote a diagonal linear estimator with c 2 `2 .N; .i2 //,
Suppose that is solid and orthosymmetric. Then
RL ./ D inf sup r.Oc ; /:
c 2
(4.56)
Proof Indeed, we first observe that according to the proof of Proposition 4.20, the maximum risk of any linear estimator OC over any hyperrectangle can be reduced by discarding
off-diagonal terms:
sup r.OC ; / sup r.Odiag.C / ; /:
2. /
2. /
C 2 2. /
C 2
128
Just as in (4.55) the linear minimax risk over is clearly bounded below by that of the
hardest rectangular subproblem. However, for quadratically convex , and squared error
loss (to which it is adapted), the linear difficulties are actually equal:
Theorem 4.25 (Donoho et al., 1990) Consider the heteroscedastic Gaussian sequence
model (4.46). If is compact, solid, orthosymmetric and quadratically convex, then
RL ./ D sup RL .. //:
(4.57)
2
Combining (4.57), (4.49) and (4.55), we immediately obtain a large class of sets for which
the linear minimax estimator is almost as good as the non-linear minimax rule.
Corollary 4.26 If is compact, solid, orthosymmetric and quadratically convex, then
RL ./ RN ./:
This collection includes `p bodies for p 2 and so certainly ellipsoids, solid spheres,
etc. and the Besov bodies just discussed.
Remark 4.27 The results of preceding Theorem and Corollary extend easily, Exercise 4.15,
to parameter spaces with a Euclidean factor: D Rk 0 , where 0 is compact (and solid,
orthosymmetric and quadratically convex). This brings in useful examples such as Sobolev
ellipsoids in the Fourier basis, 2 .C /, recall (3.9).
Proof of Theorem 4.25. First we observe that (4.57) can be formulated as a minimax theorem. Indeed, (4.56) displays the left side as an inf sup. From (4.52) we see that the right side
of (4.57) equals sup2 infc r.Oc ; /. Thus, we need to show that
inf sup r.Oc ; / D sup inf r.Oc ; /:
c 2
2 c
(4.58)
.i2 /.
and s D
Now we verify that we may apply the Kneser-Kuhn minimax theorem,
Corollary A.4, to f .c; s/. Clearly f is convex-concave indeed, even linear in the second argument. By Remark 1 following Proposition 4.20, we may assume that the vector
c 2 `2 .N; .i2 // \ 0; 11 , while s 2 2C `1 : The latter set is convex by assumption and
`1 -compact by the assumption that is `2 -compact. (Check this, using the Cauchy-Shwarz
inequality.) Finally, f .c; s/ is trivially `1 -continuous in s for fixed c in 0; 11 .
P
Example. Let n;2 .C / denote an `2 ball of radius C in Rn : f W n1 i2 C 2 g: Theorem
4.25 says, in the homoscedastic case i , that
X
n
n
X
i2
2
2
2
RL .n;2 .C /; / D sup
W
i C ;
2 C i2
1
1
and since s ! s=.1 C s/ is concave, it is evident that the maximum is attained at the vector
with symmetric components i2 D C 2 =n: Thus,
RL .n;2 .C /; / D n 2
C2
;
n 2 C C 2
(4.59)
129
which grows from 0 to the unrestricted minimax risk n 2 as the signal-to-noise ratio C 2 =n 2
increases from 0 to 1.
While the norm ball in infinite sequence space, 2 .C / D f 2 `2 W kk2 C g is not
compact, the preceding argument does yield the lower bound
RL .2 .C /; / C 2 ;
which already shows that no linear estimate can be uniformly consistent as ! 0 over all
of 2 .C /. Section 5.5 contains an extension of this result.
Remark. We pause to preview how the various steps taken in this chapter and the next
can add up to a result of some practical import. Let OPS; denote the periodic smoothing
spline with regularization parameter in the white noise model, Section 3.4. If it is agreed
to compare estimators over the mean square smoothness classes D 2 .C /, Section 3.1,
(3.9), it will turn out that one cannot improve very much over smoothing splines from the
worst-case MSE point of view.
Indeed, borrowing some results from the next chapter (5.1, 5.2), the best mean squared
error for such a smoothing spline satisfies
RPS . ; / D inf sup r.OPS; ; I / .1 C c.; // RL . ; /;
2
along with the bound lim!0 c.; / 0:083 if 2. In combination with this chapters result bounding linear minimax risk by a small multiple of non-linear minimax risk,
Corollary 4.26, we can conclude that
RPS .2 .C /; / .1:10/.1:25/ RN .2 .C /; /
for all 2 and at least all sufficiently small . Thus even arbitrarily complicated nonlinear esimators cannot have worst-case mean squared error much smaller than that of the
relatively humble linear smoothing spline.
i 2 N;
Cov.z/ D ;
(4.60)
in which the components zi may be correlated. In contrast to Section 3.10, we may not
necessarily wish to work in a basis that exactly diagonalizes . This will be of interest
in the later discussion of linear inverse problems with a wavelet-vaguelette decomposition,
Chapter 12.
Make the obvious extensions to the definition of minimax risk among all non-linear and
among linear estimators. Thus, for example, RN .; / D infO sup2 EL.O .y/; / when
y follows (4.60). For suitable classes of priors P , we similarly obtain Bayes minimax risks
B.P ; /. The first simple result captures the idea that adding independent noise can only
make estimation harder. Recall the non-negative definite ordering of covariance matrices or
operators: 0 means that 0 is non-negative definite.
130
Lemma 4.28 Consider two instances of model (4.60) with 0 . Suppose that the loss
function a ! L.a; / is convex. Then
RN .; 0 / RN .; /;
and
RL .; 0 / RL .; /:
(4.61)
Of course, RDL .; / D RDL .; d / since RDL involves only the variances of z. Finally,
let the correlation matrix corresponding to be
%./ D d 1=2 d 1=2 :
Proposition 4.30 Suppose that y follows the correlated Gaussian model (4.60). Let %min
denote the smallest eigenvalue of %./. Suppose that is orthosymmetric and quadratically
convex. Then
1
RL .; / RDL .; / %min
RL .; /:
If is diagonal, then %min D 1 and RDL D RL . This happens, for example, in the
Karhunen-Lo`eve basis, Section 3.10. If is near-diagonalin a sense to be made more
precise in Chapter 12then not much is lost with diagonal estimators. For general , it can
happen that %min is small and the bound close to sharp, see the example below.
Proof Only the right hand side bound needs proof. It is easily verified that %min d
and that %min 1 and hence using Lemma 4.28 that
RL .; / RL .; %min d / %min RL .; d /:
131
(4.62)
132
yield bounds for all . See, for example, Sections 4.7 and 9.3 in which 0 is a hypercube.
This can be enough if all that is sought is a bound on the rate of convergence.
(e) Containment with optimization. Given a family of spaces
, optimize the
choice of
: RN ./ sup
RN .
/. This is used for Besov bodies (Section 9.9), `p balls
(Section 11.6), and in the Besov shell method of Section 10.8.
(f) Comparison of loss functions or models. If L.a; / L0 .a; / either everywhere or
with -probability one for suitable priors , then it may be easier to develop bounds using
L0 , for example via Bayes risk calculations. This strategy is used in Section 10.4 with `q
loss functions and in Section ??. A variant of the comparison strategy appears in Section
4.9 in which an ordering of covariance matrices implies an ordering of risks, and in Section
15.3 in which a discrete sampling model is compared with a continuous white noise model.
(g) Generic reduction to testing/classification. In this approach, a finite subset F
is chosen so that every pair of points satisfies ki j k 2: If w./ is increasing, then
inf sup E w.kQ
Q
and the estimation problem has been reduced to a classification error problem, which might,
for example, be bounded by a version of Fanos lemma (e.g. Cover and Thomas (1991)).
This common strategy is used here only in Section 5.5where more details are givenwhere
there is no special structure on .
(4.63)
2M
We call the right side the Bayes-minimax risk. Often M is defined by constraints on marginal
moments and
M will not be supported on . For example,
Pin2general
P if 2.C /2is the 2ellipsoid
2
2
defined by ai i C , then we might use M.C / D f.d / W ai E i C g.
The idea is that a judiciously chosen relaxation of the constraints defining may make
the problem easier to evaluate, and yet still be asymptotically equivalent to as ! 0:
The main task, then, is to establish that RN .; / B.M; / as ! 0:
(a) Basic Strategy. Suppose that one can find a sequence supported in , that is
nearly least favorable: B. / B.M; /: Then asymptotic equivalence would follow from
the chain of inequalities
B. / RN .; / B.M; / B. /:
(4.64)
(b) Asymptotic Concentration. Often it is inconvenient to work directly with priors supported on . Instead, one may seek a sequence 2 M that is both asymptotically least
133
(4.65)
If one then constructs the conditioned prior D . j/ and additionally shows that
B. / B. /;
(4.66)
then asymptotic equivalence follows by replacing the last similarity in (4.64) by B.M; /
B. / B. /.
There are significant details to fill in, which vary with the specific application. We try to
sketch some of the common threads of the argument here, noting that some changes may
be needed in each setting. There is typically a nested family of minimax problems with
parameter space .C / depending on C , so that C < C 0 implies that .C / .C 0 /.
Often, but not always, C will be a scale parameter: .C / D C .1/: We assume also
that the corresponding prior family is similarly nested. Let R.C; / B.C; / denote the
frequentist and Bayes minimax risks over .C / and M.C / respectively. We exploit the
nesting structure by taking as the least favorable prior for B.
C; / for some
< 1:
Although will typically not live on .
C /, it often happens that it is asymptotically
concentrated on the larger set .C /:
We now give some of the technical details needed to carry out this heuristic. The setting
is `2 loss, but the argument can easily be generalized, at least to other additive norm based
loss functions. Since C remains fixed, set D .C /: Let be a prior distribution with
B. /
B.
C; / and ./ > 0. Set D .j/, and let O be the Bayes estimator
of for the conditioned prior . The task is to relate B. / to B. /: From the frequentist
definition of Bayes risk B. / B.O ; /, and so
B. / E jjO jj2 j ./ C E jjO jj2 ; c
B. / ./ C 2E jjO jj2 C jj jj2 ; c :
Here and below we use E to denote expectation over the joint distribution of .; y/ when
prior is used. Since is concentrated on , B. / R.C; /; and on putting everything
together, we have
B.
C; /.1 C o.1// B. / R.C; / ./ C 2E fkO k2 C jjjj2 ; c g:
In summary, we now have a lower bound for the minimax risk.
Lemma 4.32 Suppose that for each
< 1 one chooses 2 M.
C / such that, as ! 0,
B. /
B.
C; /.1 C o.1//;
(4.67)
./ ! 1;
(4.68)
(4.69)
(4.70)
134
Often the function B.
C; / will have sufficient regularity that one can easily show
lim lim inf
%1
!0
B.
C; /
D 1:
B.C; /
(4.71)
See, for example, Exercise 4.7 for the scale family case. In general, combining (4.70) with
(4.71), it follows that R.C; / B.C; /:
Remark 4.33 Versions of this approach appear
1.
2.
3.
4.
4.12 Notes
Brown (1971) cites James and Stein (1961) for identity (4.4) but it is often called Browns identity for the
key role it plays in the former paper.
Aside: The celebrated paper of Brown (1971) uses (4.4) and (2.23) (the n-dimensional version of (4.6))
to show that statistical admissibility of O is equivalent to the recurrence of the diffusion defined by
dXt D r log p.Xt /dt C 2d Wt : In particular the classical and mysterious Stein phenomenon, namely the
O
inadmissibility of the maximum likelihood estimator .y/
D y in exactly dimensions n 3, is identified
n
with the transience of Brownian motion in R ; n 3: See also Srinivasan (1973).
A careful measure theoretic discussion of conditional distributions is given in Schervish (1995, Appendix
B.3). Broad conditions for the Borel measurability of Bayes rules found by minimizing posterior expected
loss are given in Brown and Purves (1973).
Brown et al. (2006) gives an alternative proof of the Bayes risk lower bound (4.9), along with many other
connections to Steins identity (2.58). Improved bounds on the Bayes risk are given by Brown and Gajek
(1990).
4. The Bayes minimax risk B.P/ introduced here is also called the -minimax risk (where refers to
the class of prior distributions) in an extensive literature; overviews and further references may be found in
Berger (1985) and Ruggeri (2006).
The primary reference for the second part of this chapter is Donoho et al. (1990), where Theorems 4.17,
4.25 and 9.5 (for the case i ) may be found. The extension to the heteroscedastic setting given here is
straightforward. The short proof of Theorem 4.17 given here relies on a minimax theorem; Donoho et al.
(1990) give a direct argument.
More refined bounds in the spirit of the Ibragimov-Hasminskii bound of Theorem 4.17, valid for all
> 0, were derived and applied by Levit (2010a,b).
[J and MacGibbon?] A Bayesian version of the I-H bound is given by Vidakovic and DasGupta
(1996), who show that the linear Bayes minimax risk for all symmetric and unimodal priors on ; as at
most 7:4% worse than the exact minimax rule. [make exercise?]
It is curious that the limiting least favorable
p distribution (4.43) found by Bickel (1981), after the transformation x D sin.s=2/, becomes .2=/ 1 x 2 dx, the Wigner semi-circular limiting law for the
(scaled) eigenvalues of a real symmetric matrix with i.i.d. entries (e.g. Anderson et al. (2010, Ch. 2)).
Local repulsionof prior support points, and of eigenvaluesis a common feature.
Least favorable distributions subject to moment constraints for the single normal mean with known
variance were studied by Feldman (1991) and shown to be either normal or discrete.
Levit (1980, 1982, 1985) and Berkhin and Levit (1980) developed a more extensive theory of second
order asymptotic minimax estimation of a d -dimensional Gaussian mean. Quite generally, they showed
that the second order coefficient (here 2 ), could be interpreted as twice the principal eigenvalue of the
Laplacian (here D 2d 2 =dt 2 / on the fundamental domain (here 1; 1), with the asymptotically least
Exercises
135
favorable distribution having density the square of the principal eigenfunction, here !.t/ D cos. t=2/: We
do not delve further into this beautiful theory since it is essentially parametric in nature: in the nonparametric
settings to be considered in these notes, we are still concerned with understanding the first order behaviour
of the minimax risk with noise level or sample size n:
The overview of lower bound methods for nonparametric estimation in Tsybakov (2009, Ch. 2) is accompanied by extensive historical bibliography.
Exercises
4.1
(Qualitative features of risk of proper Bayes rules.) Suppose that y N.; 2 /, that has a
proper prior distribution , and that O is the squared error loss Bayes rule.
(a) Show that r.O ; / cannot be constant for 2 R. [Hint: Corollary 4.10.]
(b) If E jj < 1, then r.O ; / is at most quadratic in : there exist constants a; b so that
r.O ; / a C b 2 . [Hint: apply the covariance inequality (C.8) to E j xj. x/.
(c) Suppose in addition that is supported in a bounded interval I . Show that P .O 2 I / D 1
for each and hence that r.O ; / is unbounded in , indeed r.O ; / c 2 for suitable c > 0.
4.2
(Proof of van Trees inequality.) Suppose that X N.; 1/ and that the prior has density
p./d . Let E denote expectation with respect to the joint distribution of .x; /. Let A D
O
.y/
I and B D .@=@/log .y /p. /: Show that EAB D 1, and then use the CauchySchwarz inequality to establish (4.9). (Belitser and Levit, 1995)
(Fisher information for priors on an interval.) (a) Consider the family of priors .d / D
c .1 j j/ . For what values of is I. / 1?
(b) What is the minimum value of I. /?
(c) Show that 1 in (4.43) minimizes I./ among probability measures supported on 1; 1.
4.3
4.4
4.5
4.6
4.7
136
4.8
(Direct argument for minimaxity on products). In the setting of Proposition 4.16, suppose that
.Oj ; j / is a saddle-point in the j -th problem. Let O .y/ D .Oj .yj // and D .j /. Show
without using priors that .O ; / is a saddle-point in the product problem.
4.9
(Taking the interval constraint literally.) Recall that if Y N.; 2 /, we defined L .; / D
infO sup 2 ; EOc .Y / 2 , for linear estimators Oc .Y / D cY . An awkward colleague
c
complains it is nonsensical to study L .; / since no estimator Oc in the class is sure to satisfy
the constraint O 2 ; . How might one reply?
4.10 (Bounded normal mean theory for L1 loss.) Redo the previous question for L.; a/ D j
In particular, show that
aj:
Q /;
O .y/ D sgn y;
and
B. / D 2 .
R1
Q
where, as usual ./
D .s/ds: In addition, show that
1
L .; /
:
D 1=:32 < 1:
B.1 /
; N .; /
p
Hint: show that L .; 1/ P .; 1/ D min.; 2=/:
4.11 (Continued.) For L1 loss, show that (a) N .; /
p D N .=; 1/ is increasing in ; and (b)
lim!1 N .; / D
0 ; where
0 D E0 jzj D 2=:
[Hint: for (b) consider the uniform prior on ; .]
4.12 (Discrete prior spacing and risk functions.) This exercise provides some direct support for the
claim before Figure 4.2 that a risk function bounded by 2 forces a discrete prior to have atoms
spaced at most O./ apart. To simplify, consider D 1.
(a) Show that for any estimator .x/
O
D x C g.x/ that if jg.x/j M for x 2 K, then
p
r.;
O / M 2 P .K/ 2 M :
D sup
(b) Again for simplicity consider a two point prior, which may as well be taken as .d/ D
0 0 C 1 0 with 1 D 1 0 . Show that the posterior mean
O .x/ D 0
0 e 0 x 1 e
0 e 0 x C 1 e
0 x
0 x
(c) Consider first 0 D 1=2 and argue that there exists a > 0 such that if 0 is large, then for
some jj < 0
r.O ; / > a20 :
(4.73)
(d) Now suppose that 0 D 1=2
and show that (4.73) still holds for a D a.
/.
P
4.13 (Hyperrectangles, exponential decay and domain of analyticity.) Suppose f .t / D 11 k e 2 i kt
P1
k
2 i t .
and consider the associated function g.z/ D
1 k z of the complex variable z D re
jkj
If jk j D O.e
/, show that g is analytic in the annulus A D fz W e < jzj < e g. A near
converse also holds: if g is analytic in a domain containing A ; then jk j D O.e jkj /: Thus,
the larger the value of , the greater the domain of analyticity.
137
Exercises
4.14 (Minimax affine implies diagonal.) An affine estimator has the form OC;b .y/ D Cy C b. Prove
the following extension of Proposition 4.20: if OC;b is minimax among affine estimators over a
hyperrectangle ./, then necessarily b D 0 and C must be diagonal.
4.15 (Linear minimaxity on products with a Euclidean factor.) Adopt the setting of Section 4.8: the
model yi D i C i zi , (4.46), with 2 orthosymmetric and with squared error loss.
(a) Suppose first that D 0 with both and 0 being solid orthosymmetric. Show
that RL ./ D RL . / C RL .0 /. [Hint: start from (4.56).]
(b) If 0 satisfies the assumptions of Theorem 4.25, i.e. is compact, solid, orthosymmetric and
quadratically convex, then show that the conclusion of that theorem applies to D Rk 0 :
namely RL ./ D sup2 RL .. //.
4.16 (Translation invariance implies diagonal Fourier optimality.) Signals and images often are
translation invariant. To make a simplified one-dimensional model, suppose that we observe,
in the time domain, xk D
k C k for k D 1; : : : ; n: To avoid boundary effects, assume
that x;
and are extended to periodic functions of k 2 Z, that is x.k C n/ D x.k/; and so
on. Define the shift of
by .S
/k D
kC1 : The set is called shift-invariant if
2 implies
S
2 . Clearly, then, S l
2 for all l 2 Z:
P
(a) Show that D f
W nkD1 j
k
k 1 j < C g is an example of a shift-invariant set. Such
sets are said to have bounded total variation.
Now rewrite the model in the discrete Fourier domain. Let e D e 2 i=n and note that the discrete
Fourier transform y D F x can be written
yk D
n
X1
ekl xl ;
k D 0; : : : ; n
1:
lD0
2V . /
Thus, on a translation invariant set , an estimator that is minimax among affine estimators
must have diagonal linear form when expressed in the discrete Fourier basis.
4.17 (No minimax result for projection estimators.) Show by example that the equality (4.58) fails
if c is restricted to f0; 1gI , the class of projections onto subsets of the co-ordinates.
4.18 (Linear and diagonal minimax risk in intra-class model.)
Consider the setting of Example 4.31.
(a) Show that in the basis of the Karhunen-Lo`eve transform, the variances are
"21 D p 2 C 1;
"2k D 1; k 2:
P
(b) Show that RL ..// D i i2 2 =.i2 C 2 /, and RDL .. // D p.1C 2 / 2 =.1C 2 C 2 /.
(c) Derive conclusion (4.62).
5
Linear Estimators and Pinskers Theorem
Compared to what an ellipse can tell us, a circle has nothing to say. (E. T. Bell).
Under appropriate assumptions, linear estimators have some impressive optimality properties. This chapter uses the optimality tools we have developed to study optimal linear estimators over ellipsoids, which as we have seen capture the notion of mean-square smoothness
of functions. In particular, the theorems of Pinsker (1980) are notable for several reasons.
The first gives an exact evaluation of the linear minimax risk in the Gaussian sequence model
for quadratic loss over general ellipsoids in `2 : The second shows that in the low noise limit
! 0; the non-linear minimax risk is actually equivalent to the linear minimax risk: in
other words, there exist linear rules that are asymptotically efficient. The results apply to ellipsoids generally, and thus to all levels of Hilbert-Sobolev smoothness, and also to varying
noise levels in the co-ordinates, and so might be considered as a crowning result for linear
estimation.
The linear minimax theorem can be cast as a simple Lagrange multiplier calculation,
Section 5.1. Section 5.2 examines some examples: in the white noise model, ellipsoids of
mean square smoothness and of analytic function, leading to very different rates of convergence (and constants!). Fractional integration is used as an example of the use of the linear
minimax theorem for inverse problems. Finally, a concrete comparison shows that the right
smoothing spline is actually very close in performance to linear minimax rule.
Section 5.3 states the big theorem on asymptotic minimax optimality of linear estimators among all estimators in the low noise limit. In this section we give a proof for the white
noise model with polynomial ellipsoid constraints this allows a simplified argument in
which Gaussian priors are nearly least favorable. The Bayes rules for these Gaussian priors are linear, and are essentially the linear minimax rules, which leads to the asymptotic
efficiency.
Section 5.4 gives the proof for the more general case, weaving in ideas from Chapter 4
in order to combine the Gaussian priors with other priors needed for co-ordinates that have
especially large or small signal to noise ratios.
The chapter concludes with a diversionary interlude, Section 5.5, that explains why the
infinite sequence model requires a compactness assumption for even as weak a conclusion
as consistency to be possible in the low noise limit.
138
139
i 2 N:
(5.1)
ai2 i2 C 2 g:
(5.2)
A pleasant surprise is that there is an explicit solution for the minimax linear estimator over
such ellipsoids.
Proposition 5.1 Suppose that the observations follow sequence model (5.1) and that is
an ellipsoid (5.2). Assume that ai are positive and nondecreasing with ai ! 1: Then the
minimax linear risk
X
RL ./ D
i2 .1 ai =/C ;
(5.3)
i
i2 ai .
ai / C D C 2 :
(5.4)
ai =/C yi ;
(5.5)
and is Bayes for a Gaussian prior C having independent components i N.0; i2 / with
i2 D i2 .=ai
1/C :
(5.6)
Some characteristics of the linear minimax estimator (5.5) deserve note. Since the ellipsoid weights ai are increasing, the shrinkage factors ci decrease with i and hence downweight the higher frequencies more. In addition, there is a cutoff at the first index i such
that ai : the estimator is zero at frequencies above the cutoff. Finally, the optimal linear
estimator depends on all the parameters C; .i /; and .ai /as they vary, so does the optimal estimator. In particular, the least favorable distributions, determined by the variances i2
change with changing noise level.
Proof The set is solid, orthosymmetric and quadratically convex. Since sup ai D 1
it is also compact. Thus the minimax linear risk is determined by the hardest rectangular
subproblem, and from Theorem 4.25,
n X 2 2
o
X
2 2
2
i i
W
a
C
:
(5.7)
RL ./ D sup RL .. // D sup
i
i
i2 C i2
2
i
This maximum may be evaluated by forming the Lagrangian
Xn
i4 o
1 X 2 2
LD
i2
ai i :
2
2
2
C
i
i
i
i
Simple calculus shows that the maximum is attained at i2 given by (5.6). The positive
140
part constraint arises because i2 cannot bePnegative. The Lagrange multiplier parameter
is uniquely determined by the equation
ai2 i2 D C 2 ; which on substitution for i2
yields (5.4). This equation has a unique solution since the left side is a continuous, strictly
increasing, unbounded function of . The corresponding maximum is then (5.3).
We have seen that the hardest rectangular subproblem is . /, with given by (5.6).
The minimax linear estimator for . /, recalling (4.28), is given by Oi D ci yi with
ci D
i2
i2
D 1
2
C i
ai
:
C
(5.8)
We now show that O is minimax linear for all of . Lemma (3.12) (generalized in the
obvious way to model (5.1)) evaluates the maximum risk of O over an ellipsoid (5.2) as
X
sup r.O ; / D
i2 ci2 C C 2 sup ai 2 .1 ci /2 :
2
From (5.8) it is clear that ai 1 .1 ci / equals 1 for all ai and is less than
ai > . Consequently, using (5.4) and (5.8),
X
X ai
ai
1
D
i2 ci .1 ci /:
i2
C 2 sup ai 2 .1 ci /2 D C 2 =2 D
C
i
i
i
for
which shows that O is indeed minimax linear over . Finally, from (5.8) it is evident that
O is Bayes for a prior with independent N.0; i2 / components.
Remark. The proof of the proposition first uses (5.7), which corresponds to a minimax
theorem for the payoff function r.c; / D r.Oc ; /, as seen at (4.58). The proof then goes
further than the minimax statement (4.58) to exhibit .O ; / D .Oc ; / as a saddlepoint:
the extra argument in the second paragraph shows that for all c and for all 2 ,
r.Oc ; / r.Oc ; / r.Oc ; /:
k2N
141
To describe the summation set N , observe that the weights ak k with relative error at
most O.1=k/ and so
N D N./ D fk W ak < g fk W k < 1= g:
Setting k D 1= , we have the integral approximations (compare (3.51))
k
X
k2N
: X p : pC1=
akp D
:
k D
p C 1
kD1
2
(5.10)
where, in the usual rate of convergence notation, r D 2=.2 C 1/. We finally have
X
ak : 2
1 1C1=
2
RL ./ D
D k
1
C1
k2N
1 r
2 r
:
D
2 1= D
.2 C 1/C 2
C1
C1
2.1 r/ 2r
D Pr C
;
(5.11)
where the Pinsker constant
r
.2 C 1/1
Pr D
C1
r r
.1
2 r
r/r
Fractional integration
We turn to an example of inverse problems that leads to increasing variances k2 in the sequence model. Consider the noisy indirect observations model
Y D Af C Z;
(5.12)
introduced at (3.64), When A is -fold integration, examples (ii)-(iv) of Section 3.9 showed
that the singular values bk c 1 k as k ! 1, with relative error O.1=k/. The constant
c D in the trigonometric basis, and equals 1 for the Legendre polynomial basis. So we
142
obtain an example of sequence model (5.1) with ak k as before and k c k . Proposition 5.1 allows evaluation of the minimax mean squared error over 2 .C /. A calculation
similar to that done earlier in this section yields a straightforward extension of (5.11):
RL . .C /; / Pr; C 2.1
r /
.c /2r ;
.2 C 2 C 1/1
Pr; D
C 2 C 1
2 C 1
and an analog of Lemma 3.3 holds, with different constants. For details see Domnguez et al.
(2011, Thm. 5.1) and references therein.
2. When the trigonometric basis is used one must be careful with the definition of A
due to the arbitrary constant of integration. Following Zygmund
R t (2002), consider periodic
functions with integral 0. Then one can set A1 .f /.t / D 0 f .s/ds, and define A for
2 N by iteration,
A D A1 . For > 0 non-integer, if ek .t /PD e 2 i k t for k 2 Z
P
and f .t/
ck ek .t/ with c0 D 0, one can define .A f /.t / k .i k/ ck ek .t /; and
Zygmund (2002, Vol. II, p. 135) shows that
Z t
1
f .s/.t s/ 1 ds:
.A f /.t / D
./ 1
143
:
Restricting r to positive integers, we have F .r/ D e 2r
, with
D c;1 c;2 > 0, from
which we may write our sought-after solution as D e r0 for 2 1; e / with
C2
1
log
:
r0 D
2
2
2
Now we may write the minimax risk (5.3) as ! 0 in the form
RL .; / D 2 2
r0
X
.r0 k/
:
kD1
2
log
is only logarithmically worse than the parametric rate 2 , and the dependence on .a; C /
comes, at the leading order term, only through the analyticity range and not via the scale
factor C:
if we choose weights wk D ak2 corresponding to the ellipsoid (5.2). This should be compared
with the linear minimax solution of (5.5), which we write as
M
O;k
D .1
ak =/C yk :
sup
r.O ; I /:
2
2 .C /
Thus we will take O to be either OS S or M . First, however, it is necessary to specify the order
of smoothing spline: we take the weights equal to the (squared) ellipsoid weights: wk D
144
ak2 , thus w2k D w2k 1 RD .2k/2 : When is a non-negative integer m, this corresponds
to a roughness penalty .D m f /2 : We also need to specify the value of the regularization
parameter , respectively , to be used in each case. A reasonable choice is the optimum, or
minimax value
D argmin r.
N O I /:
Here is shorthand for, respectively, S S , the minimax value for the spline family, and
M D M2 , that for the minimax family. This is exactly the calculation done in Chapter 3 at
(3.55) and (3.90) for the spline OS S and minimax OM families respectively . [Of course, the
result for the minimax family must agree with (5.11)!] In both cases, the solutions took the
form, again with r D 2=.2 C 1/,
r.
N O ; / c2 e H.r/ C 2.1
.c1 2 =C 2 /r ;
where the binary entropy function H.r/ D
r log r
c1S S D 2v =;
c2S S D vr =41 r ;
c1M
c2M
1
vN =;
2
.1
r/ log.1
v D .1
r/ 2r
;
(5.13)
r/ and
1=2/=sinc .1=2/;
2
vN r ;
vN D 2 =. C 1/.2 C 1/:
Thus the methods have the same dependence on noise level and scale C , with differences
appearing only in the coefficients. We may therefore summarize the comparison through
the ratio of maximum mean squared errors. Remarkably, the low noise smoothing spline
maximal MSE turns out to be only negligibly larger than the minimax linear risk of the
Pinsker estimate. Indeed, for D 2 .C /, using (5.13), we find that as ! 0;
RS S .; / v r 1 1
RL .; /
vN
4
<1:083 D 2
:
D 1:055 D 4
:
! 1 ! 1:
(5.14)
4v r
vN
<4:331 D 2
:
D 4:219 D 4
:
! 4 ! 1;
and so S S is approximately four times M and this counteracts the lesser shrinkage of
smoothing splines noted earlier.
Furthermore, in the discrete smoothing spline setting of Section 3.4, Carter et al. (1992)
present small sample examples in which the efficiency loss of the smoothing spline is even
smaller than these asymptotic values= (see also Exercise 5.4.) In summary, from the maximum MSE point of view, the minimax linear estimator is not so different from the Reinsch
smoothing spline that is routinely computed in statistical software packages.
145
(5.15)
ai =2
Theorem 5.2 (Pinsker) Assume that .yi / follows sequence model (5.1) with noise levels
.i /. Let D .a; C / be an ellipsoid (5.2) defined by weights .ai / and radius C > 0:
Assume that the weights satisfy conditions (i) and (ii). Then, as ! 0,
RN .; / D RL .; /.1 C o.1//:
(5.16)
Thus the linear minimax estimator (5.5) is asymptotically minimax among all estimators.
Remarks. 1. The hardest rectangular subproblem results of Section 4.8 say that RL .I /
1:25RN .I /, but this theorem asserts that, in the low noise limit, linear estimates cannot
be beaten over ellipsoids, being fully efficient.
2. The condition that sup ai D 1 is equivalent to compactness of in `2 : In Section 5.5,
it is shown for the white noise model that if is not compact, then RN .; / does not even
approach 0 as ! 0:
3. In the white noise model, i D , condition (ii) follows from (i). More generally,
condition (ii) rules out exponential growth of i2 , however it is typically satisfied if i2 grows
polynomially with i.
4. Pinskers proof is actually for an even more general situation. We aim to give the
essence of Pinskers argument in somewhat simplified settings.
(5.17)
146
Under these conditions we give a proof based on Gaussian priors. Although special, the
conditions do cover the Sobolev ellipsoid and spline examples in the previous section, and
give the flavor of the argument in the more general setting.
Pinskers linear minimax theorem provides, for each , a Gaussian prior with independent
2
i2 D 2 . =ai 1/C and the Lagrange multiplier
co-ordinates
P i N.0; i / where
satisfies i ai . ai /C D C 2 = 2 . Since the sequence .i2 / maximizes (5.7), we might
call this the least favorable Gaussian prior. It cannot be least favorable among all priors, in
the sense of Section 4.3, for example because it is not supported on . Indeed, for this prior
X
X
E
ai2 i2 D
ai2 i2 D C 2 ;
(5.18)
so that the ellipsoid constraint holds only in mean. However, we will show under our restricted conditions that a modification is indeed asymptotically concentrated on , and implements the heuristics described above. The modification is made in two steps. First, define
a Gaussian prior with slightly shrunken variances:
(5.19)
G W i N 0; .1 /i2 ;
with & 0 to be specified. We will show that G ./ ! 1 and so for the second step it
makes sense to obtain a prior supported on by conditioning
.A/ WD G .Aj 2 /:
The idea is then to show that these satisfy (5.17).
Comparing Gaussian and conditioned priors. We can do calculations easily with G
since it is Gaussian, but we are ultimately interested in ./ and its Bayes risk B. /. We
need to show that they are close, which we expect because G ./ 1:
Let E denote expectation under the joint distribution of .; y/ when G . Let O
denote the Bayes rule for prior , so that O D E j; y. The connection between G and
is captured in
Lemma 5.3
.1
k2 ; c :
(5.20)
/i2 /,
(5.21)
/RL ./:
(5.22)
k2 D EkO
k2 ; c :
If for O we take the Bayes rule for , namely O , then by definition EkO k2 j D B. /:
Now, simply combine this with the two previous displays to obtain (5.20).
Now a bound for the remainder term in (5.20), using the ellipsoid structure.
147
Lemma 5.4
Proof
EkO
k2 ; c cG .c /1=2 C 2 :
since i N.0; i2 /. Consequently E.O;i i /4 24 Ei4 D c 2 i4 . Applying the CauchyP
Schwarz inequality term by term to kO k2 D i .O;i i /2 , we find
X
E.O;i i /4 1=2
EkO k2 ; c G .c /1=2
i
cG .c /1=2
(5.23)
i2 :
i2 amin2
/RL ./
cG .c /1=2 C 2 ;
(5.24)
and so for (5.17) it remains to show that for suitable ! 0, we also have G .c / ! 0
sufficiently fast.
P
G concentrates on . Under the Gaussian prior (5.19), E ai2 i2 D .1 /C 2 and so
the complementary event
n X
o
c D W
ai2 .i2 Ei2 > C 2 :
We may write ai2 i2 as i Zi2 in terms of independent standard Gaussians Zi with
i D .1
/ai2 i2 D .1
/ 2 ai .
ai / C :
Now apply the concentration inequality for weighted 2 variates, (2.77), to obtain
G .c / expf t 2 =.32kk1 kk1 g;
with t D C 2 . Now use (5.18) and the bound x.1 x/ 1=4 to obtain
X
kk1
ai2 i2 D C 2 ;
kk1 2 2 =4:
Consequently the denominator 32kk1 kk1 8C 2 . /2 .
We now use the polynomial growth condition (i) to bound . The calculation is similar
to that in the Sobolev ellipsoid case leading to (5.10); Exercise 5.5 fills in the details. The
result is that, for a constant c now depending on ; b1 and b2 , and r D 2=.2 C 1/ and
changing at each appearance,
c C r 1 r ;
We conclude from the concentration inequality that
G .c / expf c 2 .C =/2.1
r/
g;
and hence that if is chosen of somewhat larger order than 1 r while still approaching
zero, say / 1 r , then G .c / exp. c = 2 / D o.RL .// as required.
148
Remark 5.5 Our special assumptions (i) and (ii) were used only in the concentration
inequality argument to show that G .c / RL ./. The bad
P bound comes at the end of
the proof P
of Lemma 5.4: if instead we were able to replace i2 amin2 C 2 by a bound of
the form
i2 cRL ./, then it would be enough to show G .c / ! 0. For the bound
to be in terms of RL ./, it is necessary to have the i comparable in magnitude, see (5.32)
below. This can be achieved with a separate treatment of the very large and very small values
of i , and is the approach taken in the general case, to which we now turn.
Ns ;
Ng ;
Nb ;
according as
i2 =i2 2
0; q
.q
q; 1/;
; q/;
i
0;
:
qC1
(5.25)
(5.26)
! 1;
149
q
||
q+1
k
||
q+1
k
k1
Nb
Ng
k2
k3
Ns
Figure 5.1 The big, gaussian and small signal to noise regimes for Sobolev
ellipsoids
with similar expressions for jNb j and jNs j that also increase proportionally to .C 2 = 2 /1 r .
Definition of priors D .; q/. A key role is played by the minimax prior variances
i2 found in Proposition 5.1. We first use them to build sub-ellipsoids s ; b and g ,
defined for m 2 fs; b; gg by
X
X
m D m .; q/ D f.i ; i 2 Nm / W
ai2 i2
ai2 i2 g:
Nm
ai2 i2
Nm
Since
D C , we clearly have s g b . We now define priors m D
m .; q/ supported on m , see also Figure 5.2:
s :
i nd
for i 2 Nb , set i Vi , cosine priors on i ; i , with density i 1 cos2 .i =2i /,
recall (4.43),
i nd
g :
for i 2 Ng , first define G ; which sets i N.0; .1 /i2 / for some fixed 2 .0; 1/.
Then define g by conditioning:
b :
We show that the priors m D m .; q/ have the following properties:
(a) B.s / rs .q
1=2
1=2
/ ! 1 as q ! 1,
150
"high"
"medium"
"low" b
Figure 5.2 The small components prior is supported on the extreme points of a
hyperrectangle in s ; the big component prior lives on a solid hyperrectangle in
b . The Gaussian components prior is mostly supported on g , cf. (5.34), note
that the density contours do not match those of the ellipsoid.
(b) B.b / rb .q 1=2 /Rb ./ for all , and rb .q 1=2 / ! 1 as q ! 1, and
(c) If > 0 and q D q./ are given, and if Rg ./ RL ./, then for < ./ sufficiently
small, B.g / .1 /Rg ./:
Assuming these properties to have been established, we conclude the proof as follows.
Fix > 0 and then choose q./ large enough so that both rs .q 1=2 / and rb .q 1=2 / 1 .
We obtain
B.m / .1
/Rm ./;
(5.28)
Now, if Rg ./ RL ./, then the previous display holds also for m D g and sufficiently small, by (c), and so adding, we get B. / .1 /RL ./ for sufficiently small.
On the other hand, if Rg ./ RL ./, then, again using (5.28),
B. / .1
/RL ./
Rg ./ .1
/2 RL ./:
Either way, we establish (5.17), and are done. So it remains to prove (a) - (c).
Proofs for (a) and (b). These are virtually identical and use the fact that two point and
cosine priors are asymptotically least favorable as i =i ! 0 and 1 respectively. We tackle
B.s / first. For a scalar problem y1 D 1 C 1 z1 with univariate prior .d / introduce the
notation B.; 1 / for the Bayes risk. In particular, consider the two-point priors needed
for the small signal case. By scaling, B. ; / D 2 B.= ; 1/, and the explicit formula
(2.30) for B.= ; 1/ shows that when written in the form
B. ; / D L .; /g.= /;
(5.29)
we must have g.t/ ! 1 as t ! 0. Now, using this along with the additivity of Bayes risks,
and then (5.25) and (5.27), we obtain
X
X
g.i =i /L .i ; i / rs .q 1=2 /Rs ./;
(5.30)
B.s / D
B.i ; i / D
Ns
Ns
if we set rs .u/ D inf0tu g.t/. Certainly rs .u/ ! 1 as u ! 0, and this establishes (a).
151
For the large signal case (b), we use the cosine priors V , and the Fisher information
bound (4.16), so that the analog of (5.29) becomes
B.V ; / L .; /h.= /;
with h.t/ D .t 2 C 1/=.t 2 C I.1V // ! 1 as t ! 1: The analog of (5.30), B.g /
rb .q 1=2 /Rb ./ follows with rb .q/ D inftq h.t / ! 1 as t ! 1:
Proof of (c): This argument builds upon that given in the special white noise setting in
the previous section. Let Og D E jg ; y denote the Bayes rule for g . With the obvious
substitutions, the argument leading to (5.20) establishes that
.1
k2 ; cg :
(5.31)
Ng
For i 2 Ng the variables are Gaussian and we may reuse the proof of Lemma 5.4 up through
the inequality (5.23). Now substituting the bound above, we obtain the important bound
EfkOg
(5.33)
1/2 2 .Ng /;
(5.34)
where
2 .N / D max i2
i 2N
.X
i2 :
i 2N
The bound (5.34) reflects three necessary quantities, and hence shows why the method
works. First q governs the signal to noise ratios i2 =i2 , while governs the slack in the
expectation ellipsoid. Finally 2 .Ng / is a surrogate for the number of components 1=Ng in
the unequal variance case. (Indeed, if all i2 are equal, this reduces to 1=jNg j).
P
P
Proof of (5.34). Indeed, let S D Ng ai2 i2 and Cg2 D Ng ai2 i2 : Since Ei2 D .1
/i2 and Var i2 D 2.1 /2 i4 , we have
ES D .1
VarS 2.1
/Cg2 ;
2
/
and
Cg2
maxfai2 i2 g:
Ng
152
From definition (5.6) of i2 and bounds (5.26) defining the Gaussian range Ng :
ai2 i2 D i2 ai .
ai /C 2 i2 2 .q C 1/ 2 ; 1=4;
and so
max ai2 i2
.q C 1/2 max i2
P 2 2
P 2 q 2 2 .Ng /:
4
ai i
j
Inserting bound (5.34) into (5.33), we obtain
EfkOg
k2 ; cg g c.q/.
(5.35)
We now use the hypothesis Rg ./ RL ./ to obtain a bound for .Ng /. Indeed, using
the definition of Rg and (5.7), we have
X
X
i2 Rg ./ RL ./ D
i2 .1 ai =/C
i
Ng
.=2/
i2 ;
ai =2
i 2Ng
ai
ai =2
B.g / Rg ./1
f .q; ; /. /:
Recall that > 0 and q D q./ are given. We now set D =2, and note that . / ! 0
as ! 0. Indeed, that ! 1 follows from condition (5.4), here with i2 D 2 %2i , along
with the assumption (i) that ai % 1 monotonically. Our assumption (ii) then implies that
. / ! 0: Consequently, for < .; q.//; we have f .q./; =2; /. / < =2: Thus
B.g / .1 /Rg ./ and this completes the proof of (c).
153
Of course, if RN .; / does not converge to 0; then there exists c > 0 such that every
estimator has maximum risk at least c regardless of how small the noise level might be.
This again illustrates why it is necessary to introduce constraints on the parameter space in
order to obtain meaningful results in nonparametric theory. In particular, there can be no
uniformly consistent estimator on f 2 `2 .N/ W kk2 1g, or indeed on any open set in the
norm topology.
Because there are no longer any geometric assumptions on , the tools used for the proof
change: indeed methods from testing, classification and from information theory now appear.
While the result involves only consistency and so is not at all quantitative, it nevertheless
gives a hint of the role that covering numbers and metric entropy play in a much more refined
theory (Birge, 1983) that describes how the massiveness of determines the possible rates
of convergence of RN ./:
where the expectation is taken over the Y marginal of the joint distribution of .; Y /: Let
h.q/ D q log q .1 q/ log.1 q/ be the binary entropy function. Write pe D P .O /
154
for the overall error probability when using estimator O : Fanos lemma provides a lower
bound for pe :
h.pe / C pe log.m
1/ H.jY /:
Necessity of compactness
For both parts of the proof, we use an equivalent formulation of compactness, valid in complete metric spaces, in terms of total boundedness: is totally bounded if and only if for
every , there is a finite set fi ; : : : ; m g such that the open balls B.i ; / of radius centered
at i cover W so that [m
iD1 B.i ; /: Also, since is bounded, it has a finite diameter
D supfk1 2 k W 1 ; 2 2 g:
Let > 0 be given. Since RN .; / ! 0, there exists a noise level and an estimator Q
such that
E; kQ k2 2 =2
for all 2 :
(5.38)
Let be a finite and 2 discernible subset of : each distinct pair i ; j in satisfies
ki j k > 2: From Q .y/ we build an estimator O .y/ with values confined to by
choosing a closest i 2 to Q .y/ W of course, whenever O i ; it must follow that
kQ i k : Consequently, from Markovs inequality and (5.38), we have for all i
Pi fO i g Pi fkQ
i k g
EkQ
i k2 1=2:
(5.39)
On the other hand, the misclassification inequality (5.37) provides a lower bound to the error
probability: for the noise level Gaussian sequence model, one easily evaluates
K.Pi ; Pj / D ki
j k2 =2 2 2 =2 2 ;
2 =2 2 C log 2
:
log.j j 1/
155
Combining this with (5.39) gives a uniform upper bound for the cardinality of :
log.j j
1/ 2
C 2 log 2:
Sufficiency of Compactness
Given > 0, we will construct an estimator O such that E kO k2 20 2 on for all
sufficiently small . Indeed, compactness of supplies a finite set D f1 ; : : : ; m g such
O
that [m
iD1 B.i ; /; and we will take to be the maximum likelihood estimate on the
sieve : Thus we introduce the (normalized) log-likelihood
L./ D 2 log dP; =dP0; D hy; i
1
kk2 ;
2
(5.40)
(5.41)
i Wki k4
We now show that the terms in the second sum are small when is small. Let 2 be
fixed, and choose a point in , renumbered to 1 if necessary, so that 2 B.1 ; /: To have
O D i certainly implies that L.i / L.1 /; and from (5.40)
L.i /
L.1 / D hy
1
.
2 i
C 1 /; i
1 i:
; ui 12 ki
1 k
=2;
where in the second inequality we used jh1 ; uij k1 k < , and in the third
Q
ki 1 k 3. Thus P fO D i g .=.2//;
and so from (5.41)
E kO
Q
k2 .4/2 C m2 .=.2//
20 2 ;
156
Efromovich (1996) gives an extension of the sharp minimax constant results to a variety of nonparametric
settings including binary, binomial, Poisson and censored regression models.
For further discussion of minimax estimation in the fractional integration setting, see Cavalier (2004).
The use of parameter spaces of analytic functions goes back to Ibragimov and Khasminskii (1983) and
Ibragimov and Khasminskii (1984), see also Golubev and Levit (1996).
The consistency characterization, Theorem 5.6, is a special case of a result announced by Ibragimov
and Has0 minski (1977), and extended in Ibragimov and Khasminskii (1997). The approach of maximizing
likelihood over subsets that grow with sample size was studied as the method of sieves in Grenander
(1981).
Exercises
5.1
Pinsker constant: tracking the error terms. Consider the fractional integration setting of Section
5.2 in which k D =bk and the singular values bk D c 1 k .1 C O.k 1 //. This of course
includes the special case of direct estimation with bk 1.
(a) Consider the ellipsoids 2 .C / with a2k D a2k 1 D .2k/ . Let N D N./ D fk W ak <
g and show that
k D jN./j D 1= C O.1/ D 1= .1 C O.k 1 //:
P
p
(b) For p D 0; 1; 2, let Sp D k2N bk 2 ak and show that
Sp D .2 C p C 1/
(c) Verify that RL ./ D 2 .S0
that (CHECK)
1 2 2 CpC1
c k
.1
RL . .C /; / D Pr; C 2.1
5.2
r /
5.4
Polynomial rates in severely ill-posed problems. Consider ellipsoids with a2k D a2k 1 D e k
corresponding to analytic functions as in Section 5.2. Suppose that k D e k with > 0, so
that the estimation problem is severely-ill-posed. Show that the linear minimax risk
RL .; / QC 2.1
5.3
C O.k 1 //:
/ 2
1=.2/
C .C 2 =4/ C 2 :
1=
2 :
C1
157
Exercises
(iii) Let A be the set of values .; / for which
.k/ 1
.k/ C
k0
v .1
v /dv:
[It is conjectured that this holds for most or all > 0; > 0]. Show that
=.2C1/
. C 1/.2 C 1/ C 2
:
N D
2
1=
so long as .; N
/ 2 A.
(iv) Conclude that in these circumstances,
RS S .I /
e C c .=C /2.1
RL .I /
5.5
r/
for all > 0. [Here e is the constant in the limiting efficiency (5.14).]
(Pinsker proof: bounding .) Adopt assumptions (i), (ii) in the special case of Pinskers
theorem in Section 5.3. Let k be the smallest integer so that b2 .k C 1/ =2. Show that
X
i
ai .
ai /C b1 . =2/
k
X
i:
i D1
r.
6
Adaptive Minimaxity over Ellipsoids
However beautiful the strategy, you should occasionally look at the results. (Winston
Churchill)
An estimator that is exactly minimax for a given parameter set will depend, often quite
strongly, on the details of that parameter set. While this is informative about the effect of
assumptions on estimators, it is impractical for the majority of applications in which no
single parameter set comes as part of the problem description.
In this chapter, we shift perspective in order to study the properties of estimators that can
be defined without recourse to a fixed . Fortunately, it turns out that certain such estimators
can come close to being minimax over a whole class of parameter sets. We exchange exact
optimality for a single problem for approximate optimality over a range of circumstances.
The resulting robustness is usually well worth the loss of specific optimality.
The example developed in this chapter is the use of the James-Stein estimator on blocks
of coefficients to approximately mimic the behavior of linear minimax rules for particular
ellipsoids.
The problem is stated in more detail for ellipsoids in Section 6.1. The class of linear
estimators that are constant on blocks is studied in Section 6.2, while the blockwise JamesStein estimator appears in Section 6.3. The adaptive minimaxity of blockwise James-Stein
is established; the proof boils down to the ability of the James-Stein estimator to mimic the
ideal linear shrinkage rule appropriate to each block, as already seen in Section 2.6.
While the blockwise shrinkage approach may seem rather tied to the details of the sequence model, in fact it accomplishes its task in a rather similar way to kernel smoothers
or smoothing splines in other problems. This is set out both by heuristic argument and in a
couple of concrete examples in Section 6.4.
Looking at the results of our blockwise strategy (and other linear methods) on one of
those examples sets the stage for the focus on non-linear estimators in following chapters:
linear smoothing methods, with their constant smoothing bandwidth, are ill-equipped to deal
with data with sharp transitions, such as step functions. It will be seen later that the adaptive
minimax point of view still offers useful insight, but now for a different class of estimators
(wavelet thresholding) and wider classes of parameter spaces.
Section 6.5 is again an interlude, containing some remarks on fixed versus worst
case asymptotics and on superefficiency. Informally speaking, superefficiency refers to the
possibility of exceptionally good estimation performance at isolated parameter points. In
parametric statistics this turns out, fortunately, to be usually a peripheral issue, but examples
158
159
given here show that points of superefficiency are endemic in nonparametric estimation.
The dangers of over-reliance on asymptotics based on a single are illustrated in an example
where nominally optimal bandwidths are found to be very sensitive to aspects of the function
that are difficult to estimate at any moderate sample size.
In this chapter we focus on the white noise modelsome extensions of blockwise James
Stein to linear inverse problems are cited in the Notes.
2
.D f / L2 on periodic functions in L2 0; 1 when represented in the Fourier basis
(3.8): To recall, for ; C > 0, we have
2 .C / D f 2 `2 W
1
X
ak2 k2 C 2 g;
(6.1)
kD0
with a0 D 0 and a2k 1 D a2k D .2k/ for k 1. As we have seen in previous chapters,
Pinskers theorem delivers a linear estimator O .; C /, given by (5.5), which is minimax
linear for all > 0, and asymptotically minimax among all estimators as ! 0:
As a practical matter, the constants .; C / are generally unknown, and even if one believed a certain value .0 ; C0 / to be appropriate, there is an issue of robustness of MSE performance of O .0 ; C0 / to misspecification of .; C /: One possible way around this problem
is to construct an estimator family O , whose definition does not depend on .; C /; such that
if is in fact restricted to some 2 .C /; then O has MSE appropriate to that space:
sup
2 .C /
r.O ; / c ./RN . .C /; /
as ! 0;
(6.2)
160
I D N0 D f0; 1; 2; : : :g; and we suppose that the blocks are defined by an increasing
sequence flj ; j 0g N0 :
Bj D flj ; lj C 1; : : : ; lj C1
1g;
nj D lj C1
lj :
(6.3)
In some cases, the sequence lj and associated blocks Bj might depend on noise level . If
l0 > 0, then we add an initial block B 1 D f0; : : : ; l0 1g.
p
Particular examples might include lj D j for some > 0, or lj D e j : An dependent example is given by weakly geometric blocks, with L D log 1 and lj D
L .1C1=L /j 1 . However, we will devote particular attention to the case of dyadic blocks.
in which lj D 2j , so that the j th block has cardinality nj D 2j :
In the dyadic case, we consider a variant of the ellipsoids (6.1) that is defined using
weights that are constant on the dyadic blocks: al 2j if l 2 Bj D f2j ; : : : ; 2j C1 1g.
The corresponding dyadic Sobolev ellipsoids
X
X
(6.4)
D .C / D f W
22j
l2 C 2 g:
j 0
l2Bj
Let TD;2 denote the class of such dyadic ellipsoids fD .C /; ; C > 0g:
The two approaches are norm-equivalent: write kk2F ; for the squared norm corresponding to (6.1) and kk2D; for that corresponding to (6.4). Exercise 6.1 fills in the (finicky)
details. It is then easily seen that for all 2 `2 :
kkD; kkF ; 2 kkD; :
(6.5)
Remark. For wavelet bases, ellipsoid weights that are constant on dyadic blocks are the
natural way to represent mean-square smoothnesssee Section 9.6. In this case, the index
I D .j; k/; with j 0 and k 2 f0; : : : ; 2j 1g: There is a simple mapping of doubly
indexed coefficients j;k onto a single sequence l by setting l D 2j C k (including the
special case 1;0 $ 0 , compare (9.41)).
The success of octave based thinking in harmonic analysis and wavelet methodology
makes it natural to consider dyadic blocks and for definiteness we focus on this blocking
scheme.
Block diagonal linear estimators. This term refers to the subclass of diagonal linear
estimators in which the shrinkage factor is constant within blocks: for all blocks j :
Oj;cj .y/ D cj yj ;
cj 2 R:
The mean squared error on the j th block has a simple form, directly or from (3.11),
r .Oj;cj ; j / D nj 2 cj2 C .1
cj /2 kj k2 :
(6.6)
The corresponding minimax risk among block linear diagonal estimators is then
X
RBL .; / D inf sup
r .Oj;cj ; j /:
.cj /
The minimax theorem for diagonal linear estimators, Theorem 4.25, and its proof have an
immediate analog in the block case.
161
Proof As in the proof of Theorem 4.25, we apply the Kneser-Kuhn minimax theorem, this
time with payoff function, for c D .cj / and s D .si / D .i2 /; given by
Xh
X i
f .c; s/ D 2
nj cj2 C .1 cj /2
si :
j
i2Bj
Now consider the right side of (6.7). The minimization over .cj / can be carried out term
by term in the sum. The ideal shrinkage factor on the j th block is found by minimizing (6.6).
This yields shrinkage factor c IS .j / D kj k2 =.nj 2 C kj k2 / and the corresponding ideal
estimator OjIS .y/ D c IS .j /yj has ideal risk
r .OjIS ; j / D
nj 2 kj k2
;
nj 2 C kj k2
(6.8)
r .OjIS ; j /:
(6.9)
Block Linear versus Linear. Clearly RL .; / RBL .; /: In two cases, more can be
said:
(i) Call block symmetric if is invariant to permutations of the indices I within blocks.
A variant of the argument in Section 4.7 employing random block-preserving permutations
(instead of random signs) shows that if is solid, ortho- and block- symmetric, then
RL .; / D RBL .; /
(6.10)
See Exercise 6.4. The dyadic Sobolev ellipsoids D .C / are block symmetric and so are an
example for (6.10).
(ii) For general ellipsoids .a; C / as in (5.2), and a block scheme (6.3), measure the
oscillation of the weights ak within blocks by
ak
:
osc.Bj / D max
k;k 0 2Bj ak 0
It follows, Exercise 6.3, from the linear minimax risk formula (5.3) that if ak ! 1 and
osc.Bj / ! 1; then
RL .; / RBL .; /
as ! 0:
(6.11)
j
either lj D .j C 1/ for > 0; or lj D e in either case osc .Bj / D .lj C1 = lj / ! 1:
For weakly geometric blocks, osc.Bj / D .lj C1; = lj / D .1 C 1=L / ! 1. The block
sizes must necessarily be subgeometric in growth: for dyadic blocks, lj D 2j ; the condition
fails: osc .Bj / ! 2 :
162
j <L
<yj
BJS
JS
O
O
j .y/ D j .yj / L j < J :
:
0
j J
(6.13)
(6.14)
For the earliest blocks, specified by L, no shrinkage is performed. This may be sensible
because the blocks are of small size .nj 2/; or are known to contain very strong signal,
as is often the case if the blocks represent the lowest frequency components.
No blocks are estimated after J : Usually J is chosen so that l D lJ D 2 ; which is
proportional to the sample size n in the usual calibration. This restriction corresponds to not
attempting to estimate, even by shrinkage, more coefficients than there is data.
It is now straightforward to combine earlier results to obtain risk bounds for O BJS that
will also show in many cases that it is asymptotically minimax.
Theorem 6.2 In the homoscedastic white noise model, let O BJS denote the block JamesStein estimator (6.14).
(i) For dyadic blocks, consider TD;2 D fD .C /; ; C > 0g, and let J D log2 2 . Then
for each 2 TD;2 , the estimator O BJS is adaptive minimax as ! 0:
sup r .O BJS ; / RN .; /:
(6.15)
2
(ii) For more general choices of blocks, for each 2 T2 D f2 .C /; ; C > 0g assume
that osc.Bj / ! 1 as j ! 1 or maxkj osc.Bk / ! 1 as j ! 1 and ! 0 jointly.
Suppose that the block index J in (6.14) satisfies J D o. / for all > 0. Then adaptive
minimaxity (6.15) holds also for each 2 T2 .
For known noise level we see that, unlike the Pinsker linear minimax rule, which depends
on C and details of the ellipsoid weight sequence (here ), the block James-Stein estimator
has no adjustable parameters (other than the integer limits L and J ), and yet it can achieve
asymptotically the exact minimax rate and constant for a range of values of C and .
Some remarks on the assumptions in case (ii): A definition of J such as through lJ D
2 means that necessarily J ! 1 as ! 0. The oscillation condition prevents the block
sizes from being too large while the bound J ! 0 means that the block sizes cannot be
too small. Some further discussion of qualifying block sizes may be found after the proof.
163
Proof
and employ the structure of O BJS given in (6.14). On low frequency blocks, j < L, the
estimator is unbiased and contributes only variance terms nj 2 to MSE. On high frequency
blocks, j J , only a bias term kj k2 is contributed. On the main frequency blocks,
L j < J , we use the key bound (6.13). Assembling the terms, we find
r .O BJS ; / .lL C 2J
2L/ 2 C
JX
1
r .OjIS ; j / C
j DL
l2 :
(6.16)
ll
In view of (6.9), the first right-side sum is bounded above by the block linear minimax
risk RBL .; /. Turning to the second sum, for any ellipsoid .a; C / with al % 1; define
the (squared) maximal tail bias
X
X
l2 W
al2 l2 C 2 D C 2 al 2 :
./ D sup
(6.17)
ll
(6.18)
164
In fact, this last problem arises from the bound 2 2 in (6.13), and could be reduced by
using a modified estimator
j nj 2
O
j D 1
yj ;
j J ;
(6.19)
kyj k2 C
with j D 1 C tj with tj > 0. This reduces the error at zero to essentially a large deviation
probability, see e.g. Brown et al. (1997), who use dyadic blocks and tj D 1=2, or Cavalier
and Tsybakov (2001) who use smaller blocks and tj .nj 1 log nj /1=2 .
2. Depending on the value of j close to 1 or largerone might prefer to refer to (6.19) as
a block shrinkage or a block thresholding estimator. As just noted, the value of j determines
the chance that Oj 0 given that j D 0, and this chance is small in the block thresholding
regime.
The use of block thresholding in conjunction with smaller sized blocks of wavelet coefficients has attractive MSE properties, even for function spaces designed to model spatial inhomogeneity. For example, Cai (1999) uses blocks of size log n D log 2 and j D 4:505.
We return to an analysis of a related block threshold approach in Chapters 8 and 9.
3. The original Efromovich and Pinsker (1984) estimator set
(
nj 2
kyj k2 .1 C tj /nj 2 ;
1 ky
2 yj ;
jk
O
j D
(6.20)
0
otherwise
for tj > 0 and j J . To prove adaptive minimaxity over a broad class of P
ellipsoids (5.2),
they required in part that nj C1 =nj ! 1 and tj ! 0, but slowly enough that j 1=.tj3 nj / <
1. The class of estimators (6.19) is smoother, being continuous and weakly differentiable.
Among these, the Block James-Stein estimator (6.12) makes the particular choice j D
.nj 2/=nj < 1 and has the advantage that the oracle bound (6.13) deals simply with the
events fOj D 0g in risk calculations.
4. Theorem 6.2 is an apparently more precise result than was established for Holder
classes in the white noise case of Proposition 4.22, where full attention was not given to
the constants. In fact the preceding argument goes through, since 1 .C / defined in (4.53)
satisfies all the required conditions, including block symmetry. However, D 1 .C /
lacks a simple explicit value for RN .; /, even asymptotically, though some remarks can
be made. Compare Theorem 14.2 and Section 14.4.
165
As examples, we cite
1. Weighted Fourier series. The function decreases with increasing frequency, corresponding to a downweighting of signals at higher frequencies. The parameter h controls the
actual location of the cutoff frequency band.
2. Kernel estimators.
in Section 3.3 that in the time domain, the estimator has
R 1 We saw
1
O
the form .t/ D h K.h .t s//d Y .s/; for a suitable kernel function K./, typically
symmetric about zero. The parameter h is the bandwidth of the kernel. Representation (6.21)
follows after taking Fourier coefficients. Compare Lemma 3.7 and the examples given there.
3. Smoothing splines. We saw in Sections 1.4 and 3.4 that the estimator Ok minimizes
X
X
.yk k /2 C
k 2r k2 ;
where
penalty term viewed in the time domain takes the form of a derivative penalty
R r the
.D f /2 for some integer r: In this case, Ok again has the representation (6.21) with D h4
and .hk/ D 1 C .hk/2r 1 :
In addition, many methods of choosing h or from the data y have been shown to be
asymptotically equivalent to first order these include cross validation, Generalized cross
validation, Rices method based on unbiased estimates of risk, final prediction error, Akaike
information criterionsee e.g. Hardle et al. (1988) for details and literature references. In
this section we use a method based on an unbiased estimate of risk.
The implication of the adaptivity result Theorem 6.2 is that appropriate forms of the
block James-Stein estimator should perform approximately as well as the best linear (or nonlinear) estimators, whether constructed by Fourier weights, kernels or splines, and without
the need for an explicit choice of smoothing parameter from the data.
We will see this in examples below, but first we give an heuristic explanation of the close
connection of these linear shrinkage families with the block James-Stein estimator (6.12).
Consider a Taylor expansion of .s/ about s D 0: If the time domain kernel K.t / corresponding to is even about 0, then the odd order terms vanish and .s/ D 1 C 2 s 2 =2 C
4 s 4 =4C: : : , so that for h small and a positive even integer q we have .hk/ 1 bq hq k q ,
compare (3.31).
Now consider grouping the indices k into blocks Bj for example, dyadic blocks Bj D
fk W 2j < k 2j C1 g: Then the weights corresponding to two indices k; kN in the same block
are essentially equivalent: k 2r =kN 2r 2 2 2r ; 22r so that we may approximately write
Ok .1
cj /yk ;
k 2 Bj :
(6.22)
Here cj depends on h, but this is not shown explicitly, since we are about to determine cj
from the data y anyway.
For example, we might estimate cj using an unbiased risk criterion, as described in Sections 2.5 and 2.6. Putting C D .1 cj /Inj in the Mallowss CL criterion (2.53) yields
Ucj .y/ D nj 2
(6.23)
[As noted below (2.58), this formula also follows from Steins unbiased risk estimator applied to Oj .y/ D yj cj yj ]. The value of cj that minimizes (6.23) is cOj D nj 2 =kyj k2 ;
which differs from the James-Stein estimate (6.12) only in the use of nj rather than nj 2:
Thus, many standard linear methods are closely related to the diagonal linear shrinkage
estimator (6.22). In the figures below, we compare four methods:
166
1. LPJS: apply block James-Stein estimate (6.14) on each dyadic block in the Fourier frequency domain: O LPJS .y/ D .OjLPJS .yj //. Dyadic blocking in the frequency domain is
a key feature of Littlewood-Paley theory in harmonic analysis.
2. WaveJS: apply the James-Stein estimate (6.14) on each dyadic block in a wavelet coefficient domain: the blocks yj D .yj k ; k D 1; : : : ; 2j /.
R
3. AutoSpline: Apply a smoothing spline for the usual energy penalty .f 00 /2 using a regularization parameter O chosen by minimizing an unbiased estimator of risk.
4. AutoTrunc: In the Fourier frequency domain, use a cutoff function: .hl/
O
D I fl h 1 g
and choose the location of the cutoff by an unbiased risk estimator.
Implementation details. Let the original time domain data be Y D .Y.l/; l D 1; : : : ; N / for N D 2J .
The discrete Fourier transform (DFT), e.g. as implemented in MATLAB, sets
y./ D
N
X
Y.l/e 2 i.l
1/. 1/=N
D 1; : : : ; N:
(6.24)
lD1
P
If the input Y is real, the output y 2 CN must have only N (real) free parameters. Indeed y.1/ D N
1 Y.l/
PN
and y.N=2 C 1/ D 1 . 1/l Y.l/ are real, and for r D 1; : : : ; N=2 1, we have conjugate symmetry
y.N=2 C 1 C r/ D y.N=2 C 1
r/:
(6.25)
Thus, to build an estimator, one can specify how to modify y.1/; : : : ; y.N=2 C 1/ and then impose the
constraints (6.25) before transforming back to the time domain by the inverse DFT.
1. (LPJS). Form dyadic blocks
yj D fRe.y.//; Im.y.// W 2j
< 2j g
for j D 2; : : : ; J 1. Note that nj D #.yj / D 2j . Apply the James Stein estimator (6.12) to each yj ,
while leaving y./ unchanged for D 0; 1; 2. Thus L D 2, and we take 2 D .N=2/ 2 , in view of (6.38).
2. (WaveJS). Now we use a discrete wavelet transform instead of the DFT. Anticipating the discussion
in the next chapter, Y is transformed into wavelet coefficients .yj k ; j D L; : : : ; J 1; k D 1; : : : ; 2j / and
scaling coefficients .yQLk ; k D 1; : : : ; 2L /. We use L D 2; J D J and the Symmlet 8 wavelet, and apply
Block James Stein to the blocks yj D .yj k W k D 1; : : : ; 2j /; while leaving the scaling coefficients yQL
unchanged.
3. (Autospline). We build on the discussion of periodic splines in Section 3.4. There is an obvious
relabeling of indices so that in the notation of this section, D 1 corresponds to the constant term, and
each > 1 to a pair of indices 2. 1/ 1 and 2. 1/. Hence, linear shrinkage takes the form O ./ D
c ./y./ with
c ./ D 1 C .
1/4
Note that c ./ is real and is the same for the cosine and sine terms. We observe that c1 ./ D 1 and
decree, for simplicity, that cN=2C1 ./ D 0. Then, on setting d D 1 c and applying Mallows CL
formula (2.53), we get an unbiased risk criterion to be minimized over :
U./ D N C
N=2
X
d ./2 jy./j2
4d ./;
D2
4. (AutoTruncate). The estimator that cuts off at frequency 0 is, in the frequency domain,
(
y./ 0
O0 ./ D
0
> 0 :
167
Using Mallows Cp , noting that each frequency corresponds to two real degrees of freedom, and neglecting
terms that do not change with 0 , we find that the unbiased risk criterion has the form
U0 .y/ D
N C 40 C
N=2
X
jy./j2 ;
0 2 f1; : : : ; N=2g:
0 C1
Canberra Temps
LPJS, WaveJS
20
20
15
15
10
10
10
0.2
0.4
0.6
0.8
10
0.2
AutoSpline
20
15
15
10
10
0.2
0.4
0.6
0.6
0.8
0.8
AutoTruncate
20
10
0.4
0.8
10
0.2
0.4
0.6
Figure 6.1 Top left: Canberra temperature data from Figure 1.1. Top right: block
James-Stein estimates in the Fourier (solid) and wavelet (dashed) domains. Bottom panels:
linear spline and truncation smoothers with bandwidth parameter chosen by minimizing an
unbiased risk criterion.
These are applied to two examples: (a) the minimum temperature data introduced in Section 1.1, and (b) a blocky step function with simulated i.i.d. Gaussian noise added. For
simplicity in working with dyadic blocks, we have chosen a subset of N D 256 days 1 . The
temperature data has correlated noise, so our theoretical assumptions dont hold exactly. Indeed, one can see the different noise levels in each wavelet band (cf Chapter 7.5). We used
an upper bound of O D 5 in all cases. Also, the underlying function is not periodic over this
range and forcing the estimator to be so leads to somewhat different fits than in Figure 1.1;
the difference is not central to the discussion in this section.
1
For non-dyadic sample sizes, see Section 7.8 for some references discussing wavelet transforms. In the
Fourier domain, one might simply allow the block of highest frequencies to have dimension N 2log2 N .
168
Noisy Blocks
20
25
15
20
15
10
10
5
5
0
0
5
10
0.2
0.4
0.6
0.8
10
0.2
LPJS
25
20
20
15
15
10
10
0.2
0.4
0.6
0.8
0.8
AutoSpline
25
10
0.4
0.6
0.8
10
0.2
0.4
0.6
Figure 6.2 Top panels: A blocky step function with i.i.d Gaussian noise added, N =
2048. Bottom panels: selected reconstructions by block James-Stein and by smoothing spline
(with data determined ) fail to remove all noise.
The qualitative similarity of the four smoothed temperature fits is striking: whether an
unbiased risk minimizing smoothing parameter is used with splines or Fourier weights, or
whether block James-Stein shrinkage is used in the Fourier or wavelet domains. The similarity of the linear smoother and block James-Stein fits was at least partly explained near
(6.22).
The similarity of the Fourier and wavelet James-Stein reconstructions may be explained
as follows. The estimator (6.22) is invariant with respect to orthogonal changes of basis for
the vector yj D .yk W k 2 Bj /. To the extent that the frequency content of the wavelets
spanning the wavelet multiresolution space Wj is concentrated on a single frequency octave
(only true approximately), it represents an orthogonal change of basis from the sinusoids
belonging to that octave. The James-Stein estimator (6.12) is invariant to such orthogonal
basis changes.
The (near) linear methods that agree on the temperature data also give similar, but now
unsatisfactory, results on the blocky example, Figure 6.2. Note that none of the methods are
effective at simultaneously removing high frequency noise and maintaining the sharpness of
jumps and peaks.
169
It will be the task of the next few chapters to explain why the methods fail, and how
wavelet thresholding can succeed. For now, we just remark that the blocky function, which
evidently fails to be differentiable, does not belong to any of the ellipsoidal smoothing
classes 2 .C / for 1=2, based on the expectation that the Fourier coefficients decay
at rate O.1=k/. Hence the theorems of this and the previous chapter do not apply to this
example.
Discussion
Visualizing least favorable distributions. Pinskers theorem gives an explicit construction
of
asymptotically least favorable distribution associated with the ellipsoid D f W
P the
ai2 i2 C 2 g: simply take independent variables i N.0; i2 /, with i given by (5.6).
Recalling that the i can be thought of as coefficients of the unknown function in an orthonormal basis f'
Pi g of L2 0; 1, it is then instructive to plot sample paths from the random
function X.t/ D i 'i .t/.
Figure 6.3 shows two such sample paths in the trigonometric basis (3.8) corresponding to
smoothness m D 1 and m D 2 in (3.9). Despite the different levels of smoothness, notice
that the spatial homogeneity in each casethe degree of oscillation within each figure is
essentially constant as one moves from left to right in the domain of the function.
2
1.5
0.5
1
0.5
0
0.5
1
0.5
1.5
2
1
0
200
400
600
800
1000
200
400
600
800
1000
Figure 6.3 Sample paths from two Gaussian priors corresponding to (5.6) in
Pinskers theorem, which are near least favorable for ellipsoids in the trigonometric
basis. In both cases D 0:5 and C D 500. Left: mean square derivatives D 1.
Right D 2. The corresponding computed values of .; C / are 228.94 and 741.25
respectively.
Challenges to the ellipsoid model. Of course, not all signals of scientific interest will
necessarily have this spatial homogeneity. Consider the NMR signal in Figure 1.2 or the
plethysmograph signal in Figure 6.4. One sees regions of great activity or oscillation in
the signal, and other regions of relative smoothness.
Comparing sample paths from the Gaussian priors with the data examples, one naturally
suspects that the ellipsoid model is not relevant in these cases, and asks whether linear
estimators are likely to perform near optimally (and in fact, they dont).
170
0.4
0.2
0.0
Voltage
0.6
0.8
1240
1260
1280
1300
Time (secs)
Another implicit challenge to the ellipsoid model and the fixed bandwidth smoothers
implied by (5.5) comes from the appearance of smoothing methods with locally varying
bandwidth such as LO(W)ESS, Cleveland (1979). We will see the locally varying bandwidth
aspect of wavelet shrinkage in Section 7.5.
Commentary on the minimax approach. One may think of minimax decision theory as a
method for evaluating the consequences of assumptionsthe sampling model, loss function,
and particularly the structure of the postulated parameter space : The results of a minimax
solution consist, of course, of the minimax strategy, the least favorable prior, the value, and
also, information gained in the course of the analysis.
The minimax method can be successful if the structure of is intellectually and/or scientifically significant, and if it is possible to get close enough to a solution of the resulting
minimax problem that some significant and interpretable structure emerges. Pinskers theorem is an outstanding success for the approach, since it yields an asymptotically sharp
solution, along with the important structure of linear estimators, independent Gaussian least
favorable priors, decay of shrinkage weights with frequency to a finite cutoff, and so on. For
some datasets, as we have seen, this is a satisfactory description.
The clarity of the solution, paradoxically, also reveals some limitations of the formulation.
The juxtaposition of the Pinsker priors and some other particular datasets suggests that for
some scientific problems, one needs richer models of parameter spaces than ellipsoids. This
is one motivation for the introduction of Besov bodies in Chapter 9.6 below.
171
the unknown function is kept fixed, and the risk behavior of an estimator sequence O is
analysed as ! 0. Asymptotic approximations might then be used to optimize parameters
of the estimatorsuch as bandwidths or regularization parametersor to assert optimality
properties.
This mode of analysis has been effective in large sample analysis of finite dimensional
models. Problems such as superefficiency are not serious enough to affect the practical implications widely drawn from Fishers asymptotic theory of maximum likelihood.
In nonparametric problems with infinite dimensional parameter spaces, however, fixed
asymptotics is more fragile. Used with care, it yields useful information. However, if
optimization is pushed too far, it can suggest conclusions valid only for implausibly large
sample sizes, and misleading for actual practice. In nonparametrics, superefficiency is more
pervasive: even practical estimators can exhibit superefficiency at every parameter point, and
poor behaviour in a neighbourhood of any fixed parameter point is a necessary property of
every estimator sequence.
After reviewing Hodges classical example of parametric superefficiency, we illustrate
these points, along with concluding remarks about worst-case and minimax analysis.
E .O
[For this subsection, D R:] Hodges counterexample modifies the MLE O .y/ D y in a
shrinking neighborhood of a single point:
(
p
0 jyj <
O
.y/ D
y otherwise:
p
D
1
p
in clear violation of the Fisherian program. A fuller introduction to this and related superefficiency issues appears in Lehmann and Casella (1998, Section 6.2). Here we note two
phenomena which are also characteristic of more general parametric settings:
(i) points of superefficiency are rare: in Hodges example, only at D 0: More generally,
172
r .O ; /
1:
RN .; /
(6.26)
(ii) Superefficiency entails poor performance at nearby points. For Hodges example, conp
p
sider D =2: Since the threshold zone extends 1=.2 / standard deviations to the
p
right of , it is clear that O makes a squared error of . =2/2 with high probability, so
p
p
:
2 r.O ; =2/ D 2 . =2/2 ! 1: Consequently
sup
p
j j
r.O ; /
! 1:
RN .; /
(6.27)
Le Cam, Huber and Hajek showed that more generally, superefficiency at 0 forces poor
properties in a neighborhood of 0 : Since broadly efficient estimators such as maximum likelihood are typically available with good risk properties, superefficiency has less relevance in
parametric settings.
Remark. Hodges estimator is an example of hard thresholding, to be discussed in some
detail for wavelet shrinkage in non-parametric estimation. It is curious that the points of
superefficiency that are unimportant for the one-dimensional theory become essential for
sparse estimation of high dimensional signals.
(6.28)
173
and (6.8): indeed, since L D 2 and ab=.a C b/ min.a; b/, we may write
X
X
r .O BJS ; / 2J 2 C
min.nj 2 ; kj k2 / C
l2 :
j
l>
The proof of Theorem 6.2 showed that the first and third terms were o. 2r /; uniformly over
2 : Consider, therefore, the second term, which we write as R1 .; /: For any j ; use
the variance component below j and the bias term thereafter:
X
R1 .; / 2j 2 C 2 2j
22j kj k2 :
j j
To show that R1 .; / D o. 2r /; first fix a > 0 and then choose j so that 2j 2 D 2r :
[Of course, j should be an integer, but there is no harm in ignoring this point.] It follows
that 2 2j D 2 2r ; and so
X
2r R1 .; / C 2
22j kj k2 D C o.1/;
j j
since the tail sum vanishes as ! 0; for 2 .C /: Since > 0 is arbitrary, this shows
that R1 .; / D o. 2r / and establishes (6.28).
The next result shows that for every consistent estimator sequence, and every parameter
point 2 `2 ; there exists a shrinking `2 neighborhood of over which the worst case risk
of the estimator sequence is arbitrarily worse than it is at itself. Compare (6.27). In parametric settings, such as the Hodges example, this phenomenon occurs only for unattractive,
superefficient estimators, but in nonparametric estimation the property is ubiquitous. Here,
neighborhood refers to balls in `2 norm: B.0 ; / D f W k 0 k2 < g: Such neighborhoods do not have compact closure in `2 , and fixed asymptotics does not give any hint of
the perils that lie arbitrarily close nearby.
Proposition 6.4 Suppose that O is any estimator sequence such that r .O ; 0 / ! 0: Then
there exists ! 0 such that as ! 0;
r .O ; /
! 1:
2B.0 ; / r .O ; 0 /
(6.29)
sup
Remark. The result remains true if the neighborhood B.0 ; / is replaced by its intersection with any dense set: for example, the class of infinitely differentiable functions.
p
Proof Let
2 D r .O ; 0 / W we show that D
will suffice for the argument. The
proof is a simple consequence of the fact that B.1/ D f W kk2 1g is not compact
(compare Theorem 5.6 or the example following Theorem 4.25), so that RN .B.1/; /
c0 > 0 even as ! 0: All that is necessary is to rescale the estimation problem by defining
N D 1 . 0 /; yN D 1 .y 0 /; N D 1 ; and so on. Then yN D N C N z is an instance
of the original Gaussian sequence model, and B.0 ; / corresponds to the unit ball B.1/.
Rescaling the estimator also via ON .y/
N D 1 O .y/ 0 ;
2 EkO
N 2;
k
174
! 1:
l<0
As in Section 3.3 and 6.4, represent a kernel estimator in the Fourier domain by diagonal
shrinkage
Oh;l D .hl/yl ;
(6.32)
R i st
K.t/dt is the Fourier transform of kernel K. The qth order moment
where .s/ D e
condition becomes a statement about derivatives at zero, cf. (3.31). To simplify calculations,
we use a specific choice of qth order kernel:
.s/ D .1
jsjq /C :
(6.33)
With this kernel, the mean squared error of (6.32) can be written explicitly as
X
X
r .Oh ; / D
2 .1 jhljq /2 C jhlj2q l2 C
l2 :
jljh
jlj>h
(6.34)
C bq . /h2q ;
175
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
(6.36)
with cq D 1 C .2q/ 1 . Thus the rate of convergence, 2q=.2q C 1/, reflects only the order
of the kernel used and nothing of the properties of f . Although this already is suspicious,
it would seem, so long as f is smooth, that the rate of convergence can be made arbitrarily
close to 1; by using a kernel of sufficiently high order q:
However, this is an over literal use of fixed asymptotics a hint of the problem is already
suggested by the constant term in (6.36), which depends on bq . / and could grow rapidly
with q: However, we may go further and do exact MSE calculations with formula (6.34)
using kernel (6.33). As specific test configurations in (6.31) we take
8
3
l even; l 2 l1 ; l2
<jlj
3
l D c.l1 ; l2 / jlj
(6.37)
l odd; l 2 l1 ; l2
:
0
otherwise;
and
c.l1 ; l2 / chosen so that a Sobolev 2nd derivative smoothness condition holds:
P 4with
l l2 D C 2 : Two choices are
(I) l1 D 4;
l2 D 20;
(II) l1 D 4;
l2 D 400;
C D 60;
C D 60:
176
10
10
10
10
10
10
10
10
10
10
Figure 6.6 MSE of ideal bandwidth choice for II : r .Oh . II / ; II / resulting from
qth order optimal bandwidth (6.35) for q D 2; 4; 8 with exact risks calculated using
(6.34). Also shown is the upper bound (6.16) for the risk of the dyadic blocks James
Stein estimator (6.14).
q=2
q=4
q=8
DyadicJS
10
10
10
10
10
10
10
10
10
10
Figure 6.7 Corresponding plot of MSEs and James-Stein bound for ideal
bandwidth choice for I .
The 4th order kernel will dominate q D 2 for n somewhat larger than 106 ; but q D 8 will
dominate only at absurdly large sample sizes.
Figure 6.7 shows that the situation is not so bad in the case of curve I : because the higher
frequencies are absent, the variance term in (6.34) is not so inflated in the q D 8 case.
177
However, with moderate noise levels ; a test would not be able to discriminate beween
and II : This is an instance of the nearby instability of MSE seen earlier in this section.
We can also use (6.35) to compute the relative size of optimal bandwidths for the two
functions, using Rq D h;q .1 /= h;q .2 / as a function of q. Indeed, for q D 2; 4; 8, one
computes that Rq D 1; 2:6 and 6:8.
Thus, at least for q > 2; both h . / and r.Oh ; / are very sensitive to aspects of the function that are difficult or impossible to estimate at small sample sizes. The fixed expansions
such as (6.30) and (6.36) are potentially unstable tools.
I
Remarks. 1. Block James Stein estimation. Figures 6.6 and 6.7 also show the upper
bounds (6.16) for the MSE of the dyadic blocks James-Stein estimator, and it can be seen
that its MSE performance is generally satisfactory, and close to the q D 2 kernel over small
sample sizes. Figure 6.8 compares the ratio r .O BJS ; /=r .O q ; / of the Block JS mean
squared error to the qth order kernel MSE over a much larger range of n D 2 : The James
Stein MSE bound is never much worse than the MSE of the qth order optimal bandwidth,
and in many cases is much better.
2. Smoothness assumptions. Since I and II have finite Fourier expansions, they are
certainly C 1 ; but here they behave more like functions with about two square summable
derivatives. Thus from the adaptivity Theorem 6.2, for large, one expects that Block JS
should eventually improve on the q D 4 and q D 8 kernels, and this indeed occurs in Figure
6.8 on the right side of the plot. However, the huge sample sizes show this theoretical
to be impractical. Such considerations point toward the need for quantitative measures of
smoothnesssuch as Sobolev or Besov normsthat combine the sizes of the individual
coefficients rather than qualitative hypotheses such as the mere existence of derivatives.
Risk ratio vs. sample size
10
10
JS/
JS/
JS/
JS/
JS/
JS/
q
q
q
q
q
q
=
=
=
=
=
=
2 , J = 20
4
8
2 , J = 400
4
8
5
10
10
10
Figure 6.8 Ratio of James Stein MSE bound to actual MSE for kernels of order
q D 2; 4; 8 at D I (dotted) and II (solid) over a wide range of sample sizes
n D 2.
3. Speed limits. There is a uniform version of (6.36) that says that over ellipsoids of
178
functions with mean-square derivatives, the uniform rate of convergence using the qth
order kernel is at best . 2 /2q=.2qC1/ ; no matter how large is. By contrast, the adaptivity
results of Theorem 6.2 (and its extensions) for the block James-Stein estimate show that it
suffers no such speed limit, and so might effectively be regarded as acting like an infinite
order kernel. (Exercise 1 below has further details.)
Concluding discussion. Worst case analysis is, in a way, the antithesis of fixed analysis.
The least favorable configurationwhether parameter point or prior distribution will
generally change with noise level . This is natural, since the such configurations represent
the limit of resolution attainable, which improves as the noise diminishes.
The choice of the space to be maximized over is certainly critical, and greatly affects
the least favorable configurations found. This at least has the virtue of making clearer the
consequences of assumptionsfar more potent in nonparametrics, even if hidden. It might
be desirable to have some compromise between the local nature of fixed asymptotics, and
the global aspect of minimax analysisperhaps in the spirit of the local asymptotic minimax approach used in parametric asymptotics. Nevertheless, if one can construct estimators
that deal successfully with many least favorable configurations from the global minimax
frameworkas in the blockwise James-Stein constructionsthen one can have some degree of confidence in such estimators for practical use in settings not too distant from the
assumptions.
6.6 Notes
2 and 3. The first results in the adaptive minimax setting are due to Efromovich and Pinsker (1984), who
pioneered the use of estimator (6.20), and Golubev (1987). The approach of 2 and 3 follows that of
Donoho and Johnstone (1995).
Cavalier and Tsybakov (2001) introduce the term penalized blockwise Stein rule for the variant (6.19),
and use it to establish sharp oracle inequalities and sharp asymptotic minimaxity results for very general
classes of ellipsoids, along with near optimal results for Besov spaces. They also emphasize the use of
weakly geometric blocks, which were studied by Nemirovski (2000). Cavalier and Tsybakov (2002) extend the penalized blockwise Stein approach to linear inverse problems. Efromovich (2004b) establishes
a similar oracle inequality for the Efromovich-Pinsker estimator (6.20) under weaker assumptions on the
noise model. In the spirit of the extension of these results to other nonparametric models, as discussed in
Section 3.11, we mention the sharp adaptivity results of Efromovich and Pinsker (1996) for nonparametric
regression with fixed or random design and heteroscedastic errors. Rigollet (2006) has a nice application of
these ideas to adaptive density estimation on R.
We have focused on functions of a single variable: Efromovich (2010) gives an example of use of thresholding and dyadic blocking for a series estimator in a fairly flexible multivariate setting.
4. The comparison of linear methods draws from Donoho and Johnstone (1995) and Donoho et al.
(1995). Johnstone (1994) has more on drawing sample paths from least favorable and near least favorable
priors on ellipsoids and Besov balls.
5. The first part of this section borrows from Brown et al. (1997), in particular Proposition 6.3 is a
version of Theorem 6.1 there. van der Vaart (1997) gives a review of the history and proofs around superefficiency. These articles contain full references to the work of Le Cam, Huber and Hajek.
The exact risk analysis is inspired by the study of density estimation in Marron and Wand (1992), which
in turn cites Gasser and Muller (1984). Of course, the density estimation literature also cautions against the
use of higher order (q > 2) kernels due to these poor finite sample properties. We did not try to consider
the behavior of plug-in methods that attempt to estimate h ./ variability in the data based estimates of
h ./ would of course also contribute to the overall mean squared error. Loader (1999) provides a somewhat
critical review of plug-in methods in the case q D 2.
While the choice q D 8 may seem extreme in the setting of traditional density estimation, it is standard
to use wavelets with higher order vanishing moments for example, the Daubechies Symmlet 8 discussed
179
Exercises
in Daubechies (1992, p. 198-199) or Mallat (1998, p. 252), see also Chapter 7.1. Analogs of (6.30) and
(6.36) for wavelet based density estimates appear in Hall and Patil (1993), though of course these authors
do not use the expansions for bandwidth selection.
Exercises
6.1
(Equivalence of Fourier and dyadic Sobolev norms.) Fix > 0. In the Fourier ellipsoid case,
let a0 D 0 and a2k 1 D a2k D .2k/ for k 1. For the dyadic case, let aQ 0 D 0 and aQ l D 2j
if l D 2j C k for j 0 and k D 0; 1; : : : ; 2j 1. For sequences D .l ; l 0/ define
semi-norms
1
1
X
X
jj2F ; D
al2 l2 ;
jj2D; D
aQ l2 l2 :
lD1
lD1
6.2
al aQ l al :
(c) Define the norms kk2F ; D 02 C jj2F ; , and kk2D; D 02 C j j2D; , and verify the
inequalities (6.5).
(Speed limits for qth order kernels.)
We have argued that in the Gaussian sequence model in the Fourier basis, it is reasonable to
think of a kernel estimate with bandwidth h as represented by Oh;l D .hl/yl :
(a) Explain why it is reasonable to express the statement K is a qth order kernel, q 2 N, by
the assumption .s/ D 1 cq s q C o.s q / as s ! 0 for some cq 0:
P 2 2
(b) Let .C / D f W
al l C 2 g with a2l 1 D a2l D .2l/ be, as usual, an ellipsoid of
mean square differentiable functions. If K is a qth order kernel in the sense of part (a), show
that for each > q;
inf
sup
h>0 2 .C /
6.3
[Thus, for a second order kernel, the (uniform) rate of convergence is n 4=5 , even if we consider
ellipsoids of functions with 10 or 106 derivatives. Since the (dyadic) block James Stein estimate
has rate n 2=.2C1/ over each .C /, we might say that it corresponds to an infinite order
kernel.]
P 2 2
(Oscillation within blocks.) Let .a; C / be an ellipsoid f.i / W
ai i C 2 g. Assume that
ai % 1. Let blocks Bj be defined as in (6.3) and the oscillation of ai within blocks by
osc.Bj / D max
l;l 0 2Bj
al
:
al 0
as ! 0:
(Block linear minimaxity.) Show that if is solid, orthosymmetric and block-symmetric, then
RL .; / D RBL .; /
180
6.5
6.6
(Time domain form of kernel (6.33)). Let L.t / D sin t =. t /, and assume, as in (6.33), that
.s/ D .1 jsjq /C . Show that the corresponding time domain kernel
K.t / D L.t /
6.7
6.8
(6.38)
. i /q L.q/ .t /:
Make plots of K for q D 2; 4 and compare with Figure 3.1. Why is the similarity not surprising?
(Superefficiency for Block James-Stein.) In Proposition 6.3, suppose that 2 .C / is given. Show
that conclusion (6.28) holds for any blocking scheme (6.3) for which J 2 D o. 2r /.
(Exact risk details.) This exercise records some details leading to Figures 6.56.8.
(i) For vectors x; X 2 CN , the inverse discrete Fourier transform x = ifft(X) sets x.j / D
P
2 i.j 1/.k 1/=N ; j D 1; : : : ; N . Suppose now that
N 1 N
kD1 X.k/e
X.1/ D N 0 ;
ReX.l C 1/ D N l ;
ImX.l C 1/ D N
for 1 l < N=2 and X.k/ D 0 for k > N=2. Also, set tj D j=N . Verify that
Rex.j / D f .tj
1 / D 0 C
N=2
X
l cos 2ltj
C
sin 2ltj
j D 1; : : : ; N:
1;
lD1
(ii) Consider the sequence model in the form yl D l C zl for l 2 Z. For the coefficients
specified by (6.37) and below, show that the risk function (6.34) satisfies
r.Oh ; / D 2 C 2 2
lh
X
1
2
.hl/q 2 C h2q C12
lX
2 ^lh
j 2q
l2
X
2
C C12
lh C1
lDl1
Pl2
2 D C 2=
where lh D h 1 and C12
j 2:
lDl1
(iii) Introduce functions (which also depend on l1 ; l2 and C )
V .m; nI h; q/ D
n
X
.hl/q 2 ;
2
B.m; nI p/ D C12
lDm
n^l
X2
jp
lDm_l1
l2
X
j 2q
D B.l1 ; l2 I 2q/
l1
JX
1
bD2
where Bb D
B.2b 1
1; 2b I 0/
and B D
B.2J 1
nb Bb
C B ;
nb C Bb n
C 1; l2 I 0/.
7
A Primer on Estimation by Wavelet Shrinkage
When I began to look at what Meyer had done, I realized it was very close to some ideas in
image processing. Suppose you have an image of a house. If you want to recognize simply
that it is a house, you do not need most of the details. So people in image processing had
the idea of approaching the images at different resolutions. (Stephane Mallat, quoted in
New York Times.)
When an image arrives on a computer screen over the internet, the broad outlines arrive
first followed by successively finer details that sharpen the picture. This is the wavelet transform in action. In the presence of noisy data, and when combined with thresholding, this
multiresolution approach provides a powerful tool for estimating the underlying object.
Our goal in this chapter is to give an account of some of the main issues and ideas behind
wavelet thresholding as applied to equally spaced signal or regression data observed in noise.
The purpose is both to give the flavor of how wavelet shrinkage can be used in practice,
as well as provide the setting and motivation for theoretical developments in subsequent
chapters. Both this introductory account and the later theory will show how the shortcomings
of linear estimators can be overcome by appropriate use of simple non-linear thresholding.
We do not attempt to be encyclopedic in coverage of what is now a large area, rather we
concentrate on orthogonal wavelet bases and the associated multiresolution analyses for
functions of a single variable.
The opening quote hints at the interplay between disciplines that is characteristic of
wavelet theory and methods, and so is reflected in the exposition here.
Section 7.1 begins with the formal definition of a multiresolution analysis (MRA) of
square integrable functions, and indicates briefly how particular examples are connected
with important wavelet families. We consider decompositions of L2 .R/ and of L2 .0; 1/,
though the latter will be our main focus for the statistical theory.
This topic in harmonic analysis leads directly into a signal processing algorithm: the twoscale relations between neighboring layers of the multiresolution give rise in Section 7.2 to
filtering relations which, in the case of wavelets of compact support, lead to the fast O.n/
algorithms for computing the direct and inverse wavelet transforms on discrete data.
Section 7.3 explains in more detail how columns of the discrete wavelet transform are related to the continuous wavelet and scaling function of the MRA, while Section 7.4 describes
the changes needed to adapt to finite data sequences.
Finally in Section 7.5 we are ready to describe wavelet thresholding for noisy data using the discrete orthogonal wavelet transform of n D 2J equally spaced observations. The
181
182
hidden sparsity heuristic is basic: the wavelet transform of typical true signals is largely
concentrated in a few co-ordinates while the noise is scattered throughout, so thresholding
will retain most signal while suppressing most noise.
How the threshold itself is set is a large question we will discuss at length. Section 7.6
surveys some of the approaches that have been used, and for which theoretical support exists.
The discussion in these two sections is informal, with numerical examples. Corresponding
theory is developed in later chapters.
Vj Vj C1 ;
f .x/ 2 Vj if and only if f .2x/ 2 Vj C1 ; 8j 2 Z;
[j 2Z Vj D L2 .R/:
\j 2Z Vj D f0g;
there exists ' 2 V0 such that f'.x k/ W k 2 Zg is an orthonormal basis (o.n.b) for V0 .
The function ' in (iv) is called the scaling function of the given MRA. Set 'j k .x/ D
2j=2 '.2j x k/: One says that j k has scale 2 j and location k2 j : Properties (ii) and (iv)
imply that f'j k ; k 2 Zg is an orthonormal basis for Vj : The orthogonal projection from
L2 .R/ ! Vj is then
X
Pj f D
hf; 'j k i'j k :
k
183
k; 2
Vj D ff 2 L2 .R/ W f jIj k D cj k g;
with cj k 2 R. Thus Vj consists of piecewise constant functions on intervals of length 2
and Pj f .x/ is the average of f over the interval Ij k that contains x.
Example. Box spline MRA. Given r 2 N; set
Vj D ff 2 L2 \ C r
b
' .2/ D 2
1=2b
h./b
' ./;
(7.2)
hke
i k
1=2
b
g ./b
' ./:
(7.4)
Define recentered and scaled wavelets j k .x/ D 2j=2 .2j x k/: Suppose that it is
possible to define using (7.4) so that f j k ; k 2 Zg form an orthonormal basis for Wj . Then
it may be shown from property (iii) of the MRA that the full collection f j k ; .j; k/ 2 Z2 g
forms an orthonormal basis for L2 .R/.
184
Wj ;
j J
j J
jki
jk:
(7.5)
The first is called a homogeneous expansion, while the second is said to be inhomogeneous
since it combines only the detail spaces at scales finer than J:
Figure 7.1(left) shows some examples of j k for a few values of j; k: as elements of an
orthonormal basis, they are mutually orthogonal with L2 -norm equal to 1.
A key heuristic idea is that for typical functions f , the wavelet coefficients hf; j k i are
large only at low frequencies or wavelets located close to singularities of f . This heuristic
notion is shown schematically in Figure 7.1(right) and is quantified in some detail in Section
9.6 and Appendix B.
Here is a simple result describing the wavelet coefficients of piecewise constant functions.
R
Lemma 7.2 Suppose has compact support S; S and
D 0. Suppose f is piecewise constantRwith d discontinuities. Then at level j at most .2S 1/d of the wavelet coefficients j k D f j k are non-zero, and those are bounded by c2 j=2 , with c D k k1 kf k1 .
R
Proof Let the discontinuities of f occur at x1 ; : : : ; xd . Since
D 0,
Z
Z
j k D f j k D 2 j=2 f .2 j .t C k// .t /dt
vanishes unless some xi lies in the interior of supp. j k /. In this latter case, we can use the
right hand side integral to bound jj k j kf k1 k k1 2 j=2 . The support of j k is k2 j C
2 j S; S, and the number of k for which xi 2 int.supp. j k // is at most 2S 1. So the
total number of non-zero j k at level j is at most .2S 1/d .
The construction of some celebrated pairs .'; / of scaling function and wavelet is
sketched, with literature references, in Appendix B.1. Before briefly listing some of the
well known families, we discuss several properties that the pair .'; / might possess.
Support size. Suppose that the support of is an interval of length S, say 0; S . Then
j
C 2 j 0; S . Now suppose also that f has a singularity at x0 .
j k is supported on k2
The size ofRS determines the range of influence of the singularity on the wavelet coefficients
j k .f / D f j k . Indeed, at level j , the number of coefficients that feel the singularity at
x0 is just the number of wavelet indices k for which supp j k covers x0 , which by rescaling
is equal to S (or S 1 if x0 lies on the boundary of supp j k ).
It is therefore in principle desirable to have small support for and '. These are in turn
determined by the support of the filter h, by means of the two scale relations (7.1) and
(7.3). For a filter h D .hk ; k 2 Z/; its support is the smallest closed interval containing the
non-zero values of hk : For example, Mallat (1999, Chapter 7) shows that
(i) supp ' D supp h if one of the two is compact, and
2 C1 N2 N1 C1
(ii) if supp ' D N1 ; N2 , then supp D N1 N
;
:
2
2
185
(7,95)
(6,43)
jk
(6,21)
(5,13)
(4, 8)
(3, 5)
2{jk
2{j(k+S)
(3, 3)
(6,32)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 7.1 Left panel: Wavelets (from the Symmlet-8 family), the pair .j; k/
indicates wavelet j k , at resolution level j and approximate location k2 j . Right
panel: Schematic of a wavelet j k of compact support hitting a singularity of
function f .
(7.6)
c C2
j.C1=2/
If is a positive integer, then the C assumption is just the usual notion that f has
continuous derivatives, and the constant C D kD f k1 =. For > 0 non-integer,
we use the definition of Holder smoothness of order , given in Appendix C.23/ Note the
parallel with the definition (3.25) of vanishing moments for an averaging kernel K, and the
expression (3.26) for the approximation error of a qth order kernel.
Daubechies (1988) showed that existence of p vanishing moments for an orthogonal
wavelet implied a support length for h, and hence for '; , of at least 2p 1. Thus, for
such wavelets, there is a tradeoff between short support and large numbers of vanishing moments. A resolution of this tradeoff is perhaps best made according to the context of a given
application.
O
Regularity. Given an estimate
P Of .x/ of function f , we see by writing out the wavelet
O
expansion in (7.5) as f .x/ D j k j k .x/ that the smoothness of x ! j k .x/ can impact
the visual appearance of a reconstruction. However it is the number of vanishing moments
that affects the size of wavelet coefficients at fine scales, at least in regions where f is
smooth. So both properties are in general relevant. For the common wavelet families [to be
reviewed below], it happens that regularity increases with the number of vanishing moments.
186
1.
187
Meyer
BattleLemarie
6
0.2
0.4
0.6
0.8
6
0.2
D4 Wavelet
0.4
0.6
0.8
0.2
S8 Symmlet
6
6
0.2
0.4
0.6
0.8
0.6
0.8
C3 Coiflet
0.4
6
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
Figure 7.2 The wavelet 4;8 .x/ from the members of several common wavelet
families. The Battle-Lemarie case uses linear splines, m D 1. For the Daubechies,
Symmlet and Coiflet cases, p D 2; 8 and 3 respectively, yielding 2, 8 and 6
vanishing moments. Produced using the function MakeWavelet.m in WaveLab.
`2Z
per
'j;kCr2j .x/
per
'j;k .x/
per
.x/
j;kCr2j
per
and
D j;k .x/ for any
The definition implies that
D
integers k and r and j 0. If '; have compact support, then for j larger than some j1 ,
these sums reduce to a single term for each x 2 I: [Again, this is analogous to the discussion
of periodization of kernels at (3.17) and (3.22) (??).]
per
per
per
per
Define Vj D span f'j k ; k 2 Zg; and Wj D span f j k ; k 2 Zg W this yields an
188
orthogonal decomposition
per
L2;per .I / D VL
per
Wj ;
j L
per
Vj
per
Wj
with dim
D dim
D 2j for j 0: Meyer makes a detailed comparison of Fourier
series and wavelets on 0; 1; including remarkable properties such as uniform convergence
of the wavelet approximations of any continuous function on 0; 1:
(ii) Orthonormalization on 0; 1 For non-periodic functions on 0; 1; one must take a
different approach. We summarize results of the CDJV construction, described in detail
in Cohen et al. (1993b), which builds on Meyer (1991) and Cohen et al. (1993a). The construction begins with a Daubechies pair .'; / having p vanishing moments and minimal
support p C 1; p. For j such that 2j 2p and for k D p; : : : ; 2j p 1, the scaling
functions 'jintk D 'j k have support contained wholly in 0; 1 and so are left unchanged. At
the boundaries, for k D 0; : : : ; p 1, construct functions 'kL with support 0; p C k and 'kR
with support p k; 0, and set
'jintk D 2j=2 'kL .2j x/;
int
'j;2
j
k 1
1//:
The 2p functions 'kL ; 'kR are finite linear combinations of scaled and translated versions of
the original ' and so have the same smoothness as '. We can now define the multiresolution
spaces Vjint D spanf'jintk ; k D 0; : : : ; 2j 1g. It is shown that dimVjint D 2j , and that they
have two key properties:
(i) in order that Vjint VjintC1 , it is required that the boundary scaling functions satisfy two
scale equations. For example, on the left side
pC2k
x pX1
X
1
L L
D
'l .x/ C
hL
Hkl
p 'kL
km '.x
2
2
mDp
lD0
m/:
L
2X
1
int
k 'Lk
.x/
j k
int
j k .x/
j L kD0
kD0
int
with k D hf; 'Lk
i and j k D hf;
contains polynomials of degree p
order p.
X 2X1
int
j k i:
189
Wj
1,
Bj0
D f'j
1;k ; k
2 Zg [ f
j 1;k ; k
2 Zg
with coefficients
aj
1 k
D hf; j
1;k i;
dj
1 k
D hf;
j 1;k i;
(7.8)
aj ! faj
Sj W
faj
1 ; dj 1 g
1 ; dj 1 g
(analysis)
! aj
(synthesis)
Sj D Aj 1 D ATj :
To derive explicit expressions for Aj and Sj , rewrite the two-scale equations (7.1) and
(7.3) in terms of level j , in order to express 'j 1;k and j 1;k in terms of 'j k , using the fact
that Vj 1 and Wj 1 are contained in Vj : Rescale by replacing x by 2j x 2k and multiply
both equations by 2j=2 : Recalling the notation 'j k .x/ D 2j=2 '.2j x k/; we have
X
X
'j 1;k .x/ D
hl'j;2kCl .x/ D
hl 2k'j l .x/:
(7.9)
l
(7.10)
190
dj
1 k
(7.11)
gl
2kaj l D Rg ? aj 2k;
where R denotes
P the reversal operator Rak D a k, and ? denotes discrete convolution
a ? bk D ak lbl. Introducing also the downsampling operator Dak D a2k, we
could write, for example, aj 1 D D.Rh ? aj /. Thus the analysis, or fine-to-coarse step
Aj W aj ! .aj 1 ; dj 1 / can be described as filter with Rh and Rg and then downsample.
P
Synthesis step Sj . Since 'j 1;k 2 Vj 1 Vj , we can expand 'j 1;k as l h'j 1;k ; 'j l i'j l ,
along with an analogous expansion for j 1;k 2 Wj 1 Vj . Comparing the coefficients
(7.9) and (7.10) yields the identifications
h'j
Since 'j l 2 Vj D Vj
1;k ; 'j l i
D hl
2k;
j 1;k ; 'j l i
D gl
2k:
(7.12)
[Note that this time the sums are over k (the level j 1 index), not over l as in the analysis
step!]. Taking inner products with f in the previous display leads to the synthesis rule
X
aj l D
hl 2kaj 1 k C gl 2kdj 1 k:
(7.13)
k
To write this in simpler form, introduce the zero-padding operator Za2k D ak and
Za2k C 1 D 0, so that
aj l D h ? Zaj
1 l
C g ? Zdj
1 ; dj 1 /
1 l:
Computation. If the filters h and g have support length S, the analysis steps (7.11) each
require S multiplys and adds to compute each coefficient. The synthesis step (7.13) similarly
needs S multiplys and adds per coefficient.
The Cascade algorithm. We may represent the successive application of analysis steps
beginning at level J and continuing down to a coarser level L by means of a cascade diagram
aJ
AJ
aJ{1
dJ{1
AJ{1
aJ{2
aL+1
aL
dJ{2
dL+1
dL
191
aJ
1 ; dJ 2 ; : : : ; dL ; aL g:
(7.14)
The forward direction is the analysis operator, given by the orthogonal discrete wavelet
transform W . The reverse direction is the synthesis operator, given by its inverse, W T D
SJ SJ 1 SLC1 :
W as a matrix. W represents a change of basis from VJ D spanf'J k ; k 2 Zg to
VL WL WJ
D spanff'Lk g [ f
j k g; L
j J
1; k 2 Zg:
1I k 2 Zg and A D fI D .L; k/ W k 2
2 Z, then we have
2D
D .L; k 0 / 2 A:
r k
h.r/ n
2r k aj n D Rh.r/ ? aj 2r k:
g .r/ n
2r k aj n D Rg .r/ ? aj 2r k:
dj
r k
n
r
This formula says that the 2 -fold downsampling can be done at the end of the calculation
if appropriate infilling of zeros is done at each stage. While not necessarily sensible in computation, this is helpful in deriving a formula. The proof of this and all results in this section
is deferred to the end of the chapter.
To describe the approximation of ' and it is helpful to consider the sequence of nested
lattices 2 r Z for r D 1; : : :. Define functions ' .r/ ; .r/ on 2 r Z using the r-fold iterated
filters:
' .r/ .2 r n/ D 2r=2 h.r/ n;
.r/
.2 r n/ D 2r=2 g .r/ n:
(7.15)
192
Clearly ' .1/ and .1/ are essentially the original filters h and g, and we will show that
.r/ ! ; .r/ ! in an appropriate sense. Indeed, interpret the function ' .r/ on 2 r Z
as a (signed) measure r D ' .r/ that places mass 2 r ' .r/ .2 r n/ at 2 r n. Also interpret
the function ' on R as the density with respect to Lebesgue
R measure
R of a signed measure
D '. Then weak convergence of r to means that f dr ! f d for all bounded
continuous functions f .
Proposition 7.6 The measures ' .r/ and
respectively as r ! 1.
.r/
The left panel of Figure 7.1 illustrates the convergence for the Daubechies D4 filter.
We now describe the columns of the discrete wavelet transform in terms of these approximate scaling and wavelet functions. To do so, recall the indexing conventions D and A used
in describing .WI i /. In addition, for x 2 2 .j Cr/ Z, define
j=2 .r/ j
' .2 x
'j.r/
k .x/ D 2
.r/
j k .x/
k/;
D 2j=2
.r/
.2j x
k/:
(7.16)
Proposition 7.7 Suppose that N D 2J . The discrete wavelet transform matrix .WI i / with
I D .j; k/ and i 2 Z is given by
(
h I ; 'J i i D N 1=2 j.Jk j / .i=N / I D .j k/ 2 D;
WI i D
.J L/
h'Lk ; 'J i i D N 1=2 'Lk
.i=N / I 2 A:
Thus, the I th row of the wavelet transform matrix looks like I.J j / (where I D .j; k/),
and the greater the separation between the detail level j and the original sampling level J ,
the closer the corresponding function j.Jk j / is to the scaled wavelet j k .x/:
Cascade algorithm on sampled data. We have developed the cascade algorithm assuming
that the input sequence aJ k D hf; 'J k i. What happens if instead we feed in as inputs
aJ k a sequence of sampled values ff .k=N /g?
Suppose that f is a square integrable function on 2 J Z D N 1 Z: The columns of the
discrete wavelet transform will be orthogonal with respect to the inner product
X
hf; giN D N 1
f .N 1 n/g.N 1 n/:
(7.17)
n2Z
Proposition 7.8 If aJ n D N
aj k D h'j.Jk
j/
; f iN ;
1=2
f .N
dj k D h
.J j /
; f iN ;
jk
k 2 Z:
(7.18)
Thus, when applied on sampled data, the cascade algorithm produces discrete wavelet
coefficients which approximate the true wavelet coefficients of the underlying functions in
two steps: 1) the integral is approximated by a sum over an equally spaced grid, and 2) the
functions 'j k and j k are approximated by 'j.Jk j / and j.Jk j / .
Formulas (7.18) are an explicit representation of our earlier description that the sequences
faj k; k 2 Zg and fdj k; k 2 Zg are found from faJ k; k 2 Zg by repeated filtering and
downsampling. Formulas (7.18) suggest, without complete proof, that the iteration of this
process is stable, in the sense that as J j increases (the number of levels of cascade between
the data level J and the coefficient level j ), the coefficients look progressively more like the
continuous-time coefficients h'j k ; f i:
193
0.5
r=1
j=4,k=8, Jj=6
1
2
2
1.5
0.5
0.5
1.5
2
1
0.5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.5
r=2
j=5,k=20, Jj=5
1
2
2
1.5
0.5
0.5
1.5
2
1
0.5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.5
r=3
j=6,k=16, Jj=4
1
2
2
1.5
0.5
0.5
1.5
2
1
0.5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.5
r=6
j=8,k=120, Jj=2
1
2
2
1.5
0.5
0.5
1.5
0.5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 7.4 Left: The function .r/ on 2 r Z for the Daubechies D4 filter for
various values of r. Right: rows of the wavelet transform matrix, N D 1024, for
the Daubechies D4 filter, showing scale j , location k and iteration number J D j .
Continuous world
Discrete World
aJ k D h'J k ; f i
aJ k D N
1=2 f .nN 1 /
#
.J j /
aj k D h'j k ; f i
aj k D h'j k
dj k D h
dj k D h
jk; f
; f iN
.J j /
; f iN
jk
Table 7.1 Schematic comparing the orthogonal wavelet transform of functions f 2 L2 .R/ with the
discrete orthogonal wavelet transform of square summable sequences formed by sampling such
functions on a lattice with spacing N 1 : The vertical arrows represent the outcome of r D J j
iterations of the cascade algorithm in each case.
Table 7.1 highlights a curious parallel between the continuous and discrete worlds:
the discrete filtering operations represented by the cascade algorithm, through the DWT
matrix W , are the same in both cases!
194
1;k
D aLk
1 and k D 1; : : : ; 2j
j D L; : : : ; J
k D 1; : : : ; 2L :
(7.19)
with I denoting the columns of the inverse discrete wavelet transform matrix W T . [The
bolding is used to distinguish the vector I arising in the finite transform from the function
I 2 L2 .R/.] If we set tl D l=N and adopt the suggestive notation
I .tl /
WD
I;l ;
(7.20)
i:i:d
zl N.0; 1/:
(7.21)
It is assumed, for now, that is known. The goal is to estimate f , at least at the observation
points tl : The assumption that the observation points are equally spaced is quite important
see the chapter notes for referenceswhereas the specific form of the error model and knowledge of are less crucial.
Basic strategy. The outline is simply described. First, the tranform step, which uses a finite
orthogonal wavelet transform W as described in the previous section. Second, a processing
195
step in the wavelet domain, and finally an inverse transform, which is accomplished by W T ;
since W is orthogonal.
.n
1=2
1=2
Yl /
fO.tl /
! .wI /
?
?
y
WT
(7.22)
.wO I /
D D fI W L j J
1I k D 1; : : : ; 2j g;
A D fI D .L; k/ W k D 1; : : : ; 2L g:
The transformation .wI I t/ is a scalar function of the observed coefficient wI , usually nonlinear and depending on a parameter t: We say that operates co-ordinatewise. Often, the
parameter t is estimated, usually from all or some of the data at the same level as I; yielding
the modified expression .wI I t .wj //, where I 2 Ij D f.j; k/ W k D 1; : : : ; 2j g: In some
cases, the function itself may depend on the coefficient index I or level j . Common
examples include (compare Figure 2.2) hard thresholding:
H .wI I t / D wI I fjwI j tg;
and soft thresholding:
8
<wI t
S .wI I t / D 0
:
wI C t
wI > t
jwI j t
wI < t:
These may be regarded as special cases of a more general class of threshold shrinkage
rules, which are defined by the properties
odd:
shrinks:
bounded:
threshold:
. x; t/ D .x; t /,
.x; t/ x if x 0,
x .x; t/ t C b if x 0, (some b < 1),
.x; t/ D 0 iff jxj t.
196
Here are some examples. All but the first depend on an additional tuning parameter.
1. .x; t/.x t 2 =x/C suggested by Gao (1998) based on the garotte of Breiman (1995),
2. Soft-hard thresholding (Gao and Bruce, 1997): This is a compromise between soft and
hard thresholding defined by
8
if jxj t1
<0
t1 /
.xI t1 ; t2 / D sgn.x/ t2 .jxj
if
t1 < jxj t2
t2 t1
:
x
if jxj > t2 :
3. The smooth clipped absolute deviation (SCAD) penalty threshold function of Fan and
Li (2001), and
4. .xI t; a/ constructed as the posterior median for a prior distribution that mixes a point
mass at zero with a Gaussian of specified variance (Abramovich et al., 1998), as discussed
in Section 2.4 and below.
Methods for estimating t from data will be discussed in the next section.
Another possibility is to threshold blocks of coefficients. One example is James-Stein
shrinkage applied to the whole j -th level of coefficients:
JS .wI I s.wj // D s.wj /wI ;
s.wj / D .1
.2j
2/ 2 =jwj j2 /C :
The entire signal is set to zero if the total energy is small enough, jwj j2 < .2j 2/ 2 ,
otherwise it a common, data-determined linear shrinkage applies to all co-ordinates. When
the true signal is sparse, this is less effective than thresholding, because either the shrinkage
factor either causes substantial error in the large components, or fails to shrink the noise
elements - it cannot avoid both problems simultaneously. An effective remedy is to use
smaller blocks of coefficients, as discussed in the next section and Chapters 8 and 9.
The estimator. Writing fO for the vector N 1=2 fO.tl / and y for .N 1=2 yl /, we may
summarize the process as
fO D W T .W y/:
This representation makes the important point that the scaling and wavelet functions ' and
are not required or used in the calculation. So long as the filter h is of finite length, and
the wavelet coefficient processing w ! wO is O.N /, then so is the whole calculation.
Nevertheless, the iteration that occurs within the cascade algorithm generates approximations to the wavelet, cf. Section 7.3. Thus, we may write the estimator more explicitly
as
X
fO.tl / D
I .w/ I .tl /
(7.23)
I
X
I 2A
wI 'I .tl / C
.wI /
I .tl /;
I 2D
Thus, I D N 1=2 j.Jk j / here is not the continuous time wavelet j k D 2j=2 .2j k/;
but rather the .J j /t h iterate of the cascade, after being scaled and located to match j k ;
compare (7.16) and Proposition 7.7.
197
The .I; l/th entry in the discrete wavelet transform matrix W is given by N
P
and in terms of the columns I of W , we have yl D I wI I .N 1 l/:
1=2
.J j /
.N
jk
First examples are given by the NMR data shown in Figure 1.2 and the simulated Bumps
example in Figure 7.5. The panels in Figure 1.2 correspond to the vertices of the processing
diagram (7.22) (actually transposed!). The simulated example allows a comparison of soft
and hard thresholding with the true signal and shows that hard thresholding here preserves
the peak heights more accurately.
(a) Bumps
60
60
50
50
40
40
30
30
20
20
10
10
10
0.2
0.4
0.6
0.8
10
0.2
50
50
40
40
30
30
20
20
10
10
0.2
0.4
0.6
0.6
0.8
0.8
60
10
0.4
0.8
10
0.2
0.4
0.6
Figure 7.5 Panels (a), (b): artificial Bumps signal constructed to resemble a
spectrum, formula in Donoho and Johnstone (1994a), kf kN D 7 and N D 2048
points. I.i.d. N.0; 1/ noise added to signal, so signal to noise ratio is 7. Panels (c),
(d): Discrete wavelet transform with Symmlet8 filter
p and coarse scale L D 5. Soft
(c) and hard (d) thresholding with threshold t D 2 log n 3:905:
The thresholding estimates have three important properties. They are simple, based on
co-ordinatewise operations, non-linear, and yet fast to compute
(O.n/ time).
p
The appearance of the estimates constructed with the 2 log n thresholds is noise free,
with no peak broadening, and thus showing spatial adaptivity, in the sense that more averaging is done in regions of low variability. Comparison with Figure 6.2 shows that linear
methods fail to exhibit these properties.
l/
198
The hidden sparsity heuristic. A rough explanation for the success of thresholding goes
as follows. The model (7.21) is converted by the orthogonal wavelet transform into
p
i:i:d
wI D I C zQI ;
D = n; zQI N.0; 1/:
(7.24)
Since the noise is white (i.e. independent with constant variance) in the time domain, and
the wavelet transform is orthogonal, the same property holds for the noise variables zQI in the
wavelet domainthey each contribute noise at level 2 : On the other hand, in our examples,
and more generally, it is often the case that the signal in the wavelet domain is sparse, i.e. its
energy is largely concentrated in a few components. With concentrated signal and dispersed
noise, a threshold strategy is both natural and effective, as we have seen in examples, and
will see from a theoretical perspective in Chapters 8, 9 and beyond. The sparsity of the
wavelet representation may be said to be hidden, since it is not immediately apparent from
the form of the signal in the time domain. This too is taken up in Chapter 9.
Estimation of . Assume that the signal is sparsely represented, and so most, if not all,
data coefficients at the finest level are essentially pure noise. Since there are many (2J 1 /
such coefficients, one can estimate 2 well using a robust estimator
O 2 D MADfwJ
1;k ; k
2 IJ
1 g=0:6745;
which is not affected by the few coefficients which may contain large signal. Here MAD denotes the median absolute deviation (from zero). The factor 0:6745 is the population MAD
of the standard normal distribution, and is used to calibrate the estimate.
Soft vs. Hard thresholding The choice of the threshold shrinkage rule and the selection
of threshold t are somewhat separate issues. The choice of is problem dependent. For
example, hard thresholding exactly preserves the data values above the threshold, and as
such can be good for preserving peak heights (say in spectrum estimation), whereas soft
thresholding forces a substantial shrinkage. The latter leads to smoother visual appearance
of reconstructions, but this property is often at odds with that of good fidelity as measured
for example by average squared error between estimate and truth.
Correlated data. If the noise zl in (7.21) is stationary and correlated, then the wavelet
transform has a decorrelating effect. (Johnstone and Silverman (1997) has both a heuristic and more formal discussion). In particular, the levelwise variances j2 D Var.wj k / are
independent of k. Hence it is natural to apply level-dependent thresholding
wO j k D .wj k ; tj /:
p
For example, one might take tj D O j 2 log n with O j D MADk fwj k g=0:6745:
Figure 7.6 shows an ion channel example from Johnstone and Silverman (1997) known
to have a stationary correlated noise structure. Two different level dependent choices of
thresholds are
p compared. Consistent with remarks in the next section, and later theoretical
results, the 2 log n choice is seen to be too high.
Wavelet shrinkage as a spatially adaptive kernel method. We may write the result of
thresholding using (7.19) and (7.25) in the form
X
fO.tl / D
wO I I .tl /
wO I D cI .y/wI
(7.25)
I
199
100
200
300
400
500
100
200
300
400
500
Figure 7.6 Ion channel data. Panel (a) sample trace of length 2048. Panel (b)
Dotted line: true signal,
Dashed line: reconstruction using translation invariant (TI)
p
thresholding at O j 2 log n. Solid line: reconstruction using TI thresholding at data
determined thresholds (a combination of SURE and universal).Further details in
Johnstone and Silverman (1997).
where we have here written I .w/ in the data-dependent linear shrinkage form cI .w/yI .
Inserting the wavelet transform representation (7.20) into (7.25) leads to a kernel representation for fO.tl /:
X
XX
O l ; tm /ym ;
K.t
cI .y/ I .tl / I .tm /ym D
fO.tl / D
I
cI .y/
I .s/
I .t /;
s; t 2 ftl D l=N g:
(7.26)
The hat in this kernel emphasizes that it depends on the data through the coefficients
cI .y/: The individual component kernels KI .t; s/ D I .t / I .s/ have bandwidth 2 j B
where B is the support length of the filter h: Hence, one may say that the bandwidth of KO at
tl is of order 2 j.tl / ; where
j.tl / D maxfj W cI .y/
I .tl /
0; some I 2 Ij g:
In other words, tl must lie within the support of a level j wavelet for which the corresponding data coefficient is not thresholded to zero. Alternatively, if a fine scale coefficient
estimate wO j k 0; then there is a narrow effective bandwidth near 2 j k: Compare Figure
7.7 and Exercise 7.2. By separating the terms in (7.26) corresponding to the approximation
set A and the detail set D; we may decompose
KO D KA C KO D
P
where the approximation kernel KA .tl ; tm / D
k 'I .tl /'I .tm / does not depend on the
observed data y:
200
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 7.7 Spatially adaptive kernel corresponding to hard thresholding of the NMR
O l ; tm /, compare (7.26), is shown for tl;1 0:48
signal as in Figure 1.2. The kernel tm ! K.t
and tl;2 0:88. The bandwidth at 0:88 is broader because j.tl;2 / < j.tl;1 /.
Translation invariant versions. The discrete wavelet transform (DWT) is not shift invariant: the transform of a shifted signal is not the same as a shift of the transformed original.
This arises because of the dyadic downsampling between levels that makes the DWT nonredundant. For example, the Haar transform of a step function with jump at 1=2 has only
one non-zero coefficient, whereas if the step is shifted to say, 1=3, then there are log2 N
non-zero coefficients.
The transform, and the resulting threshold estimates, can be made invariant to shifts by
multiples of N 1 by the simple device of averaging. Let S denote the operation of circular
shifting by N 1 : Sf .k=N / D f ..k C 1/=N /; except for the endpoint which is wrapped
around: Sf .1/ D f .1=N /: Define
fOT I D Ave1kN .S
fO S k /:
(7.27)
The translation invariant (TI) estimator averages over all N shifts, and so would appear to
involve at least O.N 2 / calculation. However, the proposers of this method, Coifman and
Donoho (1995), describe how the algorithm can in fact be implemented in O.N log N /
operations.
It can be seen from Figure 7.8 that the extra averaging implicit in fOT I reduces artifacts
considerablycompare the bottom panels of Figure 7.5. Experience in practice has generally been that translation invariant averaging improves the performance of virtually every
method of thresholding, and its use is encouraged in situations where the log N computational penalty is not serious.
Software. The wavelet shrinkage figures in this book were produced in Matlab using
the public domain library WaveLab (version 850) available at stat.stanford.edu.
Matlab also has a proprietary wavelet toolbox. In R, the WaveThresh package is available at cran.r-project.org and is described in the book by Nason (2008).
201
Hard Haar TI
60
60
50
50
40
40
30
30
20
20
10
10
10
0.2
0.4
0.6
0.8
10
0.2
Soft Haar TI
25
20
20
15
15
10
10
0.2
0.4
0.6
0.6
0.8
0.8
Hard Haar TI
25
10
0.4
0.8
10
0.2
0.4
0.6
O
Figure 7.8 A comparison of translation invariant
p thresholding (7.27) applied to f given by
soft and hard Haar wavelet thresholding, at t D 2 log n, for n D 2048, for the test signals
Bumps of Figure 7.5 and Blocks of Figure 6.2. For direct comparisons of thresholding with
and without TI-averaging, see Coifman and Donoho (1995).
202
Fixed methods set a threshold in advance of observing data. One may use a fixed number of p
standard deviations k, or a more conservative limit, such as the universal threshold
t D 2 log n.
p
1. Universal threshold n D 2 log n. This is a fixed threshold method, and can be
used with either soft or hard thresholding. If Z1 ; : : : ; Zn are i.i.d. N.0; 1/ variates, then it
can be shown (compare (8.30)) that for n 2,
p
1
Pn D P f max jZi j > 2 log ng p
:
1i n
log n
Similarly, it can be shown 1 that the expected number of jZi j that exceed the threshold will
satisfy the same bound. For a wide range of values of n, including 64 D 26 n 220 ; the
expected number of exceedances will be between 0:15 and 0:25, so only in at most a quarter
of realizations will any pure noise variables exceed the threshold.
Since the wavelet transform is orthogonal, it follows from (7.24) that
P ffOn 0jf 0g D P fw 0j 0g D 1
Pn ! 1:
Thus, with high probability, no spurious structure is declared, and in this sense, the universal threshold leads to a noise free reconstruction. Note however that this does not mean
that fO D f with high probability when f 0, since fO is not linear in y:
The price for this admirably conservative performance is that the method chooses large
thresholds, which can lead to noticeable bias at certain signal strengths. When combined
with the soft thresholding non-linearity, the universal threshold leads to visually smooth
reconstructions, but at the cost of considerable bias and relatively high mean squared error.
This shows up in the theory as extra logarithmic terms in the rate of convergence of this
estimator, e.g. Theorem 10.9.
2. False discovery rate (FDR) thresholding. This is a data dependent method for hard
thresholding that is typically applied levelwise in the wavelet transform. Suppose that yi
N.i ; 2 / are independent, and form the order statistics of the magnitudes:
jyj.1/ jyj.2/ : : : jyj.n/ :
Fix the false discovery rate parameter q 2 .0; 1=2: Form quantiles tk D z.q=2 k=n/: Let
kOF D maxfk W jyj.k/ tk g; and set tOF D t O and use this as the hard threshold
kF
(7.28)
The boundary sequence .tk / may be thought of as a sequence of thresholds for t statistics
in model selection: the more variables (i.e. coefficients in our setting) enter, the easier it is
for still more to be accepted (i.e. pass the threshold unscathed.) Figure 7.9 shows the method
on two signals of different sparsity levels: the threshold tOF chosen is higher for the sparser
signal.
As is shown in Abramovich et al. (2006), the FDR estimator has excellent mean squared
error performance in sparse multinormal mean situationsfor example being asymptotically
adaptive minimax over `p balls. In addition (unpublished), it achieves the right rates of
1
For more detail on these remarks, see the proof of (8.30) and Table 8.1 in the next chapter.
203
6
5
4
3
3
2
2
1
0
1
0
10
15
20
20
40
60
80
100
120
Figure 7.9: Illustration of FDR thresholding at different sparsities (a) 10 out of 10,000.
convergence
over Besov function classes - thus removing the logarithmic terms present when
p
the 2 log n threshold is used. In Chapter 11, we will see that a related estimator arises from
penalized least squares model selection, and yields the correct rate of convergence results
both in the single sequence model and for wavelet function estimation, Section 12.1.
However, the choice of q is an issue requiring further study the smaller the value of q;
the larger the thresholds, and the more conservative the threshold behavior becomes.
3. Steins unbiased risk estimate (SURE) thresholding. This is a data dependent method
for use with soft thresholding, again typically level by level. It has the special feature of allowing for certain kinds of correlation in the noise. Thus, assume that y Nn .; V /, and
assume that the diagonal elements kk of the covariance matrix are constant and equal to 2 :
This situation arises, for example, if in the wavelet domain, k ! yj k is a stationary process.
At (2.71) and Exercise 2.8, we derived the unbiased risk criterion for soft thresholding,
and found that E kO k2 D E UO .t /; where (putting in the noise level 2 )
X
X
UO .t/ D 2 n C
min.yk2 ; t 2 / 2 2
I fjyk j t g:
k
Now set
tOS URE D
argmin
p
0t
UO .t /:
2 log n
The criterion UO .t/ does not depend on details of the correlation .j k ; j k/ and so can be
used in correlated data settings when the correlation structure is unknown, without the need
of estimating it. Also, UO .t/ is piecewise quadratic with jumps at jyk j, so the minimization
can be carried out in O.n log n/ time.
204
The SURE estimate also removes logarithmic terms in the rates of convergence of wavelet
shrinkage estimates over Besov classes, though a pretest is needed in certain cases to complete the proofs. See Donoho and Johnstone (1995); Johnstone (1999); Cai and Zhou (2009b)
and Exercise 12.3.
4. Empirical Bayes. This data dependent method for levelwise thresholding provides
a family of variants on soft and hard thresholding. Again assume an independent normal
means model, yi D i C zi ; with zi i.i.d standard normal. As in Section 2.4, allow i to
independently be drawn from a mixture prior distribution :
i .1
w/0 C w a :
Here w is the probability that i is non-zero, and
a .d / is a family of distributions with
scale parameter a > 0; for example the double exponential, or Laplace, density
a .d / D .a=2/e ajj d:
P
Using L1 loss kO k1 D n1 jOi i j; it was shown in Section 2.4 that the Bayes rule
for this prior is the median OEB .y/ of the posterior distribution of given y W
OEB;i .y/ D .yi I w; a/;
and that the posterior median has threshold structure:
.yI w; a/ D 0 if jyj t .w; a/;
while for large jyj; it turns out, (2.43), that jy .y/j a.
The hyperparameters .w; a/ can be estimated by maximizing the marginal likelihood of
.w; a/ given data .yi /: Indeed, the marginal of yi
Z
Z
m.yi jw; a/ D .yi i /.d / D .1 w/ .yi / C w .yi i /
a .di /
Q
and the corresponding likelihood `.w; a/ D i m.yi jw; a/.
Theory shows that the method achieves the optimal rates of convergence, while simulations suggest that the method adapts gracefully to differing levels of sparsity at different
resolution levels in the wavelet transform (Johnstone and Silverman, 2005b).
A numerical comparison. Table 7.2 is an extract from two larger tables in Johnstone
and Silverman (2004a) summarizing results of a simulation comparison of 18 thresholding
methods. The observations x D 0 IS C z are of length 1000 with IS denoting the indicator
function of a set S I D f1; : : : ; 1000g, and with noise zi being i.i.d. standard normal.
The non-zero set S is a random subset of I for each noise realization, and each of three
sizes K D jSj D 5; 50; 500 corresponding to very sparse, sparse and dense signals
respectively. Four signal strengths 0 D 3; 4; 5 and 7 were used, though only two are shown
here. There are thus 3 4 D 12 configurations. One hundred replications were carried out
for each of the values of K and 0 , with the same 100,000 noise variables used for each set
of replications.
p Among the 18 estimators, we select here: Universal soft and hard thresholding at level
2 log n 3:716, FDR thresholding with q D 0:1 and 0:01, SURE thresholding, and
205
finally empirical Bayes thresholding first with a D 0:2 fixed and w estimated, and second
with .a; w/ estimated, in both cases by marginal maximum likelihood.
For each estimation method Om and configuration c , the average total squared error was
recorded over the nr D 100 replications:
r.Om ; c / D nr 1
nr
X
c k2 :
Some results are given in Table 7.2 and the following conclusions can be drawn:
hard thresholding with the universal threshold particularly with moderate or large amounts
of moderate sized signal, can give disastrous results.
Estimating the scale parameter a is probably preferable to using a fixed value, though
it does lead to slower computations. In general, the automatic choice is quite good at
tracking the best fixed choice, especially for sparse and weak signal.
SURE is a competitor when the signal size is small (0 D 3) but performs poorly when
0 is larger, particularly in the sparser cases.
If q is chosen appropriately, FDR can outperform exponential in some cases, but in the
original larger tables, it is seen that the choice of q is crucial and varies from case to case.
Table 7.2 Average of total squared error of estimation of various methods on a mixed signal of
length 1000.
Number nonzero
Value nonzero
5
5
50
5
500
5
med
10th
max
a D 0:2
exponential
38
36
18
17
299
214
95
101
1061
857
665
783
18
7
30
30
48
52
SURE
38
42
202
210
829
835
35
151
676
FDR q=0.01
FDR q=0.1
43
40
26
19
392
280
125
113
2568
1149
656
651
44
18
91
39
210
139
universal soft
universal hard
42
39
73
18
417
370
720
163
4156
3672
7157
1578
529
50
1282
159
1367
359
An alternative way to compare methods is through their inefficiency, which compares the
risk of Om for a given configuration c with the best over all 18 methods:
"
#
r.Om ; c /
O
ineff.m ; c / D 100
1 :
minm r.Om ; c /
The inefficiency vector ineff.Om / for a given method has 12 components (corresponding to
the configurations c ) and Table 7.2 also records three upper quantiles of this vector: median,
and 10th and 12th largest. Minimizing inefficiency has a minimax flavorit turns out that
the empirical Bayes methods have the best inefficiencies in this experiment.
5. Block Thresholding
206
k2b
b D 1; : : : ; B:
1/C1 ; : : : ; ybL /;
yk2 . We can then define a block thresholding rule via the general prescription
Ob .y/ D c.Sb =/yb ;
where the function c./ has a thresholding character. Three natural choices are
p
2 L
L
H
S
;
;
c JS .s/ D 1
c .s/ D I fs > g;
c .s/ D 1
C
s
s2 C
corresponding to block hard, soft and James-Stein thresholding respectively. Each of these
may be thought of as an extension of a univariate threshold rule to blocks of size L. Thus,
O 1 / and note that the three cases reduce to ordinary hard, soft,
for D 1 write .x/
O
for .xe
and garotte thresholding (Gao (1998), see also Section 8.2) respectively.
Hard thresholding of blocks was studied by Hall et al. (1999a,b), who took L D .log n/
for > 1. Block James-Stein thresholding was investigated by Cai (1999) with L D log n
and 2 D 4:505. In Chapter 8 we will study Block soft thresholding, which has monotonicity
properties that make it easier to handle, and we will recover analogs of most of Cais results,
including the motivation for choice of L and .
One may wish to estimate both block size L and threshold from data, level by level,
for example by minimizing unbiased estimate of risk. In this way one might obtain larger
thresholds and smaller blocks for sparser signals. This is studied at length by Cai and Zhou
(2009b), see also Efromovich and Valdez-Jasso (2010).
r k
D Rh ? aj
rC1 2k;
Now hl
equals
2k D Z r
X
m
h.r
1/
1 h2r 1 l
mZ r
aj n
h.r
hm
l aj n
1/
2r
l hl
2k:
2r k and since Z r
1
2r
1/
2r k D h.r
1 hm
1/
D 0 unless m D 2r
? Zr
hn
1 l;
2r k D h.r/ n
2r k:
207
Relating h.r/ to '. Recall that the scaling function ' was defined by the Fourier
Q1 b
j
domain formula ' ./ D j D1 h.2p / . This suggests that we look at the Fourier transform of h.r/ : First
2
note that the transform of zero padding is given by
X
X
ch.!/ D
Z
e i l! Zhl D
e i 2k! hk D b
h.2!/;
Proof of Proposition 7.6.
1b p
pD0 h.2 !/: Making the substitution !
distribution ' .r/ having Fourier transform
Qr
b.2
r=2 .r/
/ D
r ,
D2
r b
Y
h.2 j /
:
p
2
j D1
(7.29)
Observe that ' .r/ ./ has period 2rC1 . This suggests, and we now verify, that ' .r/ can be
Pthought of as
a function (or more precisely, a measure) defined on 2 r Z: Indeed, a discrete measure D n mn2 r n
supported on 2 r Z has Fourier transform
Z
X
r
b
./ D e i x .dx/ D
m.2 r /:
mne i 2 n D b
n
P
Thus, the quantity 2 r=2 h.r/ .2 r / in (7.29) is the Fourier transform of a measure n 2 r=2 h.r/ n2 r n :
r
r
Secondly,
a real valued function g.2 n/ defined on 2 Z is naturally associated to the measure g D
P
r
r
r can be motivated by considering integrals of funcn 2 g.2 n/2 r n ; (the normalizing multiple 2
tions against g). Combining these two remarks shows that ' .r/ is indeed a function on 2 r Z; with
2
r .r/
'
.2
r=2 .r/
n/ D 2
n:
(7.30)
Furthermore, the measure r D ' .r/ has Fourier transform ' .r/ ./. Since ' .r/ ./ ! b
'./ for all and
b
'./ is continuous at 0, it follows from the Levy-Cramer theorem C.18, appropriately extended to signed
measures, that ' .r/ converges weakly to '.
The weak convergence for .r/ to follows similarly from the analog of (7.29)
b ./ D 2 gb.2
.r/
and convergence
r=2 .r/
/ D
r
h.2
b
g.2 1 / Y b
b ./ ! b./.
j D2
j /
.r/
P ROOF OF P ROPOSITION 7.7. We first re-interpret the results of Lemma 7.5. Suppose j < J: Since
'j k 2 VJ , we have
X
'j k D
h'j k ; 'J n i'J n ;
n
Replacing j with J
h'J
r and comparing the results with those of the Lemma, we conclude that
r;k ; 'J n i
D h.r/ n
2r k;
J r;k ; 'J n i
J /=2 .J j /
'
.2j
D g .r/ n
2r k:
j , we get
k/ D N
1=2 .J j /
'j k
.n=N /;
208
which is the second equation of Proposition 7.7. The first follows similarly.
P ROOF OF P ROPOSITION 7.8. Let r D J j; so that aj D aJ r and, using Lemma 7.5, aj k D
P .r/
2r k aJ n: From (7.15),
n h n
h.r/ n
which implies that aj k D N
exactly analogous.
2r k D 2
1
r=2 .r/
'
.2
k/ D N
.r/
1 n/f .N 1 n/
n 'j k .N
1=2 .r/
'j k .N 1 n/;
.J j /
D h'j k
7.8 Notes
1. In addition to the important books by Meyer (1990), Daubechies (1992) and Mallat (2009) already cited,
we mention a selection of books on wavelets from various perspectives: Hernandez and Weiss (1996), Chui
(1997), Wojtaszczyk (1997), Jaffard et al. (2001), Walter and Shen (2001), Cohen (2003), Pinsky (2009),
Starck et al. (2010). Heil and Walnut (2006) collects selected important early papers in wavelet theory.
Many expositions rightly begin with the continuous wavelet transform, and then discuss frames in detail
before specialising to orthogonal wavelet bases. However, as the statistical theory mostly uses orthobases,
we jump directly to the definition of multiresolution analysis due to Mallat and Meyer here in a unidimensional form given by Hernandez and Weiss (1996):
1. Warning: many authors use the opposite convention Vj C1 Vj
Conditions (i) - (iv) are not mutually independent -see for example Theorem 2.1.6 in Hernandez and
Weiss (1996).
Unequally spaced data? [TC & LW: fill in!]
More remarks on L1 loss leading to posterior median.
Include Eisenberg example?
Topics not covered here: Extensions to other data formats: time series spectral density estimation, count
data and Poisson estimation.
Books specifcially focused on wavelets in statistics include Ogden (1997), Hardle et al. (1998), Vidakovic (1999), Percival and Walden (2000), Jansen (2001) and Nason (2008). The emphasis in these books
is more on describing methods and software and less on theoretical properties. Hardle et al. (1998) is a more
theoretically oriented treatment of wavelets, approximation and statistical estimation, and has considerable
overlap in content with the later chapters of this book, though with a broader focus than the sequence model
alone.
SURE thresholding is discussed in Donoho and Johnstone (1995), which includes details of the O.n log n/
computational complexity.
Exercises
7.1
7.2
.2
m/ D 2r=2 WI;mC2r k
8
Thresholding and Oracle inequalities
Oracle, n. something regarded as an infallible guide or indicator, esp. when its action
is viewed as recondite or mysterious; a thing which provides information, insight, or
answers. (Oxford English Dictionary)
210
the n co-ordinates are grouped into blocks of size L, there arises a family of such oracle
inequalities with thresholds varying with L.
Without further information on the nature or size of , this logarithmic factor cannot be
improved. In the remainder of this chapter, we focus on the consequences of assuming that
we do have such information, namely that the signal is sparse.
A simple class of models for a sparse signal says that at most a small number of coordinates can be non-zero, k out of n say, though we do not know which ones. The minimax
risk for estimation of in such cases is studied in Sections 8.48.8, and is shown, for example, to be asymptotic to 2n2 kn log.n=kn / if the non-zero fraction kn =n ! 0. Thresholding
rules are asymptotically minimax in this case, and the upper bound is an easy consequence
of earlier results in this chapter.
The lower bound requires more preparation, being given in Section 8.6. It is based on
construction of a nearly least favorable sparse prior. We consider in parallel two different
models of sparsity. In the first, univariate model, we observe x N.; 1/ and give a prior
and study Bayes risks B./, for example over models of sparse priors. In the multivariate
model, y Nn .; n I /, we consider a high dimensional vector n with a large proportion of
components i D 0 and seek estimators that minimize the maximum risk r.O ; /. Of course,
the two models are related, as one method of generating a sparse multivariate mean vector
is to draw i.i.d. samples from a sparse prior in the univariate model.
Sections 8.5 and 8.6 are devoted to sparse versions of the univariate ( ! 0) and multivariate models (kn =n ! 0) respectively. The former introduces sparse two point priors,
supported mostly on 0 but partly on a single value > 0, this value being set up precisely
so that observed data near will, in the posterior, still be construed as most likely to have
come from the atom at 0! The latter section looks at a similar heuristic in the multivariate
model, studying independent copies of a single spike prior on a collection of kn blocks,
and arriving at a proof of the lower bound half of the 2n2 kn log.n=kn / limiting minimax
risk claim. The single spike prior approach can in particular handle the highly sparse case
in which kn remains bounded as n grows, whereas the approach based on i.i.d draws from a
univariate prior requires the extra assumption that kn ! 1.
Sections 8.7 and 8.8 consider respectively univariate and multivariate models in which
the non-zero fraction can take any positive value not necessarily approaching zero. The univariate results provide a foundation for a comprehensive statement of the limiting minimax
risk properties over multivariate models of exact sparsity, Theorem 8.20.
Notation. We continue to write y Nn .; 2 I / for the Gaussian model with noise level
, and use a distinguished notation x N.; 1/ when focusing on a single observation with
noise level one.
211
such a thing is of course not really possible, we are forced to accept extra terms, either
additive or multiplicative, in the analogs of bias and variance.
Proposition 8.1 If y N.; 2 /, there exists a constant M such that if 4
(
M 2 C . 1/ 2 if jj
rH .; /
M 2 2
if jj > :
(8.1)
Consider first the small signal case jj < : Arguing crudely,
2 2E y 2 IE C 2 2 :
E yIE
The first term is largest when j j D : In this case, if we set x D y= N.1; 1/ then
Z 1
2
2
E y IE 2
x 2 .x 1/dx 4. 1/ 2 ;
(8.2)
where we used the fact that for y 3; .y C1/2 .y/ 2.y 2 1/.y/ D 2.d=dy/ y.y/.
In the large signal case, j j > ; we use the relation y D C z to analyse by cases,
obtaining
(
z
if jyj > ;
yIE D
z y if jyj ;
so that in either case
.yIE
/2 2 2 .z 2 C 2 /:
Taking expectations gives the result, for example with M D 8. We have however deemphasized the explicit constants (which will be improved later anyway in Lemma 8.5 and
(8.18)) to emphasise the structure of the bound, which is the most important point here.
Exercise 8.2 shows how the condition > 4 can be removed.
From the proof, one sees that when the signal is small, the threshold produces zero most
of the time and the MSE is essentially the resulting bias plus a term for rare errors which
push the data beyond the threshold. When the signal is large, the data is left alone, and hence
has variance of order , except that errors of order are produced about half the time when
D
Example 8.2 Let us see how (8.1) yields rough but useful information in an n-dimensional
estimation problem. Suppose, as in the introductory example of Section 1.3, that y
Nn .; n2 I / with n P
D n 1=2 and that is assumed to be constrained to lie in an `1 -ball
n
n;1 D f 2 R W
ji j 1g. On this set, the minimax risk for linear estimation equals
1=2 (shown at (9.28) in the next chapter), but thresholding does much better. Let Bn be the
set of big coordinates ji j D n 1=2 ; and Sn D Bnc : Clearly, when 2 n;1 , the
number of big coordinates is relatively limited: jBn j n1=2 : For the small coordinates,
212
i2 n
ji j; so
X
i2 n
1=2
Bn
2
1/ 2
Sn
1=2
M n
C M n 1=2 C . 1/:
p
Choosing, for now, D 1 C log n; so that . 1/ D .0/n 1=2 ; we finally arrive at
p
EkO k2 M 0 log n= n:
While this argument does not give exactly the right rate of convergence, which is .log n=n/1=2 ,
let alone the correct constant, compare (13.39) and Theorem 13.17, it already shows clearly
that thresholding is much superior to linear estimation on the `1 ball.
<x x >
OS .x; / D 0
(8.3)
jxj
:
x C x < :
D arg min .x
/2 C 2jj;
while
(
x
OH .x; / D
0
jxj >
jxj :
D arg min .x
(8.4)
/2 C 2 I f 0g:
Similarities. These two estimators are both non-linear, and in particular have in common
the notion of a threshold region jxj , where the signal is estimated to be zero. Of course,
hard thresholding is discontinuous, while soft thresholding is constructed to be continuous,
which explains the names. Compare Figure 2.2. The threshold parameter in principle can
vary over the entire range .0; 1/, so the family includes as limiting cases the special linear
O 0/ D x and .x;
O 1/ D 0 that keep and kill the data respectively. In
estimators .x;
general, however, we will be interested in thresholds in the range between about 1.5 and
a value proportional to the square root of log-sample-size. We now make some comments
specific to each class.
Differences. Hard thresholding preserves the data outside the threshold zone, which can
be important in certain applications, for example in denoising where it is desired to preserve
213
as much as possible the heights of true peaks in estimated spectra. The mathematical consequence of the discontinuity is that the risk properties of hard thresholding are a little more
awkwardfor example the mean squared error is not monotonic increasing in 0:
Soft thresholding, on the other hand, shrinks the data towards 0 outside the threshold
zone. The mean squared error function is now monotone in 0; and we will see later
that the shrinkage aspect leads to significant smoothing properties in function estimation
(e.g. Chapter 10). In practice, however, neither soft nor hard thresholding is universally
preferablethe particular features of the application play an important role. The estimator
that we call soft thresholding has appeared frequently in the statistics literature, for example
Efron and Morris (1971), who term it a limited-translation rule.
Compromises. Many compromises between soft and hard thresholding are possible that
appear in principle to offer many of the advantages of both methods: a threshold region for
small x and exact or near fidelity to the data when x is large. Some examples were given
in Section 7.5. While these and other proposals can offer useful advantages in practice, we
concentrate here on soft and hard thresholding, because of their simplicity and the fact that
they well illustrate the main theoretical phenomena.
Soft thresholding.
We begin with soft thresholding as it is somewhat easier to work with mathematically. The
explicit risk function rS .; / D EOS .x; / 2 can be calculated by considering the
various zones separately explicit formulas are given in Section 8.10. Here we focus on
qualitative properties and bounds. We Rfirst restate for completeness some results already
proved in Section 2.7. Write .A/ D A .z/dz for the standard Gaussian measure of an
interval A and let I D ; . The risk of soft thresholding is increasing in > 0:
@
rS .; / D 2.I
@
/ 2;
(8.5)
while
rS .; 1/ D 1 C 2 ;
which shows the effect of the bias due to the shrinkage by , and
8
2 =2
.all /
Z 1
< e
p
rS .; 0/ D 2
.z /2 .z/dz
4 1 ./ . 2/:
:
4 3 ./ . large/:
(8.6)
(8.7)
(compare Exercise 8.3). A sharper bound is sometimes useful (also Exercise 8.3)
rS .; 0/ 4 3 .1 C 1:5 2 /./;
(8.8)
valid for all > 0. The risk at D 0 is small because errors are only made when the
observation falls outside the threshold zone.
We summarize and extend some of these conclusions about the risk properties:
214
Lemma 8.3 Let rNS .; / D minfrS .; 0/ C 2 ; 1 C 2 g. For all > 0 and 2 R,
1
rN .; /
2 S
(8.9)
The risk bound rNS .; / has the same qualitative flavor as the crude bound (8.1) derived
earlier for hard thresholding, only now the constants are correct. In fact, the bound is sharp
when is close to 0 or 1:
Figure 8.1 gives a qualitative picture of these bounds. Indeed, the (non-linear) soft thresholding rule can be thought of as the result of splining together three linear estimators,
namely O0 .x/ 0 when x is small and O1; .x/ D x when jxj is large. Compare Figure
2.2. The risk of O0 is 2 , while that of O1; is 1 C 2 . Thus our risk bound is essentially the
minimum of these two risk functions of linear estimators, with the addition of the rS .; 0/
term. We see therefore that rS .; 0/ C 2 is a small signal bound, most useful when jj is
small, while 1 C 2 is useful as a large signal bound.
Proof Symmetry of the risk function means that we may assume without loss that 0:
By (8.5), the partial derivative .@=@/rS .; 0 / 20 , and so
Z
Z
0
0
rS .; / rS .; 0/ D
.@=@/rS .; /d
20 d0 D 2 :
(8.10)
0
The upper bound follows from this and (8.6). For the lower bound, observe that if x > ,
then OS .x; / D x , while if x , then OS .x; / 0 and so jO j .
Writing x D C z; we have
rS .; / E.z
(8.11)
using (8.7). If ; then from monotonicity of the risk function, rS .; / rS .; /;
and applying (8.11) at D ;
/2 I fz > 0g C 2 =2 D 2 2.0/ C 1=2 .2 C 1/=2
p
with the last inequality valid if and only if 8=: In this case, the right sides of the
last two
N
/=2 and we are done. The proof of the lower bound for
p displays both exceed r.;
< 8= is deferred to Section 8.10.
rS .; / E.z
Consequences of (8.9) are well suited to showing the relation between sparsity and quality
of estimation. As was also shown in Section 2.7, using elementary properties of minima, one
may write
rS .; / rS .; 0/ C .1 C 2 / ^ 2 :
In conjunction with the bound rS .; 0/ e
(8.12)
=2
/. 2 ^ 2 /:
(8.13)
215
rS(;0) + 2
+1
rS(;)
rH(;)
1
Figure 8.1 Schematic diagram of risk functions of soft and hard thresholding when
noise level D 1 and the threshold is moderately large. Dashed lines indicate
upper bounds for soft thresholding of Lemma 8.3. Dotted line is risk of the unbiased
estimator O1 .x/ D x.
Hard thresholding
The risk function is easily written in the form
2
rH .; / D .I
z 2 .z/dz:
/ C
(8.14)
jzCj>
Q
z 2 .z/dz D 2./ C 2./
2./;
(8.15)
as ! 1. Note that the value at 1 reflects only variance and no bias, while the value
at zero is small, though larger than that for soft thresholding due to the discontinuity at :
However (8.14) also shows that there is a large risk near D when is large:
rH .; / 2 =2:
See Exercise 8.5 for more information near D .
Qualitatively, then, as increases, the risk of hard thresholding starts out quite small and
has quadratic behavior for small, then increases to a maximum of order 2 near D ,
and then falls away to a limiting value of 1 as ! 1. See Figure 8.1.
An analogue of the upper bound of Lemma 8.3 is available for hard thresholding. In this
case, define
(
minfrH .; 0/ C 1:22 ; 1 C 2 g 0
rNH .; / D
Q
1 C 2 .
/
;
216
(8.16)
p
if 2;
if 1:
Proof Again we assume without loss that 0: The upper bound for is a direct
consequence of (8.14). For 0 ; the approach is as used for (8.10), but for the details
of the bound 0 .@=@/rH .; / 2:4, we refer to Donoho and Johnstone (1994a,
Lemma 1). As a result we obtain, for 0 ,
rH .; / rH .; 0/ C 1:22 :
The alternate bound, rH .; / 1 C 2 , is immediate from (8.14).
The lower bound is actually easierby checking separately the cases and ,
it is a direct consequence of an inequality analogous to (8.11):
E OH .x; /
Q
We have
0 and define g./ D . C /2 ./:
Q
;
g 0 ./ D . C /./h./;
h./ D 2 ./=./
p
p
Q
./=
and h.0/ D 2 0 if 2: Differentiation and the bound ./
show that h is decreasing and hence negative on 0; 1/, so that g./ g.0/ D 2 =2: In the
Q
case where we only assume that 1; we have g./ 2 .1 C /2 ./
2 ; as may be
checked numerically, or by calculus.
For part (b), set D
For use in later sections, we record some corollaries of the risk bounds. First, for all ,
rH .; / 1 C 2 ;
for all ;
(8.17)
1
> 1:
(8.18)
(8.19)
Remark. In both cases, we have seen that the maximum risk of soft and hard thresholding is O.2 /. This is a necessary consequence of having a threshold region ; W if
O
.x/
is any estimator vanishing for jxj ; then simply by considering the error made by
estimating 0 when D ; we find that
O
E ..x/
/2 2 P fjxj g 2 =2
for large :
(8.20)
217
k22
.for 1/
(8.21)
(8.22)
(8.23)
(8.24)
d;
U2 .w/ D d C
2.d
1/.=w/1=2 :
(8.26)
Since U2 .1/ D d C D d.1 C 2 /, we easily obtain (8.22), essentially from the monotone
convergence theorem as ! 1.
To compute the derivative of ! r.; /, we will need some identities for noncentral
218
Since
U10 .w/
.@=@/r.; / D F;d . / C .d
(8.27)
which shows the monotonicity. Notice that when d D 1, the second right side term drops
out and we recover the derivative formula (8.5) for scalar thresholding. Borrowing from
Exercise 2.17 the inequality f;d C2 .w/ .w=d /f;d .w/, we find that the second term in
the previous display is bounded above by
Z
d 1 1 1=2
f;d .w/dw 1 F;d . /;
d
w
which completes the verification that .@=@/r.; / 1.
For the risk at zero, rewrite (8.25) as
Z 1
r.; / D C
.U2 U1 /.w/f;d .w/dw;
and note that for w we have .U2 U1 /0 D 1 C .d 1/ 1=2 w 3=2 0, so long as
1 (and d 1). Consequently r.; 0/ 21 Fd . / as was claimed.
The inequality (8.24) is part of (2.91) in Exercise 2.15.
It is now easy to establish upper and lower bounds for the risk of block soft thresholding
that have the flavor of Lemma 8.3 in the univariate case.
Proposition 8.7 The mean squared error of block soft thresholding satisfies
rS;d .; / rS;d .; 0/ C minfkk2 ; .1 C 2 /d g;
and, for
p
2,
rS;d .; / rS;d .; 0/ C
1
4
minfkk2 ; 2 d=2g:
Proof The upper bound is immediate from (8.23) and (8.22). [Of course, a stronger bound
minfrS;d .; 0/ C kk2 ; .1 C 2 /d g holds, but the form we use later is given.] For the lower
bound, again put D 2 d and r.; / D rS;d .; / and use representation (8.27) to write
Z
r.; / r.; 0/
F 0 ;d . /d 0 :
0
Suppose that =2. Exercise 8.7 shows that for 2d , we have F 0 ;d . / 1=4, and so
the display is bounded below by =4 D kk2 =4. When =2, simply use monotonicity of
the risk function and the bound just proved to get r.; / r.; =2/ r.; 0/ C =8.
219
(8.29)
as n ! 1, so that with high probability, O does not assert the presence of spurious structure. Indeed, note that if each yi is distributed independently as N.0; 2 /, then the chance
that at least one observation exceeds threshold n equals the extreme value probability
h
p
in
p
1
Q
$n D P f max jZi j 2 log ng D 1
1 2
2 log n
p
; (8.30)
i D1;:::;n
log n
valid for n 2 (see 3 in Section 8.10).
Table 8.1 compares the exact value $n of the extreme value probability with the upper
bound $n0 givenpin (8.30). Also shown is the expectation of the number Nn of values Zi
that exceed the 2 log n threshold. It is clear that the exceedance probability converges to
zero rather slowly, but also from the expected values that the number of exceedances is at
most one with much higher probability, greater than about 97%, even for n large. Compare
also Exercise 8.9. And looking at the ratios $n0 =$n , one sees that while the bound $n0 is
not fully sharp, it does indicate the (slow) rate of approach of the exceedence probability to
zero.
n
p
2 log n
$n
$nW
ENn
$n0
32
64
128
256
512
1024
2048
4096
2.63
2.88
3.12
3.33
3.53
3.72
3.91
4.08
0.238
0.223
0.210
0.199
0.190
0.182
0.175
0.169
0.248
0.231
0.217
0.206
0.196
0.188
0.180
0.174
0.271
0.251
0.235
0.222
0.211
0.201
0.193
0.186
0.303
0.277
0.256
0.240
0.226
0.214
0.204
0.196
p
Table 8.1 For i.i.d. Gaussian noise: sample size n, threshold 2 log n, exceedance probability $n ,
extreme value theory approximation $nW expected number of exceedances ENn ; upper bound $n0 of
(8.30)
The classical extreme value theory result Galambos (1978, p. 69) for the maximum of n
220
P .W t / D expf e t g;
(8.31)
p
p
p
where an D 2 log n .log log n C log 4/=.2 2 log n/ and bn D 1= 2 log n. Section
8.9 has some more information on the law of Mn .
Here we are actually more interested in maxi D1;:::;n jZi j, but this is described quite well
by M2n . (Exercise 8.11 explains why). Thus the exceedance
probability $n might be app
proximated by $nW D P .W c2n / where c2n D . 2 log n a2n /=b2n /. Although the
convergence to the extreme value distribution in (8.31) is slow, of order 1= log n (e.g. Hall
(1979), Galambos (1978, p. 140)). Table 8.1 shows the extreme value approximation to be
better than the direct bound (8.30).
A simple non-asymptotic bound follows from the Tsirelson-Sudakov-Ibragimov bound
Proposition 2.9 for a Lipschitz.1/ function f W Rn ! R of a standard Gaussian n vector
Z Nn .0; I / W
an ! W;
P fjf .Z/
Ef .Z/j t g 2e
t 2 =2
When applied to f .z/ D max jzi j; this says that the tails of max jZi j are sub-Gaussian, while
the extreme value
p result in fact suggests more: that the limiting distribution has standard
deviation O.1= log n/ about an .
Ideal Risk. Suppose that yi D i C zi ; i D 1; : : : n; with, as usual zi being i.i.d.
N.0; 1/: Given a fixed value of , an ideal linear estimator c;i
D ci yi would achieve the
best possible mean squared error among linear estimators for the given :
min r.c;i
; / D
ci
i2 2
2 21 ; 1 i2 ^ 2 :
i2 C 2
Because of the final bound, we might even restrict attention to the ideal projection, which
chooses ci from 0 or 1 to attain
min r.c;i
; / D i2 ^ 2 :
ci 2f0;1g
Thus the optimal projection choice ci . / equals 1 if i2 2 and 0 otherwise, so that
(
yi if i2 2
i .y/ D
0 if i2 2 :
One can imagine an oracle, who has partial, but valuable, information about the unknown
: for example, which co-ordinates are worth estimating and which can be safely ignored.
Thus, with the aid of a projection oracle, the best mean squared error attainable is the ideal
risk:
X
R.; 2 / D
min.i2 ; 2 /;
(8.32)
i
In Chapter 9 we will discuss further the significance of the ideal risk, and especially its
interpretation in terms of sparsity.
Of course, the statistician does not normally have access to such oracles, but we now show
that it is nevertheless possible to mimic the ideal risk with threshold estimators, at least up
to a precise logarithmic factor.
221
Proposition 8.8 pSuppose that y Nn .; 2 /. For the soft thresholding estimator (8.28) at
threshold n D 2 log n,
EkOSn
n
h
i
X
k22 .2 log n C 1/ 2 C
min.i2 ; 2 / :
(8.33)
A similar result holds for OHn , with the multiplier .2 log n C 1/ replaced by .2 log n C 1:2/.
The factor 2 log n is optimal without further restrictions on , as n ! 1,
inf sup
O 2Rn
2 C
EkO
Pn
k2
.2 log n/.1 C o.1//:
2 2
1 min.i ; /
(8.34)
Results of this type help to render the idea of ideal risk statistically meaningful: a genuine
estimator, depending only on available data, and not upon access to an oracle, can achieve
the ideal risk R.; / up to the (usually trivial) additive factor 2 and the multiplicative
factor 2 log n C 1: In turn, the lower bound (8.9) shows that the ideal risk is also a lower
bound to the mean squared error of thresholding, so that
1
R.; /
2
EkOSn
This logarithmic penalty can certainly be improved if we add extra constraints on : for
example that belong to some `p ball, weak or strong (Chapter 13). However, lower bound
(8.34) shows that the 2 log n factor is optimal for unrestricted ; at least asymptotically.
Note that the upper bounds are non-asymptotic, holding for all 2 Rn and n 1:
The upper bound extends trivially to correlated, heteroscedastic data, since thresholding
depends only on the univariate marginal distributions of the data. The only change is to replace 2 by i2 , the variance of the ith coordinate, in the ideal risk, and to modify the additive
factor to ave 1in i2 : There is also a version of the lower bound under some conditions on
the correlation structure: for details see Johnstone and Silverman (1997).
Proof Upper bound. For soft thresholding, a slightly stronger result was already established as Lemma 2.8.pFor hard thresholding, we first set D 1 and use (8.18) to establish
the bound, for n D 2 log n
rH .; / .2 log n C 1:2/.n
C 2 ^ 1/:
Q
This is clear for > 1, while for < 1, one verifies that rH .; 0/ D 2./ C 2./
1
.2 log n C 1:2/n for n 2. Finally, add over co-ordinates and rescale to noise level .
Lower bound. The proof is deferred till Section 8.6, since it uses the sparse two point
priors to be discussed in the next section.
Remark. Alan Millers variable
p selection scheme. A method of Miller (1984, 1990) offers
an interesting perspective on 2 log n thresholding. Consider a traditional linear regression
model
y D X C 2 z;
where y has N components and X has n < N columns x1 xn and the noise z
NN .0; I /: For convenience only, assume that the columns are centered and scaled: xiT 1 D 0
222
and jxi j2 D 1: Now create fake regression variables xi ; each as an independent random permutation of the entries in the corresponding column xi . Assemble X and X D
x1 xn into a larger design matrix XQ D X X with coefficients Q t D t t and
fit the enlarged regression model y D XQ Q by a forward stepwise method. Let the method
stop just before the first fake variable xi enters the model. Since the new variables xi are
approximately orthonormal among themselves and approximately orthogonal to each xi , the
estimated coefficients Oi are essentially i.i.d. N.0; 1/; and so the stopping criterion amounts
: p
to enter variables above the threshold given by maxi D1;:::;n jOi j D 2 log n:
Smaller thresholds. It is possible to obtain a bound of the form (8.33) for all 2 Rn
EkOSn
k22
n
n
h
i
X
2
C
min.i2 ; 2 / :
(8.35)
p
valid for thresholds n and bounds n notably smaller than 2 log n and 2 log n C 1
respectively. The details for (8.35) are in Section 8.10; it is shown there that n 2 .0; 1/ is
the unique solution of
.n C 1/rS .; 0/ D .1 C 2 /;
(8.36)
1.276
1.474
1.669
1.859
2.045
2.226
2.403
2.575
2.743
2.906
2.633
2.884
3.115
3.330
3.532
3.723
3.905
4.079
4.245
4.405
2.549
3.124
3.755
4.439
5.172
5.950
6.770
7.629
8.522
9.446
7.931
9.318
10.704
12.090
13.477
14.863
16.249
17.636
19.022
20.408
223
Block thresholding
We indicate the extension of Proposition 8.8 to block (soft) thresholding. Suppose that 2
Rn is partitioned into B blocks each of size L, thus we assume n D BL. While other
groupings of co-ordinates are possible, for simplicity we take contiguous blocks
b D .b.L
1/C1 ; : : : ; bL /;
b D 1; : : : ; B:
Let y be partitioned similarly; we sometimes abuse notation and write yb D .yk ; k 2 b/. As
in Chapter 6.2, we might consider block diagonal estimators Oc;b D .cb yb /. For simplicity,
we focus on projections, with cb D 0 or 1. The mean squared error of Oc;b is then either
entirely bias, kb k2 when cb D 0, or entirely variance, L 2 , when cb D 1. The ideal
projection chooses the minimum of the two and is given by
(
yb
if kb k2 L 2
b .y/ D
0
if kb k2 < L 2 :
Of course, this projection oracle requires knowledge of the block norms kb k2 , and it
achieves the block ideal risk
X
min.kb k2 ; L 2 /:
R.; I L/ D
b
P
Block soft thresholding can mimic the projection oracle. Let Sb2 D k2b yk2 and define
OB D .O;b / by
p
L
B
yb ;
b D 1; : : : ; B:
(8.38)
O;b .y/ D S;L .yb ; / D 1
Sb
C
With these definitions, and after rescaling to noise level , we can rewrite the conclusion
of Proposition 8.7 as follows.
Proposition 8.9 Suppose that y Nn .; 2 I / and that n D BL. The block soft thresholding estimator OB , (8.38), satisfies
EkOB
N L/;
k22 B 2 rS;L .; 0/ C R.; I
where rS;L .; 0/ is bounded at (8.21) and N 2 D 1 C 2 . If ; L are chosen such that
rS;L .; 0/ n 1 , then
EkOB
k22
C
B
X
min.kb k2 ; LN 2 2 /:
bD1
We turn to choice of block size and threshold. The factor LN 2 D L.1 C 2 / in the
risk bound should in principle be as small as possible consistent with a small value for
rS;L .; 0/. From (8.24), we have rS;L .; 0/ expf F .2 /L=2g 1=n so long as F .2 /
.2 log n/=L. Since F is monotone increasing, we are led to the equation
F .2 / D 2
224
L D
p
p
LD1
L
L D .log n/.1C/
L 1:
4:50524
2 log n
(8.39)
As a function of block size L, the factor L.1C2 / may be written as L.1CF 1 ..2 log n/=L//
and since F 1 .x/ max.1; x/, we find that L.1 C 2 / 2 max.L; log n/. From this perspective, then, there is little advantage to choosing block sizes of order larger than L D log n.
i D 1; : : : ; n
(8.40)
(8.41)
1=p
ji jp C p :
Results for these notions of approximate sparsity will be given in Chapters 9, 11 and 13.
1
225
In this chapter we concentrate on exact sparsity, an important case which is also technically easiest to work with. Exact sparsity may also be viewed as a limiting case of approximate sparsity as p ! 0, in the sense that
kkpp D
n
X
ji jp ! #f i W i 0g D kk0 :
iD1
Minimax risk over n k. When is restricted to a parameter set , the minimax risk
for mean squared error is defined, as in earlier chapters, by
RN .; / D inf sup E kO
O 2
k22 :
r.; i / .n
iD1
Now come the bounds for soft thresholding obtained in Section 8.2. The risk
p at zero is
bounded by (8.7), where we use, for example, the middle expression for 2. In addition, for all values of , the risk is bounded by 1 C 2 , compare (8.6). Thus, the previous
display is bounded by
4n 1 ./ C k.2 C 1/:
Set the threshold at n D
Consequently,
n
X
iD1
r.n ; i / p
2k
log.n=k/
2n =2
D .0/k=n.
C k2 log.n=k/ C 1:
Asymptotic model. Our object is to study the asymptotic behavior of the minimax risk
RN as the number of parameters n increases. We regard the noise level D n and number
of non-zero components k D kn as known functions of n: This framework accomodates a
common feature of statistical practice: as the amount of data increaseshere thought of as
a decreasing noise level n per parameterso too does the number of parameters that one
may contemplate estimating. To simplify the theory, however, we mostly set D 1. Since it
is a scale parameter, it is easy to put it back into the results, for example as in the summary
statement in Theorem 8.20.
Consider first the case in which kn =n ! 0 (the situation when kn =n ! > 0 is deferred
to Sections 8.7 and 8.8). Here the contribution from the risk at 0 in the previous display is of
smaller order than that from the risk bound for the kn nonzero components, and we arrive at
226
(8.42)
2n k
The leading term is proportional to the number of non-zero components kn , while the
multiplier 2 log.n=kn / can be interpreted as the per-component cost of not knowing the
locations of these non-zero components.
This upper bound for minimax risk over n kn turns out to be asymptotically optimal.
The result is formulated in the following theorem, to be finally proved in Section 8.6.
Theorem 8.10 Assume model (8.40) and parameter space (8.41). If kn =n ! 0 as n ! 1,
then
RN .n kn / D inf sup r.O ; / 2kn log.n=kn /:
(8.43)
O 2n kn
Of course, this is much smaller than the minimax risk for the unconstrained parameter space RN .Rn / n, which as noted at (2.51), is attained by the MLE O .y/ D y.
The assumption of sparsity n kn entails a huge reduction in minimax risk, from n to
2kn log.n=k
n /, and this reduction can, for example, be achieved using soft thresholding at
p
n D 2 log.n=kn /.
To establish Theorem 8.10, we need lower bounds on the minimax risk. These will be
obtained by computing the Bayes risks of suitable nearly least favorable priors , compare
the discussion in Chapter 4.3.
We consider two approaches to constructing these priors, to be outlined here and developed in detail over the next three sections. In the first, which we now call the multivariate problem,
we work with model (8.40) and the n-dimensional mean squared error
O / D Pn E Oi .y/ i 2 . In the second, the univariate Bayes problem, we consider a
r.;
1
scalar observation y1 D 1 Cz1 , but in addition suppose that 1 is random with distribution
1 , and that an estimator .y1 / is evaluated through its integrated MSE:
B.; / D E E .y1 /
1 2 D E r.; 1 /:
An obvious connection between the two approaches runs as follows: suppose that an estiO
mator .y/
in the multivariate problem is built by co-ordinatewise application of a univariate
estimator , so that Oi .y/ D .yP
i /; and that to a vector D .i / we associate a univariate
n
1
e
(discrete) distribution n D n
iD1 i . Then the multivariate and univariate Bayes mean
squared errors are related by
O / D
r.;
n
X
The sparsity condition n k in the multivariate problem, cf. (8.41), corresponds to requiring
that the prior D ne in the univariate problem satisfy
f1 0g k=n:
We will see that the univariate problem is easier to analyze, and that sometimes, but not
always, the multivariate minimax risk may evaluated via the univariate Bayes approach.
227
Candidates for least favorable priors. In the univariate Bayes problem, we build a prior
by n iid draws from a univariate mixture: for small, such as near k=n, suppose that
IID
iid
i .1
/0 C :
(8.44)
m
X
e i :
(8.45)
iD1
It might seem more natural to allow D eI , but this leads to slightly messier formulas, e.g. in (8.57).
228
/0 C ;
> 0;
(8.46)
1
.x /
D
;
.x / C .1 /.x/
1 C m.x/
.1 / .x/
P .f0gjx/
D
D exp. 12 2
P .fgjx/
.x /
x C 12 2 /
(8.47)
(8.48)
The prior probability on 0 is so large that even if x is larger than , but smaller than Ca,
the posterior distribution places more weight on 0 than . 3 See Figure 8.2.
The Bayes rule for squared error loss is as usual the posterior mean, which becomes
.x/ D P .fgjx/ D
:
1 C m.x/
1Ce
.z a/
(8.49)
a/g and
(8.50)
In particular, observe that ./ is small, and even . C a/ D =2 is far from .
3
Fire alarms are rare, but one may not believe that a ringing alarm signifies an actual fire without further
evidence.
229
(x)
m(x)
(+ a) = =2
=2
prior mass 1{
m(+a)=1
1
prior mass
+a
Figure 8.2 Two point priors with sparsity and overshoot a: posterior probability
ratio m.x/ and posterior mean .x/
(8.51)
(8.52)
In this case, there is a simple and important asymptotic approximation to the Bayes risk
of a sparse two point prior D ; .
Lemma 8.11 Let have sparsity and overshoot a D .2 log
Then, as ! 0,
B. / 2 :
Proof
By definition, we have
B. / D .1
/r. ; 0/ C r. ; /:
(8.53)
Thus, a convenient feature of two point priors is that to study the Bayes risk, the frequentist
risk function of only needs to be evaluated at two points. We give the heuristics first.
When is large and the overshoot a is also large (though of smaller order), then (8.50)
shows that for x N. ; 1/, the Bayes rule essentially estimates 0 with high probability,
thus making an error of about 2 . A fortiori, if x N.0; 1/, then estimates 0 (correctly)
with even higher probability. More concretely, we will show that, as ! 0,
r. ; / 2 ;
r. ; 0/ D o.2 /:
(8.54)
Inserting these relations into the Bayes risk formula (8.53) yields the result. The primary
contribution comes from the risk at D , and the large error 2 that is made there.
The first relation is relatively easy to obtain. Using (8.50), we may write
Z 1
.z/dz
2 ;
(8.55)
r. ; / D 2
.z a/ 2
1
C
e
1
230
(8.56)
(8.57)
(8.58)
This is a slight abuse of terminology, since the (now discrete valued) parameter is really 2 fe1 ; : : : ; en g
and not I per se.
231
Lemma 8.13 The prior n D S . I n/ has Bayes risk, for squared error loss, bounded as
follows
B.n / 2 E e1 1
p1n .y/2 :
Proof Write the Bayes risk in terms of the joint distribution of .; y/ when S .I n/,
and exploit the symmetry with respect to co-ordinates to reduce to the first component:
n
X
B.n / D E
O;i
i 2 D nEO;1
1 2 :
iD1
1 2 D .1
2
1=n/E D0 O;1
C .1=n/E D e1 O;1
2 :
With an appropriate choice of , to be made later, we might expect the first term on the
right side to be of smaller order than the second, compare (8.54) in the previous section. We
therefore drop the first term, and using (8.58), find that
B.n / E e1 O;1
2 D 2 E e1 p1n .y/
12 :
(8.59)
Wn D n e
n2 =2
n
X
e n zk :
kD1
n zk
n2 =2
De
we have EWn D 1. We might expect a law of large numbers to hold,
Since Ee
at least if n is not too big. However, if n is as large as n then Wn fails to be consistent.
p
iid
Lemma 8.14 Let z1 ; : : : ; zn N.0; 1/ and n D 2 log n. Then
(
p
1
if n n ! 1
Wn !
1 .v/ if n D n C v:
Remark. Part 6 of Section 8.10 briefly connects the behavior of Wn to results in the
random energy model of statistical physics.
Proof If n is small
law of large numbers applies. Fix
2
p enough, the ordinary weak
1 n2
1/ ! 0, and Wn ! 1 in probability
.1=2; 1/: if n <
log n, then Var Wn D n .e
p
by Chebychevs inequality. However, if n
log n, the variance can be large and we
must truncate, as is done in the triangular array form of the weak law of large numbers,
recalled in Proposition C.14. Put Xnk D e n zk and bn D e n n , and then introduce XN nk D
Xnk I fjXnk j bn g D e n zk I fzk n g. We must verify the truncation conditions (i) and
232
rn /:
Pn
n / C op .bn n 1 e
n2 =2
/:
probability:
p1n .y/ D 1 C Vn Wn 1 1 ;
P
2
2
where Wn 1 D .n 1/ 1 e n =2 n2 e n zi and Vn D .n 1/e n =2 n z1 .
Then by Lemma 8.14, since n n ! 1, Wn ! 1 in probability as n ! 1. For Vn ,
observe that
2n
again because n
in probability.
n2
2n z .n
n
zC /.n C n / ! 1;
(8.60)
p
Proof of Proposition 8.12. First suppose that n D 2 log n and that n n ! 1. We
then directly apply Lemmas 8.13 and 8.15: since x ! .1 x/2 is bounded and continuous
for x 2 0; 1 and p1n .y/ ! 0 in Pn e1 probability, we have B.n / n2 .1 C o.1//:
If instead it is not the case that n n ! 1, then choose n0 n which satisfies both
0
n n ^ n and also n n0 ! 1. For example n0 D n ^ n log n will do. Then use
n0 in the argument of the previous paragraph to conclude that
B.n / n02 .1 C o.1// n2 ^ 2n :
Remark. Let us say a little more about the connections between these calculations and
those of the previous section. We may identify n with for D 1=.n C 1/ and then think
233
of n as corresponding to the support point in the two point prior . The overshoot
condition a D .2 log 1 /
for 0 <
< 1=2 combined with (8.51) shows that !
1, which is the analog of n n ! 1. The overshoot condition also shows that the
posterior probability P .f gjx D C w/ ! 0 for each w 2 R as ! 0, which
corresponds to p1n .y/ ! 0 in Lemma 8.15.
(8.61)
p
Optimality of 2 log n risk bound. We are now also able establish the minimax lower
bound (8.34). Set D 1 without loss of generality and bring in a non-standard loss function
Q O ; / D
L.
1C
kO
P
k2
:
2
i min.i ; 1/
R
O / and B.
O / D r.
Q ;
Let r.
Q ;
Q O ; /.d / respectively denote risk and integrated risk for
the new loss function. The left hand side of (8.34) is the minimax risk for the new loss
function, and arguing as in Section 4.3, compare (4.13)(4.15), we obtain a lower bound
Q O ; / sup B./:
Q
RQ N D inf sup r.
Q O ; / inf sup B.
O
O
A nearly least favorable prior is again given by the independent block spike prior n D
with kn D log n blocks eacn of length mn D n= log n. Again, the remaining indices
are ignored and choose m D m log m so that m m ! 1. Since exactly one
coefficient is non-zero in each block, we have with probability one under n that
X
1C
min.i2 ; 1/ D 1 C kn :
nIB
234
kp n =2g .n/:
O n .n /
(8.62)
Proof Since the spike prior n concentrates on n . /, we have sup2n . / P .A/ P.A/,
where P denotes the joint distribution of .; y/ for n .
The argument makes use of the maximum a posteriori estimator for the spike prior n ,
given by OMAP D eIO , where IO D argmaxi P .I D ijy/ D argmaxi yi . It is the Bayes
estimator for the spike prior n and loss function L.a; / D I fa g, so that for any
O P.O / P.O MAP /.
estimator ,
Let O be a given arbitrary estimator and let O .y/ be the estimator defined from it by
choosing a point from the set f e1 ; : : : ; en g that is closest to O .y/ in (quasi-)norm k kp .
Therefore, if kO ei kp < =2 then O D ei this is obvious for p 1, while for p < 1
it follows from the triangle inequality for k kpp . Hence
P.kO
Z > g
(8.63)
Z an / D P .Mn
an0 /P .Z an0
an /:
235
For any 0 > 0, we have P .Mn 1 pan0 / ! 1, for example by (8.78) in Section 8.9. A
little algebra shows that an0
an 2
log n for some
.; 0 / > 0 and hence P .Z
0
an an / ! 1 also.
The second result is for MSE and offers an example of a non-asymptotic bound, that is
one valid for all finite n. It prepares for further non-asymptotic bounds in Section 11.4. As
might be expected, the non-asymptotic bounds are less sharp than their asymptotic cousins.
To state them, recall that for a single bounded normal mean in ; , Section 4.6 showed
that the minimax risk N .; 1/ D RN . ; ; 1/ satisfies
c0 . 2 ^ 1/ N .; 1/ 2 ^ 1;
for a suitable 0 < c0 < 1.
Proposition 8.17 Suppose that y Nn .0; I /. There exists c1 > 0 such that for all n 2,
c1 2 ^ .1 C log n/ RN .n . // .log n/
1=2
C 2 ^ .1 C 2 log n/:
Proof For the upper bound, consider the maximum risk of soft thresholding at n D
p
2 log n. Bound (8.12) says that
sup r.On ; / .n
n . /
p
2 log n at 0 for n 2
rS .n ; 0/ n 1 .log n/
1=2
Indeed, this follows from (8.8) for n 3 since then n 2, while for n D 2 we just evaluate
risk (8.80) numerically. The upper bound of the proposition is now immediate.
For p
the lower bound, this time we seek a bound for B.n / valid for all n. Introduce
`n D 1 C log n and n D ^ `n . We start from (8.59) and note that on the event En D
fy1 maxj yj g we have p1n .y/ 1=2 and so B.n / .n2 =4/Pn e1 .En /: From (8.63)
P .En / P fZ < 0; Mn
> n g D 21 P fMn
n g 12 P fMn
`n g
> `n / c0 for n 2:
g:
/0 C ;
(8.64)
236
(8.65)
where the second equality uses the minimax theorem 4.12. From the scale invariance
0 .; / D 2 0 .; 1/;
it will suffice to study the unit noise quantity 0 .; 1/, which we now write as 0 ./.
Proposition 8.18 The univariate Bayes risk 0 ./ is concave and increasing (and hence
continuous) for 0 1, with 0 ./ and 0 .1/ D 1. As ! 0, the minimax risk
0 ./ 2 log 1 ;
and an asymptotically minimax rule is given by soft thresholding at D .2 log 1 /1=2 .
Proof First, monotonicity is obvious. Concavity of 0 follows from concavity of Bayes
risk ! B./, Remark 4.1, together with convexity of the constraint defining m0 ./: if
both 0 .f0g/ and 1 .f0g/ , then of course ..1 /0 C 1 /.f0g/ .
When D 1, there is no constraint on the priors, so by (4.20), 0 .1/ D N .1; 1/ D 1:
Concavity of the Bayes risk also implies that B..1 /0 C / .1 /B.0 / C B./,
and since B.0 / D 0, maximizing over shows that 0 ./ :
Finally, we consider behavior as ! 0. For soft thresholding , we have r.; /
1 C 2 , compare Lemma 8.3. Since D .1 /0 C , we have
Z
B. ; / D .1 /r.; 0/ C r.; /.d/ r.; 0/ C .1 C 2 /:
For D .2 log 1 /1=2 large, recall from (8.7) that r.; 0/ 4 3 ./ D o./, so that
0 ./
C O./:
2m0 ./
For a lower bound we choose a sparse prior as in Lemma 8.11 with sparsity and
overshoot a D .2 log 1 /1=4 . Then, from that lemma and (8.52), we obtain
0 ./ B.;./ / 2 ./ 2 log 1 :
The existence and nature of the least favorable distribution for m0 ./ is of some interest,
and will be used later for the pth moment case, Proposition 13.5. The proof may be skipped
at a first reading without loss of continuity.
Proposition 8.19 Assume 0 < < 1. The Bayes minimax problem associated with m0 ./
and 0 ./ has a unique least favorable distribution . The measure is proper, symmetric
and has countably infinite support with 1 as the only accumulation points.
Of course, symmetry means that .B/ D . B/ for measurable sets B R.
237
Proof of Proposition 8.19 The set m0 ./ is not weakly compact; instead we regard it as a
subset of PC .R/, the substochastic measures on R with positive mass on R, with the vague
topology. [For more detail on vague convergence, see appendix C.19, and for example Huber
and Ronchetti (2009).] Since m0 ./ is then vaguely compact, we can apply Proposition
4.13 (via the remark immediately following it) to conclude the existence of a unique least
favorable prior 2 PC .R/. Since < 1, we know that .R/ > 0. In addition, is
symmetric.
A separate argument is needed to show that is proper, .R/ D 1. Suppose on the
contrary that D 1 .R/ > 0: From the Fisher information representation (4.4) and
(4.21), we know that P0 D ? minimizes I.P / for P varying over the convolution set
m0 ./? D fP D ? W 2 m0 ./g. We may therefore use the variational criterion in
the form given at (C.21). Thus, let P1 D P0 C ? for an arbitrary (prior) probability
measure on R. Let the corresponding densities be p1 and p0 , and set 0 D p00 =p0 .
Noting that p1 p0 D ? , we may take D for each 2 R, and (C.21) becomes
E 2
0
0
2
0
0:
Steins unbiased risk formula (2.58) applied to d0 .x/ D x 0 .x/ then shows that r.d0 ; /
1 for all . Since d0 .x/ D x is the unique minimax estimator of when x N.; 1/, Corollary 4.10, we have a contradiction and so it must be that .R/ D 1.
As is proper and least favorable, Proposition 4.14 yields a saddle point .O ; /. Using
the mixture representation (8.64), with D corresponding to , well defined because
> 0, we obtain from (4.22) applied to point masses D that for all
Z
r.O ; / r.O ; 0 / .d 0 /:
In particular, ! r.O ; / is uniformly bounded for all , and so is an analytic function
on R, Remark 4.2. It cannot be constant (e.g. Exercise 4.1) and so we can appeal to Lemma
4.18 to conclude that is a discrete measure with no points of accumulation in R. The
support of must be (countably) infinite, for if it were finite, the risk function of O would
necessarily be unbounded (again, Exercise 4.1).
238
p
O i ; n 2 log.n=kn // are asympand the (soft or hard) thresholding estimators Oi .y/ D .y
totically minimax.
The highly sparse case in which the number of spikes k remains fixed is included:
RN .n k; n / 2n2 k log n:
Proof The case kn =n ! 0 shown in the second display has essentially been proved in
previous sections.
p Indeed, the upper bound was established at (8.42) using soft thresholding at level n 2 log.n=kn /. The same argument works for hard thresholding, now using
global risk bound (8.17) and bound (8.19) for the risk at zero. The lower bound was obtained with the independent blocks prior in the paragraph concluding with (8.61). Finally,
the equivalence of the two displays n0 .kn =n/ 2kn log.n=kn / was shown in Proposition
8.18.
We turn now to the proof when n ! > 0. The proof uses the Bayes minimax method
sketched in Chapter 4 with both upper and lower bounds derived in terms of priors built
from i.i.d. draws from univariate priors in m0 .n /. As an intermediate, we need the class of
priors supported on n kn ,
Mn D Mn kn D f 2 P .Rn / W supp n kn g
and the subclass Men D Men kn Mn kn of exchangeable or permutation-invariant
priors.
The upper bound can now be outlined in a single display,
RN .n kn ; n / B.Mn ; n / D B.Men ; n / D nn2 0 .kn =n/:
(8.66)
To explain, recall that B.M; / D supfB./; 2 Mg: The first inequality follows because
Mn contains all point masses for 2 n kn , compare (4.18). If we start with a draw
from prior and then permute the coordinates randomly with a permutation , where the
average is taken over permutations of the n coordinates, we obtain a new, exchangeable prior
e D ave. /. Concavity of the Bayes risk, Remark 4.1, guarantees that B./ B. e /I
this implies the second equality, since Men Mn .
The univariate marginal 1 of an exchangeable prior 2 Men kn belongs to m0 .kn =n/,
and the independence trick of Lemma 4.15 says that if we make all coordinates independent
with marginal 1 , then the product prior 1n is less favorable than recall that the posterior
variance of each i in 1n depends on only yi and so is larger than the posterior variance of
i in , which may depend on all of y. As a result,
B./ B.1n / D nB.1 /:
Rescaling to noise level one and maximizing over 1 2 m0 .kn =n/, we obtain the equality
in the third part of (8.66). Note that the upper bound holds for all kn n.
The idea for the lower bound runs as follows. Using the arguments of Section 4.3,
RN .n kn ; n / supfB./; 2 Mn kn g:
An approximately least favorable prior for the right side might be constructed as D 1n ,
corresponding to taking n i.i.d. draws from a univariate prior 1 2 m.kn =n/ with 1 chosen
to be nearly least favorable for m.kn =n/. This is a version of the prior IID described in
239
Section 8.4. The same technical difficulty arises: let Nn D #fi W i 0g. Even though
ENn kn ; we do not have .Nn k/ D 1 and so is not guaranteed to belong to
Mn kn .
The Bayes minimax method of Section 4.11 patches this up by modifying the definition
of so that .Nn k/ ! 1 as n ! 1. The family of parameter spaces will be n k,
nested by k. The sequence of problems will be indexed by n, so that the noise level n and
sparsity kn depend on n. We use the exchangeable classes of priors Men defined above, with
Bayes minimax risk given by nn2 0 .kn =n/, compare (8.66). We introduce the notation
Bn .; n / D nn2 0 .=n/;
(8.67)
which is equally well defined for non-integer , compare definition (8.65). For each fixed
< 1, then, we will construct a sequence of priors n 2 Men
kn . which are built from
i.i.d. draws from a suitable one-dimensional distribution 1n . With n denoting n kn , we
will show that n has the properties, as n ! 1,
B.n /
Bn .
kn ; n /.1 C o.1//;
(8.68)
n .n / ! 1;
E fkO k2 C kk2 ; c g D o.Bn .
kn ; n //
(8.69)
(8.70)
where n ./ D n .jn /, and On is the Bayes estimator for the conditioned prior n and that
lim lim inf
!1 n!1
Bn .
kn ; n /
D 1:
Bn .kn ; n /
(8.71)
It then follows from Lemma 4.32 and the discussion after (4.71) that RN .n kn ; n /
Bn .kn ; n /.1 C o.1//: In conjunction with the upper bound in (8.66) this will complete the
proof of Theorem 8.20.
For
< 1, we may choose M and a univariate prior M 2 m0 .
/ with support contained in M; M and satisfying B.M /
0 .
/, compare Exercise 4.5. The corresponding prior n in the noise level n problem is constructed as i D n i , where
1 ; : : : ; n are i.i.d. draws from M . By construction and using 0 .n / 0 ./, we then
have
B.n / D nn2 B.M / nn2
0 .
/
Bn .
kn ; n /;
(8.72)
where the final equivalence uses (8.67) and the fact that 0 .
n / 0 .
/, a consequence
of Proposition 8.18.
Since M f1 0g
, we may bound kk0 above stochastically by a Binomial.n;
/
variable, Nn say, so that
n fcn g P fNn
ENn > kn
n g D O.n 1 /;
240
MedMn j tg e
t 2 =2
t 2 =2
P fjMn
EMn j tg 2e
p
Both MedMn and EMn are close to Ln D 2 log n. Indeed
p
jEMn MedMn j 2 log 2;
Ln
(8.73)
:
(8.74)
Ln 1 MedMn Ln ;
p
2 log 2 EMn Ln :
(8.75)
(8.76)
The bound (8.74) is Exercise 2.22. The right bound of (8.75) follows from (8.77) below, and
for the left bound see Exercise 8.13. The right hand bound of (8.76) is Proposition C.12 and
the left bound then follows from (8.74) and (8.75). Of course, asymptotic expressions for
MedMn and EMn follow with some extra work from the extreme value theorem (8.31).
In fact the distribution of Mn is confined largely to a shrinking interval of width 2 log2 Ln =Ln ,
mostly below Ln . Indeed, arguing analogously to (8.30), we have for n 2,
p
(8.77)
P fMn Ln g 1=. 2Ln /:
while Exercise 8.13 shows that for Ln 3,
P fMn Ln
1
3
exp.log2 Ln /g:
(8.78)
Qn
L
x:10
x:50
EMn
x:90
Ln
32
128
1024
4096
1.92
2.29
2.79
3.11
1.48
2.10
2.84
3.26
2.02
2.55
3.20
3.58
2.07
2.59
3.25
3.63
2.71
3.15
3.71
4.05
2.63
3.11
3.72
4.08
241
(8.79)
Z
/dx C
2 .x
Z
/dx C
.x
/2 .x
/dx
/.
(8.80)
(8.81)
/ C . C /. C /
where and denote the standard Gaussian density and cumulative distribution functions respectively,
Q
and .x/
D 1 .x/.
2 : Proof of lower bound in Lemma 8.3 for 0 2. Let be the solution in of r.; 0/ C 2 D
2
1 C 2 : Since r.; 0/ e =2 < 1; (compare (8.7) ), it is clear that > : For we may write,
using (8.5),
R
r.; 0/ C 0 2s.I s/ds
r.; /
:
D
R.; / D
r.;
N
/
r.; 0/ C 2
We first verify that R.; / is decreasing in : Indeed ! c C f1 ./=c C f2 ./ is decreasing
if both f10 ./ f20 ./ and .f1 =f2 /./ is decreasing. The former condition is evident, while the latter
R1
follows by the rescaling v D s= W for then .f1 =f2 /./ D 2 0 .I v/dv:
For , we also have R.; / R.; / since r.; / r.; / while r.;
N
/ 1 C 2 :
Consequently, for all
R.; / r.; /=r.; 0/ C 2 ;
and numerical evaluation for 0 2 shows the right side to be bounded below by .516, with the
minimum occurring for 2 :73; :74:
/n n, with
p
p
2. 2 log n/
1
Q
D p
:
2 log n/ p
D 2.
2 log n
n log n
.1
rS .; /
n
D
maxf.n C 1/rS .; 0/; 1 C 2 g D n ./;
1 C 2 ^ 1
nC1
(8.82)
say. To see this, consider two cases separately. For 1, the risk ! rS .; / increases to 1 C 2 at
D 1. For 1, the ratio on the left side is bounded using (8.10) by
rS .; 0/ C 2
nrS .; 0/:
n 1 C 2
Thus, for satisfying nrS .; 0/ 1,
rS .; / n ./fn
C min.2 ; 1/g;
and the bound (8.35) follows by adding over co-ordinates and rescaling to noise level .
Now we seek the minimum value of n ./. Since ! 1 C 2 is strictly increasing and ! rS .; 0/ is
strictly decreasing (as is seen from the integral in (8.7)), it follows that when nrS .; 0/ 1, the minimum
value of n ./ is attained for the (unique) solution n 2 .0; 1/ of the equation (8.36). Finally, we set
n D n .n / D 1 C 2
n .
242
5 : Proof of second half of (8.54). Combining (8.47) and (8.48), we have m.x/ D e .Ca x/ . Using
formula (8.49) for , then changing variables to z D x a and finally exploiting (8.48), we find that
Z
Z 1
2
.x/dx
e .Ca/z z =2 dz
.1 /E0 2 D .1 /2
D 2 .a/
:
1 C e z 2
1 C e .Ca x/ 2
1
We now verify that
.1
/E2 .z/
.a/
2 ./
.Ca/z z 2 =2
Z
dz C
1
1
.w/dw
:
1 C e .wCa/
(8.83)
Consider the final integral in the antepenultimate display, first over .0; 1/: we may replace the denominator
by 1 to obtain the first term in (8.83). Over . 1; 0/, we have e z =1 C e z 1, and with v D z this
part of the integral is bounded by
Z 1
.v a/dv
2
;
1 C e v
0
which with w D v a leads to the second term in (8.83). By dominated convergence, both right hand side
terms converge to zero as and a ! 1.
P
6 : The variable nkD1 e n zk is the basic quantity studied in the random energy model of statistical
physics, where it serves as a toy model for spin glasses, e.g.pMezard and Montanari (2009, Ch. 5). In the
current notation, it exhibits a phase transition at n D n D 2 log n, with qualitatively different behavior
in the high temperature (n < n ) and low temperature (n > n ) regimes.
P
iid
Here is a little more detail on the phase transition. Write Sn ./ D n1 e zk for z1 ; : : : ; zn N.0; 1/.
If is small enough, then heuristically the sum behaves like n times its expectation and
X
log
e zk log.nEe z1 / D .2n C 2 /=2;
p
with n D 2 log n. However, for large , it is better to approximate the sum by the dominant term
X
log
e zk z.1/ n :
The crossover in behavior occurs for near n , see in part Proposition C.12, and is formalized in the
following statement, which may be proved
p directly from the discussion of the random energy model in
Talagrand (2003, Ch. 1.1, 2.2). If n D 2 log n and an ! a > 0, then
(
p
log Sn .an n /
1 C a2 a 1
!
(8.84)
2a
a > 1:
log n
8.11 Notes
8.2, 8.3. Many of the ideas and bounds for soft and hard thresholding in Sections 8.2 and 8.3, and in
particular the oracle inequality Proposition 8.8, come from Donoho and Johnstone (1994a), see also Donoho
and Johnstone (1996) In classical antiquity an oracle, such as the priestess Pythia at the temple of Apollo at
Delphi, was a person or agency believed to convey wise counsel inspired by the gods.
Soft thresholding appeared earlier as one of the class of limited translation rules of Efron and Morris
(1971) and also in Bickel (1983), as discussed below.
The discussion of block soft thresholding leading to Propositions 8.6 and 8.9 is inspired by the study
of block James-Stein thresholding in Cai (1999): the definitions were compared in Section 7.6.5. Block
soft thresholding is somewhat easier to analyze because of the monotonicity property of its mean squared
error, cf. (8.27), but leads to similar results: Proposition 8.9 and the asymptotic dependence of threshold
L on block size L in (8.39) essentially match Cais conclusions. The paper of Cavalier and Tsybakov
(2001), already mentioned in the notes to Chapter 6, shows the broad adaptivity properties of penalized
blockwise Stein methods, and the interaction between block sizes and penalties. Some of the properties of
block soft thresholding developed here appear also in Donoho et al. (2012) where they are used to study
phase transitions in compressed sensing.
243
Exercises
Some of the methods used in Proposition 8.6 are derived from a distinct but related problem studied in
Johnstone (2001), namely threshold estimation of the noncentrality parameter based on W 2d ./.
8.4. The discussion paper Donoho et al. (1992) identified `0 sparsitythe term nearly black was used
there for sparse non-negative signalsas a property leading to significant reductions in minimax MSE:
Theorem 8.10 is established there in the case kn ! 1.
8.5. The discussion of sparse univariate two point priors builds on Donoho et al. (1992), which was
in turn influenced by Bickel (1983, 1981), where priors with atoms at 0 appear in the study of minimax
estimation of a normal mean subject to good risk properties at a point, such as D 0.
8.6. There is some discussion of single spike priors in Donoho et al. (1997), for example a version of
Proposition 8.16, but most of the development leading to the kn -fixed part of Theorem 8.10 is new to this
volume. Parallel results were obtained independently by Zhang (2012).
8.7. The study of Fisher information over classes of distributions with a sparse convolution component,
such as F D ? for 2 m0 ./ was stimulated by Mallows (1978). The key parts of Proposition 8.18 are
established in Bickel (1983). Bickel and Collins (1983) studied the minimization of Fisher information over
classes of distributions. Via Browns identity, Proposition 4.5, this leads to results for the Bayes minimax
risk (8.65). In particular Proposition 8.19 is proven there.
8.8. Theorem 8.8 is an `0 or exact sparsity version of a corresponding result for `p balls 13.17, a version
of which was first established in Donoho and Johnstone (1994b).
8.9. An alternative reference for standard extreme value theory results for Mn and jM jn is de Haan and
Ferreira (2006, Theorem 2.2.1). DasGupta (2011b) gives a detailed nonasymptotic analysis of the mean and
median ofp
Mn , and explores use of Q n D 1 .1 e
=n/,
D Eulers constant, as an improvement upon
the n D 2 log n threshold.
Lai and Robbins (1976) show that EMn .2 log n log log n/1=2 for n 3 and that this holds whatever
be the interdependence among the zi d N.0; 1/.
Exercises
8.1
Q
(Mills ratio and Gaussian tails.) The function R./ D ./=./
is sometimes called Mills
ratio. Show that the modified form
Z 1
Q
./
2
2
M./ D
D
e v v =.2 / dv;
./
0
and hence that M./ is increasing from 0 at D 0 up to 1 at D 1:
Define the l-th approximation to the Gaussian tail integral by
Q l ./ D
./
l
X
. 1/k .2k C 1/
:
k
2k 2k
kD0
Pl
k k
[Hint: induction shows that . 1/l 1 e x
0 . 1/ x =k 0 for x 0:]
As consequences, we obtain, for example, the bounds
./.1
Q
/ ./
./;
(8.85)
./1
C 3
15
C O.
/:
(8.86)
1/ D .2k
244
8.2
(alternate hard threshold bound.) Show how the proof of Proposition 8.1 can be modified so as
to show that for all > 0,
(
2 2 C 2. C 15/. 1/ 2 if jj
rH .; /
2.2 C 1/ 2
if jj > :
8.3
(Risk of soft thresholding at 0.) Let z N.0; 1/, and rS .; 0/ D E OS2 .z/ denote the mean
squared error of soft thresholding at D 0; compare (8.7).
(a) Use (8.85) and (8.86) to show that
8.4
rS .; 0/ 4
.1 C 1:5
rS .; 0/ 4
./
/./
> 0;
! 1:
p
(b) Conclude that rS .; 0/ 4 1 ./ if, say, 2.
2
(c) Let ./ D e =2 rS .; 0/: Use (8.85) to show that ./ > 0 for 0 D 2.0/:
2
(d) Show that ./ is concave for 2 0; 1; and conclude that r.; 0/ e =2 for all 0:
Derive the following inequalities for hard thresholding, which are sharper than direct application of the bounds in (8.16):
rH .; / .2 C 1/=2;
p
rH .; 0/ .2 _ 2/./;
p
rH .; 0/ .2 C 2/./
rH .; 0/ 2. C 1=/./:
8.5
p
2 log /2 C O..log /
=2;
p
rH .; C 2 log / 1 C .2 log /
1=2
8.7
/;
rH .; /
8.6
1=2
and so substantiate the claims that n .Q n / is within 10% of n and also between one third and
one half of 2 log n C 1. Verify also that the condition nrS .Q n ; 0/ 1 is satisfied.
(Crude bound for noncentral 2 .) If =2 and 2d , show that
P .2d ./ / 1=4:
245
Exercises
8.8
8.9
p
D
[One approach: write 2d ./ D 2d 1 C .Z C /2 with Z an independent standard normal
variate and exploit f2d ./ g f2d 1 C Z 2 C ; Z < 0g along with (2.90).
(Unbiased risk estimate for block soft thresholding.) The vectors of the vector field x ! x=kxk
in Rd have constant length for x 0. Nevertheless, show that its divergence r T .x=kxk/ D
.d 1/=kxk. Verify the unbiased risk formula (8.25) (8.26).
p
P
j 2 log ng.
(Number of exceedances of universal threshold.) Let Nn D n1 I fjZi p
Q
(a) If Zi are i.i.d. N.0; 1/, show that Nn Bin.n; pn / with pn D 2.
2 log n/.
2
(b) Show that P .Nn 2/ .npn / 1=. log n/.
(c) Show that the total variation distance between the distribution of Nn and that of a Poisson.npn /
variate converges to 0 as n ! 1.
ind
8.11 (Maximum of absolute values of Gaussian noise mimics M2n .) Let hi ; i D 1; : : : be independent half-normal variates (i.e. hi D jZi j for Zi N.0; 1/), and i be independent 1 variates,
independent of fhi g. Let Zi D hi i and Tn be the random time at which the number of positive
i reaches n. Show that the Zi are independent standard normal and that
D
iD1;:::;n
iD1;:::;n
max
i D1;:::;Tn
Zi D MTn ;
p
2n/= 2n ) N.0; 1/:
i:i:d:
8.12 p
(Lower bound for maximum of Gaussians.) Let zi N.0; 1/ and Mn D max zi . Let `n D
1 C log n. Show that for some c1 > 0, for all n 2,
P .Mn
`n / c1 :
246
p
8.15 (Millers selection scheme
requires many components.) Suppose that x1 D .1; 1; 0/T = 2 and
p
x2 D .0; 1; 1/T = 2: Consider the random permutations x1 and x2 described in A. Millers
selection method. Compute the distribution of hx1 ; x2 i and show in particular that it equals 0
with zero probability.
8.16 (Plotting risk functions for sparse two point priors.) Consider the two point prior (8.46), and
the associated version (8.48) having sparsity and overshoot a. At (8.54) we computed the
approximate risk function at two points 0 D 0 and 0 D ./. Here, make numerical plots of
the risk function 0 ! r. ; 0 / D E0 .x/ 0 2 ,
(a) for some sparse prior choices of .; / in (8.46),
(b) for some choices of sparsity and overshoot .; a/ (so that is determined by (8.48)).
8.17 (Lower bound in Theorem 8.20, sparse case) Adopt the setting of Section ??. Suppose that
n ! 0 and that kn ! 0. Let
< 1 be given, and build n from n i.i.d draws (scaled by
n ) from the univariate sparse prior
n with sparsity
n and overshoot .2 log.
n / 1 /1=4 ,
compare Section 8.5. Show that
1. The number Nn of non-zero components in a draw from n is distributed as Binomial.n;
n /,
and hence that n .n / ! 1 if and only if kn ! 1,
2. on n , we have k k2
1 n2 2n ENn (define n ), and
3. for all y, show that kOn k2
1 n2 2n ENn .
As a result, verify that the sequence n satisfies conditions (8.68) (8.71), and hence that
RN .n .kn /; n / Bn .kn ; n /.1 C o.1//.
9
Sparsity, adaptivity and wavelet thresholding
The guiding motto in the life of every natural philosopher should be, Seek simplicity
and distrust it. (The Concept of Nature, Alfred North Whithead)
In this chapter, we explore various measures for quantifying sparsity and the connections
among them. In the process, we will see hints of the links that these measures suggest with
approximation theory and compression. We then draw consequences for adaptive minimax
estimation, first in the single sequence model, and then in multiresolution settings. The simplicity lies in the sparsity of representation and the distrust in the quantification of error.
In Section 9.1, traditional linear approximation is contrasted with a version of non-linear
approximation that greedily picks off the largest coefficients in turn. Then a more explicitly
statistical point of view relates the size of ideal risk to the non-linear approximation error.
Thirdly, we look at the decay of individual ordered coefficients: this is expressed in terms of
a weak `p condition. The intuitively natural connections between these viewpoints can be
formalized as an equivalence of (quasi-)norms in Section 9.2.
Consequences for estimation now flow quite directly. Section 9.3 gives a lower bound for
minimax risk using hypercubes, and the oracle inequalities of the last chapter in p
terms of
ideal risk combined with the quasi-norm equivalences lead to upper bounds for 2 log n
thresholding over weak `p balls that are only a logarithmic factor worse than the hypercube
lower bounds. When p < 2, these are algebraically better rates than can be achieved by any
linear estimatorthis is seen in Section 9.5 using some geometric ideas from Section 4.8.
Up to this point, the discussion applies to any orthonormal basis. To interpret and extend these results in the setting of function estimation we need to relate sparsity ideas to
smoothness classes of functions, and it is here that wavelet bases play a role.
The fundamental idea may be expressed as follows. A function with a small number of
isolated discontinuities, or more generally singularities, is nevertheless smooth on average.
If non-parametric estimation is being assessed via a global norm, then one should expect
the rate of convergence of good estimators to reflect the average rather than worst case
smoothness.
Thus, a key idea is the degree of uniformity of smoothness that is assumed, and this is
measured in an Lp sense. Section 9.6 introduces this topic in more detail by comparing three
examples, namely uniform (p D 1), mean-squre (p D 2) and average (p D 1) smoothness
conditions, and then working up to the definition of Besov classes as a systematic framework
covering all the cases.
Focusing on the unit interval 0; 1, it turns out that many Besov classes of smoothness
247
248
are contained in weak `p./ balls, see Section 9.7. After some definitions for estimation
in the continuous Gaussian white noise problem in Section 9.8, the way is paved for earlier
results
in this chapter to yield, in Section 9.9, broad adaptive near-minimaxity results for
p
2 log n thresholding over Besov classes.
These results are for integrated mean squared error over all t 2 0; 1; Section 9.10 shows
that the same estimator, and similar proof ideas, lead to rate of convergence results for estimating f .t0 / at a single point t0 .
The final Section 9.11 gives an overview of the topics to be addressed in the second part
of the book.
In this chapter, in order to quantify sparsity and smoothness, we need two conventional
weakenings of the notion of a norm on a linear space: namely quasi-norms, which satisfy a
weakened triangle inequality, and semi-norms, which are not necessarily positive definite.
The formal definitions are recalled in Appendix C.1.
(In particular, P f D 0.) The coefficients i D hf; i i, and we will not distinguish between
f and the corresponding coefficient sequence D f : Again, using the orthonormal basis
property, we have
X
i2 :
kf PK f k22 D
iK
The operator PK is simply orthogonal projection onto the subspace spanned by f i ; i 2 Kg,
and yields the best L2 approximation of f from this subspace. In particular, PK is linear,
and we speak of best linear approximation.
Now consider the best choice of a subset K of size k: we have
ck2 .f / D inf kf PK f k22 W #.K/ k ;
or what is the same
ck2 . /
D inf
X
i2
W #.K/ k :
(9.1)
iK
Let jj.1/ jj.2/ : : : denote the amplitudes of in decreasing order. Then ck2 .f / is what
remains after choosing the k largest coefficients, and so
X
ck2 .f / D ck2 . / D
jj2.l/ ;
l>k
249
l>k
Ideal Risk
Return to estimation in a Gaussian white sequence model,
yi D i C zi ;
i 2 I;
thought of, as usual, as the coefficients of the continuous Gaussian white noise model (1.21)
in the orthonormal basis f i g:
Suppose that K I indexes a finite subset of the variables and that PK is the corresponding orthogonal projection. The variance-bias decomposition of MSE is given by
EkPK y
f k2 D #.K/ 2 C kPK f
f k2 ;
compare (2.47). The subset minimizing MSE depends on f ; to characterize this ideal
subset and its associated ideal risk it is again helpful to organize the minimization by size of
subset:
D inf k 2 C ck2 . / :
k0
(9.3)
f k2
(9.4)
(9.5)
The second and third forms show an important connection between ideal estimation and
non-linear approximation. They hint at the manner in which approximation theoretic results
have a direct implication for statistical estimation.
Write Sk D k 2 C ck2 ./ for the best MSE for model size k. The differences
Sk
Sk
D 2
jj2.k/
are increasing with k, and so the largest minimum value of k ! Sk occurs as k ! jj2.k/
crosses the level 2 , or more precisely, at the index k given by
N./ D N.; / D #fi W ji j g;
(9.6)
250
Compare Figure 9.1. [in approximation theory, this is called the distribution function of j j,
a usage related to, but not identical with, the standard statistical term.]
N(;)
k
2
1
jj(k)
jj(2)
jj(1)
(9.7)
It is also apparent that, in an orthonormal basis, the ideal subset estimation risk coincides
with our earlier notion of ideal risk, Section 8.3:
X
R.f; / D R.; / D
min.i2 ; 2 /:
The ideal risk measures the intrinsic difficulty of estimation in the basis f i g: Of course, it
is attainable only with the aid of an oracle who knows fi W ji j g:
The ideal risk is small precisely when both N./ and cN./ are. This has the following
interpretation: suppose that N.; / D k and let Kk . / be the best approximating set of
size k: Then the ideal risk consists of a variance term k 2 corresponding to estimation of
the k coefficients in Kk ./ and a bias term ck2 . / which comes from not estimating all other
coefficients. Because the oracle specifies Kk . / D fi W ji j > g; the bias term is as small
as it can be for any projection estimator estimating only k coefficients.
The rate of decay of R.; / with measures the rate of estimation of (or f ) using
the ideal projection estimator for the given basis. Again to quantify this, we define a second
sequence quasi-norm
X
kk2IR;r D sup 2r
min.i2 ; 2 /;
(9.8)
>0
where IR is mnemonic for ideal risk. In other words, kkIR;r D B means that R.; /
B 2 2r for all > 0, and that B is the smallest constant for which this is true.
Identity (9.7) says that good estimation is possible precisely when compresses well in
basis f i g, in the sense that both the number of large coefficients N./ and the compres2
sion number cN./
are small. Proposition 9.1 below uses (9.7) to show that the compression
number and ideal risk sequence quasi-norms are equivalent.
251
1=p
Here kkw`p is a quasi-norm, since instead of the triangle inequality, it satisfies only
k C 0 kpw`p 2p .kkpw`p C k 0 kpw`p /;
.p > 0/:
(9.9)
See 3 below for the proof, and also Exercise 9.1. We write w`p .C / for the (quasi-)norm
ball of radius C , or w`n;p .C / if we wish to emphasize that I D f1; : : : ; ng.
Smaller values of p correspond to faster decay for the components of : We will be
especially interested in cases where p < 1; since these correspond to the greatest sparsity.
We note some relations satisfied by w`p .C /:
1 : `p .C / w`p .C /: This follows from
k 1=p jj.k/ p k .1=k/
k
X
jjp.l/ kkp`p :
lD1
jjp.k/ C p
1
X
p 0 =p
D C p .p 0 =p/;
P1
(9.10)
>0
This representation makes it easy to establish the quasi-norm property. Indeed, since
N. C 0 ; / N.; =2/ C N. 0 ; =2/;
we obtain (9.9) immediately. Let N.; / D sup 2 N.; /. Equation (9.10) also yields
the implication
p N.; / C p for all H) w`p .C /:
(9.11)
252
p D 2=.2 C 1/;
H)
p D 2.1
r/:
(9.12)
Proposition 9.1 Let > 0; and suppose that r D r./ and p D p./ are given by (9.12).
Then, with cp D 2=.2 p/1=p ,
3
1=p
(9.13)
Proof The proof goes from right to left in (9.13). Since all the measures depend only on
the absolute values of .i /, by rearrangement we may suppose without loss of generality that
is positive and decreasing, so that k D jj.k/ :
1 : Suppose first that C D k kw`p , so that k C k 1=p . Hence
Z 1
1
X
X
2 2
2
2=p 2
min.k ; t /
min.C k
;t /
.C u 1=p /2 ^ t 2 du
k
kD1
D u t C
Here u D C p t
C 2 u1 2=p
D 1C
p
2
C p t 2r :
2r
min.k2 ; t 2 / C 2 : In
k 2r kk2 C ck2 . / C 2 :
Hence kp k
C 2 and so
ck2 . / k2r C 2 k
2r=p
.C 2 /1C2r=p :
k
X
j2 C 2 r
1 2
ck r . /
C 2 =r.k
r/2 C 2 .3=k/1C2 ;
k rC1
where for the last inequality we set r D k=2 k=3. Consequently, for all k 1,
kk2w`p D sup k 2=p k2 32=p C 2 :
k
253
(9.14)
2
(9.15)
n
X
ji jp C p g:
(9.16)
Pn
1
N.`n;p .C /; / D min.n; C p = p /:
(9.17)
/;
(9.18)
where c1 D c0 =2. Since `n;p .C / w`n;p.C / , the same lower bound applies also to the
weak `p ball. In Section 11.4, we will see that for p 2 this bound is sharp at the level of
rates, while for p < 2 an extra log term is present.
2. Products. Since N..1 ; 2 /; / D N.1 ; / C N.2 ; /, we have
N.1 2 ; / D N.1 ; / C N.2 ; /:
(9.19)
2 .C /
3. P
Sobolev ellipsoids. Suppose, following Section 3.2, that D
is the ellipsoid
2 2
2
f W 1
a
C
g
with
a
D
k
.
Since
a
is
increasing
with
k,
the
hypercube
; p
k
k
0
k k
P
p
is contained in if 2 0 k 2 C 2 . Thus we may bound N D N.; / from the equation
P 2
2 N
D C 2 . This was done carefully in Proposition 4.23 (our white noise case here
0 k
corresponds to D 0 there), and with r D 2=.2 C 1/ led to the conclusion
RN .2 ; / c./C 2.1
r/ 2r
:
If we only seek a lower bound on rates of convergence, this is certainly simpler than the
more refined arguments of Chapters 4.8 and 5.
254
c1 rn;p
.C; / RN .`n;p .C /; / RN .w`n;p .C /; /
Proof The first inequality is the hypercube bound (9.18), and the second follows from
`n;p .C / w`n;p .C /. It remains to assemble the upper bound for O U . From the soft thresholding oracle inequality, Proposition 8.8, we have
r .O U ; / .2 log n C 1/ 2 C R.; /:
(9.20)
/:
(9.21)
(9.22)
2w`n;p .C /
The minimax risks depend on parameters p; C and , whereas the threshold estimator
O U requires knowledge only of the noise level which, if unknown, can be estimated as
described in Chapter 7.5. Nevertheless, estimator O U comes within a logarithmic factor of
the minimax risk over a wide range of values for p and C . In the next section, we shall see
how much of an improvement over linear estimators this represents.
The upper bound in Theorem 9.2 can be written, for < C and n 2, as
c2 log n rn;p
.C; /
if one is not too concerned about the explicit value for c2 . Theorem 11.6 gives upper and
lower bounds that differ by constants rather than logarithmic terms.
Exercise 9.3 extends the weak `p risk bound (9.22) to general thresholds .
255
Lemma 9.3 Let 0 < p < 2,
D C = and and fg be integer and fractional part. Then
8
2
if
1
n
<C
X
2 2
2
p
p
2=p
sup
min.i ; / D
C f
g
(9.23)
if 1
n1=p
k kp C iD1
: 2
1=p
n
if
> n :
D 2 minfR.
p /; ng;
(9.24)
where R.t/ D t C ftg2=p . The least favorable configurations are given, from top to bottom
above, by permutations and sign changes of
.C; 0; : : : ; 0/;
.; : : : ; ; ; 0; : : : ; 0/
and
.; : : : ; /:
In the middle vector, there are
p coordinates equal to , and < 1 is given by p D
f
p g D
p
p .
Proof
(9.25)
If C
P 2; then the `p ball is entirely contained in the `1 cube of side , and the maximum
of i over the `p ball is attained at the spike D C.1; 0; : : : ; 0/ or permutations. This
1=p
yields the first bound in (9.23). At the other extreme, if C
P n 2 , then the `1 cube is
contained entirely within the `p ball and the maximum of
i is attained at the dense
configuration D .1; : : : ; 1/:
If < C < n1=p ; the worst case vectors are subject to the `1 constraint and are then
permutations of the vector D .; : : : ; ; ; 0; : : : ; 0/ with n0 components of size and
the remainder determined by the `p condition:
n0 p C p p D C p :
To verify that this is indeed the worst case configuration, change variables to ui D ip in
P 2=p
(9.25): the problem is then to maximize the convex function u !
ui subject to the
convex constraints kuk1 C p and kuk1 p . This forces an extremal solution to occur
on the boundary of the constraint set and to have the form described.
Thus n0 D C p = p and p D fC p = p g. Setting
p D C p = p ; we obtain
X
min.i2 ; 2 / D n0 2 C 2 2
D 2
p C 2 f
p g2=p :
A simpler, if slightly weaker, version of (9.23) is obtained by noting that the first two rows
of the right side are bounded by 2
p D C p 2 p , so that for all C > 0
sup R.; / min.n 2 ; C p 2
kkp C
/ D rn;p
.C; /:
(9.26)
An immediate corollary is a sharper version of Theorem 9.2 which, due to the restriction
to strong `p balls, comes without the constant
p D 2=.2 p/.
256
Proof
Simply insert the new bound (9.26) into the oracle inequality (9.20).
(9.27)
D
min.n ; C / D
rn;2
.C; /:
RL .`n;p .C /; / D RL .n;2 .C /; / D
(9.28)
Combining this with Theorem 9.2 or Corollary 9.4, which we may do simply by contrast
ing rn;p
with rn;2
, we see that C p 2 p C 2 exactly when C . Hence for p < 2, the
non-linear minimaxprisk is an algebraic order of magnitude smaller than the linear minimax
risk. Furthermore, 2 log n thresholding captures almost all of this gain, giving up only a
factor logarithmic in n.
257
.Rn /
plify expositionwhile the rich theory of Besov spaces Bp;q
./ on domains and Bp;q
on Euclidean space can be approached in various, largely equivalent, ways, it does take some
work to establish equivalence with the sequence form in terms of wavelet coefficients. To
.0; 1/
keep the treatment relatively self-contained, Appendix B gives the definition of Bp;q
in terms of moduli of smoothness and shows the equivalence with the sequence form using
classical ideas from approximation theory.
Some Heuristics
Some traditional measures of smoothness use Lp norms to measure the size of derivatives
of the function. Hence, for functions f for which D k 1 f is absolutely continuous, define
the semi-norm
Z
1=p
jf jWpk D
jD k f jp
;
1 p 1:
258
The Sobolev space Wpk of functions with k derivatives existing a.e. and integrable in Lp is
then the (Banach) space of functions for which the norm is finite. Again, in the case p D 1,
the seminorm is modified to yield the Holder norms
kf kC k D kf k1 C kD k f k1 :
Figure 9.2 shows two examples of how smaller p corresponds to a more averaged and
less worst-case measure of smoothness. For the function in the first panel,
p
kf 0 k1 D 1=a:
kf 0 k1 D 2;
kf 0 k2 D 1=a C 1=b;
In the 1 norm the peaks have equal weight, while in the 2 norm the narrower peak dominates, and finally in the 1 norm, the wider peak has no influence at all. The second panel
compares the norms of a function with M peaks each of width 1=N :
p
kf 0 k1 D M;
kf 0 k2 D MN ;
kf 0 k1 D N:
The 1 norm is proportional to the number of peaks, while the 1 norm measures the slope
of the narrowest peak and so is unaffected by the number of spikes. The 2 norm is a compromise between the two. Thus, again smaller values of p are more forgiving of inhomegeneity.
If, as in much of this work, the estimation error is measured as a global average (for example, as in mean integrated squared error), then we should be able to accomodate some degree
of such inhomogeneity in smoothness.
1=2
M peaks
1=2
1=N
259
issues, we work with an orthonormal wavelet basis for L2 .R/, and so assume that a square
integrable function f has expansion
X
XX
f .x/ D
Lk 'Lk .x/ C
j k j k .x/:
(9.29)
j L k
f .y/j
Theorem 9.6 Suppose that 0 < < 1 and that .'; / are C 1 and have compact support.
Then f 2 C .R/ if and only if there exists C > 0 such that
jLk j C;
jj k j C 2
.C1=2/j
j L:
(9.30)
Reflecting the uniformity in x, the conditions on the wavelet coefficients are uniform in
k, with the decay condition applying to the scales j .
Proof Assume first that f 2 C ; so that, Appendix C.23, kf k D kf k1 C jf j < 1,
where jf j D sup jf .x/ f .x 0 /j=jx x 0 j . For the coarse scale coefficients
jLk j 2
L=2
kf k1 k'k1 ;
(9.32)
We focus on .f / here, since the argument for .f / is similar and easier. Using the
decay (9.30) of the coefficients j k ,
X
X
j .f /j C
2 .C1=2/j
2j=2 j .2j x k/
.2j x 0 k/j:
j L
If the length of the support of is S, then at most 2S terms in the sum over k are non-zero.
In addition, the difference can be bounded using k 0 k1 when j2j x 2j x 0 j 1, and using
simply 2k k1 otherwise. Hence
X
j .f /j c C
2 j minf2j jx x 0 j; 1g;
j L
260
j
C 0 jx
x 0 j ;
which, together with the bound for .f / gives the Holder bound we seek.
Remark 9.7 We mention the extension of this result to 1. Let r D de. Assume
that ' and are C r with compact support, and that has at least r vanishing moments.
If f 2 C .R/, then there exists positive C such that inequalities (9.30) hold. Conversely,
if > 0 is not an integer, these inequalities imply that f 2 C .R/. The proof of these
statements are a fairly straightforward extension the arguments given above (Exercise 9.7.)
When is an integer, to achieve a characterization, a slight extension of C is needed,
see Section B.3 in the Appendix for some extra detail.
Remark 9.8 In the preceding proof, we see a pattern that recurs often with multiresolution
models: a count or error that is a function of level j increases geometrically up to some
critical level j0 and decreases geometrically above j0 . The total count or error is then determined up to a constant by the value at the critical level. While it is often easier to compute
the bound in each case as needed, we give a illustrative statement here. If ;
> 0, then on
setting r D
=. C
/ and c D .1 2 / 1 , we have
X
min.2j ; C 2
j / .c C c
/C 1 r r :
(9.33)
j 2Z
j
jk
jk
These remarks render plausible the following result, proved in Appendix B.4.
Theorem 9.9 If .; / are C r with compact support and has r C 1 vanishing moments,
then there exist constants C1 ; C2 such that
X
X
2
C1 kf k2W2r
Lk
C
22rj j2k C2 kf k2W2r :
(9.35)
k
j L;k
261
Average smoothness, p D 1. We consider functions in W11 , for which the norm measures
smoothness in an L1 sense: kf kW11 D kf k1 C jf 0 j. This is similar to, but not identical
1
with, the notion
R 0 of bounded variation of a function, cf. Appendix C.24: if f lies in W1 then
jf jT V D jf j.
We show that membership in W11 can be nearly characterized by `1 -type conditions on
wavelet coefficients. To state the result, adopt the notation j for the coefficients .j k / at
the j th level and similarly for L at coarse scale L.
Theorem 9.10 Suppose that .'; / are C 1 with compact support. Then there exist constants C1 and C2 such that
X
C1 kL k1 C sup 2j=2 kj k1 kf kW11 C2 kL k1 C
2j=2 kj k1 :
j L
j L
f 1 k k1 jDf j
2
I
Suppose that
has support
R previous bound to
R contained in S C 1; S . Applying the
wavelet coefficient j k D f j k yields a bound jj k j c 2 j=2 Ij k jDf j, where the
interval Ij k D 2 j k S C 1; k C S . For j fixed, as k varies, any given point x falls in at
most 2S intervals Ij k , and so adding over k yields, for each j L,
X
2j=2
jj k j 2S c jf jW11 :
k
A similar but easier argument shows that we also have kL k1 2L=2 2S k'k1 kf k1 :
Adding this to the last display yields the left bound. The extension of this argument to
f 2 T V is left to Exercise 9.8.
p D 1,
p D 2,
p D 1,
2.C1=2/j kj k1
2j kj k2
. 1=2/j
2
kj k1
262
Introducing the index a D C 1=2 1=p, we can see each case as a particular instance of
aj
a weighted `p norm
P cj Dq 21=qkj kp . To combine the information in cj across levels j , we
use `q norms . j L jcj j / , which spans a range of measures from worst case, q D 1,
to average case, q D 1.
We use as an abbreviation for fLk g [ fj k ; j L; k 2 Zg, and define
X
1=q
kkbp;q
D kL kp C
2aj q kj kqp
;
(9.36)
j L
kkbp;1
D kL kp C sup 2aj kj kp :
j L
k kbp;q
D
jLk jp
C
2aj q .
jj k jp /q=p
:
j L
smoothness
averaging (quasi-)norm over locations k
averaging (quasi-)norm over scales j .
The notation k kb kf kF is used for equivalence of norms: it means that there exist
constants C1 ; C2 , not depending on (or f ) such that
C1 kkb kf kF C2 kkb :
Armed with the Besov index notation, we may summarize the inequalities described in
the three function class examples considered earlier as follows:
Holder smoothness; p D 1:
Mean-square smoothness; p D 2:
Average smoothness/TV; p D 1:
k kb1;1
kf kC ;
> 0;
Z
k k2b2;2
jf j2 C jD f j2 ;
2 N;
Z
1
1 :
C1 kkb1;1
jf j C jDf j C2 kkb1;1
In the Holder case, we use the Zygmund class interpretation of C when 2 N. The average
smoothness/TV result corresponds only to D 1.
Example 9.11 Consider f .x/ D Ajxj g.x/. Here g is just a window function included
to make f integrable; for example suppose that g is equal to 1 for jxj 1=2 and vanishes
for jxj 1 and is C 1 overall. Assume that > 1=p so that f 2 Lp . Suppose that the
wavelet has compact support, and r > C 1 vanishing moments. Then it can be shown
263
f 2 Bp;1
if the ratio !r .f; t /p =t is uniformly bounded in t > 0. If instead the ratio
.
defines the semi-norm jf jBp;q
and then the norm kf kBp;q
D kf kp C jf jBp;q
The discussion in Appendix B is tailored to Besov spaces on a finite interval, say 0; 1. It
is shown there, Theorem B.9, that if .'; / are a C r scaling function and wavelet of compact
support giving rise to an orthonormal basis for L2 0; 1 by the CDJV construction, then the
sequence norm (9.36) and the function norm are equivalent
:
C1 kf kbp;q
kf kBp;q
C2 kf kbp;q
(9.37)
The constants Ci may depend on .'; ; ; p; q; L/ but not on f . The proof is given for
1 p; q 1 and 0 < < r.
Relations among Besov spaces. The parameter q in the Besov definitions for averaging
across scale plays a relatively minor role. It is easy to see, for example from (9.36), that
Bp;q
Bp;q
;
1
2
for q1 < q2
1=p 0 D
1=p, then
Bp;q
Bp0 ;q :
In fact, the proof becomes trivial using the sequence space form (9.36).
The situation can be summarized in Figure 9.3, which represents smoothness in the vertical direction, and 1=p in the horizontal, for a fixed value of q. Thus the y axis corresponds
to uniform smoothness, and increasing spatial inhomogeneity to 1=p: The imbeddings proceed down the lines of unit slope: for example, inhomogeneous smoothness .; 1=p/ with
> 1=p implies uniform smoothness (i.e. p 0 D 1) of lower degree 0 D 1=p 0 . Indeed
0
0
0
B1;1
C if 0 N.
B1;q
Bp;q
The line D 1=p represents the boundary of continuity. If > 1=p; then functions in
Bp;q
are continuous by the embedding theorem just cited. However in general, the spaces
with D 1=p may contain discontinuous functions one example is given by the contain1
1
ment B1;1
T V B1;1
:
Finally, for Bp;q .0; 1/, the line D 1=p 1=2 represents the boundary of L2 compact
ness - if > 1=p 1=2; then Bp;q
norm balls are compact in L2 : this observation is basic
to estimation in the L2 norm.
1
If .B1 ; k k1 / and .B2 ; k k2 / are normed linear spaces, B1 B2 means that for some constant C , we
have kf k2 C kf k1 for all f 2 B1 .
264
(1=p';')
=1=p
=1=p{1=2
1=2
1=p
Figure 9.3 Summarizes the relation between function spaces through the primary
parameters (smoothness) and 1=p (integration in Lp ). The middle line is the
boundary of continuity and the bottom, dashed, line is the boundary of
compactness.
Besov and Sobolev norms. While the Besov family does not match the Sobolev family
precisely, we do have the containment, for r 2 N,
r
Wpr Bp;1
:
We can write these embedding statements more explicitly. For r 2 N, there exists a
constant C such that
Z 1
kf kpBp;1
C
jf jp C jD r f jp :
(9.38)
r
0
In the other direction, for 0 < p 2 and r 2 N, there exists a constant C such that
1
Z
0
jD r f jp C kf kpbp;p
:
r
(9.39)
A proof of (9.38) appears in Appendix B after (B.27), while for (9.39), see Johnstone and
Silverman (2005b), though the case p 1 is elementary.
r
More generally, Wpr D Fp;2
belongs to the Triebel class of spaces, in which the order of
averaging over scale and space is reversed relative to the Besov class, see e.g. Frazier et al.
(1991) or Triebel (1983). In particular, this approach reveals an exceptional case in which
r
W2r D B2;2
, cf Theorem 9.9.
265
Simplified notation
Consider a multiresolution analysis of L2 0; 1 of one of the forms discussed in Section 7.1.
For a fixed coarse scale L, we have the decomposition L2 .0; 1/ D VL WL WLC1 ;
and associated expansion
f .x/ D
L
2X
1
k 'Lk .x/ C
X 2X1
j k
j k .x/:
(9.40)
j L kD0
kD0
For the statistical results to follow, we adopt a simplified notation for the Besov sequence
norms, abusing notation slightly. To this end, for j < L, define coefficients j k to collect
all the entries of .k /:
0 j < L; 0 k < 2j ;
j k D 2j Ck ;
1;0
(9.41)
D 0 :
If we now write
kkqbp;q
D
then we have an equivalent norm to that defined at (9.36). Indeed, since L is fixed and all
L
norms on a fixed finite dimensional space, here R2 , are equivalent, we have
X
1=q
k 1
k kp
2aj q kj kqp
:
jD 1
.
In the case of Besov spaces on 0; 1, we will therefore often write p;q instead of bp;q
Notation for norm balls. For C > 0, let
o
n X
2aj q kj kqp C q :
p;q .C / D W
j
aj
; for all j
1g:
(9.42)
266
Recall that the notation B1 B2 for (quasi-)normed linear spaces means that there exists
a constant c such that kxkB2 ckxkB1 for all x. See Figure 9.4.
=1=p{1=2
w`p
1=2
1=p
1=p
Figure 9.4 Besov spaces p;q on the dotted line are included in w`p .
Proof Using the simplified notation for Besov norm balls, we need to show that, for some
constant c1 allowed to depend on and p,
p;q .C / w`p .c1 C /
(9.43)
for p > p , but that no such constant exists for w`s for s < p .
Since p;q .C / p;1 .C /; it suffices to establish (9.43) for p;1 .C /, which in view of
(9.42) is just a product of `p balls `2j ;p .C 2 aj /. Hence, using (9.19) and (9.17) to calculate
dimension bounds for products of `p balls, and abbreviating D p;q .C /, we arrive at
X
N.; / 1 C
minf2j ; .C 1 2 aj /p g:
j
The terms in the sum have geometric growth up to and decay away from the maximum j
defined by equality between the two terms: thus 2j .C1=2/ D C =, independent of p > p .
Hence N.; / cp 2j where we may take cp D 3 C .1 2 ap / 1 < 1 for ap > 0,
which is equivalent to p > p . Now, from the definition of j , we have p 2j D C p , and
so
p N.; / cp C p
(9.44)
1=p
cp
:
and so, using the criterion (9.11) for weak `p , we obtain (9.43) with c1 D
For the second part, consider the Besov shells .j0 / defined as the collection of those
2 p;q .C / for which j k D 0 unless j D j0 . Note then that .j0 / `2j ;p .C 2 ja /:
Consider the shell corresponding to level j D j with j determined above: since this
shell belongs to D p;q .C / for all q, we have, from (9.17)
N.; / minf2j ; .C 2
s
1 p s p
C
2
ja
is unbounded in if s < p :
(9.45)
267
D 'L;2j Ck
1;0
D 'L;0 :
0 j < L; 0 k < 2j
I D f.j k/ W j 0; k D 0; : : : ; 2j
1g [ f. 1; 0/g:
As in Sections 7.4, 7.5, we write I D .j k/ when convenient. With this understanding, our
wavelet sequence model becomes
yI D I C zI ;
I 2 I;
(9.46)
F D Fp;q
.C / D ff W f 2 p;q .C /g;
(9.48)
268
secure in the knowledge that under appropriate conditions on the multiresolution analysis,
2E
f k2
(9.49)
k2 D RE .; /:
Here E might denote the class of all estimators. We will also be particularly interested in
certain classes of coordinatewise estimators applied to the wavelet coefficients. In the sequence model, this means that the estimator has the form OI .y/ D OI .yI /, where O belongs
to one of the four families in the following table. In the table, v is a scalar variable.
Family
Description
Form of OI
EL
OIL .v/ D cI v
ES
EH
EN
Scalar nonlinearities
of wavelet coefficients
I /C sgn.v/
.j k/ 2 I.n/ :
(9.50)
The name projected reflects the fact that the vector .n/ defined by
(
j k
.j k/ 2 I.n/
.n/
j k D
0
.j k/ 2 I nI.n/
can be viewed as the image of under orthogonal projection Pn W L2 ! VJ .
The projected model has two uses for us. First, under the calibration D n 1=2 , it provides an n-dimensional submodel of (9.46) that is a natural intermediate step in the white
noise model approximation of the Gaussian nonparametric regression model (7.21) with n
p
2 log n thresholding
269
2 log n thresholding
We combine the preceding results about Besov bodies and weak `p with properties of
pthresholding established in Chapter 8 to derive adaptive near minimaxity results for 2 log n
thresholding over Besov bodies p;q .C /. Consider the dyadic sequence model (9.46) and
apply
soft thresholding
to the first n D 2 D 2J coefficients, using threshold D
p
p
2 log 2 D 2 log n:
(
S .yj k ; / j < J
U
O
j k D
(9.51)
0
j J:
The corresponding function estimate, written using the notational conventions of the last
section, is
X
fOn .t / D
(9.52)
OjUk j k .t /:
.j k/2In
Remarks. 1. A variant that more closely reflects practice would spare the coarse scale
coefficients from thresholding: Oj k .y/ D yj k for j < L. In this case, we have
fOn .t/ D
L
2X
1
yQLk 'Lk .t / C
j
J
X1 2X1
OjUk
j k .t /
(9.53)
j DL kD0
kD0
where yQLk D h'Lk ; d Y i. Since L remains fixed (and small), the difference between (9.52)
and (9.53) will not affect the asymptotic results below.
2. Although not strictly necessary for the discussion that follows, we have in mind the
situation of fixed equi-spaced regression: yi D f .i=n/ C ei compare (2.83). After a
discrete orthogonal wavelet transform, we would arrive at the projected white noise model
(9.46), with calibration D n 1=2 : The restriction of thresholding in (9.51) to levels j < J
corresponds to what we might do with real data: namely threshold the n empirical discrete
orthogonal wavelet transform coefficients.
The next theorem gives an indication of the broad adaptation properties enjoyed by wavelet
thresholding.
Theorem 9.14 Assume model (9.46), and that > .1=p 1=2/C , 0 < p; q 1;
0p
< C < 1. If p < 2; then assume also that 1=p. Let O U denote soft thresholding at
2 log n, defined at (9.51) Then for any Besov body D p;q .C / and as ! 0,
sup r .O U ; / cp .2 log
/C 2.1
r/ 2r
.1 C o.1//
(9.54)
cp .2 log
A key aspect of this theorem is that thresholding learns the rate of convergence appropriate to the parameter space . The definition (9.51) of O U does not depend on the
270
parameters of p;q .C /, and yet, when restricted to such a set, the MSE attains the rate of
convergence appropriate to that set, subject only to extra logarithmic terms.
The constant cp depends only on .; p/ and may change at each appearance; its dependence on and p could be made more explicit using the inequalities in the proof.
Proof Let .n/ and O .n/ denote the first n coordinates i.e. .j; k/ with j < J of and
O we apply the soft
O respectively. To compute a bound on the risk (mean squared error) of ,
.n/
O
O
thresholding risk bound (8.33) of Proposition 8.8 to : Since j k 0 except in these first
n coordinates, what remains is a tail bias term:
r.O U ; / D E kO .n/
.2 log
.n/ k2 C k .n/
2
k2
(9.55)
X
kj k2 :
(9.56)
j J
Bound (9.56) is a pointwise estimate valid for each coefficient vector . We now
investigate its consequences for the worst case MSE of thresholding over Besov bodies
D p;q .C /. Given ; we set as before,
r D 2=.2 C 1/;
r/:
Now comes a crucial chain of inequalities. We use first the definition (9.8) of the ideal risk
semi-norm, then the bound for ideal risk in terms of weak `p./ (the third inequality of
(9.13)), and finally the fact that the Besov balls p;q .C / are embedded in w`p./ , specifically (9.43). Thus, we conclude that for any 2 p;q .C / and any > 0,
2r
cp C 2.1
R. .n/ ; / kk2IR;r 2r c kkp./
w`p./
r/ 2r
:
(9.57)
1=p/C
(9.58)
which follows from a picture: when p < 2; the vectors having largest `2 norm in an `p
ball are sparse, being signed permutations of the spike C.1; 0; : : : ; 0/: When p 2; the
extremal vectors are dense, being sign flips of C n 1=p .1; : : : ; 1/:
Now we combine across levels to obtain a tail bias bound. Suppose that 2 p;q .C /
p;1 .C /: we have kj kp C 2 aj . Now use (9.58) and write 0 D .1=p 1=2/C
P
0
to get kj k2 C 2 j . Clearly then j J kj k22 is bounded by summing the geometric
series and we arrive at the tail bias bound
sup
2
p;q .C /
k .n/
k2 c0 C 2 2
2 0 J
(9.59)
Inserting the ideal risk and tail bias bounds (9.57) and (9.59) into (9.56), we get the nonasymptotic bound, valid for 2 p;q .C /,
r.O U ; / .2 log
C 1/ 2 C cp C 2.1
r/ 2r
C cp C 2 4 :
0
(9.60)
Now suppose that C is fixed and ! 0. We verify that 2 D o. r /. This is trivial when
p 2, since 2 > r: When p < 2; the condition 1=p implies 2 0 D 2a 1 > r: This
completes the proof of (9.56).
p
2 log n thresholding
271
Lower Bounds. We saw in the proof of Proposition 9.13 that p;q .C / contains hypercubes of dimension N.; / c0 .C =/p./ . Hence the general hypercube lower bound
(9.15) implies that
RN .; / c1 .C =/p./ 2 D c1 C 2.1
r/ 2r
:
(9.61)
Block Thresholding*
We briefly look at how the adaptation results are modified if block thresholding, considered
in Sections 7.6, 8.2 and 8.3, is used instead of thresholding of individual coefficients. We
focus on block soft thresholding for simplicity, as the results are then a relatively direct
- D log n for the
extension of previous arguments for scalar thresholding. With a choice L
block size 2 , we will obtain improvements in the logarithmic factors that multiply the n r D
2r convergence rate in Theorem 9.14. However, our earlier lower bounds on thresholding
risk also show that for these estimators, the logarithmic terms cannot be removed.
- D 2j0 for simplicity, where j0 will grow slowly with
Consider a dyadic block size L
decreasing . At level j j0 , the 2j indices are gathered into blocks of size 2j0 , thus
jb D .j;b.L-
- /;
1/C1 ; : : : ; j;b L
b D 1; : : : ; 2j
j0
and the block data vector yjb is defined similarly. Now define the block soft thresholding
estimate on (wavelet) coefficients .yj k / by
(
S;L- .yjb ; / j < J
B
(9.62)
Ojb D
0
j J;
where S;L- is the block soft threshold rule defined at (8.38). To make sense of (9.62) for
the coarsest levels j < j0 , there are at least a couple of possibilities: (i) use the unbiased
- D 2j0 to
estimators Oj D yj , or (ii) gather all yj for j < j0 into a single block of size L
which S;L- is applied. The result below works in either case as the risk contribution of levels
below j0 is certainly o. 2r /.
- and threshold parameter so that
We choose block size L
2
2
log 2
1 .2 log n/=L:
(9.63)
272
p
- D log n and D D 4:50524, for which the left side equals
The main example has L
2. As noted in Section 8.3, when (9.63) holds we have rS;L- .; 0/ n 1 D 2 and can apply
the block oracle inequality of Proposition 8.9.
Theorem 9.15 Adopt the assumptions of Theorem 9.14 for ; p; q and C . Let O B denote
- chosen to satsify (9.63). Let D .1=p 1=2/C . Then
block soft thresholding with and L
r/
- C /2.1
cp .L
2r
r/
2.1 r/
cp L
./2r .1 C o.1//
(9.64)
RN .; /.1 C o.1//:
Here N 2 D 2 C 1. We need a bound on the block ideal risk analogous to (9.57). This
is a consequence of an extension to a block version of the ideas around weak `p balls. In
Exercises 9.4 and 9.5, we sketch the proof of an inequality that states for 2 p;q .C /,
N L/
- C /2.1
- cp .L
R. .n/ ; I
r/
N 2r :
./
With this inequality in hand, the rest of the proof follows as for Theorem 9.14.
For the lower bound, we first use the lower bound on risk for block soft thresholding,
Proposition 8.7, to obtain
XX
- 2 /:
r .O B ; / .1=8/
min.kjb k2 ; 2 L
j <J
As in the proof of Proposition 9.13, the space p;q .C / contains a copy of the `p ball .j / D
273
`2j ;p .C 2
aj
.j /
At this point, we focus on p < 2, leaving p 2 to Exercise 9.10. We first adapt Lemma
9.3, the evaluation of ideal risk over `p balls, to this setting. Regard `B;p .C / as an `p ball
of block norms; the lemma says in part that
(
B
X
2 C p = p 1 C = B 1=p
sup
min.kb k2 ; 2 /
(9.66)
B 2
C = B 1=p :
`B;p .C / bD1
Observe that if .kb k/B
bD1 2 `B;p .C / and n D LB, then the vector .k1 k; : : : ; kB k; 0; : : : ; 0/
with n B zeros belongs to `n;p .C /, so that the lower bound above applies to `n;p .C / also.
We may now apply the previous display to (9.65), making the assignments
p
C $ C 2 aj :
B $ 2j =L;
$ L;
It will be seen that the resulting bounds from (9.66) increase for j j and decrease with
j j , where j is determined by the equation C = D B 1=p , which with the identifications
just given and with p./ D 2=.2 C 1/ becomes
- 1=p
2j =p./ D L
1=2
C =./:
At j D j , the bound
- C /2.1
B 2 $ 2j ./2 D ./2r .L
r/
f 2F1;1
.C /
274
Proof
Decompose the estimation error over coarse, mid and tail scales:
X
X
X
fOn .t0 / f .t0 / D
aI C
aI C
aI :
I 2c
I 2m
(9.68)
I 2t
aI D
J
X1
X
.Oj k
j k /
j k .t0 /;
j DL k
and points to the new point in the proof. InPglobal estimation, the error kfO f k2 is expressed in terms of that of the coefficients, .OI I /2 , by Parsevals equality, using the
orthonormality of the basis functions j k . In estimation at a point t0 , there is no orthogonality in t, and instead we bound the root mean squared (RMS) error of a sum by the sum of
the RMS errors:
q
X q
2
X 2 X
Xq
2
2
EaI EaJ D
EaI2 :
(9.69)
E
aI D
EaI aJ
I
I;J
I;J
We can use previous results to bound the individual terms EaI2 . Indeed, recall from (8.9)
the mean squared error bound for a soft threshold estimator with threshold , here given for
noise level and N 2 D 1 C 2 :
rS .; I / 2 r.; 0/ C 2 ^ N 2 2
(9.70)
p
Since D 2 log n; we have from (8.7) that r.; 0/ n 1 : We use the Holderp
continuity
.C1=2/j
aCb
assumption
and
Lemma
7.3
to
bound
j
j
cC
2
.
In
conjunction
with
j
k
p
p
a C b, we obtain
q
p
N
Ea2 j I .t0 /j r.; 0/ C jI j ^
I
j <J
p
c= n C c; C 1 r nr ;
(9.71)
I 2c[m[t
275
0; : : : ; 2L
and so
Xq
EaI2 2L c' n
1=2
(9.72)
I 2c
In the tail sum over I 2 t, we have aI D I I .t0 / for I D .j k/ and j J . Using again
the Holder coefficient decay bound and the compact support of ,
X
X
jaI j c S
C 2 .C1=2/j 2j=2 cC 2 J D cC n :
(9.73)
I 2t
j J
Combining the coarse, mid and tail scale bounds (9.72), (9.71) and (9.73), we complete
the proof:
EfOn .t0 /
f .t0 /2 .c1 n
1=2
C c2 C 1 r nr C c3 C n
/ c22 C 2.1
r/ 2r
n .1
C o.1//:
Remarks. 1. The corresponding lower bound for estimation at a point over Holder classes
is of order n r , without the log term. More precisely
inf
sup
fO f 2F1;1
.C /
EfO.t0 /
f .t0 /2 cC 2.1
r/
n r:
We will not give a proof as we have not discussed estimation of linear functionals, such as
f ! f .t0 /, in detail. However, an argument in the spirit of Chapter 4 can be given relatively
easily using the method of hardest one-dimensional subfamilies, see Donoho and Liu (1991,
Sec. 2). The dependence of the minimax risk on n and C can also be nicely obtained by a
renormalization argument, Donoho and Low (1992). For Besov balls, see Exercise 9.12.
2. If weP
knew both and C; then we would be able to construct a linear minimax estimator
fOn;C D
I cI yI where the .cI / are the solution of a quadratic programming problem
depending on C; ; n (Ibragimov and Khasminskii (1982); Donoho and Liu (1991); Donoho
sup
i
fOn i D0;1 F1;1
.Ci /
Ci2.ri
1/ ri
n EfOn .t0 /
f .t0 /2 c2 logr0 n:
3. It is evident both intuitively and also from Lemma 7.3 that the full global constraint of
Holder regularity on 0; 1 is not needed: a notion of local Holder smoothness near t0 is all
that is used. Indeed Lemma 7.3 is only needed for indices I with I .t0 / 0:
276
n
X
min.i2 ; 2 /:
i D1
Now maximize
can be shown, e.g. Lemma 9.3that for 1 .C =/p
P over 2 22 `n;p .Cp/it
2 p
, and so
n, we have min.i ; / C
sup
/C p 2
2`n;p .C /
We might select to minimize the right sidepbound: this immediately leads to a proposed
choice D n 1 .C =/p and threshold D 2 log n.=C /p . Observe that as the signal to
1=p
noise
p ratio C = increases from 1 to n , the nominally optimal threshold decreases from
2 log n to 0, and no single threshold value appears optimal for anything other than a limited
set of situations.
3
the factor 2n 2 , while a looser bound than given by (8.13), leads to cleaner heuristics here.
9.12 Notes
277
9.12 Notes
1. DeVore (1998) is an excellent survey article on basic ideas of non-linear approximation. The equivalence
of the compression, ideal risk and weak `p quasi-norms was shown by Donoho (1993). Using absolute
278
P
values j j.i / (rather than squares) to define a compression norm supk k 1C1=p kiD1 j j.i / works for the
more restricted range 1 < p, e.g. DeVore and Lorentz (1993, Ch. 2, Prop. 3.3).
3. The construction of lower bounds using subsets of growing cardinality has a long history reviewed in
Tsybakov (2009, Ch. 2); important papers on the use of hypercubes include Bretagnolle and Huber (1979)
and Assouad (1983).
(Remark on p=..2 p/ as difference between weak and strong `p norm minimax risks. Also FDR
connections?).
Meyer (1990, Section 6.4) explains that it is not possible to characterize the integer Holder classes
C m .R/ in terms of moduli of wavelet coefficients.
Theorem 9.6 and Remark 9.7 extend to C .0; 1/ with the same proof, so long as the boundary wavelets
satisfy the same conditions as .
6. Meyer (1990, Chapter 3) establishes a more general form of Theorem 9.9: using a Fourier definition
of W2 and the notion of an r-regular multiresolution analysis, he establishes the equivalence (9.35) for all
real with jj < r.
Diagrams using the .; 1=p/ plane are used by Devore, for example in the survey article on nonlinear
approximation DeVore (1998).
9. Efromovich (2004a, 2005) uses lower bounds to risk for specific signal to provide insight on block
and threshold choice.
10. The pointwise estimation upper bound of Theorem 9.16 appears in Donoho and Johnstone (1996)
along with discussion of optimality of the logr n penalty in adaptation over . Cai (2002) shows that log n
is the optimal block size choice to achieve simultaneously optimal global and local adaptivity.
Exercises
9.1
9.2
9.3
p/C
1/C
.ap C b p /:
sup
r .O ; / n 2 rS .; 0/ C cp .1 C 2 /1
p=2
C p 2
p/:
p
2w`n;p .C /
p
This should be compared with bound (9.22) for D 2 log n.
(ii) Let Cn ; n dependpon n and define the normalized radius n D n
p
as n ! 1, set n D 2 log n and show that
p
p
rN .O ; w`n;p .C // cp nn2 n .2 log n /1
9.4
p=2
1=p .C = /.
n n
If n ! 0
.1 C o.1//:
[This turns out to be the minimax risk for weak `p ; compare the corresponding result for strong
`p in (13.47).]
(Block weak `p norms.) Suppose the elements of D .k ; k 2 N/ are grouped into successive
P
2 1=2 and
blocks of size L, so b D .b.L 1/C1 ; : : : ; bL /. Let kb k be the `2 norm
k2b k
with slight abuse of notation write kk.b/ for the bth largest of the ordered values of kb k, thus
279
Exercises
kk.1/ k k.2/ . Then say that belongs to block weak-`p if kk.b/ C b
kkw`p;L denote the smallest such C . Show that
p
1=p ,
and let
9.5
>0
n
L
; L.p=2
1/C
Cp
:
p
p;q .C /.
j Dr
Let p D 2=.2 C 1/. Show that for p > p and some c D cp , for all 2 p;q .C / we have
kkw`p ;L cL1=.p^2/
1=p
C;
9.6
C /2.1
r/
9.7
1=2/C
p=2
while
r.O JS ; n / cp0 n:
(Holder smoothness and wavelet coefficients.) Assume the hypotheses of Remark 9.7 and in
particular that smoothness satisfies m < < m C 1 for m 2 N. Show that the bounds
jLk j C;
jj k j C 2
.C1=2/j
imply that
jD m f .x/
9.8
D m f .y/j C 0 jx
yj
R
(Wavelet coefficients of BV functions.) Show that if
D 0 and supp
T V , we have
Z
f 12 k k1 jf jT V :
I , then for f 2
[Hint: begin with step functions.] Thus, complete the proof of the upper bound in Theorem 9.10.
280
9.9
jk/
denote
aj /
and
.j / b
and hence, for suitable choice of j 2 R, that the right side takes the value ./2r C 2.1 r/ .
9.11 (Thresholding at very fine scales.) We wish to weaken the condition 1=p in Theorem 9.14
to > 1=p 1=2: Instead of setting everything to zero at levels J and higher (compare (9.51)),
one possibility for controlling tail bias better is to apply soft thresholding at very high scales at
successively higher levels:
(
S .yj k ; j /;
j < J2
O
j k D
0
j J2
where for l D 0; 1; : : : ; J
j D
1,
q
2.l C 1/ log
Show that if, now > 1=p 1=2, then the upper risk bound in Theorem 9.14 continues to hold
with log 2 replaced by, say, .log 2 /3 .
9.12 (Pointwise estimation over Besov classes.)
(a) Show that point evaluationthe mapping f ! f .t0 / for a fixed t0 2 .0; 1/is a continu so long as > 1=p.
ous functional on Bp;q
.C / in place of the
(b) Assume then that > 1=p. Show that if we use a Besov ball Fp;q
Holder ball F1;1 .C /, then the pointwise estimation bound (9.67) holds with the slower rate
r 0 D 2 0 =.2 0 C 1/, where 0 D 1=p, in contrast with the rate for global estimation
.C / follows, for
r D 2=.2 C 1/ of Theorem 9.14. [The optimality of this slower rate for Fp;q
example, from the renormalization argument of Donoho and Low (1992).]
10
The optimal recovery approach to thresholding.
I 2 I.n/ ;
(10.1)
1g [ f. 1; 0/g is the
i id
collection of the first n wavelet coefficients. As usual is known and zI N.0; 1/:
p
We continue our study of asymptotic properties of thresholding at a level n D 2 log n,
already begun in Sections 9.9 and 9.10 which focused on adaptation results for global and
pointwise squared error respectively. In this chapter we focus on global error measures (and
parameter spaces) drawn from the Besov scale and derive two types of result.
First, the function estimates fO D f O corresponding to (9.51) are in a strong sense as
smooth as f , so that one has, with high probability, a guarantee of not discovering nonexistent features. (Theorem 10.6). Second, the threshold estimator (9.51) is simultaneously
near minimax (Theorem 10.9).
The proofs of the properties (a) and (b) FIX! exploit a useful connection with a deterministic problem of optimal recovery, and highlight the key role played by the concept of
shrinkage in unconditional bases, of which wavelet bases are a prime example.
Section 10.1 begins therefore with a description of the near minimax properties of soft
thresholding in the deterministic optimal recovery model. It introduces the modulus of continuity of the error norm with respect to the parameter space, which later plays a key role in
evaluating rates of convergence.
The statistical consequences are developed in two steps: first in a general n-dimensional
monoresolution Gaussian white noise model, in Sections 10.210.4, which makes no special mention of wavelets, and later in Sections 10.5 10.8 for the multiresolution wavelet
sequence model (10.1).
281
282
In both cases, when phrased in terms of moduli of continuity, upper bounds are direct
consequences of the deterministic results: this is set out in the monoresolution setting in
Section 10.2.
Actual evaluation of the modulus of continuity is taken up in Section 10.3, for the setting
of error norms and parameter sets defined by `p norms. As we seek to cover models of
sparsity, we include the cases 0 < p < 1, for which the `p measure is only a quasi-norm.
The main finding is that for an `p -ball and k k an `p0 -norm, the behavior of the
modulus depends on whether p p 0 , corresponding to dense least favorable configurations,
or whether p < p 0 , corresponding to sparse configurations.
Lower bounds in the statistical model do not flow directly from the deterministic one, and
so Section 10.4 collects the arguments in the monoresolution setting, with separate results
for sparse and dense cases.
Section 10.5 takes up the multiresolution model, beginning with the important fact that
wavelets provide unconditional bases for the Besov scale of spaces: this may be seen as
a formalization of the idea that shrinkage of coefficientsas in linear estimation or in
thresholdingis a stable operation. The propery of preservation of smoothness under thresholding, highlighted earlier, is a direct consequence.
Section 10.6 begins with the multiresolution analog of Section 10.2: drawing consequences of a deterministic observation model, now incorporating the notion of tail bias,
which is introduced to deal with estimation of a full sequence vector .I / with the cardinality of N based on only n observations. The main results for estimation in Besov norms
over Besov balls in statistical model (10.1) are formulated. A new phenomenon appears: a
distinct, and slower, rate of convergence for parameter combinations p in a logarithmic
zone. (The reason for the name appears after the detailed statement of Theorem 10.9.)
The details of the calculation of the modulus of continuity for Besov norms are taken
up in Section 10.7. The modulus provides a convenient summary describing the rate of
convergence corresponding to k kb 0 and b . An important tool is the use of Besov shells,
which consist in looking at signals whose only non-zero components lie in the j -th shell.
Focusing on the j -th shell alone reduces the calculations to an `p ball. By studying the
modulus as the shell index j varies, we see again the pattern of geometric decay away from
a critical level j D j .p/.
Finally, Section 10.8 presents lower bounds for the multiresolution setting. The Besov
shell device, after appropriate calibration, reduces the lower bound arguments to previous
results for `p -balls and error measures presented in Section 10.4.
juI j 1
I 2 I:
(10.2)
It is desired to recover the unknown vector ; but it is assumed that the deterministic noise
u might be chosen maliciously by an opponent, subject only to the uniform size bound. The
283
noise level is assumed known. The worst case error suffered by an estimator O is then
e.O ; I / D sup kO .x/
k:
(10.3)
juI j1
We will see that a number of conclusions for the statistical (Gaussian) sequence model
can be drawn, after appropriate calibration, from the deterministic model (10.2).
Assumptions on loss function and parameter space. Throughout this chapter we will assume:
(i) `2 .I / is solid and orthosymmetric, and
(ii) The error norm k k is also solid and orthosymmetric, in the sense that
jI j jI j 8I
kk kk:
The error norm can be convex, as usual, or at least -convex, 0 < 1, in the sense that
k C k k k C kk .
The Uniform Shrinkage Property of Soft Thresholding. Soft thresholding at threshold
can be used in the optimal recovery setting:
O;I .xI / D sgn.xI /.jxI j
/C :
(10.4)
The shrinkage aspect of soft thresholding has the simple but important consequence that the
estimate remains confined to the parameter space:
Lemma 10.1 If is solid orthosymmetric and , then 2 implies O 2 :
Proof Since soft thresholding shrinks each data coordinate xI towards 0 (but not past 0!)
by an amount that is greater than the largest possible noise value that could be used to
expand I in generating xI , it is clear that jO;I j jI j: Since is solid orthosymmetric,
this implies O 2 :
Minimax Error. The minimax error of recovery in the determinstic model is
E.; / D inf sup e.O ; I /;
O 2
O / D e.;
O I / is given by (10.3). Good bounds on this minimax error can be
where e.;
found in terms of a modulus of continuity defined by
./ D .I ; k k/ D
sup
fk0
1 k W k0
1 k1 g:
(10.5)
Thus, the modulus measures the error norm k k of differences of sequences in the parameter
space that are separated by at most in uniform norm.
Theorem 10.2 Suppose that and the error norm kk are solid and orthosymmetric. Then
.1=2/./ E.; / 2./:
In addition, soft thresholding O is near minimax simultaneously for all such parameter
spaces and error norms.
284
Proof For each noise vector u D .uI / under model (10.2), and 2 , we have O 2 by
the uniform shrinkage property. In addition, for each u,
kO
k1 kO
xk1 C kx
k1 2:
Hence .O ; / is a feasible pair for the modulus, and so it follows from the definition that
e.O ; / .2/. Since =2 by solid orthosymmetry, we also have .2/ 2./.
Turning now to a lower bound, suppose that the pair .0 ; 1 / 2 attains the value
./ defining the modulus. 1 The data sequence x D 1 is potentially observable under
O
(10.2) if either D 0 or D 1 ; and so for any estimator ,
O /
sup e.;
2
kO .1 /
sup
k ./=2:
2f0 ;1 g
We now define a modified modulus of continuity which is more convenient for calculations with `p and Besov norm balls.
.I ; k k/ D supfk k W 2 ; kk1 g:
In fact, .I ; k k/ D .I ; k k/, where D f1 2 W i 2 g is the
Minkowski sum of the sets and . If is a norm ball .C / D f W kk C g (so that
0 2 ), and if k k is -convex, then the modified modulus is equivalent to the original one:
./ ./ 21= .2
1=
/:
(10.6)
Indeed, the left inequality follows by taking pairs of the form .; 0/ in (10.5). For the right
inequality, let .0 ; 1 / be any feasible pair for (10.5) with D .C /. Then the scaled
difference D 2 1= .0 1 / 2 .C / and satisfies kk1 2 1= , so
k0
1 k D 21= k k 21= .2
1=
/:
The right inequality follows after maximizing over feasible pairs .0 ; 1 /.
i id
zi N.0; 1/; i D 1; : : : ; n:
(10.7)
The connection with the optimal recovery model, with I D f1; : : : ; ng, is made by considering the event
p
(10.8)
An D fsup jzI j 2 log ng;
I 2I
which because of the properties of maxima of i.i.d. Gaussians (c.f. Section 8.9) has probability approaching one:
p
P .An / D $n 1 1= log n % 1
as n ! 1:
1
If the supremum in (10.5) is not attained, the argument above can be repeated for an approximating sequence.
285
In the next two sections, we explore the implications for estimation over `p -balls in Rn
using error measured in `p0 norms. We need first to evaluate the modulus for this class of
and k k, and then to investigate lower bounds to match the upper bounds just proved.
n
n
X
X
p0
p0
D supf
min.i ; / W
ji jp C p g:
iD1
i D1
We show that
0
Wnp .; C / n0 0p ;
(10.9)
with the least favorable configurations being given up to permutations and sign changes by
D .0 ; : : : ; 0 ; 0; : : : ; 0/;
0 ;
(10.10)
with n0 non-zero coordinates and 1 n0 n. The explicit values of .n0 ; 0 / are shown in
Figure 10.1. The approximate equality occurs only if 1 < n0 < n and is interpreted as in
(10.11) below.
The result is a generalization of one for p 0 D 2 given in Lemma 9.3. We will therefore be
more informal: the verification is mostly by picturecompare Figure 10.1. First, however,
set xi D ji jp , so that we may rewrite
X p0 =p
X
0
W p D supf
xi
W
xi C p ; kxk1 p g:
P 0
The function f .x/ D xip =p is concave for p 0 p and strictly convex for p 0 > p, in both
cases over a convex constraint set. We take the two cases in turn.
(i) p p 0 . Let xN D ave xi , and xQ D .x;
N : : : ; x/.
N By concavity, f .x/
Q f .x/, and so the
maximum of f occurs at some vector c.1; : : : ; 1/. In this case, equality occurs in (10.9).
286
(ii) p < p 0 . Convexity implies that the maximum occurs at extreme points of the constraint set. For example, if C n 1=p C , then
with n0 p C p D C p ;
D .; : : : ; ; ; 0; : : : ; 0/;
0
Wnp ./
Hence
precisely
0
n0 0p
C
p0
1 p p0 p
C
2
0
Wnp .; C /
Wnp .; C / 2C p p
:
C p p
n1 jijp=C p
, or more
p p'
(10.11)
W p = n1{p =pC p
n0 = n
0 = Cn{1=p
0
Wp = n p
n0 = n
0 =
(1;...;1)
C n{1=p
p < p'
n1
jij =C
p
0
W p C p p {p
n0 = [(C=)p]
0
Wp = n p
n0 = n
0 =
0 =
C
C n{1=p
Wp = Cp
n0 = 1
0 = C
C
Figure 10.1 Top panel: Concave case p p 0 , Bottom panel: Convex case p < p 0 .
The approximate inequality is interpreted as in (10.11).
Thus n0 , or the ratio n0 =n, measures the sparsity of the least favorable configuration.
When p p 0 ; the least favorable configurations are always dense, since the contours of the
`p0 loss touch those of the `p norm along the direction .1; : : : ; 1/. On the other hand, when
p < p 0 ; the maximum value of `p0 error over the intersection of the `p ball and cube
is always attained on the boundary of the cube, which leads to sparser configurations when
C < n1=p :
287
For later use, note the special case when there is no constraint on kk1 :
WnIp0 ;p .1; C / D supfkkp0 W kkp C g D C n.1=p
1=p/C
(10.12)
2n0 02
(10.13)
k 21 n g .n/:
(10.14)
Dense Case. The argument uses a version of the hypercube method seen in Sections
288
4.7 and 9.3. Let .n0 ; 0 / be parameters of the worst case configuration for Wn .; C /: from
the figures
(
minf; C n 1=p g if p p 0
0 D
minf; C g
if p < p 0 :
from which it is clear that 0 : Let be the distribution on which makes i independently equal to 0 with probability 21 for i D 1; : : : ; n0 ; and all other co-ordinates 0: Since
supp ; we have for any .; y/ measurable event A;
sup P .A/ P .A/:
(10.15)
2
P
O
Suppose now that .y/
is an arbitrary estimator and let N.O .y/; / D i I fOi .y/i < 0g
be the number of sign errors made by O , summing over the first n0 coordinates. Under P ;
kO
0
0
kpp0 0p N.O .y/; /:
(10.16)
0
0
sup P kO kpp0 c0p P fN.O ; / cg:
2
0
0
S.c/ D inf sup P kO kpp0 c0p P fBin.n0 ; 1 / cg:
O
Let c D n0 0 ; and suppose that 1 > 0 . Write K.0 ; 1 / for the Kullback-Leibler divergence 0 log.0 =1 / C .1 0 / log..1 0 /=.1 1 //. At the end of the chapter we recall
the the Cramer-Chernoff large deviations principle
P fBin .n0 ; 1 / < n0 0 g e
along with the inequality K.0 ; 1 / 2.1
conclude that
1
and since
0
n0 0p
0
.1=2/Wnp .; C /,
n0 K.0 ;1 /
S.n0 0 / e
2n0 02
289
We describe here a key property of wavelet bases that allows us to establish strong properties for co-ordinatewise soft thresholding. An unconditional basis f I g for a Banach space
B can be defined by two conditions. The first is that f I g is a Schauder basis, meaning
everyP
v 2 B has a unique representation, that is a unique sequence fI g C such that
vD 1
1 I I . The second is a multiplier property: there exists a constant C such that for
every N and all sequences fmI g C with jmI j 1, we have
k
N
X
1
mI I
Ik
Ck
N
X
I
I k:
(10.17)
Several equivalent forms and interpretations of the definition are given by Meyer (1990, I,
Ch. VI). Here we note only that (10.17) says that shrinkage of coefficients can not grossly
inflate the norm in unconditional bases. This suggests that traditional statistical shrinkage
operations - usually introduced for smoothing or stabilization purposes - are best performed
in unconditional bases.
A key consequence of the sequence norm characterisation results described in Section 9.6
is that wavelets form unconditional bases for the Besov scale of function spaces. Indeed,
when viewed in terms of the sequence norms
X
1=q
kf kBp;q
k kbp;q
D kL kp C
2aj q kkqp
;
j L
recall (9.37) and (9.38), the multiplier property is trivially satisfied, since k f k depends
on j k only through jj k j. Donoho (1993, 1996) has shown that unconditional bases are in
a certain sense optimally suited for compression and statistical estimation.
Definition 10.5 Suppose that the orthonormal wavelet
ments. Consider a scale of functional spaces
C .R; D/ D fBp;q
0; 1 W 1=p < < min.R; D/g:
(10.18)
As seen in Section 9.6. these spaces are all embedded in C 0; 1, since > 1=p, and the
wavelet system f j k g forms an unconditional basis for each of the spaces in the scale, since
< min.R; D/:
Preservation of Smoothness
Suppose now that f I g is an unconditional basis for a function space F with normPk kF :
Data from deterministic model (10.2) can be used to construct an estimator of f D I I
PO
by setting fO D
;I I , where estimator O is given by (10.4). The uniform shrinkage
property combined with the multiplier property (10.17) imply that whatever be the noise u,
kfOkF C kf kF :
This means that one can assert that fO is as smooth as f . In particular, if f is identically 0,
then so is fO Furthermore, for a C R wavelet with D vanishing moments, this property
holds simultaneously for all spaces F in the scale C .R; D/ of (10.18).
Statistical model. We may immediately draw condlustions for the statistical model (10.1).
290
I 2 I.n/ ;
jI.n/ j D n:
Again, one still attempts to recover the entire object ; and the corresponding minimax
recovery error is
E.; I n/ D inf sup e.O .x .n/ /; I /:
O .x .n/ /
(10.19)
291
In the noisefree case D 0, we have e.O ; I 0/ D kPn k for any estimator, and so
E.; 0I n/ D sup k.I Pn /k. Consequently, we also have
.nI ; k k/ D E.; 0I n/:
It is then straightforward to establish the following finite data analog of Theorem 10.2.
Proposition 10.7 Suppose that is solid and orthosymmetric, and that the error norm
k k is solid, orthosymmetric and -convex. Then
maxf./=2; .n/g E.; I n/ c 2./ C .n/:
In addition, soft thresholding O is near minimax simultaneously for all such parameter
spaces and error norms.
Proof The lower bound is immediate. For the upper bound, consider the first n and the
remaining co-ordinates separately and use -convexity:
kO
k c kO.n/
.n/ k C k .n/
k c 2./ C .n/:
k 2./ C .n/g $n ! 1:
Thus thepstatistical model is not harder than the optimal recovery model, up to factors involving log n: We may say, using the language of Stone (1980), that 2./ C .n/ is an
achievable rate of convergence for all qualifying .; k k/.
Now specialize to the case of parameter space and error (quasi-)norm kk taken from the
Besov scale. Thus, recalling the sequence based definition (9.36), we use one Besov norm
k kb D k kbp;q
to define a parameter space .C / D f W kkbp;q
C g, and a typically
different Besov norm k kb 0 D k kb 00 0 for the error measure. This of course represents
p ;q
a substantial extension of the class of error measures: the squared error loss considered in
most of the rest of the book corresponds to 0 D 0; p 0 D q 0 D 2. We remark that the norm
k kb0 is -convex with D min.1; p 0 ; q 0 /, Exercise 10.1.
We first summarize the results of calculation of the Besov modulus and bounds for the
tail bias, the details being deferred to the next section. We then formulate the statistical
conclusions in terms of the modulus functionsthis is the main result, Theorem 10.9, of this
chapter.
An interesting feature is the appearance of distinct zones of parameters p D .; p; q; 0 ; p 0 ; q 0 /:
Regular
Logarithmic
R D fp 0 pg [
LD
In the critical case . C 1=2/p D . 0 C 1=2/p 0 , the behavior is more complicated and is
discussed in Donoho et al. (1997).
292
We recall that the notation a./ b./ means that there exist constants c1 ; c2 and c3 ,
here allowed to depend on p but not or C , such that for all < c3 we have the pair of
bounds c1 a./ b./ c2 a./.
Theorem 10.8 Let D p;q .C / and k k D k kb 00 0 : Assume that
p ;q
Q D
1=p 0 /C > 0:
.1=p
as ! 0:
(10.20)
for p 2 R;
rR D
Q
(b) the tail bias satisfies, with c2 D .2q
for p 2 L:
1/
1=q 0
.n/ c2 C n
If in addition > 1=p; then .n/ D o..n
1=2
(10.21)
//:
Part (b) shows that the condition Q > 0 is needed for the tail bias to vanish with increasing
n; we refer to it as a consistency condition. In particular, it forces 0 < . In the logarithmic
zone, the rate of convergence is reduced, some simple algebra shows that for p 2 L we have
rL < rR .
Some understanding of the regular and logarithmic zones comes from the smoothness parameter plots introduced in Chapter 9.6. For given values of the error norm parameters 0 and
p 0 , Figure 10.2 shows corresponding regions in the .1=p; / plane. The regular/logarithmic
boundary is given by the solid line D !=p 1=2 having slope ! D . 0 C 1=2/p 0 . The
consistency boundary corresponding to condition > 0 C .1=p 1=p 0 /C is given by the
broken line with inflection at .1=p 0 ; 0 /. Note that the two lines in fact intersect exactly at
.1=p 0 ; 0 /.
If ! > 1, or what is the same, if a0 D 0 C1=2 1=p 0 > 0, then there is a logarithmic zone.
In this case, the consistency boundary lies wholly above the continuity boundary D 1=p,
so the condition > 1=p imposes no additional constraint.
On the other hand, if ! 1 or a0 0, the zone boundary line is tangent to the consistency
line and there is no logarithmic zone. This explains why there is no logarithmic zone for
traditional squared error loss, corresponding to 0 D 0; p 0 D 2. In this case the continuity
boundary D 1=p implies a further constraint to ensure negligibility
of the tail bias.
R
2
As particular examples, one might contrast the error measure RjD f j, with 0 D 2; p 0 D
1 and ! D 5=2, which has a logarithmic zone, with the measure jf j, with 0 D 0; p 0 D 1
and ! D 1=2; which does not.
Make the normalization D n 1=2 : Using the bounds derived for the Besov modulus and
for the tail bias in Theorem 10.8 we obtain
293
L
R
(1=p';')
(1=p';')
1=p
1=p
Theorem 10.9 Assume model (10.1). Let D p;q .C / and k k D k kb 00 0 : Assume that
p ;q
Q D 0 .1=p 1=p 0 /C > 0 and that > 1=p: Then soft thresholding, (8.28), satsifies
p
sup P fkOn k c.n 1=2 log n/g 1 $n ! 1:
2.C /
k c.n
1=2
/g ! 1:
(10.22)
p
In the logarithmic case, the lower bound can be strengthened to .n 1=2 log n/.
p
Thus, soft thresholding at n D n 2 log n is simultaneously nearly minimax (up to
a logarithmic term) over all parameter spaces and loss functions in the (seven parameter)
scale C .R; D/, and indeed attains the optimal rate of convergence in the logarithmic case.
To appreciate the significance of adaptive estimation results such as this, note that an estimator that is exactly optimal for one pair .; k k/ may well have very poor properties
for other pairs: one need only imagine taking a linear estimator (e.g. from Pinskers theorem) that would be optimal for an ellipsoid 2;2 and using it on another space p;q with
p < 2 in which linear estimators are known (e.g. Chapter 9.9) to have suboptimal rates of
convergence.
Remark. In most of the book we have been concerned with statements about expected
O / D E L.;
O /. The optimal recovery approach leads more naturally to results
losses: r.;
about probabilities for losses: P fL.O ; R / > ctn g. The latter is weaker than the former,
1
though they areR related via r.O ; / D 0 P fL.O ; / > tgdt; which follows from the
1
identity EX D 0 P .X > t/dt for integrable random variables X 0.
294
k .j / kb D k .j / kbp;q
D 2aj kj kp
(10.23)
where, again, a D C 1=2 1=p. This shows that .j / is isomorphic to a scaled `p ball:
.j / 2j ;p .C 2 aj /: The modified modulus of continuity, when restricted to the j th shell,
reduces in turn to a scaled form of the `p modulus:
j ./ WD .I .j / ; k kb 0 /
0
D 2a j W2j .; C 2
aj
/ D W2j .2a j ; C 2
.a a0 /j
/;
(10.24)
where we have used the invariance bWn .; C / D Wn .b; bC /. It is easy to verify that nothing essential (at the level of rates of convergence) is lost by considering the shell moduli:
with D p 0 ^ q 0 ^ 1 and c D 21= ,
k.j .//j k`1 ./ c k.j .=c //j k`q0 :
(10.25)
[Proof of (10.25). The lower bound is easy: first ./ ./ and then restrict the supremum over to the j -th shell, so that ./ j ./ for each j . For the upper bound, first
use (10.6) to reduce to showing ./ k.j .//j k`q0 . Then using the definition of .j / ,
o
nX
X
0
0
k .j / kqb C q ; k .j / k1
k .j / kqb 0 W
q D sup
j
0
sup k .j / kqb 0 W k .j / kqb C q ; k .j / k1
since doingP
the maximizations separately can only increase the supremum. The final expres0
sion is just j qj ./ and so the upper bound follows.]
In view of (10.24) we can use the `p -modulus results to compute j ./ by making the
substitutions
nj D 2j ;
j D 2a j ;
Cj D C 2
.a a0 /j
Sparse case p < p 0 . We use the lower panel of Figure 10.1: as D j increases, the
three zones for W translate into three zones for j ! j , illustrated in the top panel of
Figure 10.3.
Zone (i): j < Cj nj 1=p . This corresponds to
2.aC1=p/j D 2.C1=2/j < C =;
295
so that the zone (i)/(ii) boundary occurs at j0 satisfying 2.C1=2/j0 D C =. In zone (i),
0
0 0
pj D nj jp D p 2.1Cp a /j ;
and with n0 D 2j , the maximum possible, this is a dense zone.
At the boundary j0 , on setting r0 D . 0 /=. C 1=2/; we have
0
r0 r0
Zone (ii): Cj nj 1=p < j < Cj . The right inequality corresponds to < C 2
the zone (ii)/(iii) boundary occurs at j1 satisfying 2aj1 D C =. In zone (ii),
0
pj Cjp jp
and observe using a D C 1=2
D C p p
.pa p 0 a0 /j
, so that
p 0 a0 D p. C 1=2/
pa
aj
p 0 . 0 C 1=2/
is positive in the regular zone and negative in the logarithmic zone, so that j is geometrically decreasing in the regular zone and geometrically increasing in the logarithmic zone.
The least favorable configuration has non-zero cardinality
n0 D .Cj =j /p D .C =/p 2
paj
D 2pa.j1
j/
p 0 .a a0 /j
pj D Cjp D C p 2
a0 D
1=p 0 / D Q > 0;
.1=p
r1 D 1
1=p/ D rL :
(10.26)
.a a0 /j1
a0 /=a
D C.=C /.a
D C1
r1 r1
The dense case, p p 0 is simpler. We refer to the bottom panel of Figure 10.3.
Zone (i) j < Cj nj 1=p . This zone is the same as in the sparse case, so for j j0 defined
by 2.C1=2/j0 D C =, we have
0
0 0
pj D n1j
and j D j0 2
. 0 /.j j0 /
p 0 =p
r0 r0
with r0 as before.
0
Cjp D C p 2
. 0 /p 0 j
296
Again we see that the geometric decay property (10.27) holds, with j D j0 and r D r0 ,
and as at all levels j , the least favorable configuration at level j0 is dense, n0 D 2j0 .
To summarize, under the assumptions of the Theorem 10.9, and outside the critical case
. C 1=2/p D . 0 C 1=2/p 0 , there exists j 2 R and D .; 0 ; p; p 0 / > 0 such that
j ./ r C 1 r 2
jj j j
(10.27)
Thus we have geometric decay away from a single critical level. In the regular case, j D j0
and r D r0 and the least favorable configuration at level j0 is dense, n0 D 2j0 . In the
logarithmic case, j D j1 and r D r1 , and the least favorable configuration at level j1 is
sparse, n0 D 1.
The evaluation (10.20) follows from this and (10.25).
Evaluation of Besov tail widths These can be reduced to calculations on Besov shells by
the same approach as used to prove (10.27). If we set
j D supfk .j / kb 0 W k .j / kb C g;
then the full tail width is related to these shell widths by
J C1 .2J I ; k kb 0 / k.j /j >J k`q0 :
(10.28)
j D 2ja 2j.1=p
1=p/C
aj
0
C2
/:
1=p/C
aj
aj
C , we find
D C2
j Q
In viewP
of (10.28), the full
.2J I / is equivalent to J D C 2 J Q D C n
P tail bias
q0
q0
q
Q 0j
Q
Q 0
Indeed j >J j C
, and so .2J / C 2 J
.2q
1/ 1 .
j >J 2
We now verify that the assumption > 1=p (continuity) guarantees negligibility of the
tail bias term: .n/ D o..n 1=2 //. From (10.21), .n/ D O.n Q /, while from (10.20),
.n 1=2 / n r=2 , so it is enough to verify that Q > r=2. If p is in the logarithmic zone,
this is immediate when > 1=p.
If p is in the regular zone, the condition Q > r=2 becomes 0 .1=p 1=p 0 /C >
. 0 /=.2 C 1/. If p 0 p this is trivial, while for p 0 > p it is the same as
2
.
2 C 1
0 / > .1=p
1=p 0 /:
Now the condition for p to be regular, namely .2 0 C 1/=.2 C 1/ < p=p 0 , is equivalent
to the previous display with the right side replaced by p.1=p 1=p 0 /. So, again using
> 1=p, we are done.
297
{
r0= |||
+1=2
p p0 {p {(pa{p0a0)j
j =C 2
pa(j {j)
n0 = 2 1
p0r1
a0j
a{a
r1=||
a
0 = 2
0
p0(1{r1)
0 0
pj =p 2(1+p a )j
j
j ! j()
n0 = 2
a0j
0 = 2
pj =C p 2{p (a{a )j
n0 = 1
0
0 = C 2{(a{a )j
2(+1=2)j0=C=
C
0
2aj1=C=
p0(1{r0)
p0r0
{0
r0= |||
+1=2
0 0
pj =p 2(1+p a )j
pj =C p 2{({ )p j
n0 = 2j
n0 = 2j
0 = 2a j
0 = C 2{({ +1=2)j
2(+1=2)j0=C=
Figure 10.3 Schematic of the Besov modulus j ./, defined by (10.24), when
viewed as a function of level j , with ; C held fixed. Top panel is sparse case,
p < p 0 (in the regular zone), bottom is dense case p p 0
.j / kb 0 g:
(10.29)
298
sup
a0 j
j kp0 2
g:
aj
(10.30)
aj /
2j ;p .C 2
Regular case. The Besov shell we use corresponds to the critical level j0 D p./ log2 .C =/,
where p./ D 2=.2 C 1/ and we set D D n 1=2 . The setting is dense because (cf.
0
top panel of Figure 10.3) there are n0 D 2j0 non-zero components with size 0 D 2j0 a .
Hence, we apply the dense `p -ball modulus lower bound, Proposition 10.4, to 2j0 ;p .C 2 aj0 /.
Hence, comparing (10.30) and (10.13), we are led to equate
2
a 0 j0
aj0
/;
after putting cp0 D .0 =2/1=p . Recalling the definition of the shell modulus, (10.24), we get
D cp0 j0 ./:
Because of the geometric decay of the shell modulus away from j0 , compare (10.27), there
exists c1 D c1 .p/ for which
./ c1 j0 ./:
(10.31)
Combining the prior two displays, we can say that
c2 ./ and hence
inf sup P fkO
O
kb0 c2 ./g 1
2n0 02
Logarithmic case. From the modulus calculation, we expect the least favorable configurations to be at shells near j1 and to be highly sparse, perhaps a single spike. We therefore use
the lower bounds derived for the bounded single spike parameter spaces n . / discussed
in Section ??.
First we note that if j C 2 aj , then 2j .j / 2j ;p .C 2 aj /. Fix > 0. If also
p
j .2 / log 2j , then from Proposition 8.16, we can say that
inf
P fkOj
sup
O 2j ;p .C 2
aj /
Bearing in mind the two conditions on j , it is clear that the largest possible value for j is
p
Nj D minf .2 / log 2j ; C 2 aj g:
The implied best bound in (10.29) that is obtainable using the j -th shell is then given by the
0
solution to
j 2 a j D Nj =2, namely
0
j D 12 2a j Nj :
p
Let |N1 D maxfj W .2 / log 2j C 2 aj g. It is clear that
j is increasing for j |N1
and (since a > a0 ) decreasing for j > |N1 , so our best shell bound will be derived from
|N1 .
Since we only observe data for levels j < log2 n D log2 2 , we also need to check that
299
|N1 < log2 2 , and this is done below. To facilitate the bounding of
|N1 , we first observe that
from the definition of |N1 , it follows that
C 2 a|N1 N|N1 C 2 a|N1 ;
p
and, after inserting again N|N1 D c |N1 ,
p 1=a
p 1=a
|N1
|N1
2 |N1 c4
;
c3
C
C
2
a 1
(10.32)
(10.33)
where c3 ; c4 depend on a and . After taking logarithms in the right bound, we obtain
|N1 C .2a/
log
log
(10.34)
From the left bound in (10.33), we have |N1 .2a/ 1 log2 2 C log2 .C 1=a c3 1 / < log2 2
for < 2 .a; C / since 2a > 1. Hence, as claimed, |N1 < log2 n for small.
Using (10.32), (10.33) and (10.34) in turn, along with definition (10.26) for rL , we find
that
p a aa0
p
p
|N1
a 2
.a a0 /|N1
|N1 2
C2
cC
cC 1 r log 1 rL c. log 1 /;
C
where the constant c D c.p/ may differ each time.
Returning to (10.29) and inserting
|N1 , we have
p
inf sup P fkO kb 0 c. log
O
1 /g
.2|N1 /
for < .a; C /; i.e. for n > n.a; C /. From (10.34) it is clear that |N1 ! 1 as n ! 1 so
that .2|N1 / ! 1.
Defining D log 0 =1 and N D log N 0 =N 1 ; rewrite the loglikelihood ratio as
L D log
dP0
D .
dP1
N
N
/B
C n0 :
n0 K
E1 e L D e
n0 K
300
1
kP
2
0 /2 :
Qk21 , see e.g.
10.10 Notes
The literature on optimal recovery goes back to Golomb and Weinberger (1959) and a 1965 Moscow dissertation of Smolyak. See also Micchelli (1975); Micchelli and Rivlin (1977) and Donoho (1994); the last
cited makes the connection with statistical estimation. These latter references are concerned with estimation
of a linear functional, while here we are concerned with the whole objectp
:
While the Besov shell structure emerges naturally here in the study of 2 log n thresholding, it provides
a basic point of reference for studying properties of other threshold selection schemes over the same range
of p. For example, this structure is used heavily in Johnstone and Silverman (2005b) to study wavelet
shrinkage using an empirical Bayes choice of threshold, introduced in Section 7.6.
Exercises
11
Penalization and Oracle Inequalities
The investigation of sparsity in previous chapters has been satisfied with demonstrating the
optimality of estimators by showing that they achieve minimax risks or rates up to terms
logarithmic in sample size or noise level. In this chapter and the next, our ultimate goal is
to obtain sharper bounds on rates of convergence - in fact exactly optimal rates, rather than
with spurious log terms.
The tools used for this purpose are of independent interest. These include model selection
via penalized least squares, where the penalty function is not `2 or even `1 but instead a
function of the number of terms in the model. We will call these complexity penalties. Also
concentration inequalities
Some of the arguments work for general (i.e. non-orthogonal) linear models, so we begin
with this important framework. We do not use this extra generality in this book, nor pursue
the now substantial literature on sparsity based oracle inequalities for linear and non-linear
models (see the Chapter Notes for some references). Instead, we derive a bound adequate
for our later results on sharp rates of convergence.
While it is natural to start with penalties proportional to the number of terms in the model,
it will turn out that for our later results on exact rates, it will be necessary to consider a
larger class of k log.p=k/ penalties, in which, roughly speaking, the penalty to enter the
k th variable is a function that decreases with k approximately like 2 log.p=k/, for 1.
Section 11.1 begins in the linear model setting with all subsets regression and introduces
penalized least squares estimation with penalties that depend on the size of the subset or
model considered.
Section 11.2 pauses to specialize to the case of orthogonal designequivalent to the sequence modelin order to help motivate the class of penalties to be studied. We show the
connection to thresholding, importantly with the thresholds tOpen now depending on the data,
and decreasing as the size kO of selected subset increases. The k log.p=k/ class of penalties
is motivated by connection to the expected size of coefficientsGaussian order statisticsin
a null model.
In Section 11.3,
pwe present the main oracle inequality for a class of penalties including
pen.k/ D k1C 2 log.p=k/2 for > 1. The Gaussian concentration of measure inequality of Section
2.8 plays an important role. Indeed, in considering all subsets of p variables,
there are pk distinct submodels with k variables, and this grows very quickly with k. In
order to control the resulting model explosion, good exponential probability inequalities for
the tails of chi-square distributions are needed.
Section 11.4 applies the oracle inequality to the Gaussian sequence model to obtain non301
302
asymptotic upper bounds for minimax risk over `p balls n;p .C /. Lower bounds are obtained via embedded product spaces. Both bounds are expressed in terms of a control function rn;p .C /, which when p < 2, clearly exhibits the transition from a zone of sparse least
favorable configurations to a dense zone. This is the second main theorem of the chapter,
and these conclusions are basic for the sharp rate results on estimation over Besov classes in
Chapter 12.
The remaining sections contain various remarks on and extensions of these results. Section 11.5 provides more detail on the connection between the complexity penalty functions
and thresholding, and on several equivalent forms of the theoretical complexity in the orthogonal case.
Section 11.6 remarks on the link between traditional forward and backward stepwise
model selection criteria and the class of penalties considered in this chapter.
Section 11.7 prepares for results in the next chapter on sharp rates for linear inverse problems by presenting a modified version of the main oracle inequality.
z Nn .0; I /:
(11.1)
ii
C zi ;
i D 1; : : : ; n:
k D 1; : : : ; p D n n:
with p D n for some > 1: If there is a single dominant frequency in the data, it is
possible that it will be essentially captured by an element of the dictionary even if it does
not complete an integer number of cycles in the sampling interval.
303
Pp
If we suppose that f has the form f D D1 , then these observation equations
become an instance of the general linear model (11.1) with
Xi D h
i ; i:
Again, the hope is that one can find an estimate O for which only a small number of components O 0.
All subsets regression. To each subset K f1; : : : ; pg of cardinality nK D jKj corresponds a regression model which fits only the variables x for 2 K: The possible fitted
vectors that could arise from these variables lie in the model space
SK D spanfx W 2 Kg:
The dimension of SK is at most nK , and could be less in the case of collinearity.
Let PK denote orthogonal projection onto SK : the least squares estimator O K of is
given by O K D PK y. We include the case K D , writing n D 0, S D f0g and
O .y/ 0. The issue in all subsets regression consists in deciding how to select a subset
KO on the basis of data y: the resulting estimate of is then O D PKO y:
Mean squared error properties can be used to motivate all subsets regression. We will use
a predictive risk1 criterion to judge an estimator O through the fit O D X O that it generates:
EkX O
Xk2 D EkO
k2 :
k2 D kPK
k2 C 2 dim SK :
A saturated model arises from any subset with dim SK D n, so that O K D y interpolates
the data. In this case the MSE is just the unrestricted minimax risk for Rn :
EkO
k2 D n 2 :
Comparing
the last two displays, we see that if lies close to a low rank subspace
P
2K x for jKj smallthen O K offers substantial risk savings over a saturated
model. Thus, it seems that one would wish to expand the dictionary D as much as possible
to increase the possibilities for sparse representation. Against this must be set the dangers
inherent in fitting over-parametrized models principally overfitting of the data. Penalized
least squares estimators are designed specifically to address this tradeoff.
This discussion also leads to a natural generalization of the notion of ideal risk introduced
1
Why the name predictive risk? Imagine that new data will be taken from the same design as used to
generate the original observations y and estimator O : y D X C z : A natural prediction of y is X O ,
and its mean squared error, averaging over the distributions of both z and z , is
E ky
X O k2 D E kX
X O k2 C n 2 ;
k2 , up to an additive factor that doesnt depend
304
at (8.32) in Chapter 8.3, and which in this chapter we denote by R1 .; /. For each mean
vector , there will be an optimal model subset K D K./ which attains the ideal risk
R1 .; / D min k
K
PK k2 C 2 dim SK :
Of course, this choice K./ is not available to the statistician, since is unknown. The
challenge, taken up below, is to see to what extent penalized least squares estimators can
mimic ideal risk, in a fashion analagous to the mimicking achieved by threshold estimators
in the orthogonal setting.
Complexity penalized least squares. The residual sum of squares (RSS) of model K is
ky
O K k2 D ky
PK yk2 ;
and clearly decreases as the model K increases. To discourage simply using a saturated
model, or more generally to discourage overfitting, we introduce a penalty on the size of the
model, pen.nK /, that is increasing in nK , and then define a complexity criterion
PK yk2 C 2 pen.nK /:
C.K; y/ D ky
(11.2)
The complexity penalized RSS estimate O pen is then given by orthogonal projection onto the
subset that minimizes the penalized criterion:
KO pen D argminK C.K; y/;
(11.3)
(11.4)
where we will take 2 D 2p to be roughly of order 2 log p. The well known AIC criterion
would set 2 D 2. In our Gaussian setting, it is equivalent to Mallows Cp , compare (2.54).
This is effective for selection among a nested sequence of models, but is known to overfit
in all-subsets settings, e.g. Nishii (1984); Foster and George (1994) and Exercise 11.1.]
The BIC criterion (Schwarz, 1978) puts 2 D log n. Foster and George (1994) took 2 D
2 log p, dubbing it RIC for Risk Inflation Criterion.
For this particular case, we describe the kind of oracle inequality to be proved in this
chapter. First, note that for pen0 .k/, minimal complexity and ideal risk are related:
min C.K; / D min k
K
Let p D .1 C
will be shown that
K
2p min
k
2 log p/ for > 1. Then for penalty function (11.4) and arbitary , it
EkO pen
where bounds for a./; b./ are given in Theorem 11.3 below, in particular a./ is decreasing in . Thus, the complexity penalized RSS estimator, for non-orthogonal and possibly
over-complete dictionaries, comes within a factor of order 2 log p of the ideal risk.
305
Remark. Another possibility is to use penalty functions monotone in the rank of the
model, pen.dim SK /, instead of pen.nK /. However, when k ! pen.k/ is strictly monotone,
this will yield the same models as minimizing (11.2), since a collinear model will always be
rejected in favor of a sub-model with the same span.
i D 1; : : : ; n;
i id
zi N.0; 1/:
(11.5)
This is the canonical form of the more general orthogonal regression setting Y D X C Z,
with N dimensional response and n dimensional parameter vector linked by an orthogonal
design matrix X satisfying X T X D In , and with the noise Z Nn .0; I /. This reduces to
(11.5) after premultiplying by X T and setting y D X T Y , D and z D X T Z.
We will see in this section that, in the orthogonal regression setting, the penalized least
squares estimator can be written in terms of a penalty on the number of non-zero elements
(Lemma 11.1). There are also interesting connections to hard thresholding, in which the
threshold is data dependent. We then use this connection to help motivate the form of penalty
function to be used in the oracle inequalities of the next section.
The columns of the design matrix implicit in (11.5) are the unit co-ordinate vectors ei ,
consisting of zeros except for a 1 in the i th position. The least squares estimator corresponding to a subset K f1; : : : ; ng is simply given by co-ordinate projection PK :
(
yi i 2 K
.PK y/i D
0 i K:
The complexity criterion (11.2) becomes
C.K; y/ D
yi2 C 2 pen.nK /;
i K
where nK D jKj still. Using jyj.l/ to denote the order statistics of jyi j, in decreasing order,
we can write
X
jyj2.l/ C 2 pen.k/:
(11.6)
min C.K; y/ D min
K
0kn
l>k
There is an equivalent form of the penalized least squares estimator in which the model
selection aspect is less explicit. Let N D #fi W i 0g be the number of non-zero
components of .
Lemma 11.1 Suppose that k ! pen.k/ is monotone increasing. In orthogonal model
(11.5), the penalized least squares estimator can be written
O pen .y/ D argmin ky
Proof The model space SK corresponding to subset K consists of vectors whose comC
ponents i vanish for i K. Let SK
SK be the subset on which the components i 0
306
C
for every i 2 K. The key point is that on SK
we have N D nK . Since Rn is the disjoint
C
C
union of all SK
using f0g in place of S
we get
min ky
k2 C 2 pen.nK /:
K 2S C
K
C
The minimum over 2 SK
can be replaced by a minimum over 2 SK without changing
C
C
the value because if 2 SK nSK
there is a smaller subset K 0 with 2 SK
0 here we use
monotonicity of the penalty. So we have recovered precisely the model selection definition
(11.3) of O pen .
Remark. Essentially all the penalties considered in this chapter are monotone increasing
in k. Our shorthand terminology 2k log.p=k/ penalties has the minor defect that k !
k log.p=k/ is decreasing for k p=e. However this is inessential and easily fixed, for
example, by using k ! k.1 C log.p=k// which is increasing for 0 k p.
Connection with thresholding. When pen0 .k/ D 2 k, we recover the `0 penalty and
the corresponding estimator is hard thresholding at , as seen in Section 2.3. To explore
the connection with thresholding for more general penalties, consider the form pen.k/ D
P
k
2
lD1 tn;l . Then the optimal value of k in (11.6) is
kO D argmin
k
jyj2.l/ C 2
l>k
k
X
2
tn;l
:
(11.7)
lD1
We show that O pen corresponds to hard thresholding at a data-dependent value tOpen D tn;kO .
Proposition 11.2 If k ! tn;k is strictly decreasing, then
jyj.kC1/
< tn;kO jyj.k/
O
O ;
(11.8)
and
(
yi
O pen;i .y/ D
0
jyi j tn;kO
otherwise:
(11.9)
Figure 11.1 illustrates the construction of estimated index kO and threshold tOpen .
P
P
2
Proof Let Sk D 2 klD1 tn;l
C l>k jyj2.l/ : For notational simplicity, we write tk instead
of tn;k . We have
Sk
Sk
D 2 tk2
jyj2.k/ :
and
jyj.kC1/
tkC1
< tkO ;
O
O
where at the last strict inequality we used the assumption on tk . Together, these inequalities
yield (11.8) and also the set identity
fi W jyi j tkO g D fi W jyi j jyj.k/
O g:
O we have shown (11.9).
Since the set on the right side is K,
307
jyj(j)
tn;j
tn;^k
^
k
Figure 11.1 Schematic showing construction of data dependent threshold from the
sequence tn;l and ordered data magnitudes jyj.l/ .
Gaussian order statistics and 2k log.n=k/ penalties. The z-test for i D 0 is based
on jzi j=. [If 2 were unknown and estimated by an independent 2 variate, this would be
a t-statistic zi =O .] Under the null model D 0, it is natural to ask for the magnitude of the
k-th largest test statistic jzj.k/ = as a calibration for whether to enter the k-th variable into
the model. It can be shown that if kn D o.n/, then as n ! 1,
Ejzj.k/ D .2 log.n=kn //.1 C o.1//;
(11.10)
2
so that a plausible threshold tn;k
for entry of the k-th variable is of order 2 log.n=k/. Hence
pen.k/ itself is of order 2k log.n=k/.
A heuristic justification for (11.10) comes from the equivalence of the event fjzj.k/ tg
with f#fi W jzi j tg kg. Under the null model D 0, the latter is a binomial event, so
Q // kg:
P fjzj.k/ tg D P fBin.n; 2.t
Q
Setting the mean value 2n.t/
of the binomial variate equal to k yields tn;k
Exercise REF has a more formal demonstration.
2 log.n=k/.
Example. FDR estimation. In Chapter 7.6, (7.28) described a data dependent threshold
2
choice that is closely related to penalized estimation as just described with tn;k
D z.kq=2n/.
O
Indeed, let kF D maxfk W jyj.k/ tn;k g denote the last crossing, and consider also the first
crossing kOG C 1 D minfk W jyj.k/ < tn;k g. If kOpen denotes the penalized choice (11.7), then
Section 11.6 shows that
kOG kOpen kOF
and in simulations it is often found that all three agree.
In Exercise 11.2, it is verified that if k, possibly depending on n, is such that k=n ! 0 as
n ! 1, then
2
tn;k
.1=k/
k
X
1
2
2
tn;l
tn;k
C 2 2 log.n=k 2=q/
(11.11)
308
(11.12)
2Lk /2
. > 1; Lk 0/:
(11.13)
This form is chosen both to approximate the 2k log.n=k/ class just introduced in the orthogonal case, in which p D n, and to be convenient for theoretical analysis. The penalty reduces
to pen0 of (11.4) if Lk is identically constant. Typically, however, the sequence Lk D Lp;k
is chosen so that Lp;k log.p=k/ and is decreasing in k. We will see in Section 11.4 and
the next chapter that this property is critical for removing logarithmic terms in convergence
rates. As a concession to our theoretical analysis, we need > 1 and the extra 1 in (11.13)
for the technical arguments. The corresponding thresholds are then a bit larger than would
otherwise be desirable in practice.
We abuse notation a little and write LK for LnK . Associated with the penalty is a constant
X
M D
e LK nK ;
(11.14)
K
kD0
The last term is uniformly bounded in p so long as 1. Thus, convergence of (11.14) and
the theorem below require that 2p .2 log p/ or larger when p is large.
(ii) Now suppose that Lk D log.p=k/ C 0 : Proceeding much as above,
!
p
1
X
X
p
p k k k 0 k X 1
0
kLk
M D
e
e . 1/k ;
(11.15)
e
p
k
k p
2k
0
kD0
k
p
using Stirlings formula, k D 2kk k e kC , with .12k/ 1 .12k C 1/ 1 . The last
sum converges so long as 0 > 1.
The first main result of this chapter is an oracle inequality for the penalized least squares
estimator.
Theorem 11.3 In model (11.1), let O be a penalized least squares estimator (11.2)(11.3)
for a penalty pen.k/ depending on > 1 and constant M defined at (11.13) and (11.14).
Then there exist constants a D a./; b D b./ such that for all ,
EkO pen
(11.16)
309
The constants may be taken respectively as a./ D .3 C 1/. C 1/2 =.
b./ D 4. C 1/3 =. 1/ 3 .
1/
and as
The constants a and b are not sharp; note however that a./ is decreasing in with limit
3 as ! 1. Section 11.7 has a variant of this result designed for (mildly) correlated noise
and inverse problems.
1 : Writing y D C z and expanding (11.2), we have
Proof
O y/ D kO O
C.K;
K
k2 C 2h
O KO ; zi C 2 kzk2 C 2 pen.nKO /:
K ; zi C 2 kzk2 :
Consequently
C.K; y/ D kPK? yk2 C 2 pen.nK / C.K; / C 2h
K ; zi C 2 kzk2 :
O y/ C.K; y/, so combining the corresponding equations and canBy definition, C.K;
celling terms yields a bound for O KO :
k2 C.K; / C 2hO KO
kO KO
K ; zi
2 pen.nKO /:
(11.17)
The merit of this form is that we can hope to appropriately apply the Cauchy-Schwarz inequality, (11.21) below, to the linear term hO KO K ; zi, and take a multiple of kO KO k2
over to the left side to develop a final bound.
2 : We outline the strategy based on (11.17). We construct an increasing family of sets
x for x > 0, with P .cx / M e x and then show for each 2 .0; 1/ that there are
constants a0 ./; b0 ./ for which we can bound the last two terms of (11.17): when ! 2 x ,
2hO KO
K ; zi 2 pen.nKO / .1 2 /kO KO
Assuming for now the truth of (11.18), we can insert it into (11.17) and move the squared
error term on the right side to the left side of (11.17). We get
kO KO
(11.19)
where X.!/ D inffx W ! 2 x g.R Clearly X.!/ > x implies that ! x , and so using the
1
bound on P .cx / gives EX D 0 P .X > x/dx M . Hence, taking expectations, then
minimizing over K, and setting a1 ./ D 2 .1 C a0 .// and b1 ./ D 2 b0 ./, we get
EkO KO
(11.20)
310
K ; zi kO K 0
K k K;K 0 ;
(11.21)
since z Nn .0; I / and clearly 2K;K 0 2.d / with degrees of freedom d D dim .SK
SK 0 / nK C nK 0 .
Now usepthe Lipschitz concentration of measure bound (2.76), which says here that
2
P f.d / > d C tg e t =2 for all t 0, and, crucially, for all non-negative integer
d . (If d D 0, then .0/ D 0.) For arbitrary x > 0, let EK 0 .x/ be the event
p
p
(11.22)
K;K 0 nK C nK 0 C 2.LK 0 nK 0 C x/;
and in the concentration bound set t 2 D 2.LK 0 nK 0 C x/. Let x D \K 0 EK 0 .x/, so that
X
P .cx / e x
e LK 0 nK 0 D M e x :
K0
p
p
p
Using a C b a C b twice in (11.22) and then combining with (11.21), we conclude
that on the set x ,
p
p
p
p
hO K 0 K ; zi kO K 0 K k nK 0 .1 C 2LK 0 / C nK C 2x:
The key to extracting kO K 0 K k2 with a coefficient less than 1 is to use the inequality
2 c 2 C c 1 2 , valid for all c > 0. Thus, for 0 < < 1 and c D 1 ,
2hO K 0
K ; zi
.1
/kO K 0
K k2 C
p
p i2
p
2 hp
nK 0 .1 C 2LK 0 / C nK C 2x : (11.23)
1
k2 C . 1 /kK
p
In the second, use pen.nK 0 / D nK 0 .1 C 2LK 0 /2 and get
.1
1C
1
1 2
pen.nK 0 / C
k2 :
1C 1 2
.2nK C 4x/:
1
Now, choose so that .1 C /=.1 / D , and then move the resulting 2 pen.nK 0 / term to
the left side of (11.23). To bound the rightmost terms in the two previous displays, set
1C 12
4.1 C 1 /
1
;
b0 ./ D
;
(11.24)
a0 ./ D max
;
1
1
O we recover the desired inequality
and note that nK pen.nK /. Finally, setting K 0 D K,
(11.18) and hence (11.20). Inserting D . 1/=. C 1/ gives the values for a./ D a1 ./
and b./ D b1 ./ quoted in the Theorem.
Orthogonal Case. An important simplification occurs in the theoretical complexity
C.K; / in the orthogonal case. As in Section 11.2, but now using rather than y,
X
C.K; / D
2i C 2 pen.nK /
iK
311
(11.25)
(11.26)
R.; / D min
0kn
2.l/ C 2 pen.k/:
l>k
Let us note some interesting special cases, for which we write the penalty in the form
pen.k/ D k2k :
First, with k , so that pen.k/ D 2 k is proportional to k, we verify that
X
R.; / D
min.2k ; 2 2 /;
(11.27)
and the ideal risk R1 .; / of Chapter 8 corresponds to choice 1. In addition, the oracle
inequalities of Sections 2.7 and 8.3, in the specific form (2.73), can be seen to have the form
(11.29).
Second, if k ! k is monotone, there is a co-ordinatewise upper bound for theoretical
complexity with a form generalizing (11.27).
Lemma 11.4 If pen.k/ D k2k with k ! 2k is non-increasing, then
R.; /
n
X
min.2.k/ ; 2k 2 /:
kD1
min.2.k/ ; 2k / D
~
X
(11.28)
kD1
Corollary 11.5 In the special case of orthogonal model (11.5), with the bound of Theorem
11.3 becomes
EkO pen
n
X
(11.29)
kD1
where the second inequality assumes also pen.k/ D k2k with k ! 2k non-increasing.
312
D n;p .C / D f 2 R W
n
X
ji jp C p g:
(11.30)
iD1
k22 :
In this section we will study non-asymptotic upper and lower bounds for the minimax risk
and will later see that these lead to the optimal rates of convergence for these classes of
parameter spaces.
The non-asymptotic bounds will have a number of consequences. We will again see a
sharp transition between the sparse case p < 2, in which non-linear methods clearly outperform linear ones, and the more traditional setting of p 2.
The upper bounds will illustrate the use of the 2k log.n=k/ type oracle inequalities established in the last section. They will also be used in the next chapter to derive exactly optimal
rates of convergence over Besov spaces for certain wavelet shrinkage estimators. The lower
bounds exemplify the use of minimax risk tools based on hyperrectangles and products of
spikes.
While the non-asymptotic bounds have the virtue of being valid for finite > 0, their disadvantage is that the upper and lower bounds may be too conservative. The optimal constants
can be found from a separate asymptotic analysis as ! 0, see Chapter 13 below.
A control function. The non-asymptotic bounds will be expressed in terms of a control
function rn;p .C; / defined separately for p 2 and p < 2. The control function captures
key features of the minimax risk RN .n;p .C /; / but is more concrete, and is simpler in
form. As with the minimax risk, it can be reduced by rescaling to a unit noise version
rn;p .C; / D 2 rn;p .C =/:
For p < 2, the control function is given by
8
2
<C
1 p=2
rn;p .C / D C p 1 C log.n=C p /
:
n
p
if C 1 C log n;
p
if 1 C log n C n1=p ;
if C n1=p :
(11.31)
(11.32)
See Figure 11.2. As will become evident from the proof, the three zones correspond to
situations where the least favorable signals are near zero, sparse and dense respectively.
A little calculus shows that Cp
! rn;p .C / is monotone increasing in C for 0 < p < 2, except
at the discontinuity at CL D 1 C log n. This discontinuity is not serious; for example we
have the simple bound
rn;p .CL /
2;
rn;p .CL C/
(11.33)
313
valid for all n and p 2 .0; 2/. Here r.CL / denotes the limit of r.C / as C & CL and C % CL
respectively. Indeed, the left side is equal to
1
.p=2/CL
log CL 2 p=2
21
p=2
2
using the crude bound x 1 log x 1=2 valid for x 1. Numerical work would show that
the bound is actually considerably less than 2, especially for n large.
rn;p(C)
n
(near
zero)
(sparse)
(dense)
C
1+logn
1=p
if C n1=p ;
if C n1=p :
(11.34)
To show that the bounds provided by the control function can be attained, we use a penalized least squares estimator
p O P for
p a specific choice of penalty of the form (11.13). Thus
pen.k/ D k2k with k D .1 C 2Ln;k /, and
Ln;k D .1 C 2/ log.n
=k/:
(11.35)
(11.36)
2
(11.37)
n;p .C /
Note that a single estimator O P , defined without reference to either p or C , achieves the
314
upper bound. We may thus speak of O P as being adaptively optimal at the level of rates of
convergence.
Constants convention. In the statement and proof, we use ci to denote constants that depend on .; ;
/ and aj to stand for absolute constants. While information is available
about each such constant, we have not tried to assemble this into the final constants a1 and
c1 above, as they would be far from sharp.
Proof 1 : Upper Bounds. We may assume, by scaling, that D 1. As we are in the
orthogonal setting, the oracle inequality of Theorem 11.3 combined with (11.25) and (11.26)
takes the form
EkO P
R./ D min
0kn
n
X
2.j / C k2k :
(11.38)
j >k
For the upper bound, then, we need then to show that when 2 n;p .C /,
R./ c2 rn;p .C /:
(11.39)
We might guess that worst case bounds for (11.38) occur at gradually increasing values of
k as C increases. In particular, the extreme zones for C will correspond to k D 0 and n. It
turns out that these two extremes cover most cases, and then the main interest in the proof
lies in the sparse zone for p < 2. Now to the details.
First put k D n in (11.38). Since 2n is just a constant, c3 say, we obtain
R./ c3 n
(11.40)
valid for all C (and all p), but useful in the dense zone C n1=p .
For p 2, simply by choosing k D 0 in (11.38), we also have
2=p
X
X
R./ n n 1
2j n n 1
jj jp
n1
2=p
C 2:
(11.41)
Combining the last two displays suffices to establish (11.39) in the p 2 case.
For p < 2, note that
n
X
2.j / C 2
.k C 1/1
j >k
2=p
1=p
jjp.j / C 2 .k C 1/1
2=p
j >k
We can now
cases. Putting k D 0, we get R./ C 2 , as is needed
p dispose of the extreme
1=p
for C 1 C log n. For C n , again use bound (11.40) corresponding to k D n.
p
We now work further on bounding R./ for the range C 2 1 C log n; n1=p . Inserting
the last display into (11.38) and ignoring the case k D n, we obtain
2=p
C k2k :
(11.42)
Now observe from the specific form of Ln;k that we have 2k
2 k n. Putting this into (11.42), we arrive at
R./ c4 min fC 2 k 1
2=p
1kn
315
c4 .1 C log.n=k// for
C k.1 C log.n=k//g:
(11.43)
We now pause to consider the lower bounds, as the structure turns out to be similar enough
that we can finish the argument for both bounds at once in part 3 below.
Remark. For Section 12.4, we need to make explicit the dependence of c4 on
. Indeed,
for
e, we have the following version of (11.39):
(11.44)
1=p
; 1/ a2 n min.C 2 n
2=p
; 1/:
For p < 2, we will use products of the single spike parameter sets m . / consisting of a
single non-zero component in Rm of magnitude at most , compare (8.56). Proposition 8.17
gave a lower bound for minimax mean squared error over such single spike sets.
Working in Rn , for each fixed number k, one can decree that each block of n=k successive coordinates should have a single spike belonging to n=k . /. Since minimax risk is
additive on products, Proposition 4.16, we conclude from Proposition 8.17 that for each k
RN .k1 n=k . // a3 k. 2 ^ .1 C logn=k/:
Now n;p .C / contains such a product of k copies of n=k . / if and only if k p C p ,
so that we may take D C k 1=p in the previous display. Therefore
RN .n;p .C // a4 max C 2 k 1
1kn
2=p
^ .k C k log.n=k//;
(11.45)
(11.46)
316
C2
g(x)
h(x)
g( x ? )
1+log n
C2n1{2=p
1
x ? (C)
Figure
11.3 Diagram of functions g and h and their intersection, when p < 2 and
p
1 C log n C n1=p .
p
x? . 1 C log n/ D 1; and x? .n1=p / D n:
p
Hence 1 x? n if and only if 1 C log n C n1=p ; this explains the choice of
transition points for C in the definition of r.C /.
We now relate the intersection value g.x? .C // to r.C /; we will show that
r.C / g.x? .C // 2r.C /:
(11.49)
One direction is easy: putting x? n into (11.47) shows that x? C p , and hence from
(11.48) that g.x? / r.C /: For the other direction, make the abbreviations
s D 1 C log.n=x? /;
and
t D 1 C log.n=C p /:
Now taking logarithms in equation (11.47) shows that s t C log s: But log s s=2 (since
s 1 whenever x? n), and so s 2t: Plugging this into (11.48), we obtain (11.49).
A detail. We are not quite done since the extrema in the bounds (11.46) should be computed over integers k; 1 k n. The following remark is convenient: for 1 x n, the
function h.x/ D x C x log.n=x/ satisfies
1
h.dxe/
2
h.x/ 2h.bxc/:
(11.50)
Indeed, h is concave and h.0/ D 0, and so for x positive, h.x/=2 h.x=2/. Since h is
317
increasing for 0 y n, it follows that if x 2y, then h.x/ 2h.y/. Since x 1 implies
both x 2bxc and dxe 2x, the bounds (11.50) follow.
For the upper bound in (11.46), take k D dx? e: since g is decreasing, and using (11.50)
and then (11.49), we find
min g C h .g C h/.dx? e/ g.x? / C 2h.x? / D 3g.x? / 6r.C /:
1kn
For the lower bound, take k D bx? c, and again from the same two displays,
max g ^ h .g ^ h/.bx? c/ D h.bx? c/ 12 h.x? / D 12 g.x? / 12 r.C /:
1kn
k.Lk
Lk / b:
(11.51)
k
2k D .k
1/.2k
2k
1/
0;
so that tk k . For the other bound, again use the definition of tk2 , now in the form
k
Setting D k
tk k
tk D
tk and D k
2k
k
1
tk2
k
Dk
C
t
k
1
k
1
C k
.k
1 C tk
k /:
(11.52)
k.k
k
k /
k.k 1 k /
D
:
1 C tk
1
(11.53)
318
k / D
1
2
and k.Lk
Lk / b, we find
p k.Lk 1 Lk /
2 b
2b
2 p
p
D
:
p
p
k
Lk 1 C Lk
.1 C 2Lk /
(11.54)
The bound Lk 2b implies 2k 4b, and if we return to first inequality in (11.53) and
simply use the crude bound tk 0 and k 1 k along with (11.54), we find that
= 2b=2k 1=2:
Returning to the second inequality in (11.53), we now have = 2k.k
again using (11.54), we get 4b=k ; which is the bound we claimed.
k /=, and
Some equivalences. We have been discussing several forms of minimization that turn out
to be closely related. To describe this, we use a modified notation. We consider
k
X
RS .s; / D min
0kn
sj C
R.s; / D
n
X
n
X
j ;
(11.55)
kC1
n
X
j ;
(11.56)
kC1
min.sk ; k /:
(11.57)
1
2
With the identifications sk $ tn;k
and k $ jyj2.k/ , the form RS recovers the objective
function in the thresholding formulation of penalization, (11.7). When using a penalty of
the form pen.k/ D k2k , compare (11.26), we use a measure of the form RC . Finally, the
co-ordinatewise minimum is perhaps simplest.
Under mild conditions on the sequence fsk g, these measures are equivalent up to constants. To state this, introduce a hypothesis:
(H) The values sk D .k=n/ for .u/ a positive decreasing function on 0; 1 with
lim u .u/ D 0;
u!0
0u1
319
is bounded below by
min.sk ; k /. The bounds with c will follow if we show that (H)
Pk
implies 1 sj c ksk for k D 0; : : : ; n. But
k
X
1
sj D
k
X
k=n
.u/du;
0
Z
.j=n/ n
2
g:
kQk yk2 2 tp;kC1
(11.58)
Note that we allow the threshold to depend on k: in practice it is often constant, but we wish
2
to allow k ! tp;k
to be decreasing.
In contrast, the backward stepwise approach starts with a saturated model and gradually
decreases model size until there appears to be no further advantage in going on. So, define
kOF D maxfk W kQk yk2
kQk
1 yk
2
2 tp;k
g:
(11.59)
k
X
jyj2.j / ;
j D1
so that
kOF D maxfk W jyj.k/ tp;k g;
(11.60)
and that kOF agrees with the FDR definition (7.28) with tp;k D z.qk=2n/: In this case, it is
critical to the method that the thresholds k ! tp;k be (slowly) decreasing.
2. In practice, for reasons of computational simplicity, the forward and backward stepwise
320
algorithms are often greedy, i.e., they look for the best variable to add (or delete) without
optimizing over all sets of size k.
The stepwise schemes are related to a penalized least squares estimator. Let
S.k/ D ky
Qk yk2 C 2
k
X
2
tp;j
;
(11.61)
j D1
Since ky
Qk yk2 D kyk2
S.k C 1/
kQk yk2 ,
2
kQkC1 yk2 C 2 tp;kC1
:
Thus
(<)
(>)
kQk yk
2
D 2 tp;kC1
:
<
Thus, if it were the case that kO2 > kOF , then necessarily S.kO2 / > S.kO2 1/, which would
contradict the definition of kO2 as a global minimum of S.k/. Likewise, kO2 < kOG is not
possible, since it would imply that S.kO2 C 1/ < S.kO2 /:
(11.62)
where A pB means that B A is non-negative definite. We modify the penalty to pen.k/ D
1 k.1 C 2Lk /2 : In order to handle the variance inflation aspect of inverse problems,
we want to replace the constant M in the variance term in (11.16) by one that excludes the
zero model:
X
M0 D
e LJ nJ :
(11.63)
J
321
(11.64)
Proof 1 : We modify the proof of the previous theorem in two steps. First fix J and assume
that Cov.z/ D I . Let EJ .x/ be defined as in (11.22), and then let 0x D \J 0 EJ 0 .x/ and
X 0 D inffx W ! 0x g. On the set JO , we have, as before,
kO JO
Now consider the event JO D . First, note that if kk2 2 pen.1/; we have on JO D
that, for all J
kO JO
Suppose, instead, that kk2 2 pen.1/; so that C.J; / 2 pen.1/ for all J here we
use the monotonicity of k ! pen.k/. Pick a J 0 with nJ 0 D 1; on 0x we have
p
p
p
hz; J i kJ k J;J 0 kJ k .1 C 2L1 / C nJ C 2x:
We now proceed as in the argument from (11.23) to (11.24), except that we bound 2 pen.1/
C.J; /, concluding that on 0x and JO D , we may use in place of (11.18),
2hz; J i .1
2 /kk2 C C.J; /
which might be compared with (11.19). Taking expectations, then minimizing over J , we
obtain again (11.20), this time with a2 ./ D 2 .2 C a.// and b1 ./ unchanged. Inserting
D . 1/=. C 1/ gives a0 ./ D a2 ./ and b./ D b1 ./ as before.
2 : The extension
to weakly correlated z is straightforward. We write y D C 1 z1 ,
p
where 1 D 1 and 1 D Cov.z1 / I . We apply the previous argument with ; z
replaced by 1 and z1 . The only point where the stochastic properties of z1 are used is in the
concentration inequality that is applied to J;J 0 . In the present case, if we put z1 D 1=2
1 Z
for Z N.0; I /, we can write
J;J 0 D kP 1=2
1 Zk;
where P denotes orthoprojection onto SJ SJ 0 . Since 1 .1=2
1 / 1, the map Z !
J;J 0 .Z/ is Lipschitz with constant at most 1, so that the concentration bound applies.
322
(11.65)
with
n D
`.nI ; / where
> e and the function `.n/ 1 and may depend on and
the noise level . For p
this choice, the constant M 0 in (11.64) satisfies (after using the Stirling
formula bound k > 2k k k e k ),
k
2
n
n
X
e
nk k k.1C2 / X 1
k
0
M
p
k n
n
2k n2
n1C2
kD1
kD1
(11.66)
k 1
1 X k 2 e
C;
e
2
2 ;
p
n
n
n
n
2k
n1C2
k1
11.8 Notes
The idea to use penalties of the general form 2 2 k log.n=k/ arose among several authors more or less
simultaneously:
P
Foster and Stine (1997) pen.k/ D 2 k1 2 log.n=j / via information theory.
i:i:d:
George and Foster (2000) Empirical Bayes approach. [i .1 w/0 C wN.0; C / followed by
estimation of .w; C /]. They argue that this approach penalizes the k th variable by about 2 2 log...n C
1/=k/ 1/.
The covariance inflation criterion of Tibshirani and Knight (1999) in the orthogonal case leads to pen.k/ D
P
2 2 k1 2 log.n=j /:
FDR - discussed above (?).
Birge and Massart (2001) contains a systematic study of complexity penalized model selection from the
specific viewpoint of obtaining non-asymptotic bounds, using a penalty class similar to, but more general
than that used here.
Add refs to Tsybakov oracle ineqs.
The formulation and proof of Theorem 11.3 is borrowed from Birge and Massart (2001). Earlier versions
in [D-J, fill in.]
2. The formulation and methods used for Theorem 11.6 are inspired by Birge and Massart (2001). See
also the St. Flour course Massart (2007).
6. Some bounds for kOF kOG in sparse cases are given in Abramovich et al. (2006).
Exercises
11.1 (Overfitting of AIC.) Consider the penalized least squares setting (??)(11.3) with penalty
pen0 .k/ D 2k along with n D p and orthogonal design matrix
pX D I . Show that the estimator
O pen .x/ D OH .x; / is given by hard thresholding with D 2.
(a) Show that the MSE at D 0 is approximatelyp
c0 n and evaluate c0 .
(b) With pen0 .k/pD 2k log n and hence D 2 log n, show that the MSE at D 0 is
approximately c1 log n and evaluate c1 .
11.2 (Gaussian quantiles and 2k log.n=k/ penalties.) Define the Gaussian quantile z./ by the
Q
equation .z.//
D .
323
Exercises
(a) Use (8.85) to show that
z 2 ./ D 2 log
log log
r./;
and that when 0:01, we have 1:8 r./ 3 (Abramovich et al., 2006).
(b) Show that z 0 ./ D 1=.z.// and hence that if 2 > 1 > 0, then
2 1
:
z.2 / z.1 /
1 z.1 /
(c) Verify (11.11) and (11.12).
11.3 (A small signal bound for R./.) Suppose that k ! k2k is increasing, and that 2 n;p .C /
for C 1 . Show that jj.k/ k for all k, and hence in the orthogonal case that R./
Pn
2
kD1 k .
p
p
11.4 (Monotonicity of penalty.) If k D .1 C 2Lk / with Lk D .1 C 2/ log.n
=k/ for k 1
(and L0 D L1 ) and
> e, verify that k ! k2k is monotone increasing for 0 k n.
11.5 (Inadequacy of .2 log n/k penalty.) If pen.k/ D k2k has the form 2k 2 log n, use (11.27)
with D 1 to show that
sup
R./ rNn;p .C /;
2n;p .C /
P1
kD1 k
r,
8
2
<.2=p/C
rNn;p .C / D .2=.2 p//C p .2 log n/1
:
2n log n
p=2
p
C < 2 log n
p
p
2 log n C < n1=p 2 log n
p
C n1=p 2 log n:
Especially for C near n1=p or larger, this is inferior by a log term to the control function rn;p .C /
obtained with penalty (11.35).
12
Exact rates for estimation on Besov spaces
We return to function estimation, for example in the continuous Gaussian white noise model
(1.21), viewed in the context of the sequence model corresponding to coefficients in an
orthonormal wavelet basis. We return also to the estimation framework of Section 9.8 with
the use of Besov bodies p;q .C / to model different degrees of smoothness and sparsity. The
plan is to apply the results on penalized estimation from the last chapter separately to each
level of wavelet coefficients.
This chapter has two main goals. The first is to remove the logarithmic terms that appear
in the upper bounds of Theorem 9.14 (and also in Chapter 10) while still using adaptive
estimators of threshold type. The reader may wish to review the discussion in Section 9.11
for some extra context for this goal.
The second aim of this chapter is finally to return to the theme of linear inverse problems,
introduced in Chapter 3 with the goal of broadening the class of examples to which the
Gaussian sequence model applies. We now wish to see what advantages can accrue through
using thresholding and wavelet bases, to parallel what we have studied at length for direct
estimation in the white noise model.
In the first section of this chapter, we apply the 2k log n=k oracle inequality of Chapter
11 and its `p ball consequences to show that appropriate penalized least squares estimates
(which have an interpretation as data dependent thresholding) adapt exactly to the correct
rates of convergence over essentially all reasonable Besov bodies. Thus, we show that for an
explicit O P ,
sup EkO P
simultaneously for all D p;q .C / in a large set of values for .; p; q; C /, although the
constant c does depend on these values.
Our approach is based on the inequalities of Chapter 11.4, which showed that the `p -ball
minimax risk could, up to multiplicative constants, be described by the relatively simple
control functions rnj ;p .C; / defined there. The device of Besov shells, consisting of vectors 2 that vanish except on level j , and hence equivalent to `p -balls, allows the study
of minimax risks on to be reduced to the minimax risks and hence control functions
Rj D rnj ;p .Cj ; j / where the parameters .nj D 2j ; Cj ; j / vary with j . Accordingly, a
study of the shell bounds j ! Rj yields our sharp rate results. Since this works for both
direct and indirect estimation models, it is postponed to Section 12.5.
We describe an alternative to the singular value decomposition, namely the waveletvaguelette decomposition, for a class of linear operators. The left and right singular function
324
325
systems of the SVD are replaced by wavelet-like systems which still have multiresolution
structure and yield sparse representations of functions with discontinuities. The function
systems are not exactly orthogonal, but they are nearly orthogonal, in the sense of frames,
and are in fact a sufficient substitute for analyzing the behavior of threshold estimators.
In Section 12.2, then, we indicate some drawbacks of the SVD for object functions with
discontinuities and introduce the elements of the WVD.
Section 12.3 lists some examples of linear operators A having a WVD, including integration of integer and fractional orders, certain convolutions and the Radon transform. The
common feature is that the stand-in for singular values, the quasi-singular values, decay at
a rate algebraic in the number of coefficients, j 2 j at level j .
Section 12.4 focuses on a particular idealisation, motivated by the WVD examples, that
we call the correlated levels model, cf (12.32). This generalizes the white noise model
by allowing noise levels j D 2j that grow in magnitude with resolution level j , a key
feature in inverting data in ill-posed inverse problems. In addition, the model allows for the
kind of near-independence correlation structure of noise that appears in problems with a
WVD.
Using co-ordinatewise thresholdingwith larger thresholds chosen to handle the variance
inflation with levelwe easily recover the optimal rate of convergence up to a logarithmic
factor. This analysis already makes it possible to show improvement in the rates of convergence, compared to use of the SVD, that are attainable by exploiting sparsity of representation in the WVD.
By returning to the theme of penalized least squares estimation with 2n log n=k penalties,
we are again able to dispense with the logarithmic terms in the rates of convergence in the
correlated levels model. The proof is begun in Section 12.4 up to the point at which the
argument is reduced to study of `p control functions on Besov shells. This topic is taken up
in Section 12.5.
j 2 N; k D 1; : : : ; 2j I
(12.1)
with zj k N.0; 1/ independently. Although a special case of the correlated levels model
discussed later in Section 12.4, we begin with this setting for simplicity and because of the
greater attention we have given to the direct estimation model. As in previous chapters, the
single subscript j refers to a vector: yj D .yj k /; j D .j k / etc.
We use a penalized least squares estimator on each level j , with the penalty term allowed
to depend on j , so that
OP .yj / D argmin kyj
j k2 C 2 penj .N j /;
(12.2)
j
326
j;k D
1C
2 log.2j
=k/ :
As discussed there, we assume that > 1 and
> e so that the oracle inequality of Theorem
11.3 may be applied, with M D M.
/ guaranteed to be finite for
> e by virtue of (11.15).
As in earlier chapters, compare Sections 9.9 and 10.6, we define a cutoff level J D
log2 2 and use the penalized least squared estimate only on levels j < J . [As noted earlier,
in the calibration D n 1=2 , this corresponds to estimating the first n wavelet coefficients
f of a function f based on n observations in a discrete regression model such as (1.13)
(with D 1 there).] We put these levelwise estimates together to get a wavelet penalized
least squares estimate O P D .OjP /:
(
OP .yj / j < J
OjP .y/ D
0
j J:
Remark 4 below discusses what happens if we estimate at all levels j .
From Proposition 11.2 and Lemma 11.7, the estimator is equivalent to hard thresholding
at tOj D tkOj j;kOj where kOj D N OP .yj / is the number of non-zero entries in OP .yj /
and tk2 D k2k .k 1/2k 1 . Observe that if kOj is large, the term 2 log.2j
=kOj /1=2
p
may be rather smaller than the universal threshold 2 log 2 , in part also becausepj < J ,
which corresponds to 2j < 2 . This reflects a practically important phenomenon: 2 log n
thresholds can be too high in some settings, for example in Figure 7.6, and lower choices of
threshold
can yield much improved reconstructions and MSE performance. The extra factor
p
> 1 and the extra constant 1 in the definition of j;k are imposed by the theoretical
approach taken here, but should not obscure the important conceptual point.
Theorem 12.1 Assume model (12.1) and let O P be the wavelet penalized least squares
estimate described above, and assume that
> e and > 1. For > .1=p 1=2/C along
with 0 < p; q 1 and C > 0, there exist constants c0 ; : : : ; c3 such that
c0 C 2.1
r/ 2r
RN .p;q .C /; /
sup EkO P
p;q .C /
k2 c1 C 2.1
r/ 2r
C c2 C 2 . 2 /2 C c3 2 log
1=p if p < 2:
Remarks. 1. The dependence of the constants on the parameters defining the estimator
and Besov space is given by c1 D c1 .;
; ; p/, c2 D c2 .; p/ and c3 D c3 .;
/, while c0
is an absolute constant.
2. Let us examine when the C 2.1 r/ 2r term dominates as ! 0. Since r < 1, the
2 log2 2 term is always negligible. If p 2, then 2 0 D 2 > r and so the tail bias term
0
C 2 . 2 /2 is also of smaller order. If p < 2, a convenient extra assumption is that 1=p,
0
for then 0 D a 1=2 > r=2, and again C 2 . 2 /2 is of smaller order.
Note that the condition 1=p is necessary for the Besov space Bp;q
to embed in spaces
of continuous functions.
0
3. One may ask more explicitly for what values of the tail bias C 2 . 2 /2 < C 2.1 r/ 2r .
327
r=.2 0 r/
where J ./ D j J kj k2 is the tail bias due to not estimating beyond level J . The
0
maximum tail bias over p;q .C / was evaluated at (9.59) and yields the bound c2 .; p/C 2 . 2 /2 .
To bound the mean squared error of OP .yj /, we appeal to the oracle inequality Theorem
11.3. Since model (12.1) is orthogonal, we in fact use Corollary 11.5. Using (11.29), then,
we obtain
EkOP .yj / j k2 c3 2 C c3 Rj .j ; /;
(12.4)
P
where c3 .;
/ D maxfa./; b./M.
/g and in accordance with (11.26), the level j theoretical complexity is given by
X
2
j.l/
C 2 k2j;k ;
(12.5)
Rj .j ; / D min
0knj
l>k
2
j.l/
where
denotes the l-th largest value among fj2k ; j D 1; : : : ; 2k g.
Summing over j < J D log2 2 , the first term on the right side of (12.4) yields the
c3 2 log 2 term
P in the upper bound of Theorem 12.1.
To bound j Rj .j ; / we use the Besov shells .j / D f 2 W I D 0 for I Ij g
introduced in Section 10.7. The maximum of Rj .j ; / over can therefore be obtained by
maximizing over .j / alone, and so
X
X
Rj .j ; /
sup
sup Rj .j ; /:
(12.6)
.j /
for
nj D 2j ; Cj D C 2
aj
The maximization of theoretical complexity over `p -balls was studied in detail in Section
11.4. Let rn;p .C; / be the control function for minimax mean squared error at noise level .
The proof of Theorem 11.6 yields the bound
328
follows from Proposition (12.41) to be proved there. The constant c1 D c1 .;
; ; p/ since
it depends both on c4 .;
/ and the parameters .; p/ of D p;q .C /.
Lower bound. We saw already in Theorem 9.14 that RN .; / cC 2.1 r/ 2r , but we
can rewrite the argument using Besov shells and control functions for `p balls. Since each
shell .j / , we have
RN .; / RN ..j / ; / RN .nj ;p .Cj /; / a1 rnj ;p .Cj ; /;
by the lower bound part of Theorem 11.6. Consequently RN .; / a1 maxj Rj , and that
this is bounded below by c0 C 2.1 r/ 2r is also shown in Proposition 12.6.
(12.7)
(12.8)
and the process g ! Z.g/ is Gaussian, with zero mean and covariance
Cov.Z.g/; Z.h// D g; h:
(12.9)
329
X
X 22
k k
min.k2 ; 2 =bk2 /:
k2 C k2
(12.10)
For a typical convolution operator A, the singular values bk decrease quite quickly, while
the coefficients k do not. Hence even the ideal linear risk for a step function in the Fourier
basis is apt to be uncomfortably large.
We might instead seek to replace the SVD bases by wavelet bases, in order to take advantage of wavelets ability to achieve sparse representations of smooth functions with isolated
singularities.
Example I. As a running example for exposition, suppose that A is given by integration
on R:
Z u
.Af /.u/ D f . 1/ .u/ D
f .t /dt:
(12.11)
1
Let f g be a nice orthonormal wavelet basis for L2 .R/: as usual we use for the double
index .j; k/, so that .t/ D 2j=2 .2j t k/. We may write
Z
A
.u/
2j=2 .2j t
D
D2
1
j
. 1/
k/dt D 2
2j=2 .
. 1/
/.2j u
k/
/ .u/:
. 1/
g
is.
Suppose
P initially that we consider an arbitrary orthonormal basis fek g for L2 .T /, so that
f D hf; ek iek : Suppose also that we can find representers gk 2 L2 .U / for which
hf; ek i D Af; gk :
According to Proposition C.6, this occurs when each ek 2 R.A /. The corresponding sequence of observations Yk D Y .gk / has mean Af; gk D hf; ek i and covariance 2 kl
where kl D P
Cov.Z.gk /; Z.gl // D gk ; gl . We might then consider using estimators of
O
the form f D k ck .Yk /ek for co-ordinatewise functions ck .Yk /, which might be linear or
threshold functions. However, Proposition 4.30 shows that in the case of diagonal linear estimators and suitable parameter sets, the effect of the correlation of the Yk on the efficiency of
estimation is determined by min ..//, the minimum eigenvalue of the correlation matrix
corresponding to covariance . In order for this effect to remain bounded even as the noise
level ! 0, we need the representers gk to be nearly orthogonal in an appropriate sense.
330
g
for
Let us turn again to wavelet bases. Suppose that f g is an orthonormal wavelet basis for
L2 .T / such that 2 D.A/ \ R.A / for every . Proposition C.6 provides a representer
g such that
hf;
i
D Af; g :
(12.13)
Suppose, in addition, that kg k D cj 1 is independent of k. Define two systems fu g; fv g 2
L2 .U / by the equations
v D j 1 A
u D j g ;
:
(12.14)
;
i;
D j v :
we may conclude
(12.15)
i
; g
D j 1 j 0 h
;
i
D :
(12.16)
0
/
2j .
/ :
331
0
0
k2 , we can set j D 2
v D .
/ ;
. 1/
/ :
We now turn to showing that the (non-orthogonal) systems fu g and fv g satisfy (12.12).
To motivate the next definition, note that members of both systems fu g and fv g have,
in our example, the form w .t/ D 2j=2 w.2j t k/. If we define a rescaling operator
.S w/.x/ D 2
j=2
w.2
.x C k//;
(12.18)
then in our example above, but not in general, .S w /.t / D w.t / is free of .
Definition 12.2 A collection fw g L2 .R/ is called a system of vaguelettes if there exist
positive constants C1 ; C2 and exponents 0 < < 0 < 1 such that for each , the rescaled
function wQ D S w satisfies
w.t
Q / C1 .1 C jtj/
Z
w.t
Q /dt D 0
jw.t
Q /
1 0
w.s/j
Q
C2 jt
(12.19)
(12.20)
sj
(12.21)
for s; t 2 R.
In some cases, the three vaguelette conditions can be verified directly. Exercise 12.2 gives
a criterion in the Fourier domain that can be useful in some other settings.
The following is a key property of a vaguelette system, proved in Appendix B.4. We use
the abbreviation kk2 for k. /k`2
Proposition 12.3 (i) If fw g is a system of vaguelettes satisfying (12.19) (12.21), then
there exists a constant C , depending on .C1 ; C2 ; ; 0 / such that
X
w
C kk2
(12.22)
2
(ii) If fu g; fv g are biorthogonal systems of vaguelettes, then there exist positive constants
c; C such that
X
X
ckk2
u
;
v
C kk2 :
(12.23)
The second part is a relatively straightforward consequence of the first key conclusion;
it shows that having two vaguelette systems that are orthogonal allows extension of bound
(12.22) to a bound in the opposite direction, which we have seen is needed in order to control
min ..//.
Thus, if we have two biorthogonal systems of vaguelettes, then each forms a frame: up to
multiplicative constants, we can compute norms of linear combinations using the coefficients
alone.
Definition 12.4 (Donoho (1995)) Let f g be an orthonormal wavelet basis for L2 .T /
and fu g, fv g be systems of vaguelettes for L2 .U /. Let A be a linear operator with domain
D.A/ dense in L2 .T / and taking values in L2 .U /. The systems f g; fu g; fv g form a
332
rj
u D .
.r/
g
.r/
/ ;
v D .
. r/
/ :
(12.24)
for 0 < < 1 and .u/ D uC 1 = ./. Define the order fractional derivative and
integral of by . / and . / respectively. The WVD of A is then obtained by setting
j D 2
u D .
. /
/ ;
v D .
. /
/ :
(12.25)
To justify these definitions, note that the Fourier transform of is given by (e.g. Gelfand
and Shilov (1964, p. 171)
c ./ D ./jj
b
;
b
where ./
equals c D ie i.
1/=2
i
333
It is easy to check that kgb k2 D kgb0 k2 22j so that we may take j D 2 j and u D j g ,
and, as in (12.14) set v D j 1 A .
Thus fu g and fv g are biorthogonal, and one checks that both systems are obtained by
translation and dilation of . / and . / in (12.25), with
b D jj b./=./;
b
. /
1 D jj ./
b b./:
. /
(12.26)
The biorthogonality relations for fu g and fv g will then follow if we verify that . / and
. /
satisfy (12.19)(12.21). The steps needed for this are set out in Exercise 12.2.
3. Convolution. The operator
Z
.Af /.u/ D
a.u
t /f .t /dt D .a ? f /.u/
R
is bounded on L2 .R/ if jaj < 1, by (C.29), so we can take D.A/ D L2 .R/. The adjoint
A is just convolution with a.u/
Q
D a. u/, and so in the Fourier domain, the representer g
is given by
OQ
gO D O =a;
(12.27)
OQ
where a./
D a.
O /.
As simple examples, we consider
a1 .x/ D e x I fx < 0g;
a2 .x/ D 21 e
jxj
(12.28)
i / 1 ;
aO 2 ./ D .1 C 2 / 1 ;
0
/ ;
g D
00
/ :
(12.29)
Either from representation (12.27), or more directly from (12.29), one finds that with
D 1 and 2 in the two cases, that [check]
(
22j as j ! 1;
kg k22
1
as j ! 1:
This is no longer homogeneous in j in the manner of fractional integration, but we can still
set j D min.1; 2 j /.
The biorthogonal systems fu g and fv g are given by (12.14). In the case of u D j g ,
the rescaling S u can be found directly from (12.29), yielding 2 j C 0 in the case
j > 0. The vaguelette properties (12.19) (12.21) then follow from those of the wavelet .
For v D j 1 A , it is more convenient to work in the Fourier domain, see Exercise 12.2.
4. Radon transform. For the Radon transform in R2 compare Section 3.9 for a version
on the unit diskDonoho (1995) develops a WVD with quasi-singular values j D 2j=2 .
The corresponding systems fu g, fv g are localized to certain curves in the .s; / plane
rather than to points, so they are not vaguelettes, but nevertheless they can be shown to have
the near-orthogonality property.
334
Here is a formulation of the indirect estimation problem when a WVD of the operator
is available, building on the examples presented above. Suppose that we observe A in the
stochastic observation model (12.7)(12.9), and that f ; u ; v g form a wavelet-vaguelette
decomposition of A. Consider the observations
Y .u / D Af; u C Z.u /:
Writing Y D Y.u /; z D Z.u / and noting that Af; u D j hf;
arrive at
Y D j C z :
i
D j , say, we
(12.30)
(12.31)
where the inequalities are in the sense of non-negative definite matrices. We say that the
noise z is nearly independent.
We are now ready to consider estimation of f from observations on Y . The reproducing
formula (12.17) suggests that we consider estimators of f of the form
X
fO D
.j 1 Y /
for appropriate univariate estimators .y/. The near-independence property makes it plausible that restricting to estimators in this class will not lead to great losses in estimation
efficiency; this is borne out by results to follow. Introduce y D j 1 Y N. ; j 2 2 /.
P
We have fO f D .y / and so, for the mean squared error,
X
X
E .y / 2 D
r. ; I j 1 /:
EkfO f k22 D
Notice that if j 2 j , then the noise level j 1 2j grows rapidly with level j . This
is the noise amplification characteristic of linear inverse problems and seen also in Chapter
3.9. In the next section, we study in detail the consequences of using threshold estimators to
deal with this amplification.
j D 2j ;
0
(12.32)
with the inequalities on the covariance matrix understood in the sense of non-negative definite matrices.
335
This is an extension of the Gaussian white noise model (12.1) in two significant ways:
(i) level dependent noise j D 2j with index of ill-posedness , capturing the noise
amplification inherent to inverting an operator A of smoothing type, and (ii) the presence
of correlation among the noise components, although we make the key assumption of nearindependence.
Motivation for this model comes from the various examples of linear inverse problems in
the previous section: when a wavelet-vaguelette decomposition exists, we have both properties (i) and (ii). The model is then recovered from (12.30)(12.31) when D .j k/ has the
standard index set, j D 2 j and yj k D j 1 Yj k .
The goals for this section are first to explore the effects of the level dependent noise
j D 2j on choice of threshold and the resulting mean squared error. Then indicate the
advantages in estimation accuracy that accrue with use of the WVD in place of the SVD via
an heuristic calculation with coefficients corresponding to a piecewise constant function.
Finally we introduce an appropriate complexity penalized estimator for the correlated levels
model. We formulate a result on minimax rates of estimation and begin the proof by using
oracle inequalities to reduce the argument to the analysis of risk control functions over Besov
shells.
Let us first examine what happens in model (12.32) when on level j we use soft thresholding with a threshold j j that depends on the level j but is otherwise fixed and non-random.
Thus OjSk D S .yj k ; j j /. Decomposing the mean squared error by levels, we have
X
r.O S ; / D
EkOjS j k2 ;
and if j D
2 log j 1 , we have from the soft thresholding risk bound (8.13) that
EkOjS
j k2 2j j j2 C .2j C 1/
min.j2k ; j2 /:
The noise term 2j j j2 D j 2.1C2 /j 2 , which shows the effect of the geometric inflation of
the variances, j2 D 22j 2 : To control this term, we might take j D 2 .1C2 /j D nj .1C2 / :
This corresponds to threshold
p
j D 2.1 C 2/ log nj ;
p
which is higher than the universal threshold U
2 log nj when the ill-posedness index
j D
> 0: With this choice we arrive at
X
EkOjS j k2 2 C c j
min.j2k ; 22j 2 /:
(12.33)
k
At this point we can do a heuristic calculation to indicate the benefits of using the sparse
representation provided by the WVD. This will also set the stage for more precise results
to follow. Now suppose that the unknown function f is piecewise constant with at most d
discontinuities. Then the wavelet tranform of f is sparse, and in particular, if the suppport
336
To find the worst level, we solve for j D j in the equation 2 j D 22j 2 , so that
2.1C2 /j D 2 . On the worst level, this is bounded by 2 j D . 2 /1=.1C2 / . The maxima on the other levels decay geometrically in jj j j away from the worst level, and so the
sum converges and as a bound for the rate of convergence for this function we obtain from
(12.33)
j 2
j
.log
/. 2 /1=.1C2 / :
Comparison with SVD. For piecewise constant f , we can suppose that the coefficients in
the singular function basis, k D hf; ek i decay as O.1=k/. Suppose that the singular values
bk k . Then from (12.10),
X
X
min.k2 ; 2 =bk2 /
min.k 2 ; k 2 2 / k 1 ;
k
where k solves k 2 D k 2 2 , so that k 1 D . 2 /1=.2C2 / . Hence, the rate of convergence using linear estimators with the singular value decomposition is O.. 2 /1=.2C2 / /,
while we can achieve the distinctly faster rate O.log 2 . 2 /1=.1C2 / / with thresholding and
the WVD.
In fact, as the discussion of the direct estimation case (Section 12.1) showed, the log 2
term can be removed by using data-dependent thresholding, and it will be the goal of the
rest of this chapter to prove such a result.
We will see that the minimax rate of convergence over p;q .C / is C 2.1 r/ 2r , with r D
2=.2 C 2 C 1/, up to constants depending only on .; p; q/ and .
This is the rate found earlier in Proposition 4.22 for the case of Holder smoothness .p D
q D 1/ and in Pinskers Theorem 5.2 for Hilbert-Sobolev smoothness .p D q D 2/. The
result will be established here for 0 < p; q 1, thus extending the result to cover sparse
cases with p < 2.
A further goal of our approach is to define an estimator that achieves the exact rate of
convergence 2r without the presence of extra logarithmic terms in the upper bounds, as
we have endured in previous chapters (8, 9, 10). In addition, we seek to do this with an
adaptive estimator, that is, one that does not use knowledge of the parameter space constants
.; p; q; C / in its construction.
These goals can be achieved using a complexity penalized estimator, constructed levelwise, in a manner analogous to the direct case, Section 12.1, but allowing for the modified
noise structure. Thus, at level j , we use a penalized least squares estimator OP .yj /, (12.2),
with penj .k/ D k2j;k . However, now
q
p
Lnj ;k D .1 C 2/ log.
nj nj =k/;
(12.34)
j;k D 1 .1 C 2Lnj ;k /;
337
where nj D 2j and,
nj
(
D
1 C .j
j /2
if j j D log2
if j > j :
(12.35)
The larger penalty constants
nj at levels j > j are required to ensure convergence of a
sum leading to the 2 log 2 term in the risk bound below, compare (12.38).
The penalized least squares estimator is equivalent to hard thresholding with level and
2
data dependent threshold tOj D tnj ;kOj where tn;k
D k2k .k 1/2k 1 2k and kOj D
N.OP .yj // is the number of non-zero entries in OP .yj /.
The levelwise estimators are combined into an overall estimator O P D .O P / with O P .y/ D
j
OP .yj / for j 0. [Note that in this model there is no cutoff at a fixed level J .]
Theorem 12.5 Assume the correlated blocks model (12.32) and that
> .2 C 1/.1=p
1=2/C :
(12.36)
For all such > 0 and 0 < p; q 1, for the penalized least squares estimator just
described, there exist constants ci such that if C 2 .; 2.C / /, then
c0 C 2.1
r/ 2r
RN .p;q .C /; /
sup EkOP
p;q .C /
k2 c1 C 2.1
r/ 2r
C c2 2 log
(12.37)
338
Turning now to the upper bound, the levelwise structure of OP implies that
X
EkOP k2 D
EkOP;j j k2 ;
j
and we will apply at each level j the inverse problem variant, Theorem 11.10, of the oracle
inequality for complexity penalized estimators.
Indeed, at level j , from (12.32) we may assume a model yj D j C j zj with dim.yj / D
nj D 2j and
0 Inj Cov.zj / 1 Inj :
The level j penalized least squares estimator OP;j .yj / is as described at (12.34). Theorem
11.10 implies the existence of constants a0 ./; b./ and Mj0 D Mj0 .;
j / such that
EkOP;j
Rj .j ; j / D
min
J f1;:::;2j g
Cj .J; /
Consequently,
EkOP
k2 b./1
Rj .j ; j /:
(12.38)
j >j
l>k
Therefore, we may argue as at (11.44) that the minimum theoretical complexity satisfies
sup Rj .j ; j / c.log
nj /rnj ;p .Cj ; j /;
2
j /C :
339
j /Rj :
j >j
We have reduced our task to that of analyzing the shell bounds Rj , to which we devote the
next section.
<C
rn;p .C; / D C p 2
: 2
n
n p
Cp
1 C log
1
p
C 1 C log n
p
1 C log n C n1=p
C n1=p :
p=2
(12.40)
We may refer to these cases, from top to bottom, as the small signal, sparse and dense
zones respectively, corresponding to the structure of the least favorable configurations in the
lower bound proof of Theorem 11.6.
We have seen that the `p ball interpretation of Besov shells .j / leads, for level j , to the
choices
nj D 2j ;
with a D C 1=2
Cj D C 2
aj
j D 2j ;
(12.41)
1=p.
Proposition 12.6 Suppose that 0 < p 1, 0 and > .2 C 1/.1=p 1=2/C . Let
Rj D rnj ;p .Cj ; j / denote the control functions (12.39) and (12.40) evaluated for the shell
and noise parameters .nj ; Cj ; j / defined at (12.41). Define r D 2=.2 C 2 C 1/. Then
there exist constants ci .; ; p/ such that if C , then
X
c1 C 2.1 r/ 2r max Rj
Rj c2 C 2.1 r/ 2r :
(12.42)
j 0
In addition, if j D log2
j 0
r/ 2r
:
(12.43)
j >j
Proof
r/ 2r
:
(12.44)
The essence of the proof is to show that the shell bounds j ! Rj peak at a critical level
340
j , and decay geometrically away from the value R at this least favorable level, so that
the series in (12.42) are summable. Note that for these arguments, j is allowed to range
over non-negative real values, with the results then specialized to integer values for use in
(12.42). The behavior for p < 2 is indicated in Figure 12.1; the case p 2 is similar and
simpler.
Rj
R=C 2(1{r)2r
R+
j+
(12.45)
(12.46)
j j
<R 2
p.j
j
/
1
p=2
Rj D R 2
(12.47)
1 C .j j /
j j < jC
:
2a.j jC /
RC 2
j jC
where R is as before and D .2 C1/.1=p 1=2/ > 0 in view of smoothness assumption (12.36). The values of and RC are given below; we show that RC is of geometrically
smaller order than R .
To complete the proof, we establish the geometric shell bounds in (12.45) and (12.47),
starting with the simpler case p 2: Apply control function (12.39) level by level. Thus,
on shell j , the boundary between small Cj and large Cj zones in the control function is
given by the equation .Cj =j /nj 1=p D 1: Inserting the definitions from (12.41), we recover
formula (12.46) for the critical level j .
341
In the large signal zone, j j , the shell risks grow geometrically: Rj D nj j2 D
2.2 C1/j 2 : The maximum is attained at j D j , and on substituting the definition of the
critical level j , we obtain (12.44).
In the small signal zone, j j , the shell bounds Rj D C 2 2 2j and it follows from
(12.46) that C 2 2 2j D R . We have established (12.45).
We turn to the case p < 2 and control function (12.40). Since Cj =j D .C =/2 .aC /j
with a C > 0 from (12.36), it is easily verified that the levels j belonging to the dense,
sparse and small signal zones in fact lie in intervals 0; j ; j ; jC and jC ; 1/ respectively,
where j is again defined by (12.46) and jC > j is the solution of
2.aC /jC 1 C jC log 21=2 D C =:
First, observe that the sparse/dense boundary, the definition of j and the behavior for
j j correspond to the small/large signal discussion for p 2.
In the sparse zone, j 2 j ; jC /, the shell risks Rj D Cjp j2 p 1 C log.nj jp Cj p /1 p=2 .
Using (12.41), the leading term
Cjp j2
D C p 2
p.a 2 =pC /j
decays geometrically for j j , due to the smoothness assumption (12.36); indeed we have
a 2.1=p 1=2/ D .2 C 1/.1=p 1=2/ > 0: The logarithmic term can be rewritten
using the boundary equation (12.46):
log.nj jp Cj p / D p. C C 1=2/.j
j / log 2:
pj
1 C .j
j /1
p=2
Putting j D j gives Rj D C p 2 p 2 pj D Cjp j2 p D nj j2 D R and yields the
middle formula in (12.47).
In the highly sparse zone j jC , the shell risks Rj D Cj2 D C 2 2 2aj decline geometrically from the maximum value RC D C 2 2 2ajC :
Having established bounds (12.47), we turn to establishing bounds (12.42) and (12.43).
The upper bound in (12.42) will follow from the geometric decay in (12.47) once we
establish a bound for RC in terms of R .
.1/
.2/
Let rn;p
and rn;p
denote the first two functions in (12.40) and set nC D nj C . Then define
.1/
RC D lim Rj D rnC;p
.Cj C ; j C /
j &jC
0
RC
.2/
D lim Rj D rnC;p
.Cj C ; j C /:
j %jC
We have
0
RC
D R 2
p.jC j /
1 C .jC
j /1
p=2
cR :
p
We saw at (11.33) the discontinuity in C ! rn;p .C / at C D 1 C log n was bounded,
342
p
1 C log nj C , then
rn ;p .CL j /
RC
D jC
2:
0
RC
rnj C ;p .CL j C/
0
Consequently RC 2RC
2cR and the upper bound in (12.42) follows.
For (12.43), we observe that the condition C < 2.C / implies that j < j and also
that when j > jC
.j
so that
X
.j
j /C R j
.j
j /C j
jC C jC
j ;
j /Rj
j j <jC
.j
jC /Rj C 2.jC
0
j /RC
j >jC
2a.j jC /
j >jC
Each of the terms on the right side may be bounded by c2 R by use of the appropriate
geometric decay bound in (12.47).
For the lower bound, it is enough now to observe that
max Rj D max.Rbj c ; Rdj e /:
The boundary of the sparse and small signal zones is described by the equation
1 D .Cj =j /.1 C log nj / 1=2 D .C =/2 .aC/j Cj=p .1 C j log 2/ 1=2 :
Using (12.46) as before, and taking base two logarithms, we find that the solution jC satisfies
.a C /jC D . C C 1
2 /j
`.jC /;
(12.48)
where `.j / D 1
2 log2 .1 C j log 2/:
It must be shown that this maximum is of smaller order than R . Indeed, using (12.44), RC =R D .C =/2r 2
log2 .RC =R / D 2.j
ajC / D
jC
2ajC
; and hence
after some algebra using (12.48) and recalling the definition of from below (12.47). For j 4 we have 2`.j / log2 j , and setting 2 D =. C C 1=2/ we arrive at
log2 .RC =R /
2 jC C log2 jC :
12.6 Notes
Remark on critical/sparse
regions in STAT paper.
p
The use of a larger threshold 2.1 C 2/ log n for dealing with noise amplification in inverse problems
was advocated by Abramovich and Silverman (1998); these authors also studied a variant of the WVD in
which the image function Af rather than f is explanded in a wavelet basis.
343
Exercises
Exercises
12.1 (Simple Fourier facts)
Recall or verify the following.
(a) Suppose that is C L with compact support. Then O ./ is infinitely differentiable and
j O .r/ ./j Cr jj
for all r:
(b) Suppose that has K vanishing moments and compact support. Then for r D 0; : : : ; K
we have O .r/ ./ D O.jjK r / as ! 0.
R
(c) For f such that the integral converges, jf .t /j .2/ 1 jfO./jd and
Z
jf .t/ f .s/j .2/ 1 jt sj jjjfO./jd :
1,
(d) If fO./ is C 2 for 0 < jj < 1 and if fO./ and fO0 ./ vanish as ! 0; 1, then
Z
1 2
jf .t /j .2/ t
jfO00 ./jd ;
12.2 (Vaguelette properties for convolution examples) (a) Let S be the rescaling operator (12.18),
and suppose that the system of functions w can be represented in the Fourier domain via sb D
S w : Show that vaguelette conditions (12.19)(12.21) are in turn implied by the existence of
constants Mi , not depending on , such that
Z
Z
(i)
jOs ./jd M0 ;
jOs00 ./jd M1 ;
Z
(ii) sO .0/ D 0;
and
(iii)
jj jOs ./jd M2 ;
;
u D j g ;
(c) Suppose A is given by fractional integration, (12.24), for 0 < < 1. Suppose that is C 3 ,
of compact support and has L D 2 vanishing moments. Show that fu g and fv g are vaguelette
systems.
(d) Suppose that A is given by convolution with either of the kernels in (12.28). Let D 1
for a1 and D 2 for a2 . Suppose that is C 2C , of compact support and has L D 2 C
vanishing moments. Show that fu g and fv g are vaguelette systems.
12.3 (Comparing DeVore diagrams for adaptive estimators.) Draw .; 1=p/ diagrams, introduced
in Section 9.6, to compare regularity conditions for exact minimax rate convergence for some
of the estimators in the literature recalled below. For these plots, ignore the regularity condition
on the wavelet , and the third Besov space parameter q.
(i) The SUREShrink estimator of Donoho and Johnstone (1995) assumes
> max .1=p; 2.1=p
1=2/C / ;
1 p 1:
1=2/C C 1=2;
2 2 1=6
> 1=p;
1 C 2
1 p 1:
344
for p < 2;
>0
for p 2:
13
Sharp minimax estimation on `p balls
i D 1; : : : ; n;
(13.1)
iid
with zi N.0; 1/ and constrained to lie in a ball of radius C defined by the `p norm:
D n;p .C / D f 2 Rn W
n
X
ji jp C p g:
(13.2)
i D1
k22 D
P O
i .i
k22 ;
i /2 : and in particular
(13.3)
and make comparisons with the corresponding linear minimax risk RL ./:
In previous chapters we have been content to describe the rates of convergence of RN ./,
or non-asymptotic bounds that differ by constant factors. In this chapter, and in the next for
a multiresolution setting, we seek an exact, if often implicit, description of the asymptotics
of RN ./. Asymptotically, we will see that RN depends on the size of n;p .C / through
n 2 times the dimension normalized radius
n D n
1=p
.C =/:
(13.4)
We also study linear and threshold estimators as two simpler classes that might or might not
come close in performance to the full class of non-linear estimators. In each case we also
aim for exact asymptotics of the linear or threshold minimax risk.
The `p constrained parameter space is permutation symmetric and certainly solid,
orthosymmetric and compact. It is thus relatively simple to study and yet yields a very sharp
distinction between linear and non-linear estimators when p < 2: The setting also illustrates
the Bayes minimax method discussed in Chapter 4.
When p < 2, this parameter space may be said to impose a restriction of approximate
sparsity on , as argued in earlier chapters (REFS). It represents a loosening of the requirement of exact sparsity studied in Chapter 8 using the `0 norm, in the sense that condition
(13.2) only requires that most components i are small, rather than exactly zero. Nevertheless, we will see that many of the techniques introduced in Chapter 8 for exact evaluation
of minimax risk under exact sparsity have natural extensions to the setting of approximate
sparsity discussed here.
345
346
We therefore follow the pattern established in the study of exact sparsity in Sections 8.4
8.8. In sparse cases, here interpreted as n ! 0 for 0 < p < 2, considered in Section 13.2,
thresholding, both soft and hard, again turns out to be (exactly) asymptotically minimax,
so long as the threshold is chosen carefully to match the assumed sparsity. Matching lower
bounds are constructed using the independent block spike priors introduced in Section 8.4:
the argument is similar after taking account of the `p constraint.
In dense cases, the asymptotic behavior of RN ./ is described by a Bayes-minimax
problem in which the components i of are drawn independently from an appropriate
univariate near least favorable distribution 1;n .
Again, as in Sections 8.5 and 8.7, the strategy is first to study a univariate problem
y D C Rz, with z N.0; 1/ and having a prior distribution , subject to a moment
constraint j jp d p . In this univariate setting, we can compare linear, threshold and
non-linear estimators and observe the distinction between p 2, with dense least favorable distributions, and p < 2, with sparse least favorable distributions placing most of
their mass at zero.
This is done in Section 13.3, while the following Section 13.4 takes up the properties of
the minimax threshold corresponding to the p-th moment constraint. This is used to show
that thresholding comes within a (small) constant of achieving the minimax risk over all
p and all moment constraints this is an analog of Theorem 4.17 comparing linear and
non-linear estimators over bounded intervals ; .
The second phase in this strategy is to lift the univariate results to the n-dimensional
setting specified by (13.1)(13.3). Here the independence of the co-ordinates of yi in (13.1)
and of the i in the least favorable distribution is crucial. The details are accomplished using
the Minimax Bayes approach sketched already in Chapter 4.
The Minimax Bayes strategy is not, however, fully successful in extremely sparse cases
when the expected number of spikes nn , remains bounded as n growsSection 13.5 also
compares the i.i.d. univariate priors with the independent block priors used in the sparse case.
Finally section 13.6 returns to draw conclusions about near minimaxity of thresholding in
the multivariate problem.
(13.5)
Theorem 9.5 says that the linear minimax risk is determined by the quadratic hull, and so
we may suppose that p 2: Our first result evaluates the linear minimax risk, and displays
the corner at p D 2.
Proposition 13.1 Let pN D p_2 and N D n
error loss is
1=pN
RL .n;p .C /; / D n 2 N 2 =.1 C N 2 /;
347
ip =C p ;
and a scalar function `.t / D t =.1 C t /, this optiIn terms of new variables ui D
mization can be rephrased as that of maximizing
X
/
`.C 2 2 u2=p
f .u/ D 2
i
i
Pn
The proposition just proved shows that n;1 has the same linear minimax risk as the solid
sphere n;2 , though the latter is much larger, for example in terms of volume. We have
already seen, in Example 8.2, that non-linear thresholding yields a much smaller maximum
risk over n;1 the exact behavior of RN .n;1 / is given at (13.39) below.
348
might be determined by the spike height and the `p -condition, that is, by a condition of the
form
kn pn D kn .2 log.n=kn //p=2 Cnp :
(13.6)
Cnp
npn ,
(13.8)
(13.9)
To verify this, start from (13.7) and use (13.8), first via the inequality n p n=n and then
via the equality n=n D n p tnp , we get
tn2 minf2 log.n=n /; 2 log ng D 2 log n=L n
D minf2 log n p C p log tn2 ; 2 log ng
tn2 C p log tn2 ;
which immediately yields (13.9).
We also recall a function first encountered in Lemma 9.3:
R.t / D t C ftg2=p :
(13.10)
where and fg denote integer and fractional parts respectively. See Figure 13.1.
Theorem 13.3 Let RN .Cn ; n / denote the minimax risk (13.3) for estimation over the `p
ball n;p .Cn / defined at (13.2). Define threshold tn by (13.7), then n by (13.8) and R.t / by
(13.10). Finally, let L n D n _ 1.
If 0 < p < 2 and n ! 0, and if n ! 2 0; 1, then
RN .Cn ; n / R.n /n2 2 log.n=L n /:
An asymptotically minimax rule is given by soft thresholding at n tn if n n
by the zero estimator otherwise.
(13.11)
1=p
and
Remarks. Expression (13.11) is the analog, for approximate `p sparsity, of the result (8.43)
obtained in the case of exact, `0 , sparsity. Indeed, the proof will show that n , or more precisely n C 1, counts the number of non-zero components in a near least favorable configuration. The expression (13.11) simplifies somewhat according to the behavior of n :
8
2
if n ! 1
n/
<n n 2 log.n=
2
2=p
RN .Cn ; n / C fg
(13.12)
n 2 log n if n ! 2 .0; 1/
: 2=p 2
n n 2 log n
if n ! 0:
349
0
0
Figure 13.1 The function R.t / D t C ftg2=p plotted against t for p D 1 (solid)
and p D 1=2 (dashed). The 45 line (dotted) shows the prediction of the Bayes
minimax method.
When n ! 1, (and when n ! 2 N), the limiting expression agrees with that found
in the case of exact (or `0 ) sparsity considered in Theorem 8.20, with n playing the role of
kn there.
p
When n < n 1=p it would also be possible to use thresholding at 2 log n p , but the
condition implies that all ji j 1 so thresholding is not really needed.
Proof Upper Bound. When n < n 1=p expression (13.8) shows that n 1, and so it
will be enough to verify that RN .Cn ; n / Cn2 . But this is immediate, since for the zero
estimator, we have r.O0 ; / D kk22 k k2p Cn2 .
p
For n n 1=p we use soft thresholding at tn D 2 log n p in bound (8.12) to get
X
X
rS .tn ; i / nrS .tn ; 0/ C
min.i2 ; tn2 C 1/:
i
We now claim that this bound dominates that in the previous display. Consider first the
case n 1, in which Cn2 tn2 , and so the middle bound in (13.13) becomes Cn2 . Since
npn D Cnp 1, we have npn =tn3 Cnp Cn2 , and so bound (13.13) indeed dominates.
350
i D1
351
(13.14)
where the second equality uses the minimax Theorem 4.12 and (4.14) of Chapter 4.
In particular, 1 .; / D N .; /; compare (4.26) and (4.19). In addition, the sparse
Bayes minimax risk 0 .; / D limp!0 p . 1=p ; /.
A remark on Notation. We use the lower case letters and for Bayes and frequentist
minimax risk in univariate problems, and the upper case letters B and R for the corresponding multivariate minimax risks.
We begin with some basic properties of p .; /, valid for all p and , and then turn to
the interesting case of low noise, ! 0, where the distinction between p < 2 and p 2
emerges clearly.
Proposition 13.4 The Bayes minimax risk p .; /, defined at (13.14), is
1. decreasing in p;
2. increasing in ,
3. strictly increasing, concave and continuous in p > 0,
4. and satisfies (i)
p .; / D 2 p .=; 1/, and
(ii)
p .a; / a2 p .; / for all a 1.
Proof First, (1) and 4(i) are obvious, while (2) and 4(ii) are Lemma 4.28
R and Exercise
Q / D supfB./ W jjp d D tg is
4.7(a) respectively. For (3), let t D p : the function .t
concave in t because ! B./ is concave and the constraint on is linear. Monotonicity
in p is clear, and continuity follows from monotonicity and 4(ii). Strict monotonicity then
follows from concavity.
The scaling property 4(i) means that it suffices to study the unit noise situation. As in
previous chapters, we use a special notation for this case: x N.; 1/; and write p ./ for
p .; 1/ where D = denotes the signal to noise ratio.
Information about the least favorable distribution follows from an extension of our earlier
results for p D 1, Proposition 4.19, and p D 0, Proposition 8.19. (For the proof, see
Exercise 13.5).
Proposition 13.5 For p and in .0; 1/, the Bayes minimax problem associated with
mp ./ and p ./ has a unique least favorable distribution . If p D 2, then is Gaussian, namely N.0; 2 /; while for p 2 instead is proper, symmetric and has discrete
support with 1 as the only possible accumulation points. When p < 2 the support must
be countably infinite.
Proposition 4.14 then assures us that the Bayes estimator corresponding to is minimax
for mp ./.
352
Thus, the only case in which completely explicit solutions are available is p D 2, for
which 2 .; / D 2 2 =. 2 C 2 / D L .; /, Corollary 4.6 and (4.28). From now on, however, we will be especially interested in p < 2; and in general we will not have such explicit
information about the value of p .; /; least favorable priors or corresponding estimators.
We will therefore be interested in approximations, either by linear rules when p 2; or
more importantly, by threshold estimators for all p > 0:
1 1=4
D log ..1
/=/ ;
(13.15)
and the resulting sparse prior, defined for < 1=2, was said to have sparsity and overshoot
a D .2 log 1 /1=4 .
Definition 13.6 The sparse `p prior p is the sparse prior ;./ with D p ./
determined by the moment condition
p ./ D p :
(13.16)
We write p ./ D .p .// for the location of the non-zero support point, and use notation
rather than for a small moment constraint.
Exercise 13.2 shows that
p this definition makes sense for sufficiently small. Recalling
from (8.52) that ./ 2 log 1 for small, one can verify that as ! 0,
p ./ p .2 log
p ./ .2 log
p 1=2
p=2
(13.17)
(13.18)
2p1
0 < p < 2:
(13.19)
p 1 p=2
353
2. Consider the special choice D n 1=2 . Then pn D n 1 .C =/p D n 1Cp=2 C p and so
2n D 2 log n p D .2 p/ log n 2p log C: Hence larger signal strength, represented both
in index p and in radius C; translates into a smaller choice of minimax
pthreshold. Note that
in a very small signal setting, pn D 1=n, we recover the choice n D 2 log n discussed in
earlier chapters.
3. The threshold estimator Op2 log p is also asymptotically minimax when p 2, Exercise 13.4.
Proof
(13.20)
Consequently B./ 2 ; and so also p ./ 2 . In the other direction, consider the
symmetric two point prior D .1=2/. C /; together with (13.19), formula (2.30) for
the Bayes risk shows that p ./ B. / 2 as ! 0.
Suppose now that p < 2: For the lower bound in (13.19), we use the priors p and the
asymptotics for their Bayes risks computed in Lemma 8.11. Note also that the p-th moment
constraint D p =p ./ implies that
2 . / 2 log 1 D 2 log
C p log 2 ./ 2 log
./ p .2 log
p 1 p=2
For the upper bound, we use an inequality for the maximum integrated risk of soft thresholding:
sup B.O ; / rS .; 0/ C p .1 C 2 /1 p=2 :
(13.21)
2mp ./
p=2
p=2
As this holds for all 2 mp ./, we obtain (13.21). Here we used symmetry of ! r.; /
about 0 to focus on those supported in 0; 1/.
Remark. There is an alternative approach to bounding supmp ./ BS .; / which looks for
the maximum of the linear function ! BS .; / among the extreme points of the convex
354
1+2
c p
Figure 13.2 Schematic for risk bound: although the picture shows a case with
p < 1, the argument works for p < 2.
mp ./ and shows that the maximum is actually of the two point form (8.46). This approach
yields (see Exercise 13.8)
Proposition 13.8 Let a threshold and moment space mp ./ be given. Then
supfB.O ; / W 2 mp ./g D sup r.; 0/ C .=/p r.; /
r.; 0/ C p 2
r.; 0/
(13.22)
(13.23)
The least favorable prior for O over mp ./ is of the two point prior form with determined
from and D by (8.46). As ! 1, we have
Q 1 .p=2/:
C
(13.24)
Hard thresholding. It is of some interest, and also explains some choices made in the
analysis of Section 1.3, to consider when hard thresholding OH; is asymptotically minimax.
Theorem 13.9 If p < 2 and ! 0, then the hard thresholding estimator OH; is asymptotically minimax over mp ./ if
(
2 log p
if 0 < p < 1
2 D
(13.25)
p
p
2 log C log.2 log / if 1 p < 2; > p 1:
The introductory Section 1.3 considered an example
with p D 1 and n D n 1=2 so that
p
2 log n 1 D log n: In this case the threshold D log n is not asymptotically minimax:
the proof below reveals that the risk at 0 is too large.
pTo achieve minimaxity for p 1, a
slightly larger threshold is needed, and in fact n D log.n log n/ works for any > 0.
Proof We adopt a variant of the approach used for soft thresholding. It is left as Exercise
13.3 to use Lemma 8.5 to establish that if c D p .1 C 2 / and 0 .p/, then
rH .; / rH .; 0/ C c p :
(13.26)
355
Since 2 log p , we obtain minimaxity for hard thresholding so long as the term due
to the risk at zero is negligible as ! 0:
./ D o.2
p p
/:
It is easily checked that for 0 < p < 1, this holds true for 2 D 2 log p , whereas for
1 p < 2, we need the somewhat larger threshold choice in the second line of (13.25).
For soft thresholding, the risk at zero rS .; 0/ 4 3 ./ is a factor 4 smaller than
for hard thresholding with the same (large) ; this explains why larger thresholds are only
needed in the hard threshold case.
(13.27)
where O refers to a soft threshold estimator (8.3) with threshold . Throughout this section,
we work with soft thresholding, sometimes emphasised by the subscript S, though some
analogous results are possible for hard thresholding (see Donoho and Johnstone (1994b).)
A goal of this section is to establish an analogue of Theorem 4.17, which in the case of a
bounded normal mean, bounds the worst case risk of linear estimators relative to all nonlinear ones. Over the more general moment spaces mp . /, the preceding sections show that
we have to replace linear by threshold estimators. To emphasize that the choice of estimator
in (13.27) is restricted to thresholds, we write
Z
O
B.; / D B. ; / D r.; /.d/:
Let BS ./ D inf B.; / denote the best MSE attainable by choice of soft threshold.
Our first task is to establish that a unique best ./ exists, Proposition 13.11 below. Then
follows a (special) minimax theorem for B.; /. This is used to derive some properties of
S;p .; / which finally leads to the comparison result, Theorem 13.15.
To begin, we need some preliminary results about how the MSE varies with the threshold.
356
Dependence on threshold. Let r .; / D .@=@/r.; /; from (8.79) and changes of
variable one obtains
Z
Z
w.w /dw:
w.w /dw C 2
r .; / D 2
1
jj
r .0; / D 4
w.w/dw < 0;
Z
and
(13.28)
1
0
r .; 0/ D 4
w.w
/dw < 0:
(13.29)
jwj.w
(13.30)
and by subtraction,
Z
r .; /
jj
r .; 0/ D 2
/dw:
jj
After normalizing by jr .; 0/j, the threshold risk derivative turns out to be monotone in
; a result reminiscent of the monotone likelihood ratio property. The proof is given at the
end of the chapter.
Lemma 13.10 For 0, the ratio
V .; / D
r .; /
jr .; 0/j
(13.31)
357
and so strict monotonicity of ! V .; / for 0 guarantees that this difference is < 0
or > 0 according as < 0 or > 0 . Consequently .@=@/B.; / has a single sign
change from negative to positive, ./ D 0 is unique and the Proposition follows.
The best threshold provided by the last proposition has a directional continuity property
that will be needed for the minimax theorem below. (For proof, see Further Details).
Lemma 13.12 If 0 and 1 are probability measures with 0 0 , and t D .1
t1 , then .t / ! .0 / as t & 0.
t /0 C
A minimax theorem for thresholding. Just as in the full non-linear case, it is useful to think
in terms of least favorable distributions for thresholding. Since the risk function r.; / is
bounded and continuous in , the integrated threshold risk B.; / is linear and weakly
continuous in . Hence
BS ./ D inf B.; /
is concave and upper semicontinuous in . Hence it attains its supremum on the weakly compact set mp ./, at a least favorable distribution 0 , say. Necessarily 0 0 , as BS .0 / D 0.
Let 0 D .0 / be the best threshold for 0 , provided by Proposition 13.11.
The payoff function B.; / is not convex in ; as is shown by consideration of, for
example, the risk function ! rS .; 0/ corresponding to D 0 . On the other hand,
B.; / is still linear in , and this makes it possible to establish the following minimax
theorem directly.
Theorem 13.13 The pair .0 ; 0 / is a saddlepoint: for all 2 0; 1/ and 2 mp . /,
B.0 ; / B.0 ; 0 / B.; 0 /;
(13.33)
and hence
inf sup B.; / D sup inf B.; /
mp . /
mp . /
and
S;p .; / D supfBS ./ W 2 mp . /g:
(13.34)
358
Proof The minimax Theorem 13.13 gives, in (13.34), a representation for S;p .; / analogous to (13.14) for p .; /, and so we may just mimic the proof of Proposition 13.4. except
in the case of monotonicity of in , for which we refer to Exercise 13.6.
We have arrived at the destination for this section, a result showing that, regardless of the
moment constraint, there is a threshold rule that comes quite close to the best non-linear
minimax rule. It is an analog, for soft thresholding, of the Ibragimov-Hasminskii bound
Theorem 4.17.
Theorem 13.15 (i) For 0 < p 1;
sup
;
S;p .; /
D .p/ < 1:
p .; /
! 0; 1:
(13.35)
Proof Most of the ingredients are present in Theorem 13.7 and Proposition 13.14, and we
assemble them in a fashion parallel to the proof of Theorem 4.17. The scaling S;p .; / D
2 S;p .=; 1/ reduces the proof to the case D 1. The continuity of both numerator and
denominator in D = shows that it suffices to establish (13.35).
For small , we need only reexamine the proof of Theorem 13.7: the upper bounds for
p ./pgiven there are in fact provided by threshold estimators, with D 0 for p 2 and
D 2 log p for p < 2.
For large ; use the trivial bound S;p .; 1/ 1, along with the property (1) that p . / is
decreasing in p to write
p . / 1=1 . / D 1=N .; 1/
(13.36)
359
Mn D f.d / W E
n
X
ji jp Cnp g:
(13.37)
which relaxes the `p -ball constraint of n D n;p .Cn / to an in-mean constraint. The set
Mn contains all point masses for 2 n ; and is convex, so using (4.18), the minimax
risk is bounded above by the Bayes minimax risk
RN .n;p .Cn // B.Mn / D supfB./; 2 Mn g
WD Bn;p .Cn ; n /:
We first show that that this upper bound is easy to evaluate in terms of a univariate quantity,
and later investigate when the bound is asymptotically sharp.
Recall the dimension normalized radius n D n 1=p .C =/: This may be interpreted as
the maximum scalar multiple in standard deviation units of the vector .1; : : : ; 1/ that is
contained within n;p .C /. Alternatively,
it is the bound on the average signal to noise ratio
P
measured in `p norm: .n 1 ji =jp /1=p n 1=p .C =/:
Proposition 13.16 Let p ./ denote the univariate Bayes minimax risk (13.14) for unit
noise, and let n D n 1=p Cn =n . Then
Bn;p .Cn ; n / D nn2 p .n /:
(13.38)
This is the p-th moment analog of the identity (8.66) for the `0 case. The proofs differ a
little since the method used for p-th moments does not preserve the `0 parameter space.
Proof We use the independence trick of Section 4.5 to show that the maximisation in
B.Mn / can be reduced to univariate priors. Indeed, for any 2 Mn , construct a prior N
from the product of the univariate marginals i of . We have the chain of relations
X
B./ B./
N D
B.i / nB. Q 1 /:
i
Indeed, Lemma 4.15 says that N is harder than , yielding the first inequality. Bayes risk is
additive for an independence
prior: this gives the equality. For the second inequality, form
P
1
the average Q 1 D n
i i and appeal to the concavity of Bayes risk.
The p-th moment of the univariate prior Q 1 is easily bounded:
Z
jj d Q 1 D n
n
X
1=p
Cn ; n /
and now the Proposition follows from the invariance relation 4(i) of Proposition 13.4.
360
E XAMPLE 13.2 continued. Let us return to our original example in which p D 1; the
noise n D n 1=2 , and the radius Cn D 1: Thus n D n 1 n1=2 D n 1=2 : It follows that
RN .1;n / Bn;1 .Cn ; n / D n .1=n/ 1 .n
1=2
/ .log n=n/1=2 ;
(13.39)
where the last equivalence uses (13.19). The next theorem will show that this rate and constant are optimal. Recall, for comparison, that RL .n;1 ; n / D 1=2:
The main result of this chapter describes the asymptotic behavior of the the nonlinear
minimax risk RN ./, and circumstances in which it is asymptotically equivalent to the
Bayes minimax risk. In particular, except in the highly sparse settings to be discussed below,
the least favorable distribution for RN ./ is essentially found by drawing n i.i.d rescaled
observations from the least favorable distribution p .n / for mp .n /: We can thus build on
the small results from the previous section.
Theorem 13.17 Let RN .Cn ; n / denote the minimax risk (13.3) for estimation over the `p
ball n;p .Cn / defined at (13.2), and n the normalized signal-to-noise ratio (13.4).
For 2 p 1, if n ! 2 0; 1, then
RN .Cn ; n / nn2 p .n /:
(13.40)
For 0 < p < 2, define threshold tn by (13.7), then n by (13.8) and R.t / by (13.10). Finally,
let L n D n _ 1.
(a) if n ! 2 0; 1 and n ! 1, then again (13.40) holds.
(b) If n ! 0 and n ! 2 0; 1, then
RN .Cn ; n / R.n /n2 2 log.n=L n /:
(13.41)
p<2
n ! > 0
dense
dense
n ! 0
dense
n ! 1, sparse
n , highly sparse
The highly sparse case is noteworthy because, as discussed below, the minimax Bayes
approach fails. The practical importance of this case has been highlighted by Mallat in a
satellite image deconvolution/denoising application. Hence we devote the next section to its
analysis.
Proof The sparse case, namely 0 < p < 2 and n ! 0, has already been established
in Theorem 13.3 and is included in the statement here for completeness. Our main task
here is to establish the equivalence (13.40), which, in view of Proposition 13.16, amounts
361
to proving asymptotic equivalence of frequentist and Bayes minimax risks. The detailed
behavior of RN and the structure of the asymptotically least favorable priors and estimators
follow from the results of the previous subsections on the univariate quantity p .; 1/ and
will be described below.
Asymptotic equivalence of RN and B: To show that the Bayes minimax bound is asymptotically sharp, we construct a series of asymptotically least favorable priors n that essentially concentrate on n D n;p .Cn /. More precisely, following the recipe of Chapter 4.11,
for each
< 1 we construct priors n satisfying
B.n /
Bn;p .
Cn ; n /.1 C o.1//;
(13.42)
n .n / ! 1; and
E fkO k2 C kk2 ; c g D o.Bn;p .
Cn ; n //
(13.43)
(13.44)
(13.45)
As indicated at Lemma 4.32 and the following discussion, if we verify (13.42) - (13.45) we
can conclude that RN .Cn ; n / Bn;p .Cn ; n /.
We will always define n by i.i.d rescaled draws from a univariate distribution 1 .d/ on
n
R (in some cases 1 D 1n depends on n): thus n .d / D 1n
.d=n /: Therefore, using
(13.38), condition (13.42) can be reexpressed as
B.1n /
p .
n /.1 C o.1//;
(13.46)
ji jp pn g:
We carry out the construction of 1n and n in three cases. First, under the assumption
that n ! 2 .0; 1 for all p 2 .0; 1: this is the dense case. Second, we suppose
that n ! 0 and p 2: this is in fact also a dense case since all components of the least
favorable prior are non-zero. Finally, for completeness, we discuss in outline the sparse case
n
n ! 0 with 0 < p < 2: in this case the i.i.d. prior n D 1n
establishes (13.41) only when
n ! 1: this was the reason for using the independent blocks prior in Theorem 13.3.
1 : Suppose first that n ! 2 .0; 1: Given
< 1, there exists M < 1 and a prior 1
in mp .
/ supported on M; M whose Bayes risk satisfies B.1 /
p .
/, compare
Exercise 4.5. Property (13.46) follows because p .
n / ! p .
/. Noting E1 jjp
p p and that ji j M; property (13.43) follows from the law of large numbers applied
to the i.i.d. draws from 1 : Since ji j M under the prior n , both kk2 and kO k2 are
bounded by n 2 M 2 , the latter because kO k2 En fkk2 j 2 n ; yg. Hence the left side
of (13.44) is bounded by 2nn2 M 2 n .cn / while Bn;p .
Cn ; n / is of exact order nn2 ; and so
(13.44) follows from (13.43). Property (13.45) follows from continuity of p , Proposition
13.4.
In summary, RN nn2 p .n /nn2 p ./, and an asymptotically minimax estimator can
be built from the Bayes estimator for a least favorable prior for mp ./.
362
2 : Now suppose that n ! 0. First, observe from (13.19) that p .
n /=p .n / !
2^p , so that (13.45) holds.
Suppose first that p 2: This case is straightforward: we know from the univariate
case that the symmetric two point priors n D n D .n C n /=2 are asymptotically
least favorable, so n satisfies (13.46) for large n: The corresponding measure n is already
supported on n ; so the remaining conditions are vacuous here.
In summary, RN nn2 2n and O D 0 is asymptotically minimax.
3 : Now suppose that 0 < p < 2. Although (13.41) was established in Theorem 13.3,
we do need to check that formulas (13.40) and (13.41) are consistent. For this, note first
from (13.8) that n ! 1 implies npn ! 1 (argue by contradiction), and in particular that
L n D n for n large, and also (cf. (13.7)) that tn2 D 2 log n p . We have the following chain
of relations
np .n / npn .2 log n p /1
p=2
The first equivalence uses (13.19), while the second equality is a rewriting of (13.8). The
third again uses (13.8), now inside the logarithm, and the fourth applies (13.9) to show that
the tnp factor is negligible.
In summary,
RN nn2 pn .2 log n p /1
p=2
(13.47)
p 1=2
and soft thresholding with n D .2 log n / n provides an asymptotically minimax estimator. Hard thresholding is also asymptotically minimax so long as the thresholds are
chosen in accordance with (13.25).
As promised, let us look at the i.i.d. prior construction in this sparse case. The univariate
prior is chosen as follows. Given
< 1, let 1n be the sparse prior p
n of Definition
13.6 and set
n D p .
n /;
n D p . n /:
(13.48)
This establishes (13.46) and hence (13.42). Everything now turns on the support condition
n
(13.43). Observe that the number NP
n of non-zero components in a draw from n D 1n is a
p p
p
Binomial.n; n / variable, and that i ji j D Nn n n : The support requirement becomes
f 2 n g D fNn Cnp =.np pn /g:
(13.49)
(13.50)
nn g c p Var Nn =.ENn /2 :
(13.51)
363
The right side of (13.51) converges to zero exactly when ENn D nn ! 1. We may
verify that nn ! 1 is equivalent to n ! 1. Indeed, insert (13.8) into the moment
condition (13.50) to obtain n =.nn / D
p .n =tn /p so that our claim follows from (13.7)
and (13.18) so long as npn 1. If, instead, npn 1 then it follows from (13.8) and (13.17)
that both n and nn remain bounded.
Thus condition (13.43) holds only on the assumption that n ! 1. In this case (13.44)
can be verified with a little more work; the details are omitted. Observe that when n !
2 .1; 1/, the minimax Bayes risk approaches 2 log.n=/, whereas the actual minimax
risk behaves like R./ 2 log.n=/. Thus Figure 13.1 shows the inefficiency of the minimax
Bayes risk for non-integer values of .
The assumption that n ! 1 ensures that ENn ! 1: In other words, that n has large
enough radius that the least favorable distribution in the Bayes minimax problem generates
an asymptotically unbounded number of sparse spikes. Without this condition, asymptotic
equivalence of Bayes and frequentist minimax risks can fail. For an example, return to the
1=2
case p D 1; D n 1=2 ; but now with small
; so that n ! 0. We
p radius Cn D n
1
1
2 log n: However, the linear minimax risk is
have n D n and hence B.Cn ; n / n
smaller: RL nn2 N 2 n 1 ; and of course the non-linear minimax
p risk RN is smaller still.
In this case ENn D nn D nn =n D 1=n ! 0; since n 2 log n:
(13.52)
The minimax risk among soft thresholding estimators over the `p -ball n;p .C / is given by
RS .C; / D RS .n;p .C /; / D inf
sup
2n;p .C /
E kO
k2 :
The next result is a fairly straightforward consequence of Theorems 13.15 and 13.17.
Theorem 13.18 Adopt the assumptions of Theorem 13.17. If n ! 2 0; 1 and, when
p < 2, if also n ! 1, then there exists .p/ < 1 such that
RS .Cn ; n / .p/RN .Cn ; n / .1 C o.1//:
(13.53)
If also n ! 0, then
RS .Cn ; n / RN .Cn ; n /:
The proof shows that .p/ can be taken as the univariate quantity appearing in Theorem
13.15, as so from the remarks there, is likely to be not much larger than 1. Thus, in the high
dimensional model (13.1), soft thresholding has bounded minimax
p efficiency among all estimators. In the case when n ! 0, the threshold choice n D n 2 log n p is asymptotically
minimax among all estimators.
Proof
For a given vector D .i /, define i D i =n and let n denote the empirical
364
P
1
measure n
i i . We can then rewrite the risk of soft thresholding at n , using our
earlier notations, respectively as
X
X
r.; i / D n 1 n2 B.; n /:
E
.O;i i /2 D n2
i
If 2 n;p .Cn /, then the empirical measure satisfies a univariate moment constraint
Z
X
ji =n jp n.Cn =n /p D pn :
(13.54)
jjp d n D n 1
Consequently n 2 mp .n /, and so
inf sup E kO
k2 nn2 inf
sup
2mp .n /
B.; /:
Now recalling definition (13.27) of S;p ./ and then Theorem 13.15, the right side equals
nn2 S;p .n / .p/ nn2 p .n / D .p/Bn;p .Cn ; n /;
where at the last equality we used the minimax Bayes structure Proposition 13.16. Putting
this all together, we get
RS .Cn ; n / .p/Bn;p .Cn ; n /
and the conclusion (13.53) now follows directly from Theorem 13.17. If n ! 0, then
S;p .n / p .n / by Theorem 13.7 and so we obtain the second statement.
Remark 13.19 There is a fuller Bayes minimax theory for thresholding, which allows for
a different choice of threshold in each co-ordinate. There is a notion of threshold Bayes
minimax risk, BS In;p .C; / for priors satisfying (13.37), and a vector version of Theorem
13.15
BSIn;p .C; / .p/Bn;p .C; /:
(13.55)
(13.56)
/,
Z
Z
R.; / D
jwj =
jwj D N./=D./;
D
and the intervals N D . jj; jj/ and D D . 1; 0/. One then checks that
D./2 .@=@/R.; / D D./N 0 ./ N./D 0 ./
Z
Z
Z
Z
D
jwj
wjwj
wjwj
jwj ;
D
365
13.8 Notes
after cancellation, and each term on the right side is positive when 0 and > 0 since
Z
jj
Z
wjwj D
w 2 .w
/
from symmetry and unimodality of . This shows the monotonicity of V .; / in . We turn
to the large limit: writing for jj, a short calculation shows that as ! 1
Z
Q
Q
N./
w.w /dw D .
/ ./
C ./ . / . /
0
2
Q
D./ D ./ C ./ ./= ;
so that R.; / e
2 =2
.1 C o.1// ! 1 as ! 1.
4: Proof of Lemma 13.12. Let D.; / D @ B.; /; from Proposition 13.11 we know
that ! D.; / has a single sign change from negative to positive at ./. The linearity
of ! D.; / yields
D.; t / D D.; 0 / C tD.; 1
0 / D D./ C tE./;
0 j < is that
for all 0 . Since D.0 / < 0 < D.0 C / and ! E./ is continuous and
bounded on 0; 0 C 1, the condition clearly holds for all t > 0 sufficiently small.
13.8 Notes
[To include:] DJ 94 `q losses and p < q vs p q.]
[Feldman, Mallat?, REM discussion/refs.]
Remark. [later?] A slightly artificial motivation for the `p balls model comes from
the continuous Gaussian
/dt C n 1=2 d W; t 2 0; 1 in which
Pnwhite noise model d Yt D f .t1=2
f has the form f D 1 k n;k ; where n;k .t / D n .nt k/: If is the indicator of
P
the unit interval 0; 1, then kfO f k2L2 D .Oi i /2 and since
Z
jf jp D np=2
n
X
jk jp ;
366
Exercises
p
13.1 (Spike heights for least favorable priors.) In the setting of Theorem 13.3, let n Dp
nn =tn and
mn be the integer part of n=.n C 1/. Let n D tn log tn and show that n D 2 log mn
n ! 1, for example as follows:
(a) if n < 1, show that
pn log tn ,
(b) if n 1, show that 2 log mn tn c=tn .
13.2 (Sparse priors are well defined.) Consider the sparse prior ;./ specified by equation (8.48)
with sparsity and overshoot a D .2 log 1 /
=2 : Let > 0 be small and consider the moment constraint equation ./p D p : Show that m./ D ./p has m.0C/ D 0 and is
increasing for > 0 sufficiently small. Show also, for example numerically, that for some
,
m./ ceases to be monotone for larger values of .
13.3 (Boundspfor hard thresholding.)
Use Lemma 8.5 to establish (13.26) by considering in turn
p
2 0; 5, 2 5; and : Give an expression for 0 .p/.
13.4 (Minimaxity of thresholding for p 2.) In the setting of Theorem 13.7, show that Op2 log
is asymptotically minimax when ! 0 for p 2, for example by using (8.7) and (8.12).
13.5 (Structure of the p-th moment least favorable distributions.) Establish Proposition 13.5 by
mimicking the proof of Proposition 8.19, allowing for the fact that mp . / is weakly compact.
(a) Let .d/ D p jjp .d / and use strict monotonicity of p . / to show that has
total mass 1.
(b) Let r./ D p jj p r.O ; / r.O ; 0/ and verify that for 0,
Z
r. / r. 0 / .d 0 /:
(c) Complete the argument using Lemma 4.18 and Exercise 4.1.
13.6 (Monotonicity of threshold minimax risk.) Let r.; I / denote the MSE of soft thresholding
at when x N.; 2 /, and r.; / D r.; I 1/. Show that the proof of monotonicity of
! S;p .; / can be accomplished via the following steps:
(a) It suffices to show that if 0 < , then r. 0 ; I 0 / r.; I / for all and .
(b) Writing 0 D =, verify that if 1,
.d=d/r.; I / D 2r.; 0 /
p/
(b) For 0 < p 2, there exists c > 0 such that the function ! r.; 1=p /, is convex for
2 .0; c and concave for 2 c ; 1/. [Assume, from e.g. Prekopa (1980, Theorem 3 and
Sec. 3), that ! .I / is log-concave on .0; 1/.]
(c) Show that the extreme points of mp . / have the form .1 /0 C 1 , but that it suffices
to take 0 D 0, and hence recover (13.22).
(d) Show that (13.23) is equivalent to solving for D in
R
ss .I /ds
2
R
R./ WD 0
D :
p
.I / 0 sds
367
Exercises
Q
Show that R. C u/ ! 1=.u/
uniformly on compact u-intervals and so conclude (13.24).
13.9 (Bayes minimax theory for thresholding.) Let D .i / be a vector of thresholds, and define
now O by O;i .y/ D OS .yi ; i /. If is a prior on 2 Rn , set B.; / D E E kO k2 :
Define Mn , the priors satisfying the n;p .C / constraint in mean, by (13.37) and then define
the Bayes-minimax threshold risk by
BSIn;p .C; / D inf sup B.; /:
2Mn
(a) Let BS ./ D inf B.; /. Show that a minimax theorem holds
inf sup B.; / D sup BS ./;
Mn
Mn
. ; /
where the sum is taken over sets I D fi1 ; i2 ; i3 g f1; : : : ; ng of cardinality three.
If 2 < 1 and 2 C 3 > 1, show that under Pn1 ,
Dn Sn .1 n /Sn .2 n /Sn .3 n /
in probability. [This would be the key step in extending the lower bound argument to the range
2 < p 3 and indicates the approach for general .]
14
Sharp minimax estimation on Besov spaces
14.1 Introduction
In previous chapters, we developed bounds for the behavior of minimax risk RNp..C /; /
over Besov bodies .C /. In Chapters 9 and 10, we showed that thresholding at 2 log 1
led to asymptotic minimaxity up to logarithmic factors O.log 1 /, while in Chapter 12 we
established that estimators derived from complexity penalties achieved asymptotic minimaxity up to constant factors.
In this chapter, we use the minimax Bayes method to study the exact asymptotic behavior
of the minimax risk, at least in the case of squared error loss. The price for these sharper
optimality results is that the resulting optimal estimators are less explicitly described and
depend on the parameters of .
In outline, we proceed as follows. In Section 14.2 we replace the minimax risk RN ..C /; /
by an upper bound, the minimax Bayes problem with value B.C; /, and state the main results of this chapter.
In Section 14.3, we begin study of the optimization over prior probability measures
required for B.C; /, and show that the least favorable distribution necessarily has independent co-ordinates, and hence the corresponding minimax rule is separable, i.e. acts coordinatewise. The B.C; / optimization is then expressed in terms of the univariate Bayes
minimax risks p .; / studied in Chapter 13.
In Section 14.4, a type of renormalization argument is used to deduce the dependence
of B.C; / on C and up to a periodic function of C =. At least in some cases, this function
is almost constant.
In Section 14.5, we show that the upper bound B.C; / and minimax risk RN ..C /; /
are in fact asymptotically equivalent as ! 0, by showing that the asymptotically least
favorable priors are asymptotically concentrated on .C /.
The minimax risk of linear estimators is evaluated in Section 14.6, using notions of
quadratic convex hull from Chapter 4revealing suboptimal rates of convergence when
p < 2.
In contrast, threshold estimators, Section 14.4 can be found that come within a constant
factor of RN ..C /; / over the full range of p; these results rely on the univariate Bayes
minimax properties of thresholding established in Chapter 13.4.
368
369
(14.1)
where I denotes the pair .j; k/, supposed to lie in the set I D [j 1 Ij , where for j 0;
Ij D f.j; k/ W k D 1; : : : ; 2j g and the exceptional I 1 D f. 1; 0/g: For the parameter
spaces, we restrict attention, for simplicity of exposition, to a particular class of Besov bodies
D p .C / D f D .I / W kj kp C 2
aj
for allj g;
a D C 1=2
1=p:
This is the q D 1 case of the Besov bodies p;q considered in earlier chapters. 1 They are
supersets of the other cases, and it turns out that the rate of convergence as ! 0 is the
same for all q Donoho and Johnstone (1998).
We note that is solid and orthosymmetric, and compact when > .1=p 1=2/C :
(Exercise 14.1(a)). The focus will be
PonO global `22 estimation: that is, we evaluate estimators
2
O
with the loss function k k2 D .I I / and the minimax risk
RN .; / D inf sup E kO
O
k22 :
P
In principle, a similar development could be carried out for the `p loss kO kpp D jOI
P
P
I jp , or weighted losses of the form j 2jr k jOj k j k jp :
(14.2)
Since M contains unit point masses at each 2 , we have RN .; / B.M; /: We will
again see that it is (relatively) easier to study and evaluate the Bayes minimax risk B.M; /.
To emphasize the dependence on C and , we sometimes write B.C; / for B.M; /.
The results build on the univariate Bayes minimax problem introduced in Section 13.3,
with Bayes minimax risk p .; / corresponding to observation y D C z and moment
constraint E jjp p for the prior . We use the notation p ./ for the normalized
O /
problem with noise D 1. Let denote the least favorable prior for p ./ and D .xI
denote the corresponding Bayes-minimax estimator, so that B. ; / D B. / D p ./:
A first key property of the Bayes minimax problem is that minimax estimators are separable into functions of each individual coordinate:
1
370
I 2 I;
(14.3)
where Oj .y/ is a scalar non-linear function of the scalar y. In fact there is a one parameter
O / be the Bayes
family of functions from which the minimax estimator is built: Let .xI
minimax estimator for the univariate Bayes minimax problem p ./ recalled above. Then
O I =I j /;
Oj .yI / D .y
where j D .C =/2
.C1=2/j
(14.4)
O / is not available, but we will see that useful approxFor p 2; the explicit form of .I
O / by threshold rules are possible.
imations of .I
Second, the exact asymptotic structure of the Bayes minimax risk can be determined.
Theorem 14.2 Suppose that 0 < p 1 and > .1=p
B.C; / P .C =/ C
2.1 r/ 2r
;
! 0;
1=2/C ,
! 0:
(14.5)
Combining Theorems 14.214.3, we conclude that the estimator O is asymptotically minimax for R as ! 0. In short: a separable nonlinear rule is asymptotically minimax.
The proofs of these results occupy the next three sections.
371
The independence structure of N means that the Bayes estimator ON is separable - since
prior and likelihood factor, so does the posterior, and so
O;I
D EN I .I jyI /:
N
P
In addition, the Bayes risk is additive: B./
N D I B.N I /: The constraint for membership
in M becomes, for N j ,
EN I jj1 jp C p 2
.apC1/j
for all j:
Let ! D C 1=2 and note that ap C 1 D !p. The optimization can now be carried out on
each level separately, and, since N j is a univariate prior, expressed in terms of the univariate
Bayes minimax risk, so that
X
B.C; / D sup B./ D supf
2j B.N j / W EN j jj1 jp C p 2 !pj g
2M
2 p .C 2
!j
; /:
(14.6)
In each case the sum is over j 1. Using the scale invariance of p .; /, Proposition
13.14, and introducing a parameter through 2! D C =, we have
X
B.C; / D 2
2j p .2 !.j / /:
(14.7)
j 0
Hence the Bayes-minimax rule must be separable. Recalling the structure of minimax
rules for p ./ from Section 13.3, we have
I .y/ D .yI =; j /
j D .C =/2
!j
2 ; we have
!.j /
/ D 2 2 P ./;
j 2Z
!.j /
/D
X
v2Z
2v p .2
!v
/:
(14.9)
372
r/, we get
r/ 2r
;
log2 .C =//:
(14.10)
To check convergence of the sum defining P ./, observe that for large negative v, we
have F .v/ D 2v p .2 !v / 2v , while for large positive v, referring to (13.19),
(
2v 2 2!v
with 2! 1 D 2 > 0
if p 2
F .v/ v
p!v 1 p=2
2 2
v
with p! 1 D p. C 1=2/ 1 > 0 if p < 2:
Continuity of P ./ follows from this convergence and the continuity of p ./. This completes the proof of Theorem 14.2.
Remark. How does the location j of the maximum term in Q.C; / depend on ? Suppose
that v is the location of the maximum of the function v ! 2v p .2 !v /. Then the maximum
in Q.C; / occurs at j D v C D v C ! 1 log2 .C =/: Using the calibration D n 1=2
and ! D C 1=2; we can interpret this in terms of equivalent sample sizes as
j D
log2 C
log2 n
C
C v :
1 C 2
C 1=2
(14.11)
The most difficult resolution level for estimation is therefore at about .log2 n/=.1 C 2/.
This is strictly smaller than log2 n for > 0, meaning that so long as the sum (14.8) converges, the primary contributions to the risk B.C; / come from levels below the finest (with
log2 n corresponding to a sample of size n).
Example. When p D 2, explicit solutions are possible because 2 ./ D 2 =.1 C 2 /
O ; 2/ D wx D 2 =.1 C 2 /x: Recall that j D .C =/2 !j D 2 !.j / decreases
and .xI
rapidly with j aboveP
D ! 1 log2 .C =/, so that Oj is essentially 0 for such j .
We have P ./ D j g.j / for
g.v/ D
e av
2v
D
1 C 22!v
1 C e bv
for a D log 2 and b D .2 C 1/ log 2 > a. An easy calculation shows that the maximum of
g occurs at v D log2 .1=.2//=.1 C 2/, compare also (14.11).
Figure 14.1 shows plots of the periodic function P ./ for several values of . For small
, the function P is very close to constant, while for larger it is close to a single sinusoidal
cycle. This may be understood from the Poisson summation formula (C.13). Indeed, since g
is smooth, its Fourier transform g./
O
will decay Rrapidly, and so the primary contribution in
1
the Poisson formula comes from P0; D g.0/
O
D 1 g.t /dt. The integral may be expressed
in terms of the beta function by a change of variables w D .1Ce bt / 1 , yielding b 1 B.c; 1
c/ D b 1 .c/.1 c/ for c D a=b. From Eulers reflection formula .z/.1 z/ D
= sin.z/, and using the normalized sinc function sinc .x/ D sin.x/=.x/, we arrive at
1
P0; D log 2 sinc..2 C 1/ 1 / :
(14.12)
Figure 14.1 shows that P0; provides an adequate summary for 2.
373
P()
1.5
1
0
0.2
0.4
0.6
0.8
2j B.j / Q. ; 1/ D
1
X
2j p . 2
!j
/:
To obtain J , we rely on convergence of the sum established in the proof of Theorem 14.2.
To obtain M and to construct the individual j , we may appeal to Exercise 4.5 as in case
(1) of the proof of Theorem 13.17 for `p balls.
To perform the translation, we focus on a subsequence of noise levels h defined by
C =h D 2!h ; for h 2 N. [Exercise 14.4 discusses other values of ]. The prior h concentrates on the 2J C 1 levels h C j centered at h D ! 1 log2 C =h : Let fj k ; k 2 Ng be an
iid sequence drawn from j : For jj j J , set
hCj;k D h j;k
k D 1; : : : ; 2hCj :
(14.13)
Hence, as ! 0, the near least favorable priors charge (a fixed number 2J.
/ C 1 of) ever
higher frequency bands.
We now verify conditions (4.67) (4.69) for the sequence h , noting that J and M are
374
fixed. Working through the definitions and exploiting the invariances (14.9), we have
B.h / D h2
hCJ
X
2j B.j
2j B.j /
jD J
j Dh J
J
X
2 h
h / D h 2
h2 2h Q. ; 1/
D Q. C; h / B. C; h /:
Recalling the definition of h and that a D C 1=2 1=p D ! 1=p; we have with
probability one under the prior h that
X
2 .C / ,
jhCj;k jp C p 2 a.hCj /p for jj j J;
k
nj h1
nj h
X
jj k jp 2
!jp
for jj j J;
kD1
where nj h D 2j Ch :
Write Xj k D jj k jp Ejj k jp and set tj D .1
p /2
on j , it follows that f .C /g [JjD J j h where
j h D
fnj h1
nj h
X
j!p
Xj k > tj g:
kD1
Since the probability P .j h / ! 0 as h ! 1 by the law of large numbers, for each of a
finite number of indices j , we conclude that h ..C // ! 1.
Finally, to check (4.69), observe first that kh k2 Eh kk2 j 2 ; y and that for
h we have, with probability one,
j Ch
kk D
h2
J
2X
X
jj k j2 M 2 2J C1 C 2.1
r/ 2r
h :
j D J kD1
Consequently,
EfkO k2 C k k2 ; c g 2c.M; J /B.C; h /h .c /
and the right side is o.B.C; h // as required, again because h .c / ! 0.
Finally, from Theorem 14.2 and (14.10), we have
B.
C; h /
2.1
B.C; h /
r/ P . C =h /
P .C =h /
D 2.1
r/ P
.! 1 log2
/
;
P .0/
375
by Theorem 9.5 the linear minimax risk is determined by the quadratic hull of . It follows
from the definitions (Exercise 14.2) that
0
QHull.p / D p0
p 0 D p _ 2; 0 D
1=p C 1=p 0 :
(14.14)
RL .p .C /; / RN .p0 ; /
C 2.1
r 0 / 2r 0
r 0 D 2 0 =.2 0 C 1/:
(14.15)
In particular, when p < 2; we have 0 D .1=p 1=2/; so that the linear rate r 0 is
strictly smaller than the minimax rate r: This property extends to all q 1 (Donoho and
Johnstone, 1998). For example, on the Besov body 11;1 corresponding to the Bump Algebra,
one finds that 0 D 1=2 and so the linear minimax rate is O./, whereas the non-linear rate
is much faster, at O. 4=3 /:
Let us conclude this section with some remarks about the structure of minimax linear
estimators. Since the spaces D p .C / are symmetric with respect to permutation of coordinates within resolution levels, it is intuitively clear that a minimax linear estimator will
have the form O D .Oj;cj /, where for each j , cj 2 0; 1 is a scalar and
Oj;cj D cj yj
(14.16)
EkOj;cj
j k2 :
(14.17)
N where
N D
A formal verification again uses the observation that RL ./ D RL ./
0
N construct N by setting N 2 avek 2 W
QHull./ D p0 as described earlier. Given 2 ;
jk
jk
N also. Formula (4.48) shows that R.. // is a concave
since p 0 2; one verifies that N 2
function of .i2 /; and hence that R..N // R.. //: Consequently, the hardest rectangular
subproblem lies among those hyperrectangles that are symmetric within levels j: Since the
minimax linear estimator for rectangle N has the form Oc.N /;I D NI2 =.NI2 C 2 /yi ; it follows
N has the form (14.16), which establishes (14.17).
that the minimax linear estimator for
376
k2 :
.j /
Over the full range of p, and for a large range of , thresholding is nearly minimax among
all non-linear estimators.
Theorem 14.4 For 0 < p 1 and > .1=p
as ! 0:
Proof The argument is analogous to that for soft thresholding on `p balls in Rn , Theorem
13.18. We bound RS .; / in terms of the Bayes minimax risk B.C; / given by (14.2) and
(14.6), and then appeal to the equivalence theorem RN .; / B.C; /.
Given D .j k /, let j k DPj k =. Let j denote the empirical measure of fj k ; k D
1; : : : ; 2j g, so that j D 2 j k j k . Recalling the definitions of threshold risk r.; /
and Bayes threshold risk B.; / for unit noise level from Chapter 13, we have
X
X
E kO k2 D
2 r.j ; j k / D
2j 2 B.j ; j /:
j
jk
Let j D .C =/2
mp .j /, so that
!j
k2
2j 2 S;p .j /;
since the minimization over thresholds j can be carried out level by level. Now apply Theorem 13.15 to bound
P S;p .j / .p/p .j /, and so bound the right side of the preceding
display by .p/ j 2j p .C 2 !j ; /. Hence, using (14.6)
RS .; / .p/B.C; /:
Our conclusion now follows from Theorem 14.3.
Remark. In principle, one could allow the thresholds to depend on location k as well as
scale j : D .j k /. Along the lines described in Remark 13.19 and Exercise 13.9, one can
define a Bayes minimax threshold risk BS .M; /, show that it is bounded by .p/B.M; /,
and that minimax choices of in fact depend only on j and not on k. Further details are in
Donoho and Johnstone (1998, 5).
Since .p/ 2:22 for p 2; and .1/ 1:6; these results provide some assurance that
threshold estimators achieve nearly optimal minimax performance. The particular choice of
threshold still depends on the parameters .; p; q; C /; however. Special choices of threshold
not depending on prior specifications of these parameters will be discussed in later chapters.
Similar results may be established for hard thresholding.
377
14.8 Notes
14.8 Notes
The results of this chapter are specialized to Besov spaces with q D 1 from Donoho and Johnstone (1998),
which considers both more general Besov spaces and also the Triebel scale. In these more general settings,
the levels j do not decouple in the fashion that led to (14.8). We may obtain similar asymptotic behavior
by using homogeneity properties of the Q.C; / problem with respect to scaling and level shifts.
Remark. Here and in preceding chapters we have introduced various spaces of moment-constrained
probability measures. These are all instances of a single method, as is shown by the following slightly
cumbersome notation. If is a probability measure on `2 .I/; let p ./ denote the sequence of marginal
pth moments
p ./I D .E jI jp /1=p :
I 2 I;
p 2 .0; 1:
mp ./ D Mp . ; /:
Mn D Mp .n;p .C //,
M.C / D M2 ..C //;
Mp .C / D Mp .p .C //:
Exercises
14.1 (Compactness criteria.) (a) Show, using the total boundedness criterion C.16, that p .C / is
`2 -compact when > .1=p 1=2/C .
(b) Show, using the tightness criterion given in C.18 that Mp .C / is compact in the topology
of weak convergence of probability measures on P .`2 / when > .1=p 1=2/C .
14.2 (Quadratic hull of Besov bodies.) Verify (14.14). [Hint: begin with finding the convex hull of
P P
sets of the form f.j k / W j . k jj k j /= 1g ]
14.3 (Threshold minimax theorem.) Formulate and prove a version of the threshold minimax theorem
13.13 in the Bayes minimax setting of this chapter.
14.4 Completing proof of asymptotic efficiency.
(a) Show that a function g on 0; 1 satisfies g.t / ! 1 as t ! 0 if and only if for some
2 .0; 1/, we have g.b j / ! 1 as j 2 N ! 1, uniformly in b 2 ; 1.
(b) Fix b 2 2 !;1 and for h 2 N define h by C =h D b2!h . Modify the argument of Section
14.5 to show that RN .; h / B.C; h /. [Hint: replace Q.1; 1/ by Q.b; 1/ in the argument.]
15
Continuous v. Sampled Data
Our theory has been developed so far almost exclusively in the Gaussian sequence model
(3.1). In this chapter, we indicate some implications of the theory for models that are more
explicitly associated with function estimation. We first consider the continuous white noise
model
Z t
Y .t/ D
f .s/ds C W .t /;
t 2 0; 1;
(15.1)
0
l D 1; : : : n;
(15.2)
fO f 2F
the expectation on the left-hand side being with respect to white noise observations Y in
(15.1) and on the right hand-side being with respect to yQ in (15.2). However, the general
equivalence result fails for 1=2 and we wish to establish results for the global estimation
problem for the unbounded loss function kfO f k2 that are valid also for Besov (and Triebel)
classes satisfying > 1=p; where p might be arbitrarily large.
In addition our development will address directly the common and valid complaint that
theory is often developed for theoretical wavelet coefficients in model (15.1) while computer algorithms work with empirical wavelet coefficients derived from the sampled data
model (15.2). We compare explicitly the sampling operators corresponding to pointwise
evaluation and integration against a localized scaling function. The approach taken in this
chapter is based on Donoho and Johnstone (1999) and Johnstone and Silverman (2004b).
379
are i.i.d standard Gaussian variables and that the noise level is known. For convenience,
suppose throughout that n D 2J for some integer J:
We have studied at length the white noise model (15.1) which after conversion to wavelet
coefficients yI D hd Y ; I i; I D hf; I i; zI D hd W; I i takes the sequence model form
yI D I C zI ;
I D .j; k/; j 0; k D 1; : : : ; 2j :
(15.4)
This leads to a possibly troubling dichotomy. Much of the theory developed to study
wavelet methods is carried out using functions of a continous variable, uses the multiresolution analysis and smoothness classes of functions on R or 0; 1; and the sequence model
(15.4). Almost inevitably, most actual data processing is carried out on discrete, sampled
data, which in simple cases might be modeled by (15.2).
There is therefore a need to make a connection between the continuous and sampled
models, and to show that, under appropriate conditions, that conclusions in one model are
valid for the other and vice versa. For this purpose, we compare minimax risks for estimation
of f based on sequence data y from (15.4) with that based on sampled data yQ from (15.2).
Hence, set
R.F ; / D inf sup EkfO.y/
f k22 ;
f k22 :
fO .y/ f 2F
fO .y/
Q f 2F
(15.5)
Note that the error of estimation is measured in both cases in the norm of L2 0; 1. The
parameter space F is defined through the wavelet coefficients corresponding to f , as at
(9.48):
F D ff W f 2 p;q .C /g:
Remark. One might also be interested in the error measured in the discrete norm
X
n 1 kfO f k2n D .1=n/
fO.tl / f .tl /2 :
(15.6)
R1
Section 15.5 shows that this norm is equivalent to 0 .fO f /2 under our present assumptions.
Assumption (A) on the wavelet. In this chapter the choice of ; p and q is fixed at the
n ! 1:
(15.7)
380
In words, there is no estimator giving a worst-case performance in the sampled-dataproblem (15.2) which is substantially better than what we can get for the worst-case performance of procedures in the white-noise-problem (15.4).
For upper bounds, we will specialize to estimators derived by applying certain coordinatewise mappings to the noisy wavelet coefficients.
For the white noise model, this means the estimate is of the form
X
fO D
.yI / I
I
where each function I .y/ either belongs to one of three specific families Linear, Soft
Thresholding, or Hard Thresholding or else is a general scalar function of a scalar argument. The families are:
(EL )
(ES )
(EH )
(EN )
(15.8)
where yI.n/ is an empirical wavelet coefficient based on the sampled data .yQi /, see Section 15.4 below, and the I belong to one of the families E . Then define the E -minimax risks
in the two problems:
RE .F ; / D inf sup EY kfO
f k2L2 0;1
(15.9)
f k2L2 0;1 :
(15.10)
fO 2E f 2F
and
fO 2E f 2F
n ! 1:
(15.11)
Our approach is to make an explicit construction transforming a sampled-data problem into a quasi-white-noise problem in which estimates from the white noise model can
be employed. We then show that these estimates on the quasi-white-noise-model data behave nearly as well as on the truly-white-noise-model data. The observations in the quasiwhite-noise problem have constant variance, but may be correlated. The restriction to coordinatewise estimators means that the correlation structure plays no role.
Furthermore, we saw in the last chapter in Theorems 14.1 14.3 that co-ordinatewise
non-linear rules were asymptotically minimax: R.F ; n / REN .F n / for the q D 1 cases
considered there, and the same conclusion holds more generally for p q (Donoho and
Johnstone, 1998).
381
Remark. The assumptions on .; p; q/ in Theorems 15.1 and 15.2 are needed for the
bounds to be described in Section 15.4. Informally, they correspond to a requirement that
point evaluation f ! f .t0 / is well defined and continuous, as is needed for model (15.2) to
zi D hJ i ; d W i;
i D 1; : : : m:
i D 1; : : : m:
(15.12)
Write y m for the projected data y1 ; : : : ; ym : When D n 1=2 ; the choice J D log2 n
yields an n-dimensional model which is an approximation to (15.2), in a sense to be explored
below.
The projected white noise model can be expressed in terms of wavelet coefficients. Indeed, since VJ D j <J Wj , it is equivalent to the 2J -dimensional submodel of the sequence
model given by
I 2 IJ ;
yI D I C zI ;
(15.13)
WT
.yI /
?
?
y
.OI .yI //
(15.14)
382
which is the same as (7.22), except that this diagram refers to observations on inner products,
(15.12), whereas the earlier diagram used observations from the sampling model (7.21), here
written in the form (15.2).
Consider now the minimax risk of estimation of f 2 F using data from the projected
model (15.12). Because of the Parseval relation (1.25), we may work in the sequence model
and wavelet coefficient domain.
Suppose, as would be natural in the projectedP
model, that O is an estimator
P which has
2
J
non-zero co-ordinates only in I : Set k k2;m D I 2I J I2 and kk22;m? D I I J I2 The
following decomposition emphasises the tail bias term that results from estimating only
up to level J W
kO k2 D kO k2 C k k2 ? :
(15.15)
2;m
2;m
k22;m :
(15.16)
m .C / D f 2 Rm W k kbp;q
C g:
sup
O 2m .C /
EkO .y m /
k22;m :
We look for a condition on the dimension m D 2J so that the minimax risk in the projected model is asymptotically equivalent to (i.e. not easier than ) the full model. For this it
is helpful to recall, in current notation, a bound on the maximum tail bias over smoothness
classes p;q that was established at (9.59).
Lemma 15.3 Let 0 D
.1=p
m ./ D sup kf
Fp;q
.C /
2J 0
p;q .C /
Suppose J D J./ D
log2 2 ; one verifies that if
> .1=.2 C 1//.= 0 /; then the
tail bias term becomes neglibible relative to the order . 2 /2=.2C1/ of the minimax risk.
It will be helpful to use the minimax Theorem 4.12 to re-express
:
RN .m .C /I / D sup B.I / D B.C; /;
(15.17)
m .C /
where, as usual, B.; / denotes the Bayes risk in (15.13) and (15.16) when the prior .d /
on m .C / is used.
We can now establish the equivalence of the projected white noise model with the full
model.
Proposition 15.4 Let J./ D
log2
RN .m .C /; / RN ..C /; /
! 0:
383
Proof An arbitrary estimator O .y m / in the projected model can be extended to an estimator in the full sequence model by appending zeroslet E m denote the class so obtained.
From (15.15) we obtain
inf sup EkO
k2 RN .m .C /; / C m ./:
O 2E m .C /
The left side exceeds RN ..C /; / while Lemma 15.3 shows that m ./ D o.RN ..C /; //,
so we conclude that the projected model is asymptotically no easier.
In the reverse direction, suppose that O is the Bayes estimator for the least favorable
prior D for RN .; /: Define m D m; as the marginal distribution of the first m
coordinates of under W clearly the corresponding Bayes estimate Om D Em .jy/ D
E.Pm jy/ depends only on y .m/ and so is feasible in the projected problem. Since both
and m are supported on ; O Om D E . Pm jy/ and so by Jensens inequality and
Lemma 15.3,
kO
Om k2 E.k
Pm k2 jy/ KC 2
J 0
D o. r /
by the choice of J./ specified in the hypotheses. Using E.X C Y /2 .EX 2 /1=2 C
.EY 2 /1=2 2 , we have then, for all 2 ;
EkOm
k2 .fE kO
and hence
RN .m .C /; / sup EkOm
k22;m :
(15.18)
384
Since f D
jI j<J0
I
I,
(15.19)
(15.20)
I
I .tl /:
We regard this as a map from .Rm ; k k2;m / to .Rn ; k kn /, where k kn is the time domain
norm (15.6). It is not a (partial) isometry since the vectors . I .tl / W l D 1; : : : ; n/ are
not orthogonal in the discrete norm. However, it comes close; at the end of the section we
establish
Lemma 15.5 Under assumption (A) on the wavelet system .
for m D 2J0 < n, then
I /,
if T is defined by (15.20)
p
The minimax risk in the sampling problem is, setting n D = n,
RQ N .m .C /I n / D inf
sup
O .y/
Q 2m .C /
sup
EkO .y/
Q
k22;m
: Q
Q
B.I
n / D B.C;
n /;
(15.21)
m .C /
Q
where we have again used the minimax theorem and now B.;
n / denotes the Bayes risk
m
in (15.19) and (15.18) when the prior .d / on .C / is used.
As remarked earlier, estimation of Pm f is easier than estimation of f , and so from (15.5)
and (15.15) we have
Q F ; n/ RQ N .m .C /I n /
R.
With all this notational preparation, we have recast the sampling is not easier theorem
as the statement
Q
B.C;
n / B.C; n /.1 C o.1//:
(15.22)
Pushing the sequence model observations (at noise level n ) through T generates some
heteroscedasticity which may be bounded using Lemma 15.5. To see this, we introduce el , a
p
p
vector of zeros except for n in the lth slot, so that kel kn D 1 and .T y/l D nhel ; T yin :
Then
Var.T y/l D nn2 Ehel ; T zi2n D 2 kT t el k22;m 2 2n
where 2n D max .T T t / D max .T t T / is bounded in the Lemma. Now let wQ be a zero mean
Gaussian vector, independent of y, with covariance chosen so that Var.T y C w/
Q D 2n 2 In .
D
By construction, then, T y C wQ D yQ D T C n zQ :
To implement the basic idea of the proof, let be a least favorable prior in the sequence
problem (15.17) so that B.; n / D B.C; n /. Let Q;n .y/
Q denote the Bayes estimator of
in the sampling model (15.19) and (15.18) with noise level n .
We construct a randomized estimator in the sequence model using the auxiliary variable
w:
Q
D
O w/
.y;
Q D Q; .T y C w/
Q D Q; .y/
Q
n
385
where the equality in distribution holds for the laws of T y C wQ and yQ given . Consequently
O w/
O I n / D E EI k.y;
B.;
Q
n
Q
k22;m D B.I
n n /:
Use of randomized rules (with a convex loss function) does not change the Bayes risk
B./see e.g. (A.13) in Appendix Aand so
Q
Q I n n /;
B.C; n / D B.I n / B.O ; I n / D B.I
n n / B.C
where the last inequality uses (15.21). Appealing to the scaling bounds for Bayes-minimax
risks (e.g. Lemma 4.28 and Exercise 4.7) we conclude that
(
2 Q
2 Q
Q I / B.C =I / B.C I / if > 1
B.C
Q I /
B.C
if 1:
In summary, using again Lemma 15.5,
Q
Q
B.C; n / .2n _ 1/B.C;
n / B.C;
n /.1 C o.1//:
This completes the proof of (15.22), and hence of Theorem 15.1.
Proof of Lemma 15.5
I J0 / is given by
I;
I 0 in
Dn
I .tl /
I;I
I 0 .tl /:
Exercise 15.1 gives bounds on the distance of these inner products from exact orthogonality:
jh
I;
I 0 in
II 0 j cn 1 2.j Cj
/=2
.I; I 0 /;
(15.23)
j=2
, hence
j0
I0
2
j=2
.1 C cn
2j _j /
j0
where in the first line we used (15.23) and bounded k 0 .I; I 0 /, the number of j 0 k 0 whose
0
supports hits that of I , by c2.j j /C . Now j J0 and the sum is over j 0 J0 and hence
P
SI 2
j=2
.1 C cn 1 J0 2J0 /
386
1
C < 1;
2 C 1 0
(15.24)
J0n D log2 n D Jn
(15.25)
Specifically, we prove
Theorem 15.6 Suppose that > 1=p; 1 p; q 1 and that .; / satisfy Assumption
A. Let E be any one of the four coordinatewise estimator classes of Section 15.1, and let mn
be chosen according to (15.24) and (15.25). Then as n ! 1;
RQ E .m .C /; n / RE .m .C /; n /.1 C o.1//:
We outline the argument, referring to the literature for full details. A couple of approaches
have been used; in each the strategy is to begin with the sampled data model (15.2) and construct from .yQl / a related set of wavelet coefficients .yQI / which satisfy a (possibly correlated)
sequence model
yQI D QI C .n/ zQ I :
(15.26)
O
We then take an estimator .y/
known to be good in the (projected) white noise model and
apply it with the sample data wavelet coefficients yQ D .yQI / in place of y. The aim then is to
show that the performance of O .y/
Q for appropriate and noise level .n/ is nearly as good
O
as that for .y/ at original noise level n .
(i) Deslauriers-Dubuc interpolation. Define a fundamental function Q satisfying the interQ
polation property .l/
D l;0 and other conditions, and then corresponding scaling functions
Q
Q
l .t/ D .nt l/; l D 1; : : : ; n: Interpolate the sampled function and data values by
PQn f .t/ D
n
X
lD1
f .l=n/Q l .t /;
yQ .n/ .t / D
n
X
yQl Q l .t /:
(15.27)
lD1
387
the advantage that, in decomposition (15.26), the interior noise variates zQ I are an orthogonal
transformation of the the original noise zQl and hence are independent with .n/ D n . The
boundary noise variates zQI may be correlated, but there are at most cJ0 of these, with uniformly bounded variances Var zQI cn2 : So in the coiflet case, we could actually take E to
be the class of all estimators (scalar or not).
We will restrict attention to estimators vanishing for levels j J0n ; where 2J0n D m D
mn is specified in (15.25). In view of the unbiasedness of yQ for Q in (15.2), it is natural to
decompose the error of estimation of in terms of Q :
O y/
k.
Q
Q 2;m C kQ
k
k2;m :
(15.28)
Concerning the second term on the right side, in either the Deslauriers-Dubuc or coiflet
settings, one verifies that
0
(15.29)
sup kQ k22;m cC 2 2 2J ;
.C /
(15.30)
Q 2
k
2;m
Q 2
k
2;m
(15.31)
For the proof, see Donoho and Johnstone (1999). Combining (15.28), (15.29) and (15.31),
we obtain
sup EkO .y/
Q
2.C /
388
let PQn fPbe given in the Deslauriers-Dubuc case by (15.27) and in the Coiflet case by PQn f D
n 1=2 f .tl /J l . Then the arguments referred to following (15.29) also show that
sup kPQn f
f k2 cC 2 n
2 0
D o.n r /:
(15.32)
F .C /
(and similarly for R:) We describe this in the Coiflet case, but a similar result would be
possible in the Deslauriers-Dubuc setting.
Given a continuous function f 2 L2 0; 1, we may consider two notions of sampling
operator:
p
.S f /l D nhf; J l i;
.S f /l D f .tl /:
Let Pn denote projection onto VJ D span fJ l g and PQn the interpolation operator, so that
X
X
Pn f D
hf; J l iJ l ;
and
PQn g D
n 1=2 g.tl /J l :
l
S f kn D kPn fO
PQn f k2 :
(15.34)
First suppose that fQ D .fQ.tl // is a good estimator for `2;n loss. Construct the interpolaP
tion fO.t/ D n 1=2 n1 fQ.tl /J l .t /: From the decomposition
fO
and the identity kfO
f D fO
PQn f k2 D kfQ
kfO
PQn f C PQn f
f k2 kfQ
f kn C o.n
r=2
so that fO has essentially as good performance for L2 loss as does fQ for loss `2;n :
Now suppose on the other hand that fO.t / is a good estimator for L2 loss. Construct a
discrete estimator fQ using scaling function coefficients fQ.tl / D .S fO/l : From the identity
(15.34) and the decomposition
Pn fO
PQn f D Pn .fO
f / C Pn f
f Cf
PQn f
we obtain first using (15.34), and then exploiting projection Pn , Lemma 15.3 and (15.32),
that
n 1=2 kfQ S f kn kfO f k2 C o.n r=2 /:
389
Exercises
Exercises
15.1 Show that
j
I 0 .s/
I 0 .t /j
c2.j Cj
/=2 j _j 0
Z ti C1
1
n f .tl /
f 12 n
tl
js
t j;
kf kL
is replaced by n
2 2j _j 0 .
16
Epilogue
Compressed sensing
sparse non-orthogonal linear models
covariance matrix estimation
related non-Gaussian results
390
Appendix A
Appendix: The Minimax Theorem
The aim of this appendix is to give some justification for the minimax Theorem 4.12, restated
below as Theorem A.5. Such statistical minimax theorems are a staple of statistical decision
theory as initiated by Abraham Wald, who built upon the foundation of the two person zerosum game theory of von Neumann and Morgenstern (1944). It is, however, difficult to find in
the published literature a statement of a statistical minimax theorem which is readily seen to
cover the situation of our nonparametric result Theorem A.5. In addition, published versions
(e.g. Le Cam (1986, Theorem 2.1)) often do not pause to indicate the connections with game
theoretic origins.
This appendix gives a brief account of von Neumanns theorem and one of its infinitedimensional extensions (Kneser, 1952) which aptly indicates what compactness and continuity conditions are needed. Following Brown (1978), we then attempt an account of how
statistical minimax theorems are derived, orienting the discussion towards the Gaussian sequence model. While the story does not in fact use much of the special structure of the
sequence model, the Gaussian assumption is used at one point to assure the separability of
L1 .
In later sections, a number of concepts and results from point set topology and functional
analysis are needed, which for reasons of space we do not fully recall here. They may of
course be found in standard texts such as Dugundji (1966) and Rudin (1973).
(A.1)
When equality holds in (A.1), the game is said to have a value. This occurs, for example,
391
392
for all i; j:
However, saddlepoints do not exist in general, as is demonstrated already by the matrix 10 01
The situation is rescued by allowing mixed or randomized strategies, which are probability
n
distributions x D .x.i//m
1 and y D ..y.j //1 on the space of nonrandomized rules for each
player. If the players use the mixed strategies x and y, then the expected payoff from I to
II is given by
X
f .x; y/ D x T Ay D
x.i /A.i; j /y.j /:
(A.2)
i;j
P
Write Sm for the simplex of probability vectors fx 2 Rn W xi 0; xi D 1g. The classical
minimax theorem of von Neumann states that for an arbitrary m n matrix A in (A.2),
min max f .x; y/ D max min f .x; y/:
x2Sm y2Sn
y2Sn x2Sm
(A.3)
For the payoff matrix 10 01 , it is easily verified that the fair coin tossing strategies x D y D
. 21 21 / yield a saddlepoint.
We establish below a more general result that implies (A.3).
x2K y2L
y2L x2K
(A.4)
A notable aspect of this extension of the von Neumann theorem is that there are no compactness conditions on L, nor continuity conditions on y ! f .x; y/ W the topological conditions are confined to the x-slot.
393
Note that if x ! f .x; y/ is lower semi-continuous for all y 2 L, then x ! supy2L f .x; y/
is also lower semi-continuous and so the infimum on the left side of (A.4) is attained for
some x0 2 K.
Here is an example where f is not continuous, and only the semicontinuity condition of
the theorem holds. Let R1 denote the space of sequences: a countable product of R with the
product topology: x .n/ !Px iff for each coordinate i, xi.n/ ! xi . Then the infinite simplex
K D fx 2 R1 W xi 0; iP
xi 1g is compact. Consider a simple extension of the payoff
function (A.2), f .x; y/ D
xi yi for y 2 L D fy W 0 yi C for all ig: Equality
(A.4) can easily be checked directly. However, the function x ! f .x; 1/ is not continuous:
the sequence x .n/ D .1=n; : : : ; 1=n; 0; 0; : : : :/ converges to 0 but f .x .n/ ; 1/ 1. However,
f .x; y/ is lsc in x, as is easily verified.
Knesers proof nicely brings out the role of compactness and semicontinuity, so we present
it here through a couple of lemmas.
Lemma A.2 Let f1 ; : : : fn be convex, lsc real functions on a compact convex set K. Suppose for each x 2 K that maxi fi .x/ > 0: Then there exists a convex combination that is
positive on K: for some 2 Sn ;
n
X
i fi .x/ > 0
for all x 2 K:
g>0
on N;
f >0
f
f
.p/ D 0
g
g
g
g
.q/ D 0
f
f
f C g 0
f C g 0:
(A.5)
g.q/
N
D f
N .q/ D
f .p/ D g.p/:
).
394
Since > 0 and g.p/ > 0, we conclude that < 1, and so in (A.5), we may increase
up to
and up to in such a way that
D 1;
f C g > 0
and
f C g > 0
(A.6)
1
D
;
1C
1C
(A.7)
Replacing f by f c does not harm any of the hypotheses, so we may assume that c D
0: The left inequality in (A.7) implies that Alternative II in the previous lemma fails, so
Alternative I holds, and so infx supy f 0; in contradiction with the right hand inequality
of (A.7)! Hence there must be equality in (A.7).
The following corollary is a trivial restatement of Theorem A.1 for the case when compactness and semicontinuity is known for the variable which is being maximised.
Corollary A.4 Let K; L be convex subsets of real vector spaces and f W K L ! R
be convex in x for each y 2 L; and concave in y for each x 2 K: Suppose also that L is
compact and that y ! f .x; y/ is upper semicontinuous for each x 2 K. Then
inf sup f .x; y/ D sup inf f .x; y/:
x2K y2L
Proof
y2L x2K
f .x; y/:
(A.8)
395
(A.9)
2P O
Our applications of this theorem will typically be to loss functions of the form L.a; / D
w.ka kp /; with w./ a continuous, increasing function. It is easy to verify that such loss
functions are lsc in a in the topology of pointwise convergence. Indeed, if ai.n/ ! ai.1/ for
each i, then for each fixed m; one has
m
X
iD1
jai.1/
i j D lim
n
m
X
iD1
jai.n/
kpp :
Theorem A.5 and other statistical minimax theorems, while closely related to Theorem
A.1, as will be seen below, do not seem to follow directly from it, using instead separating
hyperplane results (compare Lemma A.2).
A general framework for statistical decision theory, including minimax and complete class
results, has been developed by its chief exponents, including A. Wald, L. LeCam, C. Stein,
and L. Brown, in published and unpublished works. A selection of references includes Wald
(1950); LeCam (1955); Le Cam (1986); Diaconis and Stein (1983); Brown (1977, 1978).
The theory is general enough to handle abstract sample spaces and unbounded loss functions, but it is difficult to find a statement that immediately covers our Theorem A.5. We
therefore give a summary description of the steps in the argument for Theorem A.5, using
freely the version of the Wald-LeCam-Brown approach set out in Brown (1978). The theory
of Brown (1978) was developed specifically to handle both parametric and nonparametric
settings, but few nonparametric examples were then discussed explicitly. Proofs of results
396
given there will be omitted, but we hope that this outline nevertheless has some pedagogic
value in stepping through the general method in the concrete setting of the nonparametric
Gaussian sequence model.
Remark. There is a special case (which includes the setting of a bounded normal mean,
Section 4.6), in which our statistical minimax theorem can be derived directly from the
Kneser-Kuhn theorem. Indeed, if Rn is compact, and P D P ./, then P is compact for weak convergence of probability measures. Let K be the class of estimators O
O
Rwith finite risk functions on , let L D P and for the payoff function f take B. ; / D
O
r.; /.d/.
Observe that K is convex because a ! L.a; / is; that L is convex and compact; and
that B is convex-linear. Finally ! B.O ; / is continuous since in the Gaussian model
yi D i C i zi , the risk functions ! r.O ; / are continuous and bounded on the compact
set . Hence the Kneser-Kuhn Corollary (A.8) applies to provide the minimax result.
2D 2P
2P 2D
(A.11)
Knesers theorem suggests that we need a topology on decision rules with two key properties:
(P1) D is compact, and
(P2) the risk functions ! B.; / are lower semicontinuous.
Before describing how this is done, we explain how (A.9) follows from (A.11) using the
convexity assumption on the loss function. Indeed, given a randomized
R rule ; the standard
method is to construct a non-randomized rule by averaging: O .x/ D a.dajx/: Convexity
of a ! L.a; / and Jensens inequality then imply that
Z
O
L. .x/; / L.a; /.dajx/:
397
for all 2 :
(A.12)
Consequently, with convex loss functions, there is no reason ever to use a randomized decision rule, since there is always a better non-randomized one. In particular, integrating with
respect to an arbitrary yields
sup B.O ; / sup B.; /:
(A.13)
O
D sup inf B.; / sup inf B.O ; / inf sup B.O ; /;
O
O
and since the first and last terms are the same, all terms are equal.
8.g; c/ 2 L1 C:
(A.14)
398
This topology also satisfies our second requirement: that the maps ! B.; / be lsc.
Indeed, since A is second countable, the lsc loss functions can be approximated by an increasing sequence of continuous functions ci 2 C : L .a/ D limi ci .a/. This implies that
r.; / D supfb .f ; c/ W c L g:
c
The definition (A.14) says that the maps ! b .f ; c/ are each continuous, and so !
r.; / appears as the upper envelope of a family ofRcontinuous functions, and is hence lsc.
Finally Fatous lemma implies that ! B.; / D r.; /.d / is lsc.
A separation theorem
We have now established B.; / as a bilinear function on D P which for each fixed is
lsc on the compact D. What prevents us from applying Knesers minimax theorem directly
is that B.; / can be infinite. The strategy used by Brown (1978) for handling this difficulty
is to prove a separation theorem for extended real valued functions, and derive from this the
minimax result.
Slightly modified for our context, this approach works as follows. Let T D T .P ; 0; 1/
denote the collection of all functions b W P ! 0; 1 with the product topology, this space
is compact by Tychonoffs theorem. Now define an upper envelope of the risk functions by
setting D .D/ and then defining
Q D fb 2 T W there exists b 0 2 with b 0 bg:
Brown uses the D topology constructed above, along with the compactness and lower semicontinuity properties [P1] and [P2] to show that Q is closed and hence compact in T:
Using the separating hyperplane theorem for Euclidean spaces a consequence of Lemma
A.2 Brown shows
Q Then there
Theorem A.6 Suppose that Q is convex and closed in T and that b0 2 T n:
m
m
exists c > 0; a finite
P set .i /1 P and a probability vector .i /1 such that the convex
combination D i i 2 P satisfies
Q
for all b 2 :
(A.15)
It is now easy to derive the minimax conclusion (A.11). Indeed, write V D inf supP B.; /:
Q Convexity of D entails conIf V < 1; let > 0 and choose b0 V clearly b0 :
Q which is also closed in T as we saw earlier. Hence, the separation theorem
vexity of ;
produces 2 P such that
V
In other words, sup inf B.; / > V for each > 0, and hence it must equal V: If
V D 1; a similar argument using b0 m for each finite m also yields (A.11).
399
(A.16)
and hence
inf sup B.; / D sup inf B.; / D sup BS ./:
Proof The right side of (A.16) follows from the definition of .0 /. For the left side, given
an arbitrary 1 2 P , define t D .1 t /0 C t 1 for t 2 0; 1: by convexity, t 2 P . Let
t D .t / be the best threshold for t , so that B.t / D B.t ; t /. Heuristically, since 0 is
least favorable, we have .d=dt/B.t /jtD0 0, and we want to compute partial derivatives
of B.t ; t / and then exploit linearity in .
More formally, for t > 0 we have
B.t ; t /
B.0 ; 0 / D B.t ; 0 /
B.0 ; 0 / C B.0 ; t /
B.0 ; 0 / C 2 B;
B.t ; 0 /
B.0 ; t / C B.0 ; 0 /:
B.0 ; 0 / D t B.0 ; 1 /
B.0 ; 0 / C 2 B=t:
B.0 ; 1
B.t ; 0 /
B.0 ; 0 / ! 0
400
(1958). However, the general minimax theorems do not exhibit a saddlepoint, which emerges
directly from the present more specialized approach.
Exercise. Complete the induction step for the proof of Lemma A.2.
Appendix B
More on Wavelets and Function Spaces
That (iv) is equivalent to (iv) follows from the orthonormalization trick discussed below.
A key role in constructions and interpretations is played by the frequency domain and
the Fourier transform (C.9). The Plancherel identity (C.11) leads to a frequency domain
characterization of the orthnormality and Riesz basis conditions (iv) and (iv):
Lemma B.1 Suppose ' 2 L2 . The set f'.x k/; k 2 Zg is (i) orthonormal if and only if
X
jb
' . C 2k/j2 D 1
a.e.;
(B.1)
k
and (ii) a Riesz basis if and only if there exist positive constants C1 ; C2 such that
X
C1
jb
' . C 2k/j2 C2
a.e.
(B.2)
Partial Proof. We give the easy proof of (B.1) since it gives a hint of the role of frequency domain methods. The Fourier transform of x ! '.x n/ is e i nb
'./: Thus, orthonormality combined with the
Plancherel identity gives
Z 1
Z 1
1
e i n jb
'./j2 d :
0n D
'.x/'.x n/dx D
2
1
1
1
This use of the symbol, local to this appendix, should not be confused with the notation for wavelet
coefficients in the main text.
401
402
Now partition R into segments of length 2, add the integrals, and exploit periodicity of e i n to rewrite the
right hand side as
Z 2
X
1
e i n
jb
'. C 2k/j2 d D 0n :
2 0
k
The function in (B.1) has as Fourier coefficients the delta sequence 20n and so equals 1 a.e. For part (ii),
see e.g. [M, Theorem 3.4].
Then ' is a scaling function for the MRA, and so for all j 2 Z, f'j k W k 2 Zg is an
orthonormal basis for Vj :
Example. Box spline MRA. (See also Chapter 7.1.) Given r 2 N; set D I0;1 and D
r D ? ? D ?.rC1/ : Without any loss of generality, we may shift r D ?.rC1/ by
an integer so that the center of the support is at 0 if r is odd, and at 1=2 if r is even. Then it
can be shown (Meyer, 1990, p61), [M, Sec. 7.1] that
(
rC1
sin
=2
1 r even
i
=2
b
e
D
;
r ./ D
=2
0 r odd
X
jb
r . C 2k/j2 D P2r .cos =2/;
k
where P2r is a polynomial of degree 2r: For example, in the piecewise linear case r D 1;
P2 .v/ D .1=3/.1 C 2v 2 /: Using (B.2), this establishes the Riesz basis condition (iv) for this
MRA. Thus (B.3) gives an explicit Fourier domain expression for ' which is amenable to
numerical calculation. [M, pp 266-268] gives corresponding formulas and pictures for cubic
splines.
Remark. We have seen that a key point is the existence of multiresolution approximations
by Vj with better smoothness and approximation properties than those of the Haar multiresolution. Following Meyer (1990, p21), we say that a multiresolution analysis fVj g of L2 .R/
is r regular if D k .x/ is rapidly decreasing for 0 k r 2 N: [ A function f on R is
rapidly decreasing if for all m 2 N, then jf .x/j Cm .1 C jxj/ m : ]
Meyer (1990, p38) shows that a wavelet deriving from an r-regular multresolution analysis necessarily has r C 1 vanishing moments.
(b) Using finite filters. The MRA conditions imply important structural constraints on
b
h./: using (B.1) and (7.2) it can be shown that
Lemma B.3 If ' is an integrable scaling function for an MRA, then
.CMF/
.NORM/
jb
h./j2 C jb
h. C /j2 D 2
p
b
h.0/ D 2:
8 2 R
(B.4)
(B.5)
403
(B.4) is called the conjugate mirror filter (CMF) condition, while (B.5) is a normalization
requirement. Conditions (B.5) and (B.4) respectively imply constraints on the discrete filters:
X
X
p
h2k D 1:
hk D 2;
They are the starting point for a unified construction of many of the important wavelet families (Daubechies variants, Meyer, etc.) that begins with the filter fhkg, or equivalently b
h./:
Here is a key result in this construction.
Theorem B.4 If b
h./ is 2 periodic, C 1 near D 0 and (a) satisfies (B.4) and (B.5),
and (b) inf =2;=2 jb
h./j > 0; then
b
' ./ D
1 b
Y
h.2 l /
p
2
lD1
(B.6)
(B.7)
b
b
g ./h ./ C b
g . C /h . C / D 0:
(B.8)
i b
h . C /;
(B.9)
and use (B.4). To understand this in the time domain, note that if b
s./ has (real) coeffi
cients sk , then conjugation corresponds to time reversal: b
s ./ $ s k , while modulation
corresponds to time shift: e i b
s./ $ skC1 ; and the frequency shift by goes over to time
domain modulation: b
s. C / $ . 1/k sk : To summarize, interpreting (B.9) in terms of
filter coefficients, one obtains the mirror relation
gk D . 1/1
h1
k:
(B.10)
Together, (7.4) and (B.9) provide a frequency domain recipe for constructing a candidate
wavelet from ':
b.2/ D e i b
h . C /b
' ./:
(B.11)
Of course, there is still work to do to show that this does the job:
Theorem B.6 If g is defined by (B.9), and
orthonormal basis for L2 .R/:
by (7.4), then f
j k ; .j; k/
2 Z2 g is an
404
Vanishing moments. The condition that have r vanishing moments has equivalent formulations in terms of the Fourier transform of and the filter h.
Lemma B.7 Let
equivalent:
tj
D 0;
j D 0; : : : ; p
1:
(ii)
D j b.0/ D 0;
j D 0; : : : ; p
1:
(iii)
Djb
h./ D 0
(i)
j D 0; : : : ; p
1:
.VMp /
(B.13)
405
2R
2T
2T
{2
2 2 R
^ ()j2
jh
^ (+)j2
j^g()j2 =jh
^ (2)j2 =jg^()'()j2=2
j
See for example Mallat (1999, Theorem 7.4) or Hardle et al. (1998, Theorem 8.3).
Example. Daubechies wavelets. Here is a brief sketch, with a probabilistic twist, of some
of the steps in Daubechies construction of orthonormal wavelets of compact support. Of
course, there is no substitute for reading the original accounts (see Daubechies (1988),
Daubechies (1992, Ch. 6), and for example the descriptions by Mallat (2009, Ch. 7) and
Meyer (1990, Vol I, Ch. 3)).
1
The approach is to build a filter h D fhk gN
with hk 2 R and transfer function b
h./ D
0
PN 1
i k
satisfying the conditions of Theorem B.4 and then derive the conjugate filter
kD0 hk e
g and the wavelet from (B.9), (B.11) and Theorem B.6. The vanishing moment condition
of order p (VMp ) implies that b
h./ may be written
1 C e
b
h./ D
2
i
p
r./;
r./ D
m
X
rk e
i k
Pm
Pm
ik
is
if r./ D 0 rk e ik ; with rk 2 R; then jr./j2 D r./r ./ D r./r. / D
m sk e
both real and even, so s k D sk and hence it is a polynomial of degree m in cos D 1 2 sin2 .=2/: In
addition, j.1 C e i /=2j2 D cos2 .=2/:
406
y/p P .y/ C y p P .1
y/ D 1
0 y 1:
(B.14)
To have the support length N as small as possible, one seeks solutions of (B.14) of minimal
degree m: One solution can be described probabilistically in terms of repeated independent
tosses of a coin with Pr.Heads / D y: Either p tails occur before p heads or vice versa, so
P .y/ WD Prfp T s occur before p H s g=.1
!
p 1
X pCk 1
D
yk
k
y/p
kD0
More on vanishing moments. We now give the proof that vanishing moments of
imply rapid decay of wavelet coefficients, and look at analogs for scaling functions ' and
the interval 0; 1.
Proof of Lemma 7.3 We first recall that Holder functions can be uniformly approximated
by (Taylor) polynomials, cf. (C.25). So, let p.y/ be the approximating Taylor polynomial
3
407
j k ij
2
j=2
C2
jvj j .v/jdv:
j=2
f .k2
/j c C 2
j.C1=2/
Proof Modify the proof of Lemma 7.3 by writing the approximating polynomial at xk D
k2 j in the form p.y/
R D f .xk / C p1 .y/ where p1 is also of degree r 1, but with no
constant term, so that p1 ' D 0: Then
Z
Z
f 'j k 2 j=2 f .xk / D 2 j=2 f .xk C 2 j v/ f .xk / p1 .2 j v/'.v/dv
and so jhf; 'j k i
j=2
f .xk /j 2
j=2
C2
jvj j'.v/jdv:
1
X
k l '.t
k/ 2 Pl
l D 0; 1; : : : ; p
1:
(B.15)
kD 1
The condition (B.15) says that Pp 1 Vj and further (see Cohen et al. (1993b)) that for j
J ; Pp 1 Vj 0; 1the multiresolution spaces corresponding to the CDJV construction
described at the end of Section 7.1. Consequently Pp 1 ? Wj 0; 1 and so for j J ;
k D 1; : : : ; 2j ; we have
Z
t l jintk .t /dt D 0;
l D 0; 1; : : : ; p 1:
408
;
j
Lp
the Besov norm integrates over location at each scale and then combines over scale, while
, and more
D FPp;p
the Triebel semi-norm reverses this order. They merge if p D q: BP p;p
generally are sandwiched in the sense that BP p;p^q FPp;q BP p;p_q . Despite the importance
k
of the Triebel scaleFp;2
equals the Sobolev space Wpk , for examplewe will not focus on
them here.
These are the homogeneous definitions: if ft .x/ D f .x=t /=t , then the semi-norms satisfy a scaling relation: kft kBP D t .1=p/ 1 kf kBP . These are only semi-norms since they
vanish on any polynomial. The inhomogeneous versions are defined by bringing in a
low frequency function ' with the properties that supp b
' 2; 2, and b
' c > 0
on 5=3; 5=3. Then
kf kBp;q
D k' ? f kLp C
X
2j kfj kLp
q 1=q
j 1
. These are norms for 1 p; q 1, otherwise
with a corresponding definition for kf kFp;q
they are still quasi-norms.
Many of the traditional function spaces of analysis (and non-parametric statistics) can be
identified as members of either or both of the Besov and Triebel scales. A remarkable table
may be found in Frazier et al. (1991), from which we extract, in each case for > 0:
Holder
Hilbert-Sobolev
Sobolev
C B1;1
W2 B2;2
Wp Fp;2
N
1<p<1
409
P
If the window function also satisfies the wavelet condition j jb.2
it is straightforward to verify that jf jBP as defined above satisfies
2;2
Z
b./j2 d ;
jf jBP jj2 jf
2;2
R
corresponding with the Fourier domain definition of .D f /2 .
The Besov and Triebel function classes on R (and Rn ) have characterizations in terms of
wavelet coefficients. Using the Meyer wavelet, Lemarie and Meyer (1986) established the
characterization for homogeneous Besov norms for 2 R and 1 p; q 1: This result
is extended to 0 < p; q 1 and the Triebel scale by Frazier et al. (1991, Theorem 7.20)
After a discussion of numerous particular spaces, the inhomogenous Besov case is written
out in Meyer (1990, Volume 1, Chapter VI.10).
If .'; / have lower regularitye.g. the Daubechies families of waveletsthen these
characterisations hold for restricted ranges of .; p; q/. By way of example, if ' generates
an rregular MRA, then the result from Meyer (1990) just cited shows that the equivalence
(9.37) holds for p; q 1; jj < r.
j L k
We have made frequent use of Besov norms on the coefficients D .k / and D .j / D
.j k /. To be specific, define
;
kf kbp;q
D kkp C j jbp;q
(B.16)
1=p
j jqb D jjqbp;q
D
2aj kj kp q :
(B.17)
j L
In these definitions, one can take 2 R and p; q 2 .0; 1 with the usual modification for p
or q D 1.
This appendix justifies the term Besov norm by showing that these sequence norms are
equivalent to standard definitions of Besov norms on functions on Lp .I /.
We use the term CDJV multiresolution to describe the multiresolution analysis of L2 0; 1
resulting from the construction reviewed in Section 7.1. It is based on a Daubechies scaling
function ' and wavelet with compact support. If in addition, is C r which is guaranteed for sufficiently large S , we say that the MRA is r-regular.
This section aims to give a more or less self contained account of the following result.
Theorem B.9 Let r be a positive integer and suppose that fVj g is a r-regular CDJV multresolution analysis of L2 0; 1. Suppose that 1 p; q 1 and 0 < < r. Let the Besov
410
by (B.16). Then the two norms are equivalent: there exist constants C1 ; C2 depending on
.; p; q/ and the functions .; /, but not on f so that
:
C1 kf kbp;q
kf kBp;q
C2 kf kbp;q
Equivalences of this type were first described by Lemarie and Meyer (1986) and developed in detail in Meyer (1992, Chapters 6 - 8). for I D R: Their Calderon-Zygmund
operator methods make extensive use of the Fourier transform and the translation invariance
of R:
The exposition here, however, focuses on a bounded interval, for convenience 0; 1, since
this is needed for the white noise models of nonparametric regression. On bounded intervals, Fourier tools are less convenient, and our approach is an approximation theoretic one,
inspired by Cohen et al. (2000) and DeVore and Lorentz (1993). The survey of nonlinear approximation, DeVore (1998), although more general in coverage than needed here, contains
much helpful detail.
The conditions on ; p; q are not the most general. For example, Donoho (1992) develops
a class of interpolating wavelet transforms using an analog of L2 multiresolution analysis
for continuous functions with coefficients obtained by sampling rather than integration. For
this transform, Besov (and Triebel) equivalence results are established for 0 < p; q 1,
but with now in the range .1=p; r/.
An encyclopedic coverage of Besov and Triebel function spaces and their characterizations may be found in the books Triebel (1983, 1992, 2006, 2008).
Outline of approach. One classical definition of the Besov function norm uses a modulus
of smoothness based on averaged finite differences. We review this first. The modulus of
smoothness turns out to be equivalent to the K functional
K.f; t/ D inffkf
which leads to the view of Besov spaces as being interpolation spaces, i.e. intermediate
between Lp .I / and Wp .I /.
The connection between multiresolution analyses fVj g and Besov spaces arises by comparing the K functional at scale 2 rk , namely K.f; 2 rk /, with the approximation error
due to projection onto Vk ,
ek .f / D kf
Pk f kp :
This comparison is a consequence of two key inequalities. The direct or Jackson inequality, Corollary B.17 below, bounds the approximation error in terms of the rth derivative
kf
Pk f kp C 2
rk
kf .r/ kp :
Its proof uses bounds on kernel approximation, along with the key property that each Vj contains Pr 1 . The inverse or Bernstein inequality, Lemma B.19 below, bounds derivatives
of g 2 Vk :
kg .r/ kp C 2rk kgkp :
DeVore (1998) has more on the role of Jackson and Bernstein inequalities.
From this point, it is relatively straightforward to relate the approximation errors ek .f /
411
with the wavelet coefficient norms (B.17). The steps are collected in the final equivalence
result, Theorem B.9, in particular in display (B.48).
f .x/ D .Th
I /f .x/:
!
r
X
r
. 1/r
I /r f .x/ D
k
f .x C kh/:
(B.18)
kD0
To describe sets over which averages of differences can be computed, we need the (one
sided) erosion of A: set Ah D fx 2 A W x C h 2 Ag. The main example: if A D a; b; then
Ah D a; b h. The r th integral modulus of smoothness of f 2 Lp .A/ is then
!r .f; t /p D sup krh .f; /kp .Arh /:
0ht
For p < 1, this is a measure of smoothness averaged over A; the supremum ensures monotonicity in t. If p D 1, it is a uniform measure of smoothness, for example
!1 .f; t/1 D supfjf .x/
f .y/j; x; y 2 A; jx
yj tg:
The differences rh .f; x/ are linear in f , and so for p 1, there is a triangle inequality
!r .f C g; t /p !r .f; t /p C !r .g; t /p :
(B.19)
(B.20)
!k .f; t /p :
(B.21)
!r .f; nt /p nr !r .f; t /p :
(B.22)
412
When derivatives exist, the finite difference can be expressed as a kernel smooth of bandwidth h of these derivatives:
Lemma B.10 Let be the indicator of 0; 1, and ?r be its r th convolution power. Then
r Z
r d
r
f .x C hu/?r .u/du
(B.23)
h .f; x/ D h
dx r
Z
D hr f .r/ .x C hu/?r .u/du;
(B.24)
the latter inequality holding if f 2 Wpr .
The easy proof uses induction. A simple consequence of (B.24) is the bound
!r .f; t /p t r jf jWpr .I / ;
(B.25)
R
valid for all t 0: Indeed, rewrite the right side of (B.24) as hr K.x; v/f .r/ .v/dv, using
the kernel
K.x; v/ D h 1 ?r .h 1 .v
x//
for x 2 Ih and v D x C hu 2 I . Now apply Youngs inequality (C.27), which says that the
operator with kernel K is bounded on Lp . Note that both M1 and M2 1 since ?r is a
probability density, so that the norm of K is at most one. Hence
krh .f; /kp .Irh / hr jf jWpr .I / ;
and the result follows from the definition of !r .
B.11 Uniform smoothness. There are two ways to define uniform smoothness using moduli. Consider 0 < 1. The first is the usual Holder/Lipschitz definition
jf jLip./ D sup t
t >0
!1 .f; t /1 ;
which is the same as (C.24). The second replaces the first-order difference by one of (possibly) higher order. Let r D C 1 denote the smallest integer larger than and put
jf jLip ./ D sup t
t >0
!r .f; t /1 :
Clearly these coincide when 0 < < 1. When D 1, however, Lip .1/ D Z is the
Zygmund space, and
kf kLip .1/ D kf k1 C
sup
x;xh2A
jf .x C h/
2f .x/ C f .x C h/j
:
h
It can be shown (e.g. DeVore and Lorentz (1993, p. 52)) that Lip .1/ Lip.1/ and that the
containment is proper, using the classical example f .x/ D x log x on 0; 1.
Besov spaces. Let > 0 and r D bc C 1: Let A D R; or an interval a; b. The Besov
space Bp;q
.A/ is the collection of f 2 Lp .A/ for which the semi-norm
Z 1 ! .f; t / q dt 1=q
r
p
D
(B.26)
jf jBp;q
t
t
0
413
:
kf kBp;q
D kf kp C jf jBp;q
(B.27)
, Theorem B.12,
arguments to follow: first in the interpolation space identification of Bp;q
and second in Theorem B.20 relating approximation error to the K-functional. This indicates
why it is the Zygmund spaceand more generally Lip ./that appears in the wavelet
The main fact about K.f; t/ for us is that it is equivalent to the r th modulus of smoothness
!r .f; t/p see Theorem B.13 below.
First some elementary remarks about K.f; t /. Since smooth functions are dense in Lp , it
is clear that K.f; 0/ D 0. But K.f; t / vanishes for all t > 0 if and only if f is a polynomial
of degree at most r 1. Since K is the pointwise infimum of a collection of increasing linear
functions, it is itself increasing and concave. Further, for any f
K.f; t / min.t; 1/K.f; 1/;
(B.28)
(B.29)
A sort of converse to (B.28) will be useful. We first state a result which it is convenient
414
to prove later, after Proposition B.16. Given g 2 Wpr , let r 1 g be the best (in L2 .I /)
polynomial approximation of degree r 1 to g. Then for C D C.I; r/,
kg
1 gkp
C kg .r/ kp :
(B.30)
Now, let f 2 Lp and g 2 Wpr be given. From the definition of K and (B.30),
K.f; t/ kf
1 gkp
kf
gkp C kg
kf
gkp C C kg .r/ kp ;
1 gkp
(B.31)
The K functional K.f; t/ trades off between Lp and Wpr at scale t. Information across
scales can be combined via various weighting functions by defining, for 0 < < 1,
.f /;q D
1h
Z
0
K.f; t / iq dt 1=q
t
t
0<q<1
(B.32)
K.f; t /.
K.f; t/R by min.1; t / in the integral (B.32) leads to the sum of two integrals
R 1Replacing
1
.1 /q 1
t
dt
and 1 t q 1 dt , which both converge if and only if 0 < < 1. Hence
0
property (B.28) shows that in order for .f /;q to be finite for any f other than polynomials,
it is necessary that 0 < < 1:
On the other hand, property (B.29) shows that
.f /;q c q kf kWpr :
(B.33)
if 1 > 2 ; or if 1 D 2 and q1 q2 :
The main reason for introducing interpolation spaces here is that they are in fact Besov
spaces.
Theorem B.12 For r 2 N, and 1 p 1, 0 < q 1; 0 < < r,
This follows from the definitions and the next key theorem (Johnen, 1972), which shows
that the K-functional is equivalent to the integral modulus of continuity.
415
t > 0:
(B.34)
Proof We work on the left inequality first: from the triangle inequality (B.19) followed by
(B.20) and derivative bound (B.25), we have for arbitrary g,
!r .f; t /p !r .f
g; t /p C !r .g; t /p
gkp C t r jgjWpr :
2 kf
(B.36)
where the second inequality follows because ?r is a probability density supported on 0; r,
and the third uses (B.22).
Now estimate kg .r/ kp . Use expansion (B.18) for rtu .f; x/, noting that the k D 0 term
cancels f .x/ in (B.35). Differentiate and then use (B.23) to obtain
!
r
r Z
X
r
.r/
kC1 d
f .x C k t u/?r .u/du
g .x/ D
. 1/
dx r
k
kD1
!
r
X
r
. 1/kC1 .k t / r rk t .f; x/:
D
k
kD1
!r .f; k t /p 2r !r .f; t /p :
kD1
Putting this last inequality and (B.36) into the definition of K.f; t r / yields the right hand
bound with C2 D r r C 2r .
If A D 0; 1; then g is defined in (B.35) for x 2 I1 D 0; 3=4 if t 1=4r 2 : By symmetry,
one can make an analogous definition and argument for I2 D 1=4; 1. One patches together
the two subinterval results, and takes care separately of t > 1=4r 2 : For details see DeVore
and Lorentz (1993, p. 176, 178).
For work with wavelet coefficients, we need a discretized version of these measures.
Lemma B.14 Let L 2 N be fixed. With constants of proportionality depending on I; r; ; q
and L but not on f ,
1
X
.f /q;q
2 rj K.f; 2 rj /q :
(B.37)
j DL 1
416
Proof Since K.f; t/ is concave in t with K.f; 0/ D 0, we have K.f; t / K.f; t /, and
since it is increasing in t , we have for 2 r.j C1/ t 2 rj ,
2 r K.f; 2
rj
/ K.f; 2
r.j C1/
/ K.f; t / K.f; 2
rj
/:
From this it is immediate that, with a D 2 r.L 1/ , the sum SL .f / in (B.37) satisfies
Z ah
K.f; t / iq dt
SL .f /
t
t
0
with constants of proportionality depending only on .; q; r/. From (B.31),
Z 1
K.f; t / q dt
C K.f; a/a q
t
t
a
where C depends on .I; L; r; ; q/. With a D 2
sum SL .f /, completing the proof.
r.L 1/
MRAs on 0; 1
We use the term CDJV multiresolution to describe the multiresolution analysis of L2 0; 1
resulting from the construction reviewed in Section 7.1. It is based on a scaling function
' and wavelet with support in S C 1; S and for which has S vanishing moments.
The MRA of L2 0; 1 is constructed using S left and S right boundary scaling functions
'kL ; 'kR ; k D 0; : : : S 1.
Choose a coarse level L so that 2L 2S. For j L, we obtain scaling function spaces
Vj D spanf'j k g of dimension 2j . The orthogonal projection operators Pj W L2 .I / ! Vj
have associated kernels
X
j k .x/j k .y/;
Ej .x; y/ D
k
If in addition, is C r which is guaranteed for sufficiently large S we say that the MRA
is r-regular. Since
is C r it follows (e.g. by Daubechies (1992, Corollary 5.5.2)) that
has r vanishing moments. The CDJV construction then ensures that Pr 1 , the space of
polynomials of degree r 1 on 0; 1 is contained in VL . In fact, we abuse notation and write
VL 1 D Pr 1 . The corresponding orthogonal projection operator PL 1 W L2 .I / ! VL 1
has kernel
r 1
X
r 1 .x; y/ D
pk .x/pk .y/
x; y 2 I:
(B.38)
kD0
417
A simple fact for later use is that Pj have uniformly bounded norms on Lp 0; 1. Define
aq .'/ D maxfk'kq ; k'kL kq ; k'kR kq ; k D 0; : : : ; S
1g:
(B.40)
Lemma B.15
1 p 1,
(B.41)
kPL 1 kp C.r/:
Proof
(B.42)
R
R
from which it follows that jEj .x; y/jdy 2Sa1 .'/a1 .'/ and similarly for jEj .x; y/jdx.
We argue similarly for j D L 1 using the bounds
r 1
X
jpk .x/j C r
3=2
Z
jpk .y/jdy 1:
kD0
L2 0; 1 D VL j L Wj :
R
I
for 2 Pr
(ii) Kh .x; y/ D 0 if jy
1;
xj > Lh;
Kh f kp C hr kD r f kp ;
h > 0:
1. Assumption
418
where KQ h .x; u/ is a new kernel on I I , about which we need only know a bound, easily
derived from the above, along with conditions (ii) and (iii):
(
cM h 1 .Lh/r if jx uj Lh
jKQ h .x; u/j
0
otherwise :
R
Since I jKQ h .x; u/jdu 2cLrC1 M hr ; with a similar bound for the corresponding integral over x 2 I , our result follows from Youngs inequality (C.27) with M1 D M2 D
2cLrC1 M hr :
A common special case occurs when Kh .x; y/ D h 1 K.h 1 .x y// is a scaled translation
R k invariant kernel on R. Condition (i) is equivalent to the vanishing moment property
t K.t/dt D k0 for k D 0; 1; : : : ; r 1. If K.y/ is bounded and has compact support,
then properties (ii) and (iii) are immediate.
As a second example, consider orthogonal polynomials on I D 0; 1 and the associated
kernel r 1 .x; y/ given in (B.38). Assumptions (i) - (ii) hold for h D L D 1. The bound
(B.39) shows that (iii) holds with M D r 2 . Consequently, for f 2 Wpr .I / we obtain the
bound kf r 1 f kp C kf .r/ kp for C D C.r/, which is just (B.30).
Our main use of Proposition B.16 is a Jackson inequality for multiresolution analyses.
Corollary B.17 Suppose that fVj g is a CDJV multresolution analysis of L2 0; 1. Let Pj
be the associated orthogonal projection onto Vj , and assume that 2j 2S . Then there
exists a constant C D C.'/ such that for all f 2 Wpr .I /,
kf
Pj f k p C 2
rj
jf jWpr :
Proof We claim that assumptions (i)-(iii) hold for the kernel Ej with h taken as 2 j . The
CDJV construction guarantees that Pr 1 Vj so that (i) holds. In addition the construction
implies that (ii) holds with L D 2S and that
#fk W 'j k .x/'j k .y/ 0g 2S:
2
It follows that (iii) holds with M D 2Sa1
.'/.
419
Bernstein-type Inequalities
First a lemma, inspired by Meyer (1990, p.30), which explains the occurence of terms like
2j.1=2 1=p/ in sequence norms.
Lemma B.18 Let f
j k ; k 2 Kg be an orthonormal sequence of functions satisfying
X
(i)
j
j k .x/j b1 2j=2 ;
and
k
Z
(ii)
max
k
j j k j b1 2
j=2
1=p/
kkp :
(B.43)
to be C r .
Proof
where the functions
j k are formed from the finite set fD r ; D r k0 ; D r k1 g by exactly the
same set of linear operations as used to form j k from the set f; k0 ; k1 g.
420
Since the fj k g system satisfy the conditions (i) and (ii) of Lemma B.18, the same is true
of the f
j k g system. From the right side of that Lemma,
X
kD r gkp D 2jr k
k
j k kp c2 2jr 2j.1=2 1=p/ kkp :
Now apply the left side of the same Lemma to the (orthogonal!) fj k g system to get
X
kD r gkp C2 C1 1 2jr k
k j k kp D b1 b1 2jr kgkp :
Pk f kp :
We will show that the rate of decay of ek .f / is comparable to that of K.f; 2 rk /, using the
Jackson and Bernstein inequalities, Corollary B.17 and Lemma B.19 respectively. In order
to handle low frequency terms, we use the notation VL 1 to refer to the space of polynomials
of degree at most r 1, and adjoin it to the spaces Vk ; k L of the multiresolution analysis.
Theorem B.20 Suppose that fVj g is a r-regular CDJV multresolution analysis of L2 0; 1.
Let r 2 N be given. For 1 p 1; 0 < q < 1 and 0 < < r. With constants depending
on .; r; '/ , but not on f , we have
1
X
2k ek .f /q
L 1
Proof
1
X
2k K.f; 2
rk
/q :
(B.44)
L 1
kr
/ C2
k
X
.k j /r
ej .f /;
(B.45)
j DL 1
with constants Ci D Ci .'; r/. For the left hand inequality, let f 2 Lp and g 2 Wpr be fixed.
Write f Pk f as the sum of .I Pk /.f g/ and g Pk g, so that
ek .f / k.I
Pk /.f
g/kp C ek .g/:
gkp C 2
rk
jgjWpr :
k
X
j DL
j jWpr c
k
X
L
2rj k
j kp c
k
X
L
2rj ej
1 .f
/ C ej .f /:
421
kr
/ kf
Pk f kp C 2
k
X
.1 C 2rC1 c/
kr
jPk f jWpr
.k j /r
ej .f /:
j DL 1
2 : The left to right bound in (B.44) is immediate from (B.45). For the other inequality,
let bk D 2k ek .f / andPck D 2k K.f; 2 rk / for k L 1 and 0 otherwise. Then bound
(B.45) says that ck 1
1, where ak D C2 2 k.r / I fk 0g.
j DL 1 ak j bj for k L
Our bound kckq cr C2 kbkq now follows from Youngs inequality (C.30).
1=p.
Proof
j L
j L
The first equivalence follows from Lemma B.18 and the Remark 2 following it:
kQj f kp 2j.1=2
1=p/
kj kp ;
(B.46)
,
or
equivalently
k
j
kj
kj k
2j ej
.k j / k
2 k :
kj
1=p/
kkp :
(B.47)
422
t
t
0
Z 1
K.f; s/ q ds
s
s
0
X
(B.48)
j
2 K.f; 2 rj /q
j L 1
2j ej .f /q
j L 1
Although the ranges of summation differ, this is taken care of by inclusion of the Lp norm
of f , as we now show. In one direction this is trivial since the sum from L is no larger than
the sum from L 1. So, moving up the preceding chain, using also (B.47) with (B.41), we
get
kf kb D kkp C j jb C kPL f kp C C jf jB C.kf kp C jf jB / D C kf kB :
In the other direction, we connect the two chains by writing jf jB C eL 1 .f / C jjb
and observing from (B.42) that eL 1 .f / kI PL 1 kp kf kp C kf kp . Consequently,
kf kB D kf kp C jf jB C.kf kp C j jb /:
Now kf kp eL .f / C kPL f kp which is in turn bounded by C.jjb C kkp / by (B.49)
and (B.47). Putting this into the last display finally yields kf kB C kf kb .
kj/
1 0
(B.50)
Z
w .x/dx D 0;
jw .x 0 /
w .x/j C2 2j.1=2C/ jx 0
(B.51)
xj :
(B.52)
423
Proof of Proposition 12.3. (i) (Meyer and Coifman, 1997, Ch. 8.5) Let K0 D w wN 0 ,
our strategy is to use Schurs Lemma C.28 to show that K is bounded on `2 . The ingredients
0
are two bounds for jK0 j. To state the first, use (B.50) to bound jK0 j C 2 jj j j=2 L0 ,
where L0 is the left side of the convolution bound
Z
0
C
2j ^j dx
; (B.53)
.1 C j2j x kj/1C0 .1 C j2j 0 x k 0 j/1C0
.1 C 2j ^j 0 jk 0 2 j 0 k2 j j/1C0
0
1C
verified in Exercise B.1. Denoting the right side by CM
, the first inequality states
0
jK0 j C1 2
jj 0 j j=2
1C
:
M
0
(B.54)
For the next inequality, use the zero mean and Holder hypotheses, (B.51) and (B.52), to
argue, just as at (9.31) and (9.32), that for j 0 j ,
Z
0
jK0 j C 2j.1=2C/ jx k 0 2 j j jw0 .x/jdx:
Using again (B.50) to bound w0 and then < 0 to assure convergence of the integral, we
arrive at the second inequality
jK0 j C2 2
jj 0 j j.1=2C/
(B.55)
jj 0 j j.1=2C/
1C
M
0
(B.56)
k0
X
k0
1C
M
0
X
k0
.1 C jk
0
j 0, then
d
d k 0 j/1C
2
Z
C
dt
C ;
.1 C jtj/1C
1C
while if j 0 < j with " D 2j j , the terms M
C.1 C jk 0 k"j/ 1 have sum
0
P
1C
0
over k uniformly bounded in k and " 1. Hence in bothPcases, k 0 M
is bounded by
0
0
0
.j j /C
jj j j
C 2
. Since u C juj 2uC D 0, we have S C j 2
C uniformly in
as required.
P
P
P
(ii). The biorthogonality means that j j2 D h u ; v i, and hence by CauchySchwarz that
X
X
kk2 k
u kk
v k:
P
P
From part (i), we have k v k C kk, so it follows that k P u k C 1 kk.
Reverse the roles of u and v to establish the same lower bound for k v k.
424
Proof of Theorem 9.9 We abbreviate kf kW2r by kf kr and the sequence norm in (9.35) by
jjjf jjj2r . The approach is to establish kf kr C jjjf jjjr for f 2 VJ and then to use a density
argument to complete the proof. For f 2 VJ we can differentiate term by term to get
r
D f D
.r/
k 0k
J X
X
2jr j k
.r/
jk
D D r f0 C D r f1 :
j D0 k
Under the hypotheses on , it was shown in Section 12.3, example 1, that f. .r/ / g is a
system of vaguelettes and hence by Proposition 12.3 satisfies the frame bounds (9.34). Apply
Lemma B.18 (for p D 2; j D 0
the frame bound to conclude that kD r f1 k2 C jjjf jjjr andP
with orthogonality not required) to obtain kD r f0 k2 C
k2 . Putting these together, we
get kf kr C jjjf jjjr for f 2 VJ . The density argument says that for f 2 W2r , we have
PJ f ! f in L2 and that D r PJ f is an L2 Cauchy sequence (since kD r .PJ f PK f /k2
C jjjPJ f PK f jjjr ) so PJ ! f in W2r .
P
jr .r/
In the other direction, for f 2 VJ , we have D r f D
j J;k 2
, since the sum
converges in L2 at J D 1 from the frame bound. Hence
X
X
22rj j2k
.2rj j k /2 C 2 kD r f k22 ;
j 0;k
while
j J;k
k2 kf k22 . Add the bounds to get jjjf jjj2r C 2 kf k2r and extend by density.
B.5 Notes
Exercises
B.1
0
0
2j x
.1 C jt
k; D 2j
j0
and D k
dt
C.
/
j/ .1 C jt j/
.1 C /
.1 C /.1 C jtj/
t 0;
<.1 C =2/.1 C t /
0 t < =.2/;
g.t/
:.1 C t =/
t =:
Appendix C
Background Material
The reader ... should not be discouraged, if on first reading of 0, he finds that he does
not have the prerequisites for reading the prerequisites. (Paul Halmos, Measure Theory).
Here we collect bits of mathematical background, with references, that are used in the
main text, but are less central to the statistical development (and so, in that important sense,
are not prerequisites). Not a systematic exposition, this collection has two aims: initially to
save the reader a trip to an authoritative source, and later, if that trip is needed, to point to
what is required. References in brackets, like [1.4], indicate sections of the main text that
refer here.
i2I
C.1 Norms etc. A norm k k on a real or complex linear space X satisfies three properties:
(i) (definiteness) kxk D 0 if and only if x D 0, (ii) (scaling) kaxk D jajkxk for any scalar
a, and (iii) (triangle inequality) kx C yk kxk C kyk.
Two norms k k1 and k k2 on X are called equivalent if there exist C1 ; C2 > 0 such that
for all x 2 X,
C1 kxk1 kxk2 C2 kxk1 :
A semi-norm j j on X satisfies (ii) and (iii) but not necessarily the definiteness condition
(i). For a quasi-norm k k on X , the triangle inequality is replaced by
kx C yk C.kxk C kyk/;
for some constant C , not depending on x or y.
C.2 Compact operators, Hilbert-Schmidt and Mercer theorems. [3.9]
We begin with some definitions and notation, relying for further detail on Reed and Simon
(1980, Ch. VI.5,6) and Riesz and Sz.-Nagy (1955, Ch. VI, 97,98).
Let H and K be Hilbert spaces, with the inner product denoted by h; i, with subscripts H
425
426
Background Material
with n 2 R and n ! 0 as n ! 1:
bn2 > 0:
The set f'n g need not be complete! However A A D 0 on the subspace N.A/ D N.A A/
orthogonal to the closed linear span of f'n g. Define
n
The set f
ng
A'n
D bn 1 A'n :
kA'n k
is orthnormal, and
A'n D bn
A
n;
D bn 'n :
(C.2)
It can be verified that f n g is a complete orthonormal basis for the closure of the range of
A, and hence that for any f 2 H, using (C.2)
X
X
hAf; n i n D
bn hf; 'n i n :
(C.3)
Af D
n
Relations (C.2) and (C.3) describe the singular value decomposition of A, and fbn g are the
singular values.
We have also
X
f D
bn 1 hAf; n i'n C u;
u 2 N.A/:
(C.4)
In (C.3) and (C.4), the series converge in the Hilbert norms of K and H respectively.
C.4 Kernels, Mercers theorem. [3.10, 3.9] An operator A 2 L.H/ is Hilbert-Schmidt if
for some orthobasis fei g
X
kAk2HS D
jhei ; Aej ij2 < 1:
(C.5)
i;j
kAk2HS
The value of
does not depend on the orthobasis chosen: regarding A as an infinite
2
matrix, kAkHS D tr A A: Hilbert-Schmidt operators are compact. An operator A is HilbertSchmidt if and only if its singular values are square summable.
427
Background Material
Further, if H D L2 .T; d/, then A is Hilbert-Schmidt if and only if there is a squareintegrable function A.s; t/ with
Z
Af .s/ D A.s; t /f .t /d.t /;
(C.6)
and in that case
kAk2HS
(C.7)
Suppose now that T D a; b R and that A W L2 .T; dt / ! L2 .T; dt / has kernel A.s; t /.
The kernel A.s; t/ is called (i) continuous if .s; t / ! A.s; t / is continuous on T T , (ii)
symmetric if A.s; t/ D A.t; s/, and (iii) non-negative definite
if .Af; f / 0 for all f .
These conditions imply that A is square-integrable, T T A2 .s; t /dsdt < 1, and hence
that A is self-adjoint, Hilbert-Schmidt and thus compact and so, by the Hilbert-Schmidt
theorem, A has a complete orthonormal basis f'n g of eigenfunctions with eigenvalues 2n :
Theorem C.5 (Mercer) If A is continuous, symmetric and non-negative definite, then the
series
X
A.s; t / D
2n 'n .s/'n .t /
n
Proof We prove (iii) ) (i) ) (ii) ) (iii). If D A g, then (i) follows from the definition
of A . Then (i) ) (ii) follows from the Cauchy-Schwarz inequality with C D kgk2 .
(ii) ) (iii). The linear functional Lh D hA 1 h; i is well defined on R.A/ since A is
one-to-one. From the hypothesis, for all h D Af , we have jLhj D jhf; ij C khk2 . Thus
L is bounded on R.A/ and so extends by continuity to a bounded linear functional on R.A/.
The Riesz representation theorem gives a g 2 R.A/ such that
Af; g D L.Af / D hf; i
D A g.
428
Background Material
[4.2, Lemma 4.7]. An extended form of the dominated convergence theorem, due to
Young (1911) and rediscovered by Pratt (1960), has an easy proof, e.g. Bogachev (2007, Vol
I, Theorem 2.8.8)
Theorem C.7 If fn ; gn and Gn are -integrable functions and
1. fn ! f , gn ! g and Gn ! G a.e. (), with g and G integrable,
2. g
R n fn R Gn forR all n, and
R
3. gn ! g and Gn ! G,
R
R
then f is integrable, and fn ! f .
Covariance inequality. [Exer. 4.1]. Let Y be a real valued random variable and suppose
that f .y/ is increasing and g.y/ is decreasing. Then, so long as the expectations exist,
Ef .Y /g.Y / Ef .Y /Eg.Y /:
(C.8)
[Jensens inequality.]
429
Background Material
C.9 [7.1, 12.2, B.1]. The Fourier transform of an integrable function on R is defined by
Z 1
b
f ./ D
f .x/e i x dx:
(C.9)
1
f .x
y/g.y/dy
b./b
f ? g./ D f
g ./:
(C.10)
R1
f .x/e
2 i kx
jf j2 .x/dx D
jck j2 :
k2Z
[3.5, 14.4]. The Poisson summation formula (Folland, 1999, Sec. 8.3) states that if
b./j are bounded, then
.1 C jxj2 /jf .x/j and .1 C jj2 /jf
X
X
b.2k/:
f .j / D
f
(C.12)
j 2Z
k2Z
b) alone.]
[Dym and McKean (1972, p. 111) gives a sufficient condition on f (or f
When applied to f .x/ D g.x C t /; this yields a representation for the periodization of g
X
X
g.t C j / D
e 2 i k t b
g .2k/;
t 2 R:
(C.13)
j
1
There are several conventions for the placement of factors involving 2 in the definition of the Fourier
transform, Folland (1999, p. 278) has a comparative discussion.
430
Background Material
R
C.10 The Fourier transform of a probability measure, b
./ D e i . / is also called
the characteristic function. The convolution property (C.10) extends to convolution of probability measures: ? ./ D b
./b
./.
The characteristic function of an N.; 2 / distributions is expfi 2 2 =2g. It follows
from the convolution property that if the convolution ? of two probability measures is
Gaussian, and if one of the factors is Gaussian, then so must be the other factor.
.z/ D .2/
z 2 =2
.z/ D
Q
.z/
D
.u/du;
.u/du:
z
From .u/ uz
.u/, we obtain the simplest bound for Mills ratio, (Mills, 1926),
Q
.z/=.z/
z
.z > 0/:
(C.14)
=2
(C.15)
p
2 log n:
(C.17)
C.13 Brownian motion, Wiener integral. [1.4, 3.10]. A process fZ.t /; t 2 T g is Gaussian if all finite-dimensional distributions .Z.t1 /; : : : ; Z.tk // have Gaussian distributions for
all .t1 ; t2 ; : : : ; tk / 2 T k and positive integer k: It is said to be continuous in quadratic mean
if EZ.t C h/ Z.t/2 ! 0 as h ! 0 at all t.
The following basic facts about Brownian motion and Wiener integrals may be found, for
example, in Kuo (2006, Ch. 2). Standard Brownian motion on the interval 0; 1 is defined
as a Gaussian process fW .t/g with mean zero and covariance function Cov .W .s/; W .t // D
s ^ t: It follows that fW .t/g has independent increments: if 0 t1 < t2 < < tn , then the
increments W .tj / W .tj 1 / are independent. In addition, the sample paths t ! W .t; !/
are continuous with probability one.
Background Material
431
R1
with the series converging almost surely (Shepp, 1966). Particular examples
p for which this
representation was known earlier include the trigonmetric basis k .t / D 2 cos.k 21 / t
(Wiener) and the Haar basis j k .t / D 2j=2 h.2j t k/ for h.t / equal to 1 on 0; 21 and to 1
on 12 ; 1 (Levy).
If C.s;Rt/ is a square integrable kernel on L2 .0; 12 /, then the Gaussian random function
1
F .s/ D 0 C.s; t/d W .t/ 2 L2 0; 1 almost surely, having mean zero and finite variance
R1 2
P
2 0; 1. If C.s; t / has the
expansion i ci 'i .s/'i .t / with
0 C .s; t/dt for almost all sP
P
square summable coefficients i ci2 < 1, then F .s/ D i ci I.'i /'i .s/.
[8.6]. Weak law of large numbers for triangular arrays. Although designed for variables without finite second moment, the truncation method works well for the cases of
rapidly growing variances that occur here. The following is taken from Durrett (2010, Thm
2.2.6).
Proposition C.14 For each n let Xnk ; 1 k n; be independent. Let bn > 0 with
bn P
! 1; and let XN nk D Xnk I fjXnk j bn g. Suppose that as n ! 1,
(i) nkD1
PPn .jXnk j 2> bn / ! 0, and
2
N
(ii) bn
kD1 E Xnk ! 0 as n ! 1.
P
Let Sn D Xn1 C : : : C Xnn and put an D nkD1 E XN nk . Then
Sn D an C op .bn /:
432
Background Material
[Convex Set]
[l.s.c. and max on compact]
[metric space: seq cty = cty]
C.16 [complete, separable, metrizable, Borel field, Radon measure,
second countable, Hausdorff....]
A subset K of a metric space is compact if every covering of K by open sets has a finite
subcover.
A subset K of a metric space is totally bounded if it can be covered by finitely many balls
of radius for every > 0:
[Ref: Rudin FA p 369] If K is a closed subset of a complete metric space, then the following three properties are equivalent: (a) K is compact, (b) Every infinite subset of K has
a limit point in K, (c) K is totally bounded.
is lower semicontinuous.
C.17 If X is compact, then an lsc function f attains its infimum: infx2X f D f .x0 / for
some x0 2 X:
[If X is 1st countable, then these conditions may be rewritten in terms of sequences as
f .x/ lim inf f .xn / whenever xn ! x:]
A function g is upper semicontinuous if f D g is lsc.
C.18 Weak convergence of probability measures. [4.4]. Let be a complete separable
metric spacefor us, usually a subset of Rn for some n. Let P ./ denote the collection of
probability measures on with the Borel algebra generated by the open sets. We say
that n ! in the weak topology if
Z
Z
d n !
d
(C.18)
for all bounded continuous W ! R.
When D R or Rd , the Levy-Cramer theorem provides Ra convergence criterion in
terms of the Fourier transform/characteristic function b
./ D e i .d /, namely that
n ! n weakly if and only if b
n ./ ! b
./ for all with b
./ being continuous at 0
(Cramer, 1999, p. 102), (Chung, 1974, p.101).
A collection P P ./ is called tight if for all > 0, there exists a compact set K
for which .K/ > 1 for every 2 P .
Prohorovs theorem (Billingsley, 1999, Ch. 1.5) provides a convenient description of compactness in P ./: a set P P ./ has compact closure if and only if P is tight.
Background Material
433
C.19 Vague convergence. [4.4]. Let D R and PC .R/ be the collection of sub-stochastic
N for R
N D R [ f1g, allowing mass at 1. We
measures on R. Equivalently, PC D P .R/
say that n ! in the vague topology if (C.18) holds for all continuous with compact
support, or (equivalently) for all continuous that vanish at 1.
Clearly weak convergence implies vague convergence, and if P P .R/ is weakly compact, then it is vaguely compact. However P .R/ is not weakly compact (as mass can escape
N
to 1) but PC .R/ is vaguely compact, e.g. from Prohorovs theorem applied to P .R/.
C.20 [4.2, 8.7]. The Fisher information for location of a distribution P on R is
R 0 2
dP
;
(C.19)
I.P / D sup R 2
dP
1
1
where
R 2 the supremum is taken over the set C0 of C functions of compact support for which
dP > 0. For this definition and the results quoted here, we refer to Huber and Ronchetti
(2009, Chapter 4), [HR] below.
It follows from this definition that I.P / is a convex function of P . The definition is however equivalentRto the usual one: I.P / < 1 if and only
R if P has an absolutely continuous
density p, and p 02 =p < 1. In either case, I.P / D p 02 =p.
Given P0 ; P1 with I.PR0 /; I.P1 / < 1 and 0 t 1, let Pt D .1 t /P0 C tP1 .
Differentiating I.Pt / D pt02 =pt under the integral sign (which is justified in HR), one
obtains
Z
d
p002
2p00 0
I.Pt /jtD0 D
.p1 p00 /
.p1 p0 /
dt
p0
p02
(C.20)
Z
0
2
D 2 0 p1
I.P0 /;
0 p1 dx
where we have set 0 D p00 =p0 for terms multiplying p10 and p1 and observed that the
terms involving only p00 and p0 collapse to I.P0 /.
Since I.P / is the supremum of a set of vaguely (resp. weakly) continuous functions, it
follows that P ! I.P / is vaguely (resp. weakly) lower semicontinuous2 . Consequently,
from C.17, if P PC .R/ is vaguely compact, then there is an P0 2 P minimizing I.P /.
Formula (C.20) yields a helpful variational criterion for characterizing a minimizing P0 .
Let P1 D fP1 2 P W I.P1 / < 1g and for given P0 and P1 , let Pt D .1 t /P0 C tP1 . Since
I.P / is convex in P , a distribution P0 2 P minimizes I.P / if and only if .d=dt /I.Pt / 0
at t D 0 for each P1 2 P1 .
A slight reformulation of this criterion is also useful. The first term on the right side of
2
indeed, if V .P / denotes
R 2 the ratio in (C.19), then fP W I.P / > tg is the union of sets of the form
fP W V .P / > t;
dP > 0g and hence is open.
434
(C.20) is
only if
Background Material
0
0 .p1
p00 / D
0
0 .p1
0
0
p0 / 0:
(C.21)
C.21 (Uniqueness). Suppose (i) that P is convex and P0 2 P minimizes I.P / over P with
0 < I.P0 / < 1, and (ii) that the set on which p0 is positive is an interval and contains the
support of every P 2 P . Then P0 is the unique minimizer of I.P / in P .
In our applications, P is typically the marginal distribution ? for a (substochastic)
prior measure . (For this reason, the notation uses P ? for classes of distributions P , which
in these applications correspond to classes P of priors through P ? D fP D ? ; 2 P g.)
In particular, in the uniqueness result, p0 is then positive on all of R and so condition (ii)
holds trivially.
C.22 Steins Unbiased Estimate of Risk. [2.6]. We provide some extra definitions and
details of proof for the unbiased risk identity that comprises Proposition 2.4. As some important applications of the identity involve functions that are only almost differentiable,
we begin with some remarks on weak differentiability, referring to standard sources, such as
Gilbarg and Trudinger (1983, Chapter 7), for omitted details.
A function g W Rn ! R is said to be weakly differentiable if there exist functions hi W
n
R ! R; i D 1; : : : n; such that
Z
Z
hi D
.Di /g
for all 2 C01 ;
where C01 denotes the class of C 1 functions on Rn of compact support. We write hi D
Di g:
To verify weak differentiability in particular cases, we note that it can be shown that
g is weakly differentiable if and only if it is equivalent to a function gN that is absolutely
continuous on almost all line segments parallel to the co-ordinate axes and whose (classical)
partial derivatives (which consequently exist almost everywhere) are locally integrable (e.g.
Ziemer (1989, Thm. 2.1.4)).
For approximation arguments, such as in the proof of Proposition 2.4 below, it is convenient to use the following criterion (e.g. Gilbarg and Trudinger (1983, Thm 7.4)): Suppose
that g and h are integrable on compact subsets of Rn . Then h D Di g if and only if there
exist C 1 functions gm ! g such that also Di gm ! h where in both cases the convergence
is in L1 on compact subsets of Rn . [Exercise 2.24 outlines a key part of the proof.]
A C r partition of unity is a collection
of C r functions m .x/ 0 of compact support
P
n
such that for every x 2 R we have m m .x/ D 1 and on some neighborhood of x, all but
finitely many m .x/ D 0. We add the nonstandard requirement that for some C < 1,
X
jDi m .x/j C
for all x:
(C.22)
m
Exercise C.1 adapts a standard construction (e.g. Rudin (1973, Thm 6.20)) to exhibit an
example that suffices for our needs.
435
Background Material
Proof of Proposition 2.4 First note that by a simple translation of parameter, it suffices to
consider D 0: Next, consider scalar C 1 functions g W Rn ! R of compact support. We
aim to show that EXi g.X/ D EDi g.X /, but this is now a simple integration by parts:
Z
Z
xi g.x/.x/dx D g.x/ Di .x/dx
Z
(C.23)
D Di g.x/.x/dx:
Now use the criterion quoted above to extend to weakly differentiable g with compact
support: use that fact that for compact K Rn , convergence fm ! f in L1 .K/ implies
fm h ! f h, also in L1 .K/, for any function h bounded on K (such as xi or .x/).
Finally for extension to weakly differentiable g satisfying EjXi g.X /j C jDi g.X /j < 1,
let fm g be a C r partition of unity satsifying (C.22). Let gm D g.1 C C m /. Equality
(C.23) extends from the compactly supported gm to g after a few uses of the dominated
convergence theorem.
For a vector function g W Rn ! Rn , just apply the preceding argument to the components
gi and add. Formula (2.58) follows immediately from (2.57) (since E kX k2 D n).
C.23 Holder spaces. [4.7, 7.1, 9.6, B.3]. The Holder spaces C .I / measure smoothness uniformly on an interval I, with smoothness parameter . The norms have the form
kf kC D kf k1;I C jf j , with the sup norm added because the seminorm jf j which
reflects the dependence on will typically vanish on a finite dimensional space.
If is a positive integer, then we require that f have continuous derivatives, and set
jf j D kD f k1;I .
For 0 < < 1, we require finiteness of
o
n jf .x/ f .y/j
jf j D sup
;
x;
y
2
I
:
(C.24)
jx yj
If m is a positive integer and m < < m C 1, then we require both that f have m
uniformly continuous derivatives and also finiteness of
jf j D jD m f j
m:
We note also that Holder functions can be uniformly approximated by (Taylor) polynomials. Indeed, we can say that f 2 C .I / implies that there exists a constant C such that
for each x 2 I , there exists a polynomial px .y/ of degree de 1 such that
jf .x C y/
px .y/j C jyj ;
if x C y 2 I:
(C.25)
The
C can be taken as jf j =c , where c equals 1 if 0 < < 1 and equals
Q constant
1
j / if 1:
j D0 .
C.24 Total Variation. [ 9.6] When I D a; b, this semi-norm is defined by
n
X
jf jT V .I / D supf
jf .ti /
i D1
f .ti
1 /j
436
Background Material
1=p
kf kp :
(C.27)
Proof For p D 1 the result is immediate. For 1 < p < 1, let q be the conjugate exponent
1=q D 1 1=p. Expand jK.x; y/j as jK.x; y/j1=q jK.x; y/j1=p and use Holders inequality:
hZ
i1=q hZ
i1=p
jKf .x/j
jK.x; y/j.dy/
jK.x; y/jjf .y/jp .dy/
;
so that, using (ii),
p
jKf .x/j
M2p=q
Now integrate over x, use Fubinis theorem and bound (i) to obtain (C.27). The proof for
p D 1 is similar and easier.
Background Material
437
R
Remark. The adjoint .K g/.y/ D g.x/K.x; y/.dx/ maps Lp .X / ! Lp .Y / with
kK gkp M11
1=p
M21=p kgkp :
(C.28)
(ii) If ck D
j 2Z
ak
j bj ,
kKf kp kKk1 kf kp :
(C.29)
(C.30)
then
Another consequence, in the L2 setting, is a version with weights. Although true in the
measure space setting of Theorem C.26, we need only the version for infinite matrices.
Corollary C.28 (Schurs Lemma) [15.3, B.4]. Let K D .K.i; j //i;j 2N be an infinite
matrix and let .p.i// and .q.j // be sequences of positive numbers. Suppose that
X
p.i /K.i; j / M1 q.j /
j 2 N;
and
(i)
i
(ii)
i 2 N;
Proof
Use the argument for Theorem C.26, this time expanding jK.i; j /j as
jK.i; j /j1=2 q.j /1=2 jK.i; j /j1=2 q.j /
1=2
Theorem C.29 (Minkowskis integral inequality) [B.3]. Let .X; BX ; / and .Y; BY ; /
be finite measure spaces, and let f .x; y/ be a jointly measurable function. Then for
1 p 1,
p
1=p Z Z
1=p
Z Z
p
f .x; y/.dy/ .dx/
jf .x; y/j .dx/
.dy/:
(C.31)
C.30 Gauss hypergeometric function [3.9]. is defined for jxj < 1 by the series
F .; ;
I x/ D
1
X
./n ./n x n
nD0
. /n
1/; ./0 D 1 is
438
Background Material
the Pochhammer symbol. For Re
> Re > 0 and jxj < 1, Eulers integral representation
says that
Z 1
1
t 1 .1 t /
1 .1 tx/ dt;
F .; ;
I x/ D B.;
/
0
where B.;
/ D ./.
/=. C
/ is the beta integral. These and most identities given
here may be found in Abramowitz and Stegun (1964, Chs. 15, 22) See also Temme (1996,
Chs. 5 and 6) for some derivations. Gelfand and Shilov (1964, 5.5) show that this formula
can be interpreted in terms of differentiation of fractional order
1
.1 x/C
x
x
1
F .; ;
I x/ D D
C
:
(C.32)
.
/
./
D D
Z x
1
1
F .; ;
C I x/ D B.
; /
t
F .; ; I t /.x
t /
dt:
(C.33)
C.31 Jacobi polynomials arise from the hypergeometric function when the series is finite
nCa
a;b
Pn .1 2x/ D
F . n; a C b C n C 1; a C 1I x/;
n
where the generalized binomial coefficient is .n C a C 1/= .n C 1/.a C 1/. The polynomials Pna;b .w/; n 0 are orthogonal with respect to the weight function .1 w/a .1 C w/b
on 1; 1. Special cases include the Legendre polynomials Pn .x/, with a D b D 0, and the
Chebychev polynomials Tn .x/ and Un .x/ of first and second kinds, with a D b D 1=2
and a D b D 1=2 respectively.
The orthogonality relations, for the corresponding weight function on 0; 1, become
Z 1
2
Pma;b .1 2x/Pna;b .1 2x/ x a .1 x/b dx D ga;bIn
nm ;
0
.a C b C n C 1/
n
:
2n C a C b C 1 .a C n C 1/.b C n C 1/
(C.34)
Exercises
C.1
(Partition of unity for Proof of Proposition 2.4.) For x 2 Rn , let kxk1 D max jxk j.
(a) Exhibit a C r function .x/ 0 for which .x/ D 1 for kxk1 1 and .x/ D 0 for
kxk1 2. (Start with n D 1.)
(b) Let pi ; i D 1; 2; : : : be an enumeration of the points in Zn , and set i .x/ D .x pi /. Let
1 D 1 and for i 1, iC1 D .1 1 / .1 i /iC1 . Show that
1 C C m D 1
and hence that fi .x/g is a
C < 1 such that
Cr
.1
1 / .1
m /;
jDi 1 C C m .x/j
m
X
j D1
jDi j .x/j C:
Exercises
439
Appendix D
To Do List
The general structure of the book is approaching final form, unless feedback nowwhich
is welcome!should lead to changes. Nevertheless, many smaller points still need clean-up.
Especially, each chapter needs bibliographic notes discussing sources and references; this
is by no means systematic at present.
Specific Sections:
A section/epilogue on topics not covered
Appendix C needs to be organized.
Overall:
Each chapter clean up, attention to explaining flow. Also bibliographic notes, sources.
Table of Symbols/Acronyms,
Index.
440
Bibliography
Abel, N. 1826. Resolution dun probleme de mecanique. J. Reine u. Angew. Math, 1, 153157. [83]
Abramovich, F., and Silverman, B. W. 1998. Wavelet decomposition approaches to statistical inverse problems. Biometrika, 85, 115129. [342]
Abramovich, F., Sapatinas, T., and Silverman, B. W. 1998. Wavelet thresholding via a Bayesian approach.
J. Royal Statistical Society, Series B., 60, 725749. [196]
Abramovich, F., Benjamini, Y., Donoho, D., and Johnstone, I. 2006. Adapting to Unknown Sparsity by
controlling the False Discovery Rate. Annals of Statistics, 34, 584653. [xi, 202, 203, 322, 323]
Abramowitz, M., and Stegun, I. A. 1964. Handbook of mathematical functions with formulas, graphs, and
mathematical tables. National Bureau of Standards Applied Mathematics Series, vol. 55. For sale by the
Superintendent of Documents, U.S. Government Printing Office, Washington, D.C. [438]
Adler, R. J., and Taylor, J. E. 2007. Random fields and geometry. Springer Monographs in Mathematics.
New York: Springer. [51]
Anderson, G. W., Guionnet, A., and Zeitouni, O. 2010. An Introduction to Random Matrices. Cambridge
University Press. [134]
Ash, R. B., and Gardner, M. F. 1975. Topics in Stochastic Processes. Academic Press. [88]
Assouad, P. 1983. Deux remarques sur lestimation. C. R. Acad. Sci. Paris Ser. I Math., 296(23), 10211024.
[278]
Beckner, W. 1989. A generalized Poincare inequality for Gaussian measures. Proc. Amer. Math. Soc.,
105(2), 397400. [51]
Belitser, E., and Levit, B. 1995. On Minimax Filtering over Ellipsoids. Mathematical Methods of Statistics,
4, 259273. [135, 155]
Berger, J. O. 1985. Statistical decision theory and Bayesian analysis. Second edn. Springer Series in
Statistics. New York: Springer-Verlag. [134]
Bergh, J., and Lofstrom, J. 1976. Interpolation spaces An Introduction. New York: Springer Verlag. [413]
Berkhin, P., and Levit, B. Y. 1980. Asymptotically minimax second order estimates of the mean of a normal
population. Problems of information transmission, 16, 6079. [134]
Bertero, M. 1989. Linear inverse and ill-posed problems. Pages 1120 of: Advances in Electronics and
Electron Physics, vol. 75. New York: Academic Press. [427]
Bickel, P. J., and Collins, J. R. 1983. Minimizing Fisher information over mixtures of distributions. Sankhya
Ser. A, 45(1), 119. [243]
Bickel, P. J. 1981. Minimax estimation of the mean of a normal distribution when the parametr space is
restricted. Annals of Statistics, 9, 13011309. [121, 134, 243]
Bickel, P. J. 1983. Minimax estimation of a normal mean subject to doing well at a point. Pages 511528 of:
Rizvi, M. H., Rustagi, J. S., and Siegmund, D. (eds), Recent Advances in Statistics. New York: Academic
Press. [242, 243]
Billingsley, P. 1999. Convergence of probability measures. Second edn. Wiley Series in Probability and
Statistics: Probability and Statistics. New York: John Wiley & Sons Inc. A Wiley-Interscience Publication. [432]
Birge, L. 1983. Approximation dans les e spaces metriques et theorie de lestimation. Z. Wahrscheinlichkeitstheorie und Verwandte Gebiete, 65, 181237. [153]
441
442
Bibliography
Birge, L., and Massart, P. 2001. Gaussian Model Selection. Journal of European Mathematical Society, 3,
203268. [51, 244, 322]
Birkhoff, G., and Rota, G.-C. 1969. Ordinary Differential Equations. Blaisdell. [91]
Bogachev, V. I. 2007. Measure theory. Vol. I, II. Berlin: Springer-Verlag. [428]
Bogachev, V. I. 1998. Gaussian Measures. American Mathematical Society. [98, 395]
Borell, C. 1975. The Brunn-Minkowski inequality in Gauss space. Invent. Math., 30(2), 207216. [51]
Born, M., and Wolf, E. 1975. Principles of Optics. 5th edn. New York: Pergamon. [86]
Breiman, L. 1968. Probability. Reading, Mass.: Addison-Wesley Publishing Company. [94]
Breiman, L. 1995. Better subset selection using the non-negative garotte. Technometrics, 37, 373384.
[196]
Bretagnolle, J., and Huber, C. 1979. Estimation des densites: risque minimax. Z. Wahrscheinlichkeitstheorie
und Verwandte Gebiete, 47, 119137. [278]
Brown, L., DasGupta, A., Haff, L. R., and Strawderman, W. E. 2006. The heat equation and Steins identity:
connections, applications. J. Statist. Plann. Inference, 136(7), 22542278. [134]
Brown, L. D., and Low, M. G. 1996a. Asymptotic equivalence of nonparametric regression and white noise.
Annals of Statistics, 3, 23842398. [92, 95, 378]
Brown, L. D., and Purves, R. 1973. Measurable selections of extrema. Ann. Statist., 1, 902912. [50, 134]
Brown, L. D. 1971. Admissible estimators, recurrent diffusions and insoluble boundary value problems.
Annals of Mathematical Statistics, 42, 855903. Correction: Ann. Stat. 1 1973, pp 594596. [50, 134]
Brown, L. D. 1986. Fundamentals of statistical exponential families with applications in statistical decision theory. Institute of Mathematical Statistics Lecture NotesMonograph Series, 9. Hayward, CA:
Institute of Mathematical Statistics. [97]
Brown, L. D., and Gajek, L. 1990. Information Inequalities for the Bayes Risk. Annals of Statistics, 18,
15781594. [134]
Brown, L. D., and Low, M. G. 1996b. A constrained risk inequality with applications to nonparametric
functional estimation. Ann. Statist., 24(6), 25242535. [275]
Brown, L. D., Low, M. G., and Zhao, L. H. 1997. Superefficiency in nonparametric function estimation.
Annals of Statistics, 25, 26072625. [164, 172, 174, 178]
Brown, L. D., Carter, A. V., Low, M. G., and Zhang, C.-H. 2004. Equivalence theory for density estimation,
Poisson processes and Gaussian white noise with drift. Ann. Statist., 32(5), 20742097. [96]
Brown, L. D. 1966. On the admissibility of invariant estimators of one or more location parameters. Ann.
Math. Statist, 37, 10871136. [51]
Brown, L. 1977. Closure theorems for sequential-design processes. In: Gupta, S., and Moore, D. (eds),
Statistical Decision Theory and Related Topics II. Academic Press, New York. [395]
Brown, L. 1978. Notes on Statistical Decision Theory. Unpublished Lecture Notes. [391, 395, 397, 398]
Buhlmann, P., and van de Geer, S. 2011. Statistics for High-Dimensional Data: Methods, Theory and
Applications. Springer. [49]
Cai, T. T. 1999. Adaptive wavelet estimation: a block thresholding and oracle inequality approach. Ann.
Statist., 27(3), 898924. [54, 164, 206, 242, 272]
Cai, T. T. 2002. On block thresholding in wavelet regression: adaptivity, block size, and threshold level.
Statist. Sinica, 12(4), 12411273. [278]
Cai, T. T., and Zhou, H. H. 2009a. Asymptotic equivalence and adaptive estimation for robust nonparametric
regression. Ann. Statist., 37(6A), 32043235. [98]
Cai, T., and Zhou, H. 2009b. A data-driven block thresholding approach to wavelet estimation. Ann. Statist,
37, 569595. [204, 206, 343]
Cand`es, E., and Romberg, J. 2007. Sparsity and incoherence in compressive sampling. Inverse Problems,
23(3), 969985. [49]
Carter, A. V. 2011. Asymptotic Equivalence of Nonparametric Experiments Bibliography. webpage at
University of California, Santa Barbara, Department of Statistics. [98]
Carter, C., Eagleson, G., and Silverman, B. 1992. A comparison of the Reinsch and Speckman splines.
Biometrika, 79, 8191. [78, 143, 144, 156]
Casella, G., and Strawderman, W. E. 1981. Estimating a bounded normal mean. Annals of Statistics, 9,
870878. [120]
Bibliography
443
Cavalier, L. 2004. Estimation in a problem of fractional integration. Inverse Problems, 20(5), 14451454.
[156]
Cavalier, L., and Tsybakov, A. B. 2001. Penalized blockwise Steins method, monotone oracles and sharp
adaptive estimation. Math. Methods Statist., 10(3), 247282. Meeting on Mathematical Statistics (Marseille, 2000). [164, 178, 242]
Cavalier, L. 2011. Inverse problems in statistics. Pages 396 of: Inverse problems and high-dimensional
estimation. Lect. Notes Stat. Proc., vol. 203. Heidelberg: Springer. [99]
Cavalier, L., and Tsybakov, A. 2002. Sharp adaptation for inverse problems with random noise. Probab.
Theory Related Fields, 123(3), 323354. [178]
Chatterjee, S. 2009. Fluctuations of eigenvalues and second order Poincare inequalities. Probab. Theory
Related Fields, 143(1-2), 140. [51]
Chaumont, L., and Yor, M. 2003. Exercises in probability. Cambridge Series in Statistical and Probabilistic
Mathematics, vol. 13. Cambridge: Cambridge University Press. A guided tour from measure theory to
random processes, via conditioning. [51]
Chen, S. S., Donoho, D. L., and Saunders, M. A. 1998. Atomic decomposition by basis pursuit. SIAM J.
Sci. Comput., 20(1), 3361. [49]
Chernoff, H. 1981. A note on an inequality involving the normal distribution. Ann. Probab., 9(3), 533535.
[51]
Chui, C. K. 1992. An Introduction to Wavelets. San Diego: Academic Press. [404]
Chui, C. K. 1997. Wavelets: a mathematical tool for signal processing. SIAM Monographs on Mathematical
Modeling and Computation. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM).
With a foreword by Gilbert Strang. [208]
Chung, K. L. 1974. A course in probability theory. Second edn. Academic Press [A subsidiary of Harcourt
Brace Jovanovich, Publishers], New York-London. Probability and Mathematical Statistics, Vol. 21.
[432]
Cirelson, B., Ibragimov, I., and Sudakov, V. 1976. Norm of Gaussian sample function. Pages 2041 of:
Proceedings of the 3rd Japan-U.S.S.R. Symposium on Probability Theory. Lecture Notes in Mathematics,
550. [51]
Cleveland, W. S. 1979. Robust locally weighted regression and smoothing scatterplots. Journal of the
American Statistical Association, 74, 829836. [170]
Cogburn, R., and Davis, H. T. 1974. Periodic splines and spectral estimation. Ann. Statist., 2, 11081126.
[98]
Cohen, A. 1966. All admissible linear estimates of the mean vector. Annals of Mathematical Statistics, 37,
456463. [36, 51]
Cohen, A., Daubechies, I., Jawerth, B., and Vial, P. 1993a. Multiresolution analysis, wavelets, and fast
algorithms on an interval. Comptes Rendus Acad. Sci. Paris (A), 316, 417421. [188]
Cohen, A., Dahmen, W., and Devore, R. 2000. Multiscale Decompositions on Bounded Domains. Transactions of American Mathematical Society, 352(8), 36513685. [410]
Cohen, A. 1990. Ondelettes, analyses multiresolutions et filtres miroir en quadrature. Annales Institut Henri
Poincare, Analyse Non Lineaire, 7, 439459. [403]
Cohen, A. 2003. Numerical analysis of wavelet methods. Studies in Mathematics and its Applications, vol.
32. Amsterdam: North-Holland Publishing Co. [208]
Cohen, A., and Ryan, R. 1995. Wavelets and Multiscale Signal Processing. Chapman and Hall. [403]
Cohen, A., Daubechies, I., and Vial, P. 1993b. Wavelets on the Interval and Fast Wavelet Transforms.
Applied Computational and Harmonic Analysis, 1, 5481. [188, 194, 407]
Coifman, R., and Donoho, D. 1995. Translation-Invariant De-Noising. In: Antoniadis, A. (ed), Wavelets
and Statistics. Springer Verlag Lecture Notes. [xi, 200, 201]
Courant, R., and Hilbert, D. 1953. Methods of Mathematical Physics, Volume 1. Wiley-Interscience. [91,
92]
Cover, T. M., and Thomas, J. A. 1991. Elements of Information Theory. Wiley. [132, 153]
Cox, D. D. 1983. Asymptotics for M -type smoothing splines. Ann. Statist., 11(2), 530551. [78, 98]
Cox, D. D. 1988. Approximation of method of regularization estimators. Ann. Statist., 16(2), 694712.
[78]
444
Bibliography
Bibliography
445
Donoho, D., Johnstone, I., and Montanari, A. 2012. Accurate Prediction of Phase Transitions in Compressed
Sensing via a Connection to Minimax Denoising. IEEE Transactions on Information Theory. in press.
[242]
Donoho, D. L., and Huo, X. 2001. Uncertainty principles and ideal atomic decomposition. IEEE Trans.
Inform. Theory, 47(7), 28452862. [49]
Donoho, D. 1992. Interpolating Wavelet Transforms. Tech. rept. 408. Department of Statistics, Stanford
University. [381, 410]
Donoho, D. 1993. Unconditional bases are optimal bases for data compression and statistical estimation.
Applied and Computational Harmonic Analysis, 1, 100115. [277, 289]
Donoho, D. 1994. Statistical Estimation and Optimal recovery. Annals of Statistics, 22, 238270. [275,
300]
Donoho, D. 1995. Nonlinear solution of linear inverse problems by Wavelet-Vaguelette Decomposition.
Applied Computational and Harmonic Analysis, 2, 101126. [331, 333, 427]
Donoho, D. 1996. Unconditional Bases and Bit-Level Compression. Applied Computational and Harmonic
Analysis, 3, 388392. [289]
Donoho, D., and Low, M. 1992. Renormalization exponents and optimal pointwise rates of convergence.
Annals of Statistics, 20, 944970. [275, 280]
Dugundji, J. 1966. Topology. Allyn and Bacon, Boston. [391]
Durrett, R. 2010. Probability: theory and examples. Fourth edn. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge: Cambridge University Press. [79, 431]
Dym, H., and McKean, H. P. 1972. Fourier Series and Integrals. Academic Press. [135, 429]
Efromovich, S. 1996. On nonparametric regression for IID observations in a general setting. Ann. Statist.,
24(3), 11251144. [155]
Efromovich, S. 1999. Nonparametric curve estimation. Springer Series in Statistics. New York: SpringerVerlag. Methods, theory, and applications. [17]
Efromovich, S. 2004a. Analysis of blockwise shrinkage wavelet estimates via lower bounds for no-signal
setting. Ann. Inst. Statist. Math., 56(2), 205223. [278]
Efromovich, S. 2004b. Oracle inequalities for Efromovich-Pinsker blockwise estimates. Methodol. Comput.
Appl. Probab., 6(3), 303322. [178]
Efromovich, S. 2005. A study of blockwise wavelet estimates via lower bounds for a spike function. Scand.
J. Statist., 32(1), 133158. [278]
Efromovich, S. 2010. Dimension reduction and adaptation in conditional density estimation. J. Amer.
Statist. Assoc., 105(490), 761774. [178]
Efromovich, S., and Pinsker, M. 1996. Sharp-optimal and adaptive estimation for heteroscedastic nonparametric regression. Statist. Sinica, 6(4), 925942. [178]
Efromovich, S., and Samarov, A. 1996. Asymptotic equivalence of nonparametric regression and white
noise model has its limits. Statist. Probab. Lett., 28(2), 143145. [98]
Efromovich, S., and Valdez-Jasso, Z. A. 2010. Aggregated wavelet estimation and its application to ultrafast fMRI. J. Nonparametr. Stat., 22(7), 841857. [206]
Efromovich, S., and Pinsker, M. 1984. A learning algorithm for nonparametric filtering. Automat. i Telemeh., 11, 5865. (in Russian), translated in Automation and Remote Control, 1985, p 1434-1440. [159,
164, 178]
Efron, B. 1993. Introduction to James and Stein (1961) Estimation with Quadratic Loss. Pages 437
442 of: Kotz, S., and Johnson, N. (eds), Breakthroughs in Statistics: Volume 1: Foundations and Basic
Theory. Springer. [51]
Efron, B. 2001. Selection criteria for scatterplot smoothers. Ann. Statist., 29(2), 470504. [98]
Efron, B. 2011. Tweedies formula and selection bias. Tech. rept. Department of Statistics, Stanford
University. [50]
Efron, B., and Morris, C. 1971. Limiting the Risk of Bayes and Empirical Bayes Estimators Part I: The
Bayes Case. J. American Statistical Association, 66, 807815. [50, 213, 242]
Efron, B., and Morris, C. 1972. Limiting the risk of Bayes and empirical Bayes estimators. II. The empirical
Bayes case. J. Amer. Statist. Assoc., 67, 130139. [50]
446
Bibliography
Efron, B., and Morris, C. 1973. Steins estimation rule and its competitorsan empirical Bayes approach.
J. Amer. Statist. Assoc., 68, 117130. [39]
Erdelyi, A., Magnus, W., Oberhettinger, F., and Tricomi, F. 1954. Tables of Integral Transforms, Volume 1.
McGraw-Hill. [73]
Eubank, R. L. 1999. Nonparametric regression and spline smoothing. Second edn. Statistics: Textbooks
and Monographs, vol. 157. New York: Marcel Dekker Inc. [98]
Fan, J., and Gijbels, I. 1996. Local polynomial modelling and its applications. Monographs on Statistics
and Applied Probability, vol. 66. London: Chapman & Hall. [79]
Fan, J., and Li, R. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties.
J. Amer. Statist. Assoc., 96(456), 13481360. [196]
Fan, K. 1953. Minimax theorems. Prob. Nat. Acad. Sci. U.S.A., 39, 4247. [392]
Feldman, I. 1991. Constrained minimax estimation of the mean of the normal distribution with known
variance. Ann. Statist., 19(4), 22592265. [134]
Feller, W. 1971. An introduction to probability theory and its applications, Volume 2. New York: Wiley.
[99]
Folland, G. B. 1984. Real analysis. Pure and Applied Mathematics (New York). New York: John Wiley &
Sons Inc. Modern techniques and their applications, A Wiley-Interscience Publication. []
Folland, G. B. 1999. Real analysis. Second edn. Pure and Applied Mathematics (New York). New York:
John Wiley & Sons Inc. Modern techniques and their applications, A Wiley-Interscience Publication.
[429, 436, 437]
Foster, D. P., and George, E. I. 1994. The risk inflation criterion for multiple regression. Ann. Statist., 22(4),
19471975. [304]
Foster, D., and Stine, R. 1997. An information theoretic comparison of model selection criteria. Tech. rept.
Dept. of Statistics, University of Pennsylvania. [322]
Frazier, M., Jawerth, B., and Weiss, G. 1991. Littlewood-Paley Theory and the study of function spaces.
NSF-CBMS Regional Conf. Ser in Mathematics, 79. Providence, RI: American Mathematical Society.
[257, 264, 408, 409]
Freedman, D. 1999. On the Bernstein-von Mises Theorem with Infinite Dimensional Parameters. Annals
of Statistics, 27, 11191140. [92]
Galambos, J. 1978. The asymptotic theory of extreme order statistics. John Wiley & Sons, New YorkChichester-Brisbane. Wiley Series in Probability and Mathematical Statistics. [219, 220]
Gao, F., Hannig, J., and Torcaso, F. 2003. Integrated Brownian motions and exact L2 -small balls. Ann.
Probab., 31(3), 13201337. [90]
Gao, H.-Y. 1998. Wavelet Shrinkage DeNoising Using The Non-Negative Garrote. J. Computational and
Graphical Statistics, 7, 469488. [196, 206]
Gao, H.-Y., and Bruce, A. G. 1997. Waveshrink with firm shrinkage. Statistica Sinica, 7, 855874. [196]
Gasser, T., and Muller, H.-G. 1984. Estimating regression functions and their derivatives by the kernel
method. Scand. J. Statist., 11(3), 171185. [178]
Gelfand, I. M., and Shilov, G. E. 1964. Generalized functions. Vol. I: Properties and operations. Translated
by Eugene Saletan. New York: Academic Press. [84, 332, 438]
George, E. I., and McCulloch, R. E. 1997. Approaches for Bayesian Variable Selection. Statistica Sinica,
7, 339374. [50]
George, E. I., and Foster, D. P. 2000. Calibration and Empirical Bayes Variable Selection. Biometrika, 87,
731747. [322]
Gilbarg, D., and Trudinger, N. S. 1983. Elliptic Partial Differential Equations of Second Order. Second
edition edn. Springer-Verlag. [434]
Golomb, M., and Weinberger, H. F. 1959. Optimal approximation and error bounds. Pages 117190 of: On
Numerical Approximation. University of Wisconsin Press. [300]
Golub, G. H., and Van Loan, C. F. 1996. Matrix Computations. 3rd edn. Johns Hopkins University Press.
[44]
Golubev, G. K., and Levit, B. Y. 1996. Asymptotically efficient estimation for analytic distributions. Math.
Methods Statist., 5(3), 357368. [156]
Bibliography
447
Golubev, G. K., Nussbaum, M., and Zhou, H. H. 2010. Asymptotic equivalence of spectral density estimation and Gaussian white noise. Ann. Statist., 38(1), 181214. [98]
Golubev, G. 1987. Adaptive asymptotically minimax estimates of smooth signals. Problemy Peredatsii
Informatsii, 23, 5767. [178]
Gorenflo, R., and Vessella, S. 1991. Abel integral equations. Lecture Notes in Mathematics, vol. 1461.
Berlin: Springer-Verlag. Analysis and applications. [83, 99]
Gourdin, E., Jaumard, B., and MacGibbon, B. 1994. Global Optimization Decomposition Methods for
Bounded Parameter Minimax Risk Evaluation. SIAM Journal of Scientific Computing, 15, 1635. [120,
121]
Grama, I., and Nussbaum, M. 1998. Asymptotic Equivalence for Nonparametric Generalized Linear Models. Probability Theory and Related Fields, 111, 167214. [97]
Gray, R. M. 2006. Toeplitz and Circulant Matrices: A review. Foundations and Trends in Communications
and Information Theory, 2, 155239. [48]
Green, P., and Silverman, B. 1994. Nonparametric Regression and Generalized Linear Models. London:
Chapman and Hall. [13, 17, 70]
Grenander, U. 1981. Abstract inference. New York: John Wiley & Sons Inc. Wiley Series in Probability
and Mathematical Statistics. [156]
Grenander, U., and Rosenblatt, M. 1957. Statistical Analysis of Stationary Time Series, Second Edition
published 1984. Chelsea. [97]
Groeneboom, P. 1996. Lectures on inverse problems. Pages 67164 of: Lectures on probability theory and
statistics (Saint-Flour, 1994). Lecture Notes in Math., vol. 1648. Berlin: Springer. [99]
Groeneboom, P., and Jongbloed, G. 1995. Isotonic estimation and rates of convergence in Wicksells problem. Ann. Statist., 23(5), 15181542. [99]
Hall, P., and Patil, P. 1993. Formulae for mean integrated squared error of nonlinear wavelet-based density
estimators. Tech. rept. CMA-SR15-93. Australian National University. To appear, Ann. Statist. [179]
Hall, P. G., Kerkyacharian, G., and Picard, D. 1999a. On Block thresholding rules for curve estimation
using kernel and wavelet methods. Annals of Statistics, 26, 922942. [206]
Hall, P. 1979. On the rate of convergence of normal extremes. J. Appl. Probab., 16(2), 433439. [220]
Hall, P., and Hosseini-Nasab, M. 2006. On properties of functional principal components analysis. J. R.
Stat. Soc. Ser. B Stat. Methodol., 68(1), 109126. [89]
Hall, P., and Smith, R. L. 1988. The Kernel Method for Unfolding Sphere Size Distributions. Journal of
Computational Physics, 74, 409421. [99]
Hall, P., Kerkyacharian, G., and Picard, D. 1999b. On the minimax optimality of block thresholded wavelet
estimators. Statist. Sinica, 9(1), 3349. [206]
Hardle, W., Hall, P., and Marron, S. 1988. How far are automatically chosen regression smoothing parameters from their minimum? (with discussion). J. American Statistical Association, 83, 86101. [165]
Hardle, W., Kerkyacharian, G., Picard, D., and Tsybakov, A. 1998. Wavelets, approximation, and statistical
applications. Lecture Notes in Statistics, vol. 129. New York: Springer-Verlag. [208, 405]
Hardy, G. H., and Littlewood, J. E. 1928. Some properties of fractional integrals. I. Math. Z., 27(1),
565606. [99]
Hart, J. D. 1997. Nonparametric smoothing and lack-of-fit tests. Springer Series in Statistics. New York:
Springer-Verlag. [98]
Hastie, T., Tibshirani, R., and Wainwright, M. 2012. L1 regression? Chapman and Hall? forthcoming. [49]
Hastie, T. J., and Tibshirani, R. J. 1990. Generalized Additive Models. Chapman and Hall. [70]
Hedayat, A., and Wallis, W. D. 1978. Hadamard matrices and their applications. Ann. Statist., 6(6), 1184
1238. [51]
Heil, C., and Walnut, D. F. 2006. Fundamental Papers in Wavelet Theory. Princeton University Press. [208]
Hernandez, E., and Weiss, G. 1996. A First Course on Wavelets. CRC Press. [208, 404]
Hida, T. 1980. Brownian Motion. Springer. [98]
Hoerl, A. E., and Kennard, R. W. 1970. Ridge Regression: Biased Estimation for Nonorthogonal Problems.
Technometrics, 12, 5567. [51]
Huber, P. J., and Ronchetti, E. M. 2009. Robust Statistics. Wiley. [237, 433]
448
Bibliography
Hwang, J. T., and Casella, G. 1982. Minimax confidence sets for the mean of a multivariate normal distribution. Ann. Statist., 10(3), 868881. [51]
Ibragimov, I., and Khasminskii, R. 1997. Some estimation problems in infinite-dimensional Gaussian white
noise. Pages 259274 of: Festschrift for Lucien Le Cam. New York: Springer. [156]
Ibragimov, I. A., and Has0 minski, R. Z. 1977. Estimation of infinite-dimensional parameter in Gaussian
white noise. Dokl. Akad. Nauk SSSR, 236(5), 10531055. [156]
Ibragimov, I. A., and Khasminskii, R. Z. 1980. Asymptotic properties of some nonparametric estimates in
Gaussian white nose. In: Proceedings of Third International Summer School in Probability and Mathematical Statistics, (Varna 1978), Sofia. in Russian. [16]
Ibragimov, I. A., and Khasminskii, R. Z. 1982. Bounds for the risks of non-parametric regression estimates.
Theory of Probability and its Applications, 27, 8499. [275]
Ibragimov, I. A., and Khasminskii, R. Z. 1984. On nonparametric estimation of the value of a linear
functional in Gaussian white noise. Theory of Probability and its Applications, 29, 1832. [119, 156]
Ibragimov, I., and Hasminskii, R. 1981. Statistical estimation : asymptotic theory. New York: Springer.
[16, 17, 153]
Ibragimov, I., and Khasminskii, R. 1983. Estimation of distribution density. Journal of Soviet Mathematics,
21, 4057. [156]
Ingster, Y. I., and Suslina, I. A. 2003. Nonparametric goodness-of-fit testing under Gaussian models. Lecture Notes in Statistics, vol. 169. New York: Springer-Verlag. [17]
Jaffard, S., Meyer, Y., and Ryan, R. D. 2001. Wavelets. Revised edn. Philadelphia, PA: Society for Industrial
and Applied Mathematics (SIAM). Tools for science & technology. [208]
James, W., and Stein, C. 1961. Estimation with quadratic loss. Pages 361380 of: Proceedings of Fourth
Berkeley Symposium on Mathematical Statistics and Probability Theory. University of California Press.
[21, 39, 51, 134]
Jansen, M. 2001. Noise reduction by wavelet thresholding. Lecture Notes in Statistics, vol. 161. New York:
Springer-Verlag. [208]
Johnen, H. 1972. Inequalities connected with the moduli of smoothness. Mat. Vesnik, 9(24), 289303.
[414]
Johnson, N. L., and Kotz, S. 1970. Distributions in Statistics: Continuous Univariate Distributions - 2.
Wiley, New York. [39]
Johnstone, I. M. 1994. Minimax Bayes, Asymptotic Minimax and Sparse Wavelet Priors. Pages 303326
of: Gupta, S., and Berger, J. (eds), Statistical Decision Theory and Related Topics, V. Springer-Verlag.
[178]
Johnstone, I. M. 1999. Wavelet shrinkage for correlated data and inverse problems: adaptivity results.
Statistica Sinica, 9, 5183. [204]
Johnstone, I. M., and Silverman, B. W. 1990. Speed of Estimation in Positron Emission Tomography and
related inverse problems. Annals of Statistics, 18, 251280. [99]
Johnstone, I. M., and Silverman, B. W. 1997. Wavelet Threshold estimators for data with correlated noise.
Journal of the Royal Statistical Society, Series B., 59, 319351. [xi, 198, 199, 221]
Johnstone, I. M., and Silverman, B. W. 2004a. Needles and straw in haystacks: Empirical Bayes estimates
of possibly sparse sequences. Annals of Statistics, 32, 15941649. [50, 204]
Johnstone, I. M. 2001. Chi Square Oracle Inequalities. Pages 399418 of: de Gunst, M., Klaassen, C., and
van der Waart, A. (eds), Festschrift for Willem R. van Zwet. IMS Lecture Notes - Monographs, vol. 36.
Institute of Mathematical Statistics. [51, 243]
Johnstone, I. M. 2010. High dimensional Bernstein-von Mises: simple examples. IMS Collections, 6,
8798. [98]
Johnstone, I. M., and Silverman, B. W. 2004b. Boundary coiflets for wavelet shrinkage in function estimation. J. Appl. Probab., 41A, 8198. Stochastic methods and their applications. [378, 386, 387]
Johnstone, I. M., and Silverman, B. W. 2005a. EbayesThresh: R Programs for Empirical Bayes Thresholding. Journal of Statistical Software, 12(8), 138. [51]
Johnstone, I. M., and Silverman, B. W. 2005b. Empirical Bayes selection of wavelet thresholds. Ann.
Statist., 33(4), 17001752. [x, 170, 204, 264, 300]
Bibliography
449
Joshi, V. M. 1967. Inadmissibility of the usual confidence sets for the mean of a multivariate normal
population. Ann. Math. Statist., 38, 18681875. [51]
Kagan, A. M., Linnik, Y. V., and Rao, C. R. 1973. Characterization problems in mathematical statistics.
John Wiley & Sons, New York-London-Sydney. Translated from the Russian by B. Ramachandran,
Wiley Series in Probability and Mathematical Statistics. [50]
Kahane, J., de Leeuw, K., and Katznelson, Y. 1977. Sur les coefficients de Fourier des fonctions continues.
Comptes Rendus Acad. Sciences Paris (A), 285, 10011003. [290]
Katznelson, Y. 1968. An Introduction to Harmonic Analysis. Dover. [83, 123]
Keller, J. B. 1976. Inverse problems. Amer. Math. Monthly, 83(2), 107118. [83]
Kempthorne, P. J. 1987. Numerical specification of discrete least favorable prior distributions. SIAM J. Sci.
Statist. Comput., 8(2), 171184. [120]
Kneser, H. 1952. Sur un theor`eme fondamental de la theorie des jeux. C. R. Acad. Sci. Paris, 234, 2418
2420. [391, 392]
Kolaczyk, E. D. 1997. Nonparametric Estimation of Gamma-Ray Burst Intensities Using Haar Wavelets.
The Astrophysical Journal, 483, 340349. [96]
Komlos, J., Major, P., and Tusnady, G. 1975. An approximation of partial sums of independent RVs and
the sample DF. I. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 32, 111131. [96]
Koo, J.-Y. 1993. Optimal rates of convergence for nonparametric statistical inverse problems. Ann. Statist.,
21(2), 590599. [99]
Kotelnikov, V. 1959. The Theory of Optimum Noise Immunity. McGraw Hill, New York. [16]
Krantz, S. G., and Parks, H. R. 2002. A primer of real analytic functions. Second edn. Birkhauser Advanced
Texts: Basler Lehrbucher. [Birkhauser Advanced Texts: Basel Textbooks]. Boston, MA: Birkhauser
Boston Inc. [428]
Kuhn, H. 1953. Review of Kneser (1952). Mathematical Reviews, 14, 301. [392]
Kuo, H.-H. 1975. Gaussian Measures in Banach Spaces. Springer Verlag, Lecture Notes in Mathematics #
463. [98]
Kuo, H.-H. 2006. Introduction to stochastic integration. Universitext. New York: Springer. [430]
Lai, T. L., and Robbins, H. 1976. Maximally dependent random variables. Proceedings of the National
Academy of Sciences, 73(2), 286288. [243]
Laurent, B., and Massart, P. 1998. Adaptive estimation of a quadratic functional by model selection. Tech.
rept. Universite de Paris-Sud, Mathematiques. [51]
Le Cam, L. 1986. Asymptotic Methods in Statistical Decision Theory. Berlin: Springer. [10, 93, 391, 395]
Le Cam, L., and Yang, G. L. 2000. Asymptotics in statistics. Second edn. Springer Series in Statistics. New
York: Springer-Verlag. Some basic concepts. [93]
LeCam, L. 1955. An extension of Walds theory of statistical decision functions. Annals of Mathematical
Statistics, 26, 6981. [395]
Ledoux, M. 1996. Isoperimetry and Gaussian Analysis. In: Bernard, P. (ed), Lectures on Probability Theory
and Statistics, Ecole dEte de Probabilities de Saint Flour, 1994. Springer Verlag. [51]
Ledoux, M. 2001. The concentration of measure phenomenon. Mathematical Surveys and Monographs,
vol. 89. Providence, RI: American Mathematical Society. [45, 46, 51]
Lehmann, E. L., and Casella, G. 1998. Theory of Point Estimation. Second edn. Springer Texts in Statistics.
New York: Springer-Verlag. [39, 40, 50, 51, 96, 107, 171]
Lehmann, E. L., and Romano, J. P. 2005. Testing statistical hypotheses. Third edn. Springer Texts in
Statistics. New York: Springer. [28, 50, 80, 81, 107]
Lemarie, P., and Meyer, Y. 1986. Ondelettes et bases Hilbertiennes. Revista Matematica Iberoamericana,
2, 118. [409, 410]
Lepskii, O. 1991. On a problem of adaptive estimation in Gaussian white noise. Theory of Probability and
its Applications, 35, 454466. [275]
Levit, B. 2010a. Minimax revisited. I. Math. Methods Statist., 19(3), 283297. [134]
Levit, B. 2010b. Minimax revisited. II. Math. Methods Statist., 19(4), 299326. [134]
Levit, B. Y. 1980. On asymptotic minimax estimates of second order. Theory of Probability and its Applications, 25, 552568. [134]
450
Bibliography
Levit, B. Y. 1982. Minimax estimation and positive solutions of elliptic equations. Theory of Probability
and its Applications, 82, 563586. [134]
Levit, B. Y. 1985. Second order asymptotic optimality and positive solutions of Schrodingers equation.
Theory of Probability and its Applications, 30, 333363. [134]
Loader, C. R. 1999. Bandwidth selection: Classical or plug-in? Annals of Statistics, 27, 415438. [178]
Mallat, S. 1998. A Wavelet Tour of Signal Processing. Academic Press. [179]
Mallat, S. 1999. A Wavelet Tour of Signal Processing. Academic Press. 2nd, expanded, edition. [184, 187,
405]
Mallat, S. 2009. A wavelet tour of signal processing. Third edn. Elsevier/Academic Press, Amsterdam. The
sparse way, With contributions from Gabriel Peyre. [182, 195, 208, 401, 405]
Mallows, C. 1973. Some comments on Cp . Technometrics, 15, 661675. [51]
Mallows, C. 1978. Minimizing an Integral. SIAM Review, 20(1), 183183. [243]
Mandelbaum, A. 1984. All admissible linear estimators of the mean of a Gaussian distribution on a Hilbert
space. Annals of Statistics, 12, 14481466. [62]
Mardia, K. V., Kent, J. T., and Bibby, J. M. 1979. Multivariate Analysis. Academic Press. [26]
Marr, R. B. 1974. On the reconstruction of a function on a circular domain from a sampling of its line
integrals. J. Math. Anal. Appl., 45, 357374. [86]
Marron, J. S., and Wand, M. P. 1992. Exact mean integrated squared error. Ann. Statist., 20(2), 712736.
[178]
Massart, P. 2007. Concentration inequalities and model selection. Lecture Notes in Mathematics, vol. 1896.
Berlin: Springer. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July
623, 2003, With a foreword by Jean Picard. [17, 55, 322]
McMurry, T. L., and Politis, D. N. 2004. Nonparametric regression with infinite order flat-top kernels. J.
Nonparametr. Stat., 16(3-4), 549562. [100]
Meyer, Y. 1986. Principe dincertitude, bases hilbertiennes et algebres doperateurs. Seminaire Bourbaki,
662. [404]
Meyer, Y. 1990. Ondelettes et Operateurs, I: Ondelettes, II: Operateurs de Calderon-Zygmund, III: (with
R. Coifman), Operateurs multilineaires. Paris: Hermann. English translations of Vol I. and Vols II-III
(combined) published by Cambridge University Press. [182, 187, 208, 257, 278, 289, 402, 405, 409,
419]
Meyer, Y. 1991. Ondelettes sur lintervalle. Revista Matematica Iberoamericana, 7, 115133. [188]
Meyer, Y. 1992. Wavelets and Operators. Vol. 1. Cambridge University Press. [410]
Meyer, Y., and Coifman, R. 1997. Wavelets. Cambridge Studies in Advanced Mathematics, vol. 48.
Cambridge: Cambridge University Press. Calderon-Zygmund and multilinear operators, Translated from
the 1990 and 1991 French originals by David Salinger. [423]
Mezard, M., and Montanari, A. 2009. Information, physics, and computation. Oxford Graduate Texts.
Oxford: Oxford University Press. [242]
Micchelli, C. A. 1975. Optimal estimation of linear functionals. Tech. rept. 5729. IBM. [300]
Micchelli, C. A., and Rivlin, T. J. 1977. A survey of optimal recovery. Pages 154 of: Micchelli, C. A., and
Rivlin, T. J. (eds), Optimal Estimation in Approximation Theory. New York: Plenum Press. [300]
Miller, A. J. 1984. Selection of subsets of regression variables (with discussion). J. Roy. Statist. Soc., Series
A, 147, 389425. with discussion. [221]
Miller, A. J. 1990. Subset Selection in Regression. Chapman and Hall, London, New York. [221]
Mills, J. P. 1926. Table of the Ratio: Area to Bounding Ordinate, for Any Portion of Normal Curve.
Biometrika, 18, 395400. [430]
Nason, G. P. 2008. Wavelet methods in statistics with R. Use R! New York: Springer. [200, 208]
Nason, G. 2010. wavethresh: Wavelets statistics and transforms. R package version 4.5. [x, 170]
Nemirovski, A. 2000. Topics in non-parametric statistics. Pages 85277 of: Lectures on probability theory
and statistics (Saint-Flour, 1998). Lecture Notes in Math., vol. 1738. Berlin: Springer. [17, 178]
Nikolskii, S. 1975. Approximation of Functions of Several Variables and Imbedding Theorems. Springer,
New York. [408]
Nishii, R. 1984. Asymptotic properties of criteria for selection of variables in multiple regression. Ann.
Statist., 12(2), 758765. [304]
Bibliography
451
Nussbaum, M. 1996. Asymptotic Equivalence of density estimation and white noise. Annals of Statistics,
24, 23992430. [96]
Nussbaum, M. N. 2004. Equivalence asymptotique des Experiences Statistiques. Journal de la Societe
Francaise de Statistique, 145(1), 3145. (In French). [93]
Ogden, R. T. 1997. Essential wavelets for statistical applications and data analysis. Boston, MA:
Birkhauser Boston Inc. [208]
Peck, J., and Dulmage, A. 1957. Games on a compact set. Canadian Journal of Mathematics, 9, 450458.
[392]
Peetre, J. 1975. New Thoughts on Besov Spaces, I. Raleigh, Durham: Duke University Mathematics Series.
[263, 408, 411]
Percival, D. B., and Walden, A. T. 2000. Wavelet methods for time series analysis. Cambridge Series in
Statistical and Probabilistic Mathematics, vol. 4. Cambridge: Cambridge University Press. [208]
Pinsker, M. 1980. Optimal filtering of square integrable signals in Gaussian white noise. Problems of
Information Transmission, 16, 120133. originally in Russian in Problemy Peredatsii Informatsii 16
52-68. [61, 138]
Pinsky, M. A. 2009. Introduction to Fourier analysis and wavelets. Graduate Studies in Mathematics, vol.
102. Providence, RI: American Mathematical Society. Reprint of the 2002 original. [208]
Pratt, J. W. 1960. On interchanging limits and integrals. Annals of Mathematical Statistics, 31, 7477.
[428]
Prekopa, A. 1980. Logarithmic concave measures and related topics. Pages 6382 of: Stochastic programming (Proc. Internat. Conf., Univ. Oxford, Oxford, 1974). London: Academic Press. [366]
Ramsay, J. O., and Silverman, B. W. 2005. Functional data analysis. Second edn. Springer Series in
Statistics. New York: Springer. [89]
Reed, M., and Simon, B. 1980. Functional Analysis, Volume 1, revised and enlarged edition. Academic
Press. [425, 427]
Rice, J., and Rosenblatt, M. 1981. Integrated mean squared error of a smoothing spline. J. Approx. Theory,
33(4), 353369. [98]
Riesz, F., and Sz.-Nagy, B. 1955. Functional Analysis. Ungar, New York. [425]
Rigollet, P. 2006. Adaptive density estimation using the blockwise Stein method. Bernoulli, 12(2), 351
370. [178]
Robbins, H. 1956. An empirical Bayes approach to statistics. Pages 157163 of: Proceedings of the Third
Berkeley Symposium on Mathematical Statistics and Probability, 19541955, vol. I. Berkeley and Los
Angeles: University of California Press. [50]
Rudin, W. 1973. Functional Analysis. McGraw Hill. [391, 397, 434]
Ruggeri, F. 2006. Gamma-Minimax Inference. In: Encyclopedia of Statistical Sciences. John Wiley &
Sons. [134]
Schervish, M. J. 1995. Theory of statistics. Springer Series in Statistics. New York: Springer-Verlag. [94,
134]
Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6, 461464. [304]
Serfling, R. J. 1980. Approximation theorems of mathematical statistics. New York: John Wiley & Sons
Inc. Wiley Series in Probability and Mathematical Statistics. [45]
Shao, P. Y.-S., and Strawderman, W. E. 1994. Improving on the James-Stein positive-part estimator. Ann.
Statist., 22(3), 15171538. [40]
Shepp, L. A. 1966. Radon-Nikodym derivatives of Gaussian measures. Annals of Mathematical Statistics,
37, 321354. [89, 431]
Silverman, B. W. 1984. Spline smoothing: the equivalent variable kernel method. Annals of Statistics, 12,
898916. [73]
Simonoff, J. S. 1996. Smoothing methods in statistics. Springer Series in Statistics. New York: SpringerVerlag. [98]
Simons, S. 1995. Minimax theorems and their proofs. Pages 123 of: Du, D.-Z., and Pardalos, P. (eds),
Minimax and Applications. Kluwer Academic Publishers. [392]
Sion, M. 1958. On general minimax theorems. Pacific Journal of Mathematics, 8, 171176. [392]
452
Bibliography
Speckman, P. 1985. Spline smoothing and optimal rates of convergence in nonparametric regression models.
Annals of Statistics, 13, 970983. [78, 98]
Srinivasan, C. 1973. Admissible Generalized Bayes Estimators and Exterior Boundary Value Problems.
Sankhya, 43, 125. Ser. A. [50, 134]
Starck, J.-L., Murtagh, F., and Fadili, J. M. 2010. Sparse image and signal processing. Cambridge:
Cambridge University Press. Wavelets, curvelets, morphological diversity. [208]
Stein, C. 1956. Efficient nonparametric estimation and testing. Pages 187195 of: Proc. Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1. University of California Press,
Berkeley, CA. [21, 51]
Stein, C. 1981. Estimation of the mean of a multivariate normal distribution. Annals of Statistics, 9, 1135
1151. [37, 39, 51]
Stoffer, D. 1991. Walsh-Fourier Analysis and Its Statistical Applications. Journal of the American Statistical Association, 86, 461479. [96]
Stone, C. 1980. Optimal rates of convergence for nonparametric estimators. Annals of Statistics, 8, 1348
1360. [291]
Strawderman, W. E. 1971. Proper Bayes minimax estimators of the multivariate normal mean. Ann. Math.
Statist., 42(1), 385388. [51]
Sudakov, V. N., and Cirel0 son, B. S. 1974. Extremal properties of half-spaces for spherically invariant
measures. Zap. Naucn. Sem. Leningrad. Otdel. Mat. Inst. Steklov. (LOMI), 41, 1424, 165. Problems in
the theory of probability distributions, II. [51]
Szego, G. 1967. Orthogonal Polynomials, 3rd edition. American Mathematical Society. [103, 416]
Talagrand, M. 2003. Spin glasses: a challenge for mathematicians. Ergebnisse der Mathematik und ihrer
Grenzgebiete. 3. Folge. A Series of Modern Surveys in Mathematics [Results in Mathematics and Related Areas. 3rd Series. A Series of Modern Surveys in Mathematics], vol. 46. Berlin: Springer-Verlag.
Cavity and mean field models. [242, 430]
Tao, T. 2011. Topics in Random Matrix Theory. draft book mansucript. [51]
Temme, N. M. 1996. Special functions. A Wiley-Interscience Publication. New York: John Wiley & Sons
Inc. An introduction to the classical functions of mathematical physics. [438]
Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1),
267288. [49]
Tibshirani, R., and Knight, K. 1999. The covariance inflation criterion for adaptive model selection. Journal
of the Royal Statistical Society, Series B, 61, 529546. [322]
Tikhonov, A. N., and Arsenin, V. Y. 1977. Solutions of ill-posed problems. Washington, D.C.: John Wiley
& Sons, New York: V. H. Winston & Sons. Translated from the Russian, Preface by translation editor
Fritz John, Scripta Series in Mathematics. [51]
Triebel, H. 1983. Theory of Function Spaces. Basel: Birkhauser Verlag. [257, 264, 408, 410, 411, 413]
Triebel, H. 1992. Theory of Function Spaces II. Basel: Birkhauser Verlag. [408, 410, 411]
Triebel, H. 2006. Theory of function spaces. III. Monographs in Mathematics, vol. 100. Basel: Birkhauser
Verlag. [410]
Triebel, H. 2008. Function spaces and wavelets on domains. EMS Tracts in Mathematics, vol. 7. European
Mathematical Society (EMS), Zurich. [410]
Tsybakov, A. B. 2009. Introduction to Nonparametric Estimation. Springer. [17, 61, 131, 135, 155, 278,
300]
Tsybakov, A. 1997. Asymptotically Efficient Signal Estimation in L2 Under General Loss Functions.
Problems of Information Transmission, 33, 7888. translated from Russian. [155]
van der Vaart, A. W. 1997. Superefficiency. Pages 397410 of: Festschrift for Lucien Le Cam. New York:
Springer. [178]
van der Vaart, A. W. 1998. Asymptotic statistics. Cambridge Series in Statistical and Probabilistic Mathematics, vol. 3. Cambridge: Cambridge University Press. [56]
van der Vaart, A. 2002. The statistical work of Lucien Le Cam. Ann. Statist., 30(3), 631682. Dedicated to
the memory of Lucien Le Cam. [93]
Van Trees, H. L. 1968. Detection, Estimation and Modulation Theory, Part I. New York: Wiley. [109]
Vidakovic, B. 1999. Statistical Modelling by Wavelets. John Wiley and Sons. [208]
Bibliography
453
Vidakovic, B., and DasGupta, A. 1996. Efficiency of linear rules for estimating a bounded normal mean.
Sankhya Ser. A, 58(1), 81100. [134]
Vogel, C. R. 2002. Computational methods for inverse problems. Frontiers in Applied Mathematics, vol.
23. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM). With a foreword by H.
T. Banks. [51]
von Neumann, J., and Morgenstern, O. 1944. Theory of Games and Economic Behavior. FILL IN. [391]
Wahba, G. 1978. Improper priors, spline smoothing and the problem of guarding against model errors in
regression. J. Roy. Statist. Soc. Ser. B., 40, 364372. [90]
Wahba, G. 1983. Bayesian confidence intervals for the cross-validated smoothing spline. J. Roy. Statist.
Soc. Ser. B., 45, 133150. [90]
Wahba, G. 1985. A comparison of GCV and GML for choosing the smoothing parameter in the generalized
spline smoothing problem. Annals of Statistics, 13, 13781402. [98]
Wahba, G. 1990. Spline Methods for Observational Data. Philadelphia: SIAM. [70, 90]
Wald, A. 1950. Statistical Decision Functions. Wiley. [10, 395]
Walter, G. G., and Shen, X. 2001. Wavelets and other orthogonal systems. Second edn. Studies in Advanced
Mathematics. Chapman & Hall/CRC, Boca Raton, FL. [208]
Wand, M. P., and Jones, M. C. 1995. Kernel smoothing. Monographs on Statistics and Applied Probability,
vol. 60. London: Chapman and Hall Ltd. [98]
Wasserman, L. 2006. All of nonparametric statistics. Springer Texts in Statistics. New York: Springer. [17]
Watson, G. S. 1971. Estimating Functionals of Particle Size Distribution. Biometrika, 58, 483490. [84]
Wicksell, S. D. 1925. The corpuscle problem. A mathematical study of a biometric problem. Biometrika,
17, 8499. [84]
Williams, D. 1991. Probability with Martingales. Cambridge University Press, Cambridge. [79]
Wojtaszczyk, P. 1997. A Mathematical Introduction to Wavelets. Cambridge University Press. [208, 411]
Woodroofe, M. 1970. On choosing a delta sequence. Annals of Mathematical Statistics, 41, 16651671.
[174]
Young, W. H. 1911. On semi-integrals and oscillating successions of functions. Proc. London Math. Soc.
(2), 9, 286324. [428]
Zhang, C.-H. 2012. Minimax`q risk in `p balls. Pages 7889 of: Contemporary Developments in Bayesian
Analysis and Statistical Decision Theory: A Festschrift for William E. Strawderman. IMS Collections,
vol. 8. Institute of Mathematical Statistics. [243]
Ziemer, W. P. 1989. Weakly differentiable functions. Graduate Texts in Mathematics, vol. 120. New York:
Springer-Verlag. Sobolev spaces and functions of bounded variation. [434]
Zygmund, A. 1959. Trigonometric Series, Volume I. Cambridge University Press, Cambridge. [123]
Zygmund, A. 2002. Trigonometric series. Vol. I, II. Third edn. Cambridge Mathematical Library.
Cambridge: Cambridge University Press. With a foreword by Robert A. Fefferman. [142]